Using captured Cookies in Power Query

Unfortunately, Power Query often does not reliably maintain the cookies needed to parse data from complex sites. However, it is possible to sidestep this issue by passing through the cookies manually.

In a previous post, I demonstrated how to parse Internet Explorer cookie files and bring them into Excel.

In this post, we’ll transform those cookies into a cookie string and add to a query.

Cookies need to be specified in the following format:

Cookie1=value1; Cookie2=value2; Cookie3=value3;

This is quite different from the table of cookies that we had at the conclusion of my previous post.

But we can use Power Query to fix that.

Start with a new Blank Query and refer to the #”IE Cookies” query that we had at the end of the previous exercise.

Source = #"IE Cookies"

Then filter to just the cookies needed.

Do this by filtering the Host/path column for those cookies corresponding with the domain of the website that you are parsing e.g. flipboard.com.

(NB. This is only going to work for sites that maintain cookies between sessions.  Often these sites will have a “Remember Me” box that you can tick.)

2016-01-31 10_07_20-Query1 - Query Editor

Having done this, remove other columns except for Cookie Name and Cookie value.

2016-01-31 10_07_59-Query1 - Query Editor

Now to convert our table into a cookie string for our query:

Step 1:

Merge the columns using “=” as our separator and remove other columns

2016-01-31 10_12_02-

 

Resulting in:

2016-01-31 10_08_58-Query1 - Query Editor

Step 2:

Transpose the remaining Merged column into a single row

Step 3

Merge these columns using “; ” (note the space) as the separator and remove other columns

2016-01-31 10_12_54-Merged - Query Editor

Step 4

Drill down:

2016-01-31 10_10_59-Merged - Query Editor

So now we have a cookie text string in the format we’ll need for a query, using cookies that we have harvested from Internet Explorer.

Using the cookies in an actual Web query is quite straight forward. e.g.

In the next post, I’ll demonstrate how to use this with a real web site.

How to use Youtube API from Power Query – Part 1

2016-01-24 11_59_54-Book1 - Excel

Google offer quite a comprehensive API for Youtube but I haven’t seen any examples of people using Power Query to explore.

So let’s do that – we’re going to use Power Query to access Youtube’s API and display a simple table of popular videos such as the above.

Before you can do this, you will need an API key for accessing the Youtube API.

Google provide instructions for this here.

The key steps are:

  1. Create a new project in the Google Developers Console.
  2. Turn on access for the YouTube Data API v3.
  3. Create a browser API key.

The end result should be a long string of random characters constituting your key for accessing the Youtube Data API.

Now, we can get into Excel and Power Query.

Usually when you are accessing an API in Power Query you would use Web API authentication.  This doesn’t seem to work for Youtube Data API, as it expects the API key to be provided in the query not the header, so we will need to specify our API key manually.

Start with a Blank Query and paste the following into the Advanced Editor:

Update 28/3/2016

You may get a response indicating that you need to specify how to connect or otherwise enter credentials.  If you have already entered credentials you may need to edit them in order for future steps to work.

2016-03-28 10_00_00-Query2 - Query Editor

Click to Edit Credentials.

Then you need to indicate that you want to connect anonymously.  This is because we are providing our credentials (our API key) as part of our Query string not via the Query header.

So when the below window appears, make sure you select Anonymous on the left and select just the Youtube search hyperlink radio button on the right and finally hit Connect.

2016-03-28 10_00_25-Query2 - Query Editor

Expected result:

yt 1

The Youtube API documentation explains in detail what our query actually does, but essentially we are performing a search for the top 50 (by number of views) videos containing the term “power query”.

You could actually enter this directly into your browser if you wanted to. It is equivalent to:

https://www.googleapis.com/youtube/v3/search?key=<your api key here>&q=power%20query&part=snippet,id&maxResults=50&order=viewCount

 

Now, if you click on the items List.  Power Query will return a list of the results as records.

2016-01-24 11_36_30-Youtube Search List Merge - Query Editor

They will be easier to work with as Table so next we’ll convert to a Table.

2016-01-24 11_42_07-Youtube Search List Merge - Query Editor

Returning this:2016-01-24 11_36_52-Youtube Search List Merge - Query Editor

If you then expand out the column, you can start to see all of our data:

2016-01-24 11_37_15-Youtube Search List Merge - Query EditorNext expand out the id and snippet:

2016-01-24 11_37_35-Youtube Search List Merge - Query EditorAs you can see, we now have a list of videos that all look to be related to Power Query!

If you remove columns except for video id, publishedAt, title and channelTitle, you should end up with a result something like this:

2016-01-24 11_59_54-Book1 - Excel

We can immediately learn that ExcelIsFun is a very popular Youtube channel among Power Query enthusiasts – it has 4 videos in the top 10.  I definitely recommend checking it out.

Unfortunately, we don’t have any information about the number of views, or likes, or comments.

In order to add this information we will need to perform another query to the API asking for statistics.

I’ll be covering this in a future post.

Here’s the full code to date:

A follow up post has been added here.

Handling non tabular HTML content in Power Query – parsing Science Fiction Awards Database

Power Query is poor at extracting information from web pages that host non-tabular data.

I’ve been experimenting with a few techniques for addressing this.

For a bit of fun, we’re going to look at information held in the Science Fiction Awards Database

This is a wonderful resources that records the nominees and winners for every major science fiction and fantasy award.

The Hugos are arguably the most famous of all. You can see the results for 2015 here.

The format is great to read, but not if you want to pull them into a spreadsheet for further analysis (e.g. alignment to GoodReads rating).

2016-01-17 10_29_52-sfadb_ Hugo Awards 2015 - Internet Explorer

This doesn’t look like a table and Power Query won’t like it.

If we create a Web query using Power Query and point it at this page we’ll get an ugly HTML document.

Looking at the query in the Advanced Editor we see this:

Source = Web.Page(Web.Contents("http://www.sfadb.com/Hugo_Awards_2015"))

To make it useful, let’s view it as text instead.

Change the code in the advanced editor to be this instead (note I’ve stripped out the Web.Page).

Once you refresh, a little icon will appear with the web page.

Right click on this and select Text.

2016-01-17 10_39_18-Document - Query Editor

You should see the whole HTML source for the web page without any attempt by Power Query to parse the HTML.

If you scroll right down, you’ll find the sections with each of the winners and nominees:

2016-01-17 10_41_16-Document - Query Editor

Definitely not tabular form 🙁

So let’s modify the HTML into tables and then send it back for HTML table parsing.

Here’s the code:

In the first section, we locate each of our award categories and add a new column which contains an HTML table header with the award category as an id.

Then we change all of our list items into table entries instead.

Finally, we convert our new HTML back into a binary file and pass it back to Web.Page. This time Power Query will find some tables and the result:

2016-01-17 11_16_37-Awards (2) - Query Editor

There’s a lot more we could now do to improve the format, such as split the title/author/publisher. However, the intent here is simply to show the general approach of manipulating the HTML in text form and then sending back to the HTML parser.

How to parse Internet Explorer cookies using Power Query

One valuable approach when dealing with complex authentication schemes is to use the cookies that your browser has already captured.

This post will concentrate on Internet Explorer.  I will do future posts on Chrome and Firefox as they are a bit different.

Under Windows 7, IE stores its cookies as a series of text files in this directory:

C:\Users\<username>\AppData\Roaming\Microsoft\Windows\Cookies\

Windows 8 and Windows 8.1 store the cookies here:.

C:\Users\<username>\AppData\Local\Microsoft\Windows\INetCookies

I haven’t installed Windows 10 yet and can’t find a definitive answer on the web regarding where it stores its cookies, although I suspect the file format is identical.

Let’s start by writing a Folder Query and pointing it at that folder.

Power Query->From File->From Folder

2016-01-15 15_38_24-IE Cookies - Query Editor

There can be other files in this folder.  So it is best to filter any files that do not have an extension of .txt.  You can do this by clicking on the Filter button for the Extension column and unselecting anything that isn’t .txt.  I had some .dat files in my folder,

We only need the Name and Folder Path columns.  Remove the others by selecting Name and Folder Path and using the Remove Other Columns button from the ribbon.

Finally, merge Name and Folder Path into one column using Transform->Merge Columns.

The result should look like this:

2016-01-15 15_46_31-IE Cookies - Query Editor

Now we need to write a function  which will take each cookie file and translate into a combined spreadsheet.

Let’s pull in one of the files and see what it looks like.

Power Query-> From File-> From Text

Enter the appropriate folder e.g.

C:\Users\<username>\AppData\Roaming\Microsoft\Windows\Cookies\

and select one of the files.

2016-01-15 15_57_17-0YZW00TD - Query Editor

 

We’ve got are a series of stacked tables with 9 rows each.  We need to turn those into a more regular format with columns and rows.  There are a number of methods for transposing these in Power Query, but let’s use the method that Ken Puls and Miguel Escobar describe in their book “M is for Data Monkey“.

There are 5 steps –

  1. Add an index column starting at 0
  2. Add a custom column that returns a modulo of the index field by the number of rows in each table.  In our case – 9.
  3. Then carefully Pivot by this field selecting Do Not Aggregate in the advanced options.
  4. Fill upwards
  5. Remove null entries in your first column

I refer you to Ken and Miguel’s book for a detailed explanation of this approach.  However, the code looks like this:

A quick bit of web research explains what each of these fields actually represents:

  1. Cookie name
  2. Cookie value
  3. Issuing Web site
  4. Flags
  5. Expiration Time (most significant)
  6. Expiration Time (least significant)
  7. Creation Time (most significant)
  8. Creation Time (least significant)
  9. Record delimiter

The date/time fields require a bit of work to be useful.

The two fields for Expiration Date need to be combined in order to represent a complete date field in Microsoft’s FILETIME format,  We can do this with the following formula:

([#"Expiration time (high)"] * Number.Power(2,32)) + [#"Expiration time (low)"])

We multiply the high value by 2^32 and add the low value.  This returns a value that represents the full datetime.  We can then convert it to a regular datetime using DateTime.FromFileTime

We need to repeat this for the cookie’s creation date.

The Flags fields is a bit mask for various cookie options.  Unfortunately, as far as I can tell, Power Query doesn’t provide a simple way of converting the integer value into a bitmask, I may take a look at this at a future date but it is not critical for our purpose.

We now have a Query that can parse a cookie file.

If we turn it into a function we can then use it with our Cookie Folder query to produce a complete spreadsheet of all of our IE cookies,.

We do this by adding a Custom column to our Folder Query which calls our Cookie Parse function.

The end result should look something like this:

2016-01-15 16_13_27-IE Cookies - Query Editor

And here’s the full solution code:

First the qProcessCookie function:

And then the IE Cookies folder query:

In my next post, I’ll describe how you can load your cookies into a query that you send to a web page in Power Query.

This is going to allow us to download data from web sites that have more complex authentication requirements than Power Query can currently handle.

Any comments or questions are welcome.

Power Query Custom Web Proxies

Power Query is a very useful tool for quickly manipulating data from a variety of different (and growing) sources.

Unfortunately, it is currently not very good at dealing with non-tabular Web content or Web content that requires more than a basic level of authentication or cookies.

It is, however, possible to get around these restrictions by writing your own Custom Web Proxy in a language such as Python.

Essentially, you write a small web server in Python that acts as the middle man between Power Query and the web site you would like to pull data from.  The web server does the heavy lifting – it accesses the web site, handles the authentication, possibly performs some data manipulation and passes the results back to Power Query.

The advantage of this approach is that you can leverage existing Python libraries that are specifically designed for this kind of work including requests, BeautifulSoup and pandas.

Over the next few weeks, I’ll be writing a series of posts describing how to use Power Query to access different kinds of complex web content, using custom Web proxies and another techniques.

If there are any particular web sites or content types that you’d like me to tackle, feel free to add a comment to this post.