Handling non tabular HTML content in Power Query – parsing Science Fiction Awards Database

Power Query is poor at extracting information from web pages that host non-tabular data.

I’ve been experimenting with a few techniques for addressing this.

For a bit of fun, we’re going to look at information held in the Science Fiction Awards Database

This is a wonderful resources that records the nominees and winners for every major science fiction and fantasy award.

The Hugos are arguably the most famous of all. You can see the results for 2015 here.

The format is great to read, but not if you want to pull them into a spreadsheet for further analysis (e.g. alignment to GoodReads rating).

2016-01-17 10_29_52-sfadb_ Hugo Awards 2015 - Internet Explorer

This doesn’t look like a table and Power Query won’t like it.

If we create a Web query using Power Query and point it at this page we’ll get an ugly HTML document.

Looking at the query in the Advanced Editor we see this:

Source = Web.Page(Web.Contents("http://www.sfadb.com/Hugo_Awards_2015"))

To make it useful, let’s view it as text instead.

Change the code in the advanced editor to be this instead (note I’ve stripped out the Web.Page).

Once you refresh, a little icon will appear with the web page.

Right click on this and select Text.

2016-01-17 10_39_18-Document - Query Editor

You should see the whole HTML source for the web page without any attempt by Power Query to parse the HTML.

If you scroll right down, you’ll find the sections with each of the winners and nominees:

2016-01-17 10_41_16-Document - Query Editor

Definitely not tabular form 🙁

So let’s modify the HTML into tables and then send it back for HTML table parsing.

Here’s the code:

In the first section, we locate each of our award categories and add a new column which contains an HTML table header with the award category as an id.

Then we change all of our list items into table entries instead.

Finally, we convert our new HTML back into a binary file and pass it back to Web.Page. This time Power Query will find some tables and the result:

2016-01-17 11_16_37-Awards (2) - Query Editor

There’s a lot more we could now do to improve the format, such as split the title/author/publisher. However, the intent here is simply to show the general approach of manipulating the HTML in text form and then sending back to the HTML parser.