Power Query, Proxies – Parsing tables from Microsoft Word documents

Our proposal letters are Microsoft Word documents which include embedded pricing tables.  It is useful for me to have this information in Excel.

I’m going to explore an approach for pulling tables from Word documents into Excel using the small Python web server used in previous posts and Power Query.

As per previous post, you need to have Python (both 2.x and 3.x should work) installed.  This time, we’ll be using the python-docx library to do the heavy lifting for us.  Of course, will also need the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

Here’s the code for the Python web server:

This Python code will run a small web server that listens on port 8080 and will respond to requests sent to http://localhost:8080/extractWordTables

It is expecting to be sent a query string with a Word document’s filename e.g. http://localhost:8080/extractWordTables?filename=myworddocument.docx

Once it has a Word document it attempts to parse it and look for tables.  To do this it uses Steve Canny’s fantastic python-docx library.

In using it, the process steps through the tables, row by row, cell by cell and converts the data into an HTML document containing one or more tables.

To use this from Power Query you need to execute a Power Query -> From Web query.  Then enter something like:

http://localhost:8080/extractWordTables?filename=WordDocWithTables.docx

You may need to tell it to convert the content to a Web Page, but you should end up with M code looking something like this:

You can then select the table that you want to work further on.

I have adapted this approach to work with multiple files, so you could essentially bulk parse a set of Word documents.  However, it relies on the documents containing tables with the same number of columns etc.  I’ve also added additional features to the Web server to allow me to filter what tables are returned based on the column headings.

Can provide further detail if there is interest.

Finally, I’m exploring other options for doing all of this directly in Power Query, but I do not have a working solution just yet.

Custom web proxies and Power Query – decrypting Google Chrome cookies

It turns out that decrypting Google Chrome cookies is a non-trivial process, so I thought I’d use it as an introduction to the use of custom web proxies.

There are functions available in the .NET framework that allow you to decrypt Google Chrome cookies, however, they are not directly accessible via Power Query (as far as I can tell) and even with VBA you would need to write a COM wrapper.

Instead, I decided to write a very short script in Python which would decrypt the cookies and send the result back to Excel/Power Query.  How to do that?  Well, by calling the running Python script from a Power Query Web contents query.

The solution below has two parts, a Python web server and Power Query.  Power Query imports the Google cookies database.  It then executes a Web query to contact the Python web server and send it a list of encrypted cookies.  The Python script receives the encrypted cookies, decrypts them and send back the result.  Power Query takes the decrypted cookies and combines it with the other data it has for the cookies.

To make this work, you need to have Python (both 2.x and 3.x should work) installed.  You’ll also need the Python extensions for Windows (for the win32crypt library) and the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

So basically, this short script starts a tiny local web server.

Assuming you run it from your desktop PC it will respond to requests sent to http://localhost:8080/decryptCookie

And if we send it a list of encrypted cookies, it will decrypt them using Microsoft’s libraries and send back the decrypted cookies.

For the purposes of this exercise, we will manually launch it, but we could trigger it in a more sophisticated fashion directly from Excel.

The M code on the Power Query side of things, builds on the post from yesterday which showed how to access the Google Chrome cookie SQLite database.

I did run into a strange issue where the query I put together yesterday was truncating the encrypted cookie field from the cookies database.   For some reason, this was corrected by adding a SQL query – select * from cookies rather than simply connecting to the table.  I would have expected this to produce identical results but it didn’t.

And here is the annotated M query:

Happy to walkthrough this in more detail if necessary or respond to questions.

This data can now be used much like the “harvested” Internet Explorer cookies from other posts.

Bear in mind, that there are countless other uses for custom Web proxies with Power Query. For example, you can use them to perform more complex web authentication, web scraping with BeautifulSoup (no more crazy HTML/XML guesswork), data manipulation using Python libraries such as pandas, execute other applications on your computer,  etc. etc.

Of course, on the flipside, you need to have Python installed and have the Python script running before it will work.  On your home PC, this probably doesn’t matter.  In a corporate environment, you would want this to be hosted on a proper web server.