Extracting tables from Microsoft Word documents using just Power Query

A few weeks ago, I wrote a post demonstrating how to extract tables from Word documents using a combination of Power Query and a Python web server.

Today I want to revisit that solution and show how to do the same thing using only Power Query.

To achieve this we are going to leverage the fact that Microsoft Word .docx files are actually ZIP files containing a group of XML files.  We will decompress the ZIP file and parse the XML to pull information into Power Query.

We will take a simple Word document containing this table:

2016-02-28 12_00_17-example.docx - WordAnd import it into Power Query: 2016-02-28 12_00_36-Extract Word Table - Query Editor

Here are the high level steps:

  1. Extract the document.xml file from the Word document using the DecompressFiles function I provided in a previous post.
  2. Replace the Word table XML tags with special tags.  Note I found that some table row tags had attributes and I had to write a function ReplaceTag to locate the closing “>”.
  3. Remove all other XML tags.  I have borrowed a solution described by Bill Szysz (thank you!)
  4. Replace the special table tags we added with standard HTML table tags
  5. Use List.Accumulate to turn it back into a single text string (this helps with table cells that are split due to paragraphs etc.
  6. And finally tell Power Query to parse our output as if it was a Web page

 

Here is the commented code:

I also call my DecompressFiles function which I have as a separate query:

To make it easier, here is a copy of the workbook.

If you have any difficulty with this, let me know and I can walk through in more detail or clarify any steps.

I’ve tried this on a few different Word documents, but your mileage may vary.  I am using it to pull information out of pricing tables in our proposal letters.

If your intention is to parse tables from a lot of Word documents I would recommend the Python web server approach as you will get much better performance.

You can find more information about the XML schema used in Word documents here.

How to extract data from a ZIP file using just Power Query

Power Query doesn’t have inbuilt support for ZIP files.  However, it does have the ability to decompress deflated data held in gzip files.  Chris Webb wrote an article about this a couple of months back but not about .zip files.

In this post, I’m going to describe a method for extracting content from a standard .zip file.

A few constraints to be aware of upfront  – this will only work for ZIP files, not RAR, ZIPX, 7-Zip, bzip2, etc.

It does work with every ZIP file that I created using WinZip and WinRAR, including “Super Fast”, “Enhanced Deflate” etc.

It doesn’t work with “ZIP: No Compression” (although it would only take a couple of extra lines to accommodate).

In my experience, it works with most ZIP files that I have encountered.

The Wikipedia article on the ZIP file format was invaluable in putting this solution together.

At a high level, a ZIP file is comprised of:

File #1 Header
File #1 Compressed Data
File #2 Header
File #2 Compressed Data
Central Directory

In other words, the content is at the beginning and the “directory” for that content is at the end.

I elected to ignore the Directory and parse the content directly.

And here is my M code (sorry, you can’t do this via the GUI):

So how does it work?

The function DecompressFiles takes four parameters:

ZIPfile : the ZIP file contents
Position : byte offset within the file, this is initially zero
FileToExtract : you can optionally pass in a specific file to find within the ZIP file – passing a blank string will return all files
DataSoFar : if you are decompressing multiple files from the ZIP file, then this holds previous decompressed files – it is initially blank

There are two key steps in the code:

Firstly, it reads the zip file content and parses some key values from a file header – namely, the length of the compressed file (i.e. how many bytes of compressed data are there), the length of the filename, and the length of any extra values. Everything else has a fixed length and so isn’t important to us.

Once we have extracted these values we know exactly where the file’s compressed data is located and how long it is.

With this information, we parse the zip file again, this time with full knowledge of where the compressed data is.  This allows us to extract it out into a single field – Data.

Then it’s a simple process to use Binary.Decompress to get the decompressed content.

You should end up output looking like this:

2016-02-14 11_42_27-Decompress CSV - Query Editor

Some other things to note about the code:

  1. I’ve wrapped most of the lines in try … otherwise… blocks.  This is because I have not attempted to determine when I reach the end of the files and hit the Central Directory.  That’s one my “to do” list.
  2. The function is recursive (yes, I know that isn’t preferred).  I need to take a look at how to change that to List.Generate or similar.  I haven’t run into any performance issues however.
  3. I maintain the files as a Table of binaries to allow me both to show the filename and in case the content type varies e.g. some .csv, some .xml etc.
  4. Reading .zip files in this way is not the officially accepted approach.  You are supposed to read the Central Directory and find the files that way.  This is because the .zip file may, in theory, contain deleted files or earlier versions, which you won’t stumble across if you only look for files found in the Central Directory.  I didn’t see this as a big risk, but again it is on my “to do” list.

I’m sure there are cleverer ways of writing this code, so any feedback is welcome.

Here’s an Excel workbook containing the query: DecompressZIP

You will need to update

Source = File.Contents("C:\CompressedData.zip"),

to point to your compressed data for it to work.

In my next post, I’ll show how to use this approach to extract data previously inaccessible to Power Query such as hyperlinks and comments from an Excel .xlsx file (which is, of course, a set of zipped .xml files).

Power Query, Proxies – Parsing tables from Microsoft Word documents

Our proposal letters are Microsoft Word documents which include embedded pricing tables.  It is useful for me to have this information in Excel.

I’m going to explore an approach for pulling tables from Word documents into Excel using the small Python web server used in previous posts and Power Query.

As per previous post, you need to have Python (both 2.x and 3.x should work) installed.  This time, we’ll be using the python-docx library to do the heavy lifting for us.  Of course, will also need the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

Here’s the code for the Python web server:

This Python code will run a small web server that listens on port 8080 and will respond to requests sent to http://localhost:8080/extractWordTables

It is expecting to be sent a query string with a Word document’s filename e.g. http://localhost:8080/extractWordTables?filename=myworddocument.docx

Once it has a Word document it attempts to parse it and look for tables.  To do this it uses Steve Canny’s fantastic python-docx library.

In using it, the process steps through the tables, row by row, cell by cell and converts the data into an HTML document containing one or more tables.

To use this from Power Query you need to execute a Power Query -> From Web query.  Then enter something like:

http://localhost:8080/extractWordTables?filename=WordDocWithTables.docx

You may need to tell it to convert the content to a Web Page, but you should end up with M code looking something like this:

You can then select the table that you want to work further on.

I have adapted this approach to work with multiple files, so you could essentially bulk parse a set of Word documents.  However, it relies on the documents containing tables with the same number of columns etc.  I’ve also added additional features to the Web server to allow me to filter what tables are returned based on the column headings.

Can provide further detail if there is interest.

Finally, I’m exploring other options for doing all of this directly in Power Query, but I do not have a working solution just yet.

Custom web proxies and Power Query – decrypting Google Chrome cookies

It turns out that decrypting Google Chrome cookies is a non-trivial process, so I thought I’d use it as an introduction to the use of custom web proxies.

There are functions available in the .NET framework that allow you to decrypt Google Chrome cookies, however, they are not directly accessible via Power Query (as far as I can tell) and even with VBA you would need to write a COM wrapper.

Instead, I decided to write a very short script in Python which would decrypt the cookies and send the result back to Excel/Power Query.  How to do that?  Well, by calling the running Python script from a Power Query Web contents query.

The solution below has two parts, a Python web server and Power Query.  Power Query imports the Google cookies database.  It then executes a Web query to contact the Python web server and send it a list of encrypted cookies.  The Python script receives the encrypted cookies, decrypts them and send back the result.  Power Query takes the decrypted cookies and combines it with the other data it has for the cookies.

To make this work, you need to have Python (both 2.x and 3.x should work) installed.  You’ll also need the Python extensions for Windows (for the win32crypt library) and the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

So basically, this short script starts a tiny local web server.

Assuming you run it from your desktop PC it will respond to requests sent to http://localhost:8080/decryptCookie

And if we send it a list of encrypted cookies, it will decrypt them using Microsoft’s libraries and send back the decrypted cookies.

For the purposes of this exercise, we will manually launch it, but we could trigger it in a more sophisticated fashion directly from Excel.

The M code on the Power Query side of things, builds on the post from yesterday which showed how to access the Google Chrome cookie SQLite database.

I did run into a strange issue where the query I put together yesterday was truncating the encrypted cookie field from the cookies database.   For some reason, this was corrected by adding a SQL query – select * from cookies rather than simply connecting to the table.  I would have expected this to produce identical results but it didn’t.

And here is the annotated M query:

Happy to walkthrough this in more detail if necessary or respond to questions.

This data can now be used much like the “harvested” Internet Explorer cookies from other posts.

Bear in mind, that there are countless other uses for custom Web proxies with Power Query. For example, you can use them to perform more complex web authentication, web scraping with BeautifulSoup (no more crazy HTML/XML guesswork), data manipulation using Python libraries such as pandas, execute other applications on your computer,  etc. etc.

Of course, on the flipside, you need to have Python installed and have the Python script running before it will work.  On your home PC, this probably doesn’t matter.  In a corporate environment, you would want this to be hosted on a proper web server.

How to access an SQLite database from Power Query

SQLite is an increasingly common database format for both mobile applications and small desktop applications.  Many applications on iOS/Android use them and Google/Mozilla use them for their browsers.

Reading an SQLite database using Power Query isn’t difficult but does require a few initial steps.

To begin with, you need to install an ODBC SQLite3 driver.

I recommend using the one located here:

http://www.ch-werner.de/sqliteodbc/

 

2016-02-01 11_47_35-SQLite ODBC Driver - Internet Explorer

Make sure that you install the correct version – either 32 bit or 64 bit.

It would also be a good idea to close Excel and re-open before going any further.

Now we create an ODBC query, by selecting From Other Sources->From ODBC:

2016-02-01 11_37_38-chrome cookies.xlsx - Excel

The next step is to specify a connection string.

Here’s an example that will read the Google Chrome Cookies database (assuming you have Chrome installed and specify your username):

DRIVER=SQLite3 ODBC Driver;Database=C:\Users\<your username>\AppData\Local\Google\Chrome\User Data\Default\Cookies;LongNames=0;Timeout=1000;NoTXN=0;

You will then get asked what credentials to use.  In our case, we just want the default of no credentials:

2016-02-01 11_43_28-

You will then see a dialog box asking you to select a table.  For our example, select the “cookies” table

2016-02-01 11_56_42-

and – hopefully – you will see a table containing all the cookies from Google Chrome:2016-02-01 12_14_49-cookies - Query Editor

You will note that most of them are encrypted and therefore not immediately available for us to use.

That’s a problem for another day :)

Although this example used the Google Chrome Cookies sqlite3 database, this approach should work with minimal change for most sqlite3 databases.