Extracting tables from Microsoft Word documents using just Power Query

A few weeks ago, I wrote a post demonstrating how to extract tables from Word documents using a combination of Power Query and a Python web server.

Today I want to revisit that solution and show how to do the same thing using only Power Query.

To achieve this we are going to leverage the fact that Microsoft Word .docx files are actually ZIP files containing a group of XML files.  We will decompress the ZIP file and parse the XML to pull information into Power Query.

We will take a simple Word document containing this table:

2016-02-28 12_00_17-example.docx - WordAnd import it into Power Query: 2016-02-28 12_00_36-Extract Word Table - Query Editor

Here are the high level steps:

  1. Extract the document.xml file from the Word document using the DecompressFiles function I provided in a previous post.
  2. Replace the Word table XML tags with special tags.  Note I found that some table row tags had attributes and I had to write a function ReplaceTag to locate the closing “>”.
  3. Remove all other XML tags.  I have borrowed a solution described by Bill Szysz (thank you!)
  4. Replace the special table tags we added with standard HTML table tags
  5. Use List.Accumulate to turn it back into a single text string (this helps with table cells that are split due to paragraphs etc.
  6. And finally tell Power Query to parse our output as if it was a Web page

 

Here is the commented code:

I also call my DecompressFiles function which I have as a separate query:

To make it easier, here is a copy of the workbook.

If you have any difficulty with this, let me know and I can walk through in more detail or clarify any steps.

I’ve tried this on a few different Word documents, but your mileage may vary.  I am using it to pull information out of pricing tables in our proposal letters.

If your intention is to parse tables from a lot of Word documents I would recommend the Python web server approach as you will get much better performance.

You can find more information about the XML schema used in Word documents here.

3 thoughts on “Extracting tables from Microsoft Word documents using just Power Query”

  1. Hi Ken,
    Thank you for your string solution.
    I have an word-document with carriage return in the table. When importing the word-document they disappear.
    I tried to add some code, but it did not work.


    ReplaceRowEndTag2 = Table.ReplaceValue(ReplaceRowEndTag,””,”||ROW SOFTEND||”,Replacer.ReplaceText,{“Column1″}),

    aReplaceStr4 = Table.ReplaceValue(aReplaceStr3,”||ROW SOFTEND||”,””,Replacer.ReplaceText,{“ID”}),

    Can you help me?
    Thanks a lot

Leave a Reply

Your email address will not be published. Required fields are marked *