How to extract embedded Excel worksheets from PowerPoint using just Power Query

I’ve been using my solution to extract tables from Powerpoint a lot recently.  However, I’ve come across some situations where what looks like a table is actually an embedded Excel worksheet.

This can be quite useful in Powerpoint as it provides the ability to create subtotals and apply other formulae.

You can add these sort of tables by Insert->Table->Excel Spreadsheet or Insert->Object->Microsoft Excel Worksheet or if you Copy and Paste->Embed.

Of course, it also has the side effect of completely breaking my approach for extracting the data using Power Query.

Initially I thought I was going to have to find the reference in the XML, track it through the relationship file and then to the actual embedded workbook.  However, I quickly realised that there wasn’t much point.

The embedded worksheets always end up in the same folder within the zip file – ppt\embeddings –  so we can just grab them from there.

The code then becomes pretty simple.

  1. Unzip the .pptx file using the UnzipContents function from previous postings
  2. Filter for just files within the embedded folder
  3. Tell PowerQuery to treat the content as an Excel Workbook.

Unfortunately, it didn’t work.

It turns out that, for reasons unclear to me, Powerpoint doesn’t compress the content of embedded Excel worksheets.  This means that when the UnzipContents function tries to decompress them it fails and produces null instead.  I’ve noticed it seems to apply the same approach to embedded images such as JPGs.

There was a quick fix to this though.  I made a slight modification to the UnzipContents function and told it to return the raw binary data instead of null when it fails to decompress.


And voila, PowerQuery was able to open it as an Excel workbook, confirming that the content wasn’t compressed.

The code then becomes:


And you can call it from a Query like this:

This should yield a result like this:

2016-08-06 14_36_51-Query1 (2) - Query Editor

From here you can select the table you are interested in and drilldown the sheet:

2016-08-06 14_40_33-Query1 (2) - Query Editor

Here’s a sample and a workbook with a working example:


Extract Powerpoint embedded v2

How to extract tables from Powerpoint using just Power Query

The organization that I work for produces a lot of Powerpoint presentations.  Sometimes they can be the sole source of data for certain workflows.

2016-06-13 12_30_15-Sample.pptx - PowerPoint

So, I needed to develop a solution to extract the data from a Powerpoint presentation.

In order to achieve this using Power Query, I leveraged my previous solution for extracting tables from Microsoft Word documents as the internal formats are very similar.  If you refer to that post, you will find some additional information regarding how the code below works.

There are a few key things to note:

    1. I’ve adopted Mark White’s UnzipContents function in preference to my own.  His is an improved solution.  Note, I’ve made some minor changes to it by adding some rough comments and adding a facility to extract a specified file from a zip.
    2. Powerpoint slide XML uses <a:tbl> whereas Word document XML uses <w:tbl>
    3. I’ve applied PromoteHeaders and RemoveColumns to unexpanded tables, not strictly necessary but an approach I have had to use in other situations.
    4. In order to get at the actual table in the output, you will need to click on one of the tables located.

To use this function you will need to write a Query like this:

You can expect an output like this:

2016-06-13 12_25_49-Query1 - Query Editor

And then if you click on one of the tables within the Clean Table column, an output such as this:

2016-06-13 12_27_36-Query1 - Query Editor

You will need to create a Query called ExtractTablesFromPowerpoint with the following content:

You will also need a Query called UnzipContents, following is my slightly modified and roughly commented version of Mark’s code (any errors or misrepresentation are mine):

Let me know if you run into any issues or have any questions about the approach.

Any feedback is very welcome.

Here is a workbook and a sample file to make applying this easier:

Extract Powerpoint Table v3


How to extract Queries from a workbook using just Power Query

Had success today in developing a way of extracting Power Query queries from a workbook using just Power Query:

2016-05-29 12_14_24-Extract Queries v4.xlsx - Excel
(All the query text is in there, but Excel is limited to a row height of 409 which isn’t quite enough here)

I’ve been meaning to look into how query text is stored in Excel workbooks for a while.  I had assumed that they were found somewhere in the data model.

However, I was able to determine that they were actually stored, albeit in a rather convoluted form, in a connections.xml file within the Excel workbook .xlsx file.

Grab any Excel workbook containing Power Query queries and rename it to something .zip.

Then locate the connections.xml file in the xl folder.

If you have a look at it you’ll see what mostly looks like binary data.  But it’s in a particular form that I suspected was Base64.

So I converted the Base64 into binary to see what it was.  Turned out to be a zip file.  So I decompressed that, and hidden away inside was a text file called Section1.m containing all the queries in the workbook!

So – to recap, the text for your queries is stored in a text file in a zip file that has been converted to Base64 and placed inside an XML file that has then been turned into a zip file.  Or something like that.

And here is the code – all in Power Query – that can pull out all the queries from a specified workbook into an Excel table: (Yes – it needs a bit of a clean up!! – but gives you an insight into how I approach things)

Note – I kept running into a weird bug where it would mess up the line containing #”Replaced Value5″ whenever I opened it up in the Advanced Editor.

It would replace it, incorrectly, with:

and then helpfully report an error!

I had to keep replacing it afresh with:

This shouldn’t be an issue if you don’t open up this function in the Advanced Editor.

I also used a version of my DecompressFiles function which I hacked a bit to return the data as text rather than XML.  Sorry, it needs a bit of tidy up too but it did the job:

And to call the ExtractQueries function I had another little query like this:

Now, this should work in most cases, but I haven’t tested heavily.

To make it easier, I’ve uploaded a workbook below pre-loaded with the functions and queries and also the sample file they use.

Let me know how you go!

Workbook with function and queries: Extract Queries v5

Sample file it successfully queries: Compare-Tables

Power Query – Converting a hexadecimal number to decimal

There was an excellent blog post by Greg Deckler a few weeks back entitled Using Recursion to Solve Hex to Decimal Conversion. I thoroughly recommend reading to get some additional context.

However, I thought I’d try an alternative approach to converting hex to decimal that does not use recursion.

Converting a hexadecimal value into decimal in Power Query is actually pretty straightforward, for certain test cases.

For example, you can write:

And this will return the correct answer of 32154.

Which is great – but of course, it will only work for hex value comprising 4 characters.

So I’ve produced a more generalised solution:

Let me know if you would like a walkthrough of exactly how this works.

How to use Youtube API from Power Query – Part 3a (Mini Post)

Unfortunately, I’m not able to produce a full blog post today.  However, I’ve had a number of people requesting information regarding how to get statistics for multiple Youtube videos at once.

The basic principle is to send a list of video ids to the Youtube API separated by commas.

Below I have produced some “M” code that combines the output of my two previous blog posts.  That is, it –

1) Looks for the top 50 most popular videos
2) Produces a list of the video ids for these videos
3) Requests the statistics for these videos
4) Combines the original table with the statistics

I will do a full detailed walkthrough of how to produce this code in a future post.  But, if you’re feeling adventurous, create a Blank Query and copy and paste the below into the Advanced Editor.  You will need to copy your API key into two locations in the code.

With any luck, you should end up with output looking similar to this:

2016-05-15 12_29_03-youtube v2.xlsx - Excel

Let me know how you go. And if you have any issues don’t hesitate to ask.

I’ll try to get a full blog post up on this as soon as I can.

How to use Youtube API from Power Query – Part 2

This post follows on from my previous post regarding the use of the Youtube API from Power Query.

You will need to follow the steps from the previous post to ensure that you have a browser API key to the YouTube Data API v3.

In this post, we’ll be exploring how to get the number of views, likes, dislikes and comments for a video.

To achieve this we will use a videos query rather than a search query.

I’ve picked a video at random:

Its video id can be taken from the URL and I have highlighted it in bold below:

You can easily get this value for any video on YouTube.

Now, we can get into Excel and Power Query.

Start with a Blank Query and paste the following into the Advanced Editor:

Expected result:
2016-05-01 11_00_08-Query3 - Query Editor

This query requests the statistics for the Duck Song video (which has the id of MtN1YnoL46Q).

You can review the Youtube API documentation for more detailed information.

You could actually enter this directly into your browser if you wanted to. It is equivalent to:<your api key here>&id=MtN1YnoL46Q&part=id,statistics

Now, if you click on the items ListPower Query will return a list of the results as records.

2016-05-01 11_06_35-Query3 - Query Editor

They will be easier to work with as a Table so next we’ll convert to a Table.

2016-01-24 11_42_07-Youtube Search List Merge - Query Editor

Returning this:

2016-05-01 11_10_13-Query3 - Query Editor

If you then expand out the column, you can start to see all of our data:

2016-05-01 11_11_05-Query3 - Query EditorThen expand the statistics column and here we go – all the information we wanted regarding views, likes, dislikes and comments:

2016-05-01 11_12_12-Query3 - Query Editor

So at time of writing, this video has been viewed by 212 million people! 731,916 liked it, 95,318 disliked it and it has 197,897 comments.

Here’s all the code:

It is relatively easily to request statistics for multiple videos/channels at once. Can cover that in a future post if there is interest.

Please let me know if you encounter any issues or need further clarification.


Find the data sheet in a workbook using Power Query

A lot of the reports that I receive come in a workbook to which sheets have been added and the raw data will be on a randomly named sheet (“Extract”, “Sheet2”, “Raw Data” etc.) somewhere in the workbook.

This can make it difficult to have a repeatable process to import the data into a Data Model.

To address this I wrote a script that produces a summary of the workbook.  It lists all the sheets and indicates the number of rows and columns.  Here is the output:

2016-03-28 11_15_40-Workbook Profile - Query Editor

And here’s the script:


The key trick is to add custom columns which call Table.RowCount() and Table.ColumnCount() on the Data column (which happens to be a table).

In my case, I realized that the sheet with the most rows (although a close call in this example!) was always going to the sheet that I needed, so I wrote a quick function in Power Query to find and return the data from that sheet.

The function returns the data from the sheet with the most rows – which is likely to be the one with the raw data.

Here is GetSheetWithMostRows():

This allows me to automatically find and import the raw data from the report regardless of what the sheet is called or how many other sheets may have been added to the workbook.

Compare two tables or reports using Power Query

25/08/16 – updated to correct minor errors that may stopped this solution from working on your reports 

I receive a lot of reports on a daily, weekly or monthly basis.  Usually the reports are little changed from the previous version but there is no easy way to work out just what has changed.  I suspect that this is a pretty common issue so I’ve written a general solution in Power Query.

This function accepts two tables/reports as input and provides a report of just what has changed.

The functions should work on practically any report and could easily be adapted to compare sheets in separate workbooks, but to demonstrate, I’m going to compare this simple table:

2016-03-20 11_12_33-Compare Tables v12.xlsx - Excel

with this table:

2016-03-20 11_13_00-Compare Tables v12.xlsx - Excel

and produce this output:

2016-03-20 11_13_15-Compare Tables v12.xlsx - Excel

In comparing two tables, there are a 5 kinds of variances we need to look for:

  1. Added rows – rows in second table but not in the first (Chris Masters in our example)
  2. Removed rows – rows in first table but not in the second (eg. Penny Bradley)
  3. Added columns – columns in second table not in the first (eg. Status column)
  4. Removed columns – columns in first table not in the second (eg. Loyalty Rating column)
  5. Changed rows – rows that are in both tables but with one of more changed values

The first 4 variances are fairly easy to detect.

For rows, you can use Table.NestedJoin() to JOIN the tables and, using a joinKind of LeftAnti, return rows that either only exist in one table or only exist in the other.

For columns, you can using the List functions to compare a list of Table.ColumnNames from each table and find those which are unique to one or the other.

Detecting changed fields within rows is a bit more complicated…

The approach I took was to merge the two tables so that each row contained both the old data for the row and the new data for the row.  Then I converted the rows to lists.

I then compared the old value for each field with the new value for each field and used Power Query’s Swiss army knife List.Generate() (which I described in a blog post here) to produce a list of all the changes.  I haven’t commented this in detail, so you may wish to refer to the blog post to get a better sense of what I am doing.

In order to be able to compare the two tables, they need to have a common unique column or primary key.  If this is not the case, you may need to create one by combining two or more columns.  In some cases, it may be OK to use a numeric index field.

The function also accepts a DontCompare parameter. This is a list of fields that you want to exclude from the comparison.  This may include fields that you are not interested in or that always change e.g. Report Date.   In this example I have excluded the Last Updated field.

There is also an AlwaysInclude parameter. This is a list of up to 4 fields that you want to always include in the output.  This is to allow you to see more easily what rows have changed.  In this example, I included both the Customer Id and the Customer’s Last Name.

Here is the commented code for the two functions (the parent function CompareTables that accepts two tables and the child function CompareRows that compares fields in a single row):



And here’s an example of how to use:

Please feel free to ask questions or make comments below.

To make this easier to follow, please find a copy of the example workbook here.  Example workbook updated 25/8/16

Compare-Tables – fixed

Fun with List.Generate() – Tips and producing a Date Range


There have been a couple of blog posts about List.Generate()one by Chris Webb and another recently by Andrew Todd on the PowerPivotPro blog.

I recommend both of these articles as starting points for exploring List.Generate().  This blog post is not an introductory one.

List.Generate() makes lists.  It can make a straight list of scalar values, or a list of records, or a list of tables, or a list of records that contain lists etc.

The list that is returned is defined by the algorithm that you enter.  That is, you can, with a bit of work get it to produce pretty much any list that you like.

To achieve this, List.Generate iteratively produces records (well, they don’t have to be records, but they usually are).  These records are basically its mental scratchpad, which it (you) uses to decide both what item to add to the list that it ultimately generates and how to produce the next record (to produce the next item in its list and so on).

List.Generate accepts 4 parameters:

  1. A starting condition.  Usually this is a function returning a record.  For example, ()=>[A=1,B=1].  This means that on the first iteration, you will have a record containing two fields, A and B, where A=1 and B=1.
  2. When to stop.  This is some kind of conditional statement telling Power Query when to stop iterating.  e.g. each [A] < 5.  Note that when checking this condition, we will have received a record of the same format as the starting condition e.g. [A=something, B=something].  It then checks the value of A within this record and while it is less than 5 it keeps going.
  3. How to make the next record.  This is where the work happens.  It receives a copy of the current iteration’s record (think mental scratchpad, current values) and returns the record for the next iteration e.g.:

    Now, the [A] and [B] inside the outer square brackets are the values of A and B in the current iteration‘s record.  The A and B without the square brackets refer to A and B in the next iteration‘s record that we are making here.  One really important thing to note is that, in producing the next iteration’s record we must produce a record with the same fields (perhaps with rare exceptions).    If you don’t, then List.Generate will fail because A or B will not be defined in the next iteration.  So you couldn’t, for example, return a record here such as each [B=7]
  4. What to put into the list.  This the actual item we’re going to add to the list this iteration.  It’s not the record we’ve been passing around, at least not necessarily. It could literally be anything e.g. each 6. But usually it is based on the current record e.g. each [A]. Would return a list with the value of A from every iteration.  Or we could return a record with both values e.g. [A=[A], B=[B]] (this means we are, in this instance, returning the same record we’ve been using as our mental scratchpad).  It returns a record with the values of A and B for each iteration.

Note – while you can get away with returning a record of [[A],[B]], you can’t return a record of [[A], [B], [A]+[B]].  Why?  Because records need fields to have names and [A]+[B] doesn’t have a name unless you give it one:

Let’s look at a more complex example of using List.Generate().

I have some data where staff on a graduate programme move between teams. The data is a table of name, date, team:

2016-03-06 10_57_28-Graduates.xlsx - Excel

We will use List.Generate to turn this into date ranges.

2016-03-06 10_57_50-Employee Range.xlsx - Excel

Our starting condition is this:

So the record we’ll be using as our mental scratchpad will start with a blank Employee, Team, Date and a Counter starting at 0.

End condition:

Keep going while our Counter variable is less than the number of rows in our InputData (i.e. our list of Employees)

OK, here’s where the real work occurs.  This is how the next record in each iteration is produced:

First thing to note, we use our Counter to refer to fields in our InputData table.  So Employee=InputData{[Counter]}[#”Employee Name”], grabs the Employee Name from the Counterth row in our InputData.

Next thing to be aware of.  To calculate the value of Date in our next iteration, we use a function.  Date will be equal to whatever this function produces.  You can see that within the function we use the same approach as above to refer to elements within our InputData.  This function could do anything functions usually do (including calling other functions), what matters is that it returns a value.

Now what the function returns isn’t actually a date.  It’s a record.

So it will set the Date value in the record we are iterating to be equal to a record (so we’ve got a record within a record).

The rest of the code takes the generated list of records, expands it out to a table.  It then adds a special Index to allow us to clean pivot the output into the required output.

Here’s the full code:

There are ways to get this same output without using List.Generate(). The point here is to explore some other ways you can employ List.Generate() in your own projects.

Extracting tables from Microsoft Word documents using just Power Query

A few weeks ago, I wrote a post demonstrating how to extract tables from Word documents using a combination of Power Query and a Python web server.

Today I want to revisit that solution and show how to do the same thing using only Power Query.

To achieve this we are going to leverage the fact that Microsoft Word .docx files are actually ZIP files containing a group of XML files.  We will decompress the ZIP file and parse the XML to pull information into Power Query.

We will take a simple Word document containing this table:

2016-02-28 12_00_17-example.docx - WordAnd import it into Power Query: 2016-02-28 12_00_36-Extract Word Table - Query Editor

Here are the high level steps:

  1. Extract the document.xml file from the Word document using the DecompressFiles function I provided in a previous post.
  2. Replace the Word table XML tags with special tags.  Note I found that some table row tags had attributes and I had to write a function ReplaceTag to locate the closing “>”.
  3. Remove all other XML tags.  I have borrowed a solution described by Bill Szysz (thank you!)
  4. Replace the special table tags we added with standard HTML table tags
  5. Use List.Accumulate to turn it back into a single text string (this helps with table cells that are split due to paragraphs etc.
  6. And finally tell Power Query to parse our output as if it was a Web page


Here is the commented code:

I also call my DecompressFiles function which I have as a separate query:

To make it easier, here is a copy of the workbook.

If you have any difficulty with this, let me know and I can walk through in more detail or clarify any steps.

I’ve tried this on a few different Word documents, but your mileage may vary.  I am using it to pull information out of pricing tables in our proposal letters.

If your intention is to parse tables from a lot of Word documents I would recommend the Python web server approach as you will get much better performance.

You can find more information about the XML schema used in Word documents here.