How to extract Queries from a workbook using just Power Query

Had success today in developing a way of extracting Power Query queries from a workbook using just Power Query:

2016-05-29 12_14_24-Extract Queries v4.xlsx - Excel
(All the query text is in there, but Excel is limited to a row height of 409 which isn’t quite enough here)

I’ve been meaning to look into how query text is stored in Excel workbooks for a while.  I had assumed that they were found somewhere in the data model.

However, I was able to determine that they were actually stored, albeit in a rather convoluted form, in a connections.xml file within the Excel workbook .xlsx file.

Grab any Excel workbook containing Power Query queries and rename it to something .zip.

Then locate the connections.xml file in the xl folder.

If you have a look at it you’ll see what mostly looks like binary data.  But it’s in a particular form that I suspected was Base64.

So I converted the Base64 into binary to see what it was.  Turned out to be a zip file.  So I decompressed that, and hidden away inside was a text file called Section1.m containing all the queries in the workbook!

So – to recap, the text for your queries is stored in a text file in a zip file that has been converted to Base64 and placed inside an XML file that has then been turned into a zip file.  Or something like that.

And here is the code – all in Power Query – that can pull out all the queries from a specified workbook into an Excel table: (Yes – it needs a bit of a clean up!! – but gives you an insight into how I approach things)

Note – I kept running into a weird bug where it would mess up the line containing #”Replaced Value5″ whenever I opened it up in the Advanced Editor.

It would replace it, incorrectly, with:

and then helpfully report an error!

I had to keep replacing it afresh with:

This shouldn’t be an issue if you don’t open up this function in the Advanced Editor.

I also used a version of my DecompressFiles function which I hacked a bit to return the data as text rather than XML.  Sorry, it needs a bit of tidy up too but it did the job:

And to call the ExtractQueries function I had another little query like this:

Now, this should work in most cases, but I haven’t tested heavily.

To make it easier, I’ve uploaded a workbook below pre-loaded with the functions and queries and also the sample file they use.

Let me know how you go!

Workbook with function and queries: Extract Queries v5

Sample file it successfully queries: Compare-Tables

Find the data sheet in a workbook using Power Query

A lot of the reports that I receive come in a workbook to which sheets have been added and the raw data will be on a randomly named sheet (“Extract”, “Sheet2”, “Raw Data” etc.) somewhere in the workbook.

This can make it difficult to have a repeatable process to import the data into a Data Model.

To address this I wrote a script that produces a summary of the workbook.  It lists all the sheets and indicates the number of rows and columns.  Here is the output:

2016-03-28 11_15_40-Workbook Profile - Query Editor

And here’s the script:

 

The key trick is to add custom columns which call Table.RowCount() and Table.ColumnCount() on the Data column (which happens to be a table).

In my case, I realized that the sheet with the most rows (although a close call in this example!) was always going to the sheet that I needed, so I wrote a quick function in Power Query to find and return the data from that sheet.

The function returns the data from the sheet with the most rows – which is likely to be the one with the raw data.

Here is GetSheetWithMostRows():

This allows me to automatically find and import the raw data from the report regardless of what the sheet is called or how many other sheets may have been added to the workbook.

Compare two tables or reports using Power Query

25/08/16 – updated to correct minor errors that may stopped this solution from working on your reports 

I receive a lot of reports on a daily, weekly or monthly basis.  Usually the reports are little changed from the previous version but there is no easy way to work out just what has changed.  I suspect that this is a pretty common issue so I’ve written a general solution in Power Query.

This function accepts two tables/reports as input and provides a report of just what has changed.

The functions should work on practically any report and could easily be adapted to compare sheets in separate workbooks, but to demonstrate, I’m going to compare this simple table:

2016-03-20 11_12_33-Compare Tables v12.xlsx - Excel

with this table:

2016-03-20 11_13_00-Compare Tables v12.xlsx - Excel

and produce this output:

2016-03-20 11_13_15-Compare Tables v12.xlsx - Excel

In comparing two tables, there are a 5 kinds of variances we need to look for:

  1. Added rows – rows in second table but not in the first (Chris Masters in our example)
  2. Removed rows – rows in first table but not in the second (eg. Penny Bradley)
  3. Added columns – columns in second table not in the first (eg. Status column)
  4. Removed columns – columns in first table not in the second (eg. Loyalty Rating column)
  5. Changed rows – rows that are in both tables but with one of more changed values

The first 4 variances are fairly easy to detect.

For rows, you can use Table.NestedJoin() to JOIN the tables and, using a joinKind of LeftAnti, return rows that either only exist in one table or only exist in the other.

For columns, you can using the List functions to compare a list of Table.ColumnNames from each table and find those which are unique to one or the other.

Detecting changed fields within rows is a bit more complicated…

The approach I took was to merge the two tables so that each row contained both the old data for the row and the new data for the row.  Then I converted the rows to lists.

I then compared the old value for each field with the new value for each field and used Power Query’s Swiss army knife List.Generate() (which I described in a blog post here) to produce a list of all the changes.  I haven’t commented this in detail, so you may wish to refer to the blog post to get a better sense of what I am doing.

In order to be able to compare the two tables, they need to have a common unique column or primary key.  If this is not the case, you may need to create one by combining two or more columns.  In some cases, it may be OK to use a numeric index field.

The function also accepts a DontCompare parameter. This is a list of fields that you want to exclude from the comparison.  This may include fields that you are not interested in or that always change e.g. Report Date.   In this example I have excluded the Last Updated field.

There is also an AlwaysInclude parameter. This is a list of up to 4 fields that you want to always include in the output.  This is to allow you to see more easily what rows have changed.  In this example, I included both the Customer Id and the Customer’s Last Name.

Here is the commented code for the two functions (the parent function CompareTables that accepts two tables and the child function CompareRows that compares fields in a single row):

CompareTables:

CompareRows:

And here’s an example of how to use:

Please feel free to ask questions or make comments below.

To make this easier to follow, please find a copy of the example workbook here.  Example workbook updated 25/8/16

Compare-Tables – fixed

Fun with List.Generate() – Tips and producing a Date Range

 

There have been a couple of blog posts about List.Generate()one by Chris Webb and another recently by Andrew Todd on the PowerPivotPro blog.

I recommend both of these articles as starting points for exploring List.Generate().  This blog post is not an introductory one.

List.Generate() makes lists.  It can make a straight list of scalar values, or a list of records, or a list of tables, or a list of records that contain lists etc.

The list that is returned is defined by the algorithm that you enter.  That is, you can, with a bit of work get it to produce pretty much any list that you like.

To achieve this, List.Generate iteratively produces records (well, they don’t have to be records, but they usually are).  These records are basically its mental scratchpad, which it (you) uses to decide both what item to add to the list that it ultimately generates and how to produce the next record (to produce the next item in its list and so on).

List.Generate accepts 4 parameters:

  1. A starting condition.  Usually this is a function returning a record.  For example, ()=>[A=1,B=1].  This means that on the first iteration, you will have a record containing two fields, A and B, where A=1 and B=1.
  2. When to stop.  This is some kind of conditional statement telling Power Query when to stop iterating.  e.g. each [A] < 5.  Note that when checking this condition, we will have received a record of the same format as the starting condition e.g. [A=something, B=something].  It then checks the value of A within this record and while it is less than 5 it keeps going.
  3. How to make the next record.  This is where the work happens.  It receives a copy of the current iteration’s record (think mental scratchpad, current values) and returns the record for the next iteration e.g.:

    Now, the [A] and [B] inside the outer square brackets are the values of A and B in the current iteration‘s record.  The A and B without the square brackets refer to A and B in the next iteration‘s record that we are making here.  One really important thing to note is that, in producing the next iteration’s record we must produce a record with the same fields (perhaps with rare exceptions).    If you don’t, then List.Generate will fail because A or B will not be defined in the next iteration.  So you couldn’t, for example, return a record here such as each [B=7]
  4. What to put into the list.  This the actual item we’re going to add to the list this iteration.  It’s not the record we’ve been passing around, at least not necessarily. It could literally be anything e.g. each 6. But usually it is based on the current record e.g. each [A]. Would return a list with the value of A from every iteration.  Or we could return a record with both values e.g. [A=[A], B=[B]] (this means we are, in this instance, returning the same record we’ve been using as our mental scratchpad).  It returns a record with the values of A and B for each iteration.

Note – while you can get away with returning a record of [[A],[B]], you can’t return a record of [[A], [B], [A]+[B]].  Why?  Because records need fields to have names and [A]+[B] doesn’t have a name unless you give it one:

Let’s look at a more complex example of using List.Generate().

I have some data where staff on a graduate programme move between teams. The data is a table of name, date, team:

2016-03-06 10_57_28-Graduates.xlsx - Excel

We will use List.Generate to turn this into date ranges.

2016-03-06 10_57_50-Employee Range.xlsx - Excel

Our starting condition is this:

So the record we’ll be using as our mental scratchpad will start with a blank Employee, Team, Date and a Counter starting at 0.

End condition:

Keep going while our Counter variable is less than the number of rows in our InputData (i.e. our list of Employees)

OK, here’s where the real work occurs.  This is how the next record in each iteration is produced:

First thing to note, we use our Counter to refer to fields in our InputData table.  So Employee=InputData{[Counter]}[#”Employee Name”], grabs the Employee Name from the Counterth row in our InputData.

Next thing to be aware of.  To calculate the value of Date in our next iteration, we use a function.  Date will be equal to whatever this function produces.  You can see that within the function we use the same approach as above to refer to elements within our InputData.  This function could do anything functions usually do (including calling other functions), what matters is that it returns a value.

Now what the function returns isn’t actually a date.  It’s a record.

So it will set the Date value in the record we are iterating to be equal to a record (so we’ve got a record within a record).

The rest of the code takes the generated list of records, expands it out to a table.  It then adds a special Index to allow us to clean pivot the output into the required output.

Here’s the full code:

There are ways to get this same output without using List.Generate(). The point here is to explore some other ways you can employ List.Generate() in your own projects.

Extracting tables from Microsoft Word documents using just Power Query

A few weeks ago, I wrote a post demonstrating how to extract tables from Word documents using a combination of Power Query and a Python web server.

Today I want to revisit that solution and show how to do the same thing using only Power Query.

To achieve this we are going to leverage the fact that Microsoft Word .docx files are actually ZIP files containing a group of XML files.  We will decompress the ZIP file and parse the XML to pull information into Power Query.

We will take a simple Word document containing this table:

2016-02-28 12_00_17-example.docx - WordAnd import it into Power Query: 2016-02-28 12_00_36-Extract Word Table - Query Editor

Here are the high level steps:

  1. Extract the document.xml file from the Word document using the DecompressFiles function I provided in a previous post.
  2. Replace the Word table XML tags with special tags.  Note I found that some table row tags had attributes and I had to write a function ReplaceTag to locate the closing “>”.
  3. Remove all other XML tags.  I have borrowed a solution described by Bill Szysz (thank you!)
  4. Replace the special table tags we added with standard HTML table tags
  5. Use List.Accumulate to turn it back into a single text string (this helps with table cells that are split due to paragraphs etc.
  6. And finally tell Power Query to parse our output as if it was a Web page

 

Here is the commented code:

I also call my DecompressFiles function which I have as a separate query:

To make it easier, here is a copy of the workbook.

If you have any difficulty with this, let me know and I can walk through in more detail or clarify any steps.

I’ve tried this on a few different Word documents, but your mileage may vary.  I am using it to pull information out of pricing tables in our proposal letters.

If your intention is to parse tables from a lot of Word documents I would recommend the Python web server approach as you will get much better performance.

You can find more information about the XML schema used in Word documents here.

How to extract data from a ZIP file using just Power Query

Power Query doesn’t have inbuilt support for ZIP files.  However, it does have the ability to decompress deflated data held in gzip files.  Chris Webb wrote an article about this a couple of months back but not about .zip files.

In this post, I’m going to describe a method for extracting content from a standard .zip file.

A few constraints to be aware of upfront  – this will only work for ZIP files, not RAR, ZIPX, 7-Zip, bzip2, etc.

It does work with every ZIP file that I created using WinZip and WinRAR, including “Super Fast”, “Enhanced Deflate” etc.

It doesn’t work with “ZIP: No Compression” (although it would only take a couple of extra lines to accommodate).

In my experience, it works with most ZIP files that I have encountered.

The Wikipedia article on the ZIP file format was invaluable in putting this solution together.

At a high level, a ZIP file is comprised of:

File #1 Header
File #1 Compressed Data
File #2 Header
File #2 Compressed Data
Central Directory

In other words, the content is at the beginning and the “directory” for that content is at the end.

I elected to ignore the Directory and parse the content directly.

And here is my M code (sorry, you can’t do this via the GUI):

So how does it work?

The function DecompressFiles takes four parameters:

ZIPfile : the ZIP file contents
Position : byte offset within the file, this is initially zero
FileToExtract : you can optionally pass in a specific file to find within the ZIP file – passing a blank string will return all files
DataSoFar : if you are decompressing multiple files from the ZIP file, then this holds previous decompressed files – it is initially blank

There are two key steps in the code:

Firstly, it reads the zip file content and parses some key values from a file header – namely, the length of the compressed file (i.e. how many bytes of compressed data are there), the length of the filename, and the length of any extra values. Everything else has a fixed length and so isn’t important to us.

Once we have extracted these values we know exactly where the file’s compressed data is located and how long it is.

With this information, we parse the zip file again, this time with full knowledge of where the compressed data is.  This allows us to extract it out into a single field – Data.

Then it’s a simple process to use Binary.Decompress to get the decompressed content.

You should end up output looking like this:

2016-02-14 11_42_27-Decompress CSV - Query Editor

Some other things to note about the code:

  1. I’ve wrapped most of the lines in try … otherwise… blocks.  This is because I have not attempted to determine when I reach the end of the files and hit the Central Directory.  That’s one my “to do” list.
  2. The function is recursive (yes, I know that isn’t preferred).  I need to take a look at how to change that to List.Generate or similar.  I haven’t run into any performance issues however.
  3. I maintain the files as a Table of binaries to allow me both to show the filename and in case the content type varies e.g. some .csv, some .xml etc.
  4. Reading .zip files in this way is not the officially accepted approach.  You are supposed to read the Central Directory and find the files that way.  This is because the .zip file may, in theory, contain deleted files or earlier versions, which you won’t stumble across if you only look for files found in the Central Directory.  I didn’t see this as a big risk, but again it is on my “to do” list.

I’m sure there are cleverer ways of writing this code, so any feedback is welcome.

Here’s an Excel workbook containing the query: DecompressZIP

You will need to update

Source = File.Contents("C:\CompressedData.zip"),

to point to your compressed data for it to work.

In my next post, I’ll show how to use this approach to extract data previously inaccessible to Power Query such as hyperlinks and comments from an Excel .xlsx file (which is, of course, a set of zipped .xml files).

Power Query, Proxies – Parsing tables from Microsoft Word documents

Our proposal letters are Microsoft Word documents which include embedded pricing tables.  It is useful for me to have this information in Excel.

I’m going to explore an approach for pulling tables from Word documents into Excel using the small Python web server used in previous posts and Power Query.

As per previous post, you need to have Python (both 2.x and 3.x should work) installed.  This time, we’ll be using the python-docx library to do the heavy lifting for us.  Of course, will also need the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

Here’s the code for the Python web server:

This Python code will run a small web server that listens on port 8080 and will respond to requests sent to http://localhost:8080/extractWordTables

It is expecting to be sent a query string with a Word document’s filename e.g. http://localhost:8080/extractWordTables?filename=myworddocument.docx

Once it has a Word document it attempts to parse it and look for tables.  To do this it uses Steve Canny’s fantastic python-docx library.

In using it, the process steps through the tables, row by row, cell by cell and converts the data into an HTML document containing one or more tables.

To use this from Power Query you need to execute a Power Query -> From Web query.  Then enter something like:

http://localhost:8080/extractWordTables?filename=WordDocWithTables.docx

You may need to tell it to convert the content to a Web Page, but you should end up with M code looking something like this:

You can then select the table that you want to work further on.

I have adapted this approach to work with multiple files, so you could essentially bulk parse a set of Word documents.  However, it relies on the documents containing tables with the same number of columns etc.  I’ve also added additional features to the Web server to allow me to filter what tables are returned based on the column headings.

Can provide further detail if there is interest.

Finally, I’m exploring other options for doing all of this directly in Power Query, but I do not have a working solution just yet.

Custom web proxies and Power Query – decrypting Google Chrome cookies

It turns out that decrypting Google Chrome cookies is a non-trivial process, so I thought I’d use it as an introduction to the use of custom web proxies.

There are functions available in the .NET framework that allow you to decrypt Google Chrome cookies, however, they are not directly accessible via Power Query (as far as I can tell) and even with VBA you would need to write a COM wrapper.

Instead, I decided to write a very short script in Python which would decrypt the cookies and send the result back to Excel/Power Query.  How to do that?  Well, by calling the running Python script from a Power Query Web contents query.

The solution below has two parts, a Python web server and Power Query.  Power Query imports the Google cookies database.  It then executes a Web query to contact the Python web server and send it a list of encrypted cookies.  The Python script receives the encrypted cookies, decrypts them and send back the result.  Power Query takes the decrypted cookies and combines it with the other data it has for the cookies.

To make this work, you need to have Python (both 2.x and 3.x should work) installed.  You’ll also need the Python extensions for Windows (for the win32crypt library) and the bottle web framework.

I won’t cover the installation of these here, but happy to assist if you run into trouble.

So basically, this short script starts a tiny local web server.

Assuming you run it from your desktop PC it will respond to requests sent to http://localhost:8080/decryptCookie

And if we send it a list of encrypted cookies, it will decrypt them using Microsoft’s libraries and send back the decrypted cookies.

For the purposes of this exercise, we will manually launch it, but we could trigger it in a more sophisticated fashion directly from Excel.

The M code on the Power Query side of things, builds on the post from yesterday which showed how to access the Google Chrome cookie SQLite database.

I did run into a strange issue where the query I put together yesterday was truncating the encrypted cookie field from the cookies database.   For some reason, this was corrected by adding a SQL query – select * from cookies rather than simply connecting to the table.  I would have expected this to produce identical results but it didn’t.

And here is the annotated M query:

Happy to walkthrough this in more detail if necessary or respond to questions.

This data can now be used much like the “harvested” Internet Explorer cookies from other posts.

Bear in mind, that there are countless other uses for custom Web proxies with Power Query. For example, you can use them to perform more complex web authentication, web scraping with BeautifulSoup (no more crazy HTML/XML guesswork), data manipulation using Python libraries such as pandas, execute other applications on your computer,  etc. etc.

Of course, on the flipside, you need to have Python installed and have the Python script running before it will work.  On your home PC, this probably doesn’t matter.  In a corporate environment, you would want this to be hosted on a proper web server.

How to access an SQLite database from Power Query

SQLite is an increasingly common database format for both mobile applications and small desktop applications.  Many applications on iOS/Android use them and Google/Mozilla use them for their browsers.

Reading an SQLite database using Power Query isn’t difficult but does require a few initial steps.

To begin with, you need to install an ODBC SQLite3 driver.

I recommend using the one located here:

http://www.ch-werner.de/sqliteodbc/

 

2016-02-01 11_47_35-SQLite ODBC Driver - Internet Explorer

Make sure that you install the correct version – either 32 bit or 64 bit.

It would also be a good idea to close Excel and re-open before going any further.

Now we create an ODBC query, by selecting From Other Sources->From ODBC:

2016-02-01 11_37_38-chrome cookies.xlsx - Excel

The next step is to specify a connection string.

Here’s an example that will read the Google Chrome Cookies database (assuming you have Chrome installed and specify your username):

DRIVER=SQLite3 ODBC Driver;Database=C:\Users\<your username>\AppData\Local\Google\Chrome\User Data\Default\Cookies;LongNames=0;Timeout=1000;NoTXN=0;

You will then get asked what credentials to use.  In our case, we just want the default of no credentials:

2016-02-01 11_43_28-

You will then see a dialog box asking you to select a table.  For our example, select the “cookies” table

2016-02-01 11_56_42-

and – hopefully – you will see a table containing all the cookies from Google Chrome:2016-02-01 12_14_49-cookies - Query Editor

You will note that most of them are encrypted and therefore not immediately available for us to use.

That’s a problem for another day :)

Although this example used the Google Chrome Cookies sqlite3 database, this approach should work with minimal change for most sqlite3 databases.

 

 

How to use Youtube API from Power Query – Part 1

2016-01-24 11_59_54-Book1 - Excel

Google offer quite a comprehensive API for Youtube but I haven’t seen any examples of people using Power Query to explore.

So let’s do that – we’re going to use Power Query to access Youtube’s API and display a simple table of popular videos such as the above.

Before you can do this, you will need an API key for accessing the Youtube API.

Google provide instructions for this here.

The key steps are:

  1. Create a new project in the Google Developers Console.
  2. Turn on access for the YouTube Data API v3.
  3. Create a browser API key.

The end result should be a long string of random characters constituting your key for accessing the Youtube Data API.

Now, we can get into Excel and Power Query.

Usually when you are accessing an API in Power Query you would use Web API authentication.  This doesn’t seem to work for Youtube Data API, as it expects the API key to be provided in the query not the header, so we will need to specify our API key manually.

Start with a Blank Query and paste the following into the Advanced Editor:

Update 28/3/2016

You may get a response indicating that you need to specify how to connect or otherwise enter credentials.  If you have already entered credentials you may need to edit them in order for future steps to work.

2016-03-28 10_00_00-Query2 - Query Editor

Click to Edit Credentials.

Then you need to indicate that you want to connect anonymously.  This is because we are providing our credentials (our API key) as part of our Query string not via the Query header.

So when the below window appears, make sure you select Anonymous on the left and select just the Youtube search hyperlink radio button on the right and finally hit Connect.

2016-03-28 10_00_25-Query2 - Query Editor

Expected result:

yt 1

The Youtube API documentation explains in detail what our query actually does, but essentially we are performing a search for the top 50 (by number of views) videos containing the term “power query”.

You could actually enter this directly into your browser if you wanted to. It is equivalent to:

https://www.googleapis.com/youtube/v3/search?key=<your api key here>&q=power%20query&part=snippet,id&maxResults=50&order=viewCount

 

Now, if you click on the items List.  Power Query will return a list of the results as records.

2016-01-24 11_36_30-Youtube Search List Merge - Query Editor

They will be easier to work with as Table so next we’ll convert to a Table.

2016-01-24 11_42_07-Youtube Search List Merge - Query Editor

Returning this:2016-01-24 11_36_52-Youtube Search List Merge - Query Editor

If you then expand out the column, you can start to see all of our data:

2016-01-24 11_37_15-Youtube Search List Merge - Query EditorNext expand out the id and snippet:

2016-01-24 11_37_35-Youtube Search List Merge - Query EditorAs you can see, we now have a list of videos that all look to be related to Power Query!

If you remove columns except for video id, publishedAt, title and channelTitle, you should end up with a result something like this:

2016-01-24 11_59_54-Book1 - Excel

We can immediately learn that ExcelIsFun is a very popular Youtube channel among Power Query enthusiasts – it has 4 videos in the top 10.  I definitely recommend checking it out.

Unfortunately, we don’t have any information about the number of views, or likes, or comments.

In order to add this information we will need to perform another query to the API asking for statistics.

I’ll be covering this in a future post.

Here’s the full code to date:

A follow up post has been added here.