How to extract data from a ZIP file using just Power Query

Power Query doesn’t have inbuilt support for ZIP files.  However, it does have the ability to decompress deflated data held in gzip files.  Chris Webb wrote an article about this a couple of months back but not about .zip files.

In this post, I’m going to describe a method for extracting content from a standard .zip file.

A few constraints to be aware of upfront  – this will only work for ZIP files, not RAR, ZIPX, 7-Zip, bzip2, etc.

It does work with every ZIP file that I created using WinZip and WinRAR, including “Super Fast”, “Enhanced Deflate” etc.

It doesn’t work with “ZIP: No Compression” (although it would only take a couple of extra lines to accommodate).

In my experience, it works with most ZIP files that I have encountered.

The Wikipedia article on the ZIP file format was invaluable in putting this solution together.

At a high level, a ZIP file is comprised of:

File #1 Header
File #1 Compressed Data
File #2 Header
File #2 Compressed Data
Central Directory

In other words, the content is at the beginning and the “directory” for that content is at the end.

I elected to ignore the Directory and parse the content directly.

And here is my M code (sorry, you can’t do this via the GUI):

So how does it work?

The function DecompressFiles takes four parameters:

ZIPfile : the ZIP file contents
Position : byte offset within the file, this is initially zero
FileToExtract : you can optionally pass in a specific file to find within the ZIP file – passing a blank string will return all files
DataSoFar : if you are decompressing multiple files from the ZIP file, then this holds previous decompressed files – it is initially blank

There are two key steps in the code:

Firstly, it reads the zip file content and parses some key values from a file header – namely, the length of the compressed file (i.e. how many bytes of compressed data are there), the length of the filename, and the length of any extra values. Everything else has a fixed length and so isn’t important to us.

Once we have extracted these values we know exactly where the file’s compressed data is located and how long it is.

With this information, we parse the zip file again, this time with full knowledge of where the compressed data is.  This allows us to extract it out into a single field – Data.

Then it’s a simple process to use Binary.Decompress to get the decompressed content.

You should end up output looking like this:

2016-02-14 11_42_27-Decompress CSV - Query Editor

Some other things to note about the code:

  1. I’ve wrapped most of the lines in try … otherwise… blocks.  This is because I have not attempted to determine when I reach the end of the files and hit the Central Directory.  That’s one my “to do” list.
  2. The function is recursive (yes, I know that isn’t preferred).  I need to take a look at how to change that to List.Generate or similar.  I haven’t run into any performance issues however.
  3. I maintain the files as a Table of binaries to allow me both to show the filename and in case the content type varies e.g. some .csv, some .xml etc.
  4. Reading .zip files in this way is not the officially accepted approach.  You are supposed to read the Central Directory and find the files that way.  This is because the .zip file may, in theory, contain deleted files or earlier versions, which you won’t stumble across if you only look for files found in the Central Directory.  I didn’t see this as a big risk, but again it is on my “to do” list.

I’m sure there are cleverer ways of writing this code, so any feedback is welcome.

Here’s an Excel workbook containing the query: DecompressZIP

You will need to update

Source = File.Contents("C:\CompressedData.zip"),

to point to your compressed data for it to work.

In my next post, I’ll show how to use this approach to extract data previously inaccessible to Power Query such as hyperlinks and comments from an Excel .xlsx file (which is, of course, a set of zipped .xml files).

2 thoughts on “How to extract data from a ZIP file using just Power Query”

  1. Hi Ken,

    Thank you for a brilliant post – it inspired me to learn more about M.
    The solution above suffers from a performance problem which stems from it’s recursive nature – causing the binary data of the Zip file to be read many times.

    I’ve written an alternative function that takes full advantage of M’s BinaryFormat feature, allowing the Zip file to be decompressed in a single pass. This makes the downstream consumption of unzipped files faster.
    For more details, see: http://sql10.blogspot.co.za/2016/06/reading-zip-files-in-powerquery-m.html

    1. Hi Mark,

      That’s a great solution. I’d always intended to revisit mine and do a non-recursive version but never quite found the time to get back to it.

      So I’m looking forward to exploring your approach.

      I have noticed the performance impact you mention quite a bit in my recent blog post on extracting queries. It hadn’t really been an issue for me previously as I was dealing primarily with small files.

      Cheers

      Ken

Leave a Reply

Your email address will not be published. Required fields are marked *