Find the data sheet in a workbook using Power Query

A lot of the reports that I receive come in a workbook to which sheets have been added and the raw data will be on a randomly named sheet (“Extract”, “Sheet2”, “Raw Data” etc.) somewhere in the workbook.

This can make it difficult to have a repeatable process to import the data into a Data Model.

To address this I wrote a script that produces a summary of the workbook.  It lists all the sheets and indicates the number of rows and columns.  Here is the output:

2016-03-28 11_15_40-Workbook Profile - Query Editor

And here’s the script:

 

The key trick is to add custom columns which call Table.RowCount() and Table.ColumnCount() on the Data column (which happens to be a table).

In my case, I realized that the sheet with the most rows (although a close call in this example!) was always going to the sheet that I needed, so I wrote a quick function in Power Query to find and return the data from that sheet.

The function returns the data from the sheet with the most rows – which is likely to be the one with the raw data.

Here is GetSheetWithMostRows():

This allows me to automatically find and import the raw data from the report regardless of what the sheet is called or how many other sheets may have been added to the workbook.

Compare two tables or reports using Power Query

25/08/16 – updated to correct minor errors that may stopped this solution from working on your reports 

I receive a lot of reports on a daily, weekly or monthly basis.  Usually the reports are little changed from the previous version but there is no easy way to work out just what has changed.  I suspect that this is a pretty common issue so I’ve written a general solution in Power Query.

This function accepts two tables/reports as input and provides a report of just what has changed.

The functions should work on practically any report and could easily be adapted to compare sheets in separate workbooks, but to demonstrate, I’m going to compare this simple table:

2016-03-20 11_12_33-Compare Tables v12.xlsx - Excel

with this table:

2016-03-20 11_13_00-Compare Tables v12.xlsx - Excel

and produce this output:

2016-03-20 11_13_15-Compare Tables v12.xlsx - Excel

In comparing two tables, there are a 5 kinds of variances we need to look for:

  1. Added rows – rows in second table but not in the first (Chris Masters in our example)
  2. Removed rows – rows in first table but not in the second (eg. Penny Bradley)
  3. Added columns – columns in second table not in the first (eg. Status column)
  4. Removed columns – columns in first table not in the second (eg. Loyalty Rating column)
  5. Changed rows – rows that are in both tables but with one of more changed values

The first 4 variances are fairly easy to detect.

For rows, you can use Table.NestedJoin() to JOIN the tables and, using a joinKind of LeftAnti, return rows that either only exist in one table or only exist in the other.

For columns, you can using the List functions to compare a list of Table.ColumnNames from each table and find those which are unique to one or the other.

Detecting changed fields within rows is a bit more complicated…

The approach I took was to merge the two tables so that each row contained both the old data for the row and the new data for the row.  Then I converted the rows to lists.

I then compared the old value for each field with the new value for each field and used Power Query’s Swiss army knife List.Generate() (which I described in a blog post here) to produce a list of all the changes.  I haven’t commented this in detail, so you may wish to refer to the blog post to get a better sense of what I am doing.

In order to be able to compare the two tables, they need to have a common unique column or primary key.  If this is not the case, you may need to create one by combining two or more columns.  In some cases, it may be OK to use a numeric index field.

The function also accepts a DontCompare parameter. This is a list of fields that you want to exclude from the comparison.  This may include fields that you are not interested in or that always change e.g. Report Date.   In this example I have excluded the Last Updated field.

There is also an AlwaysInclude parameter. This is a list of up to 4 fields that you want to always include in the output.  This is to allow you to see more easily what rows have changed.  In this example, I included both the Customer Id and the Customer’s Last Name.

Here is the commented code for the two functions (the parent function CompareTables that accepts two tables and the child function CompareRows that compares fields in a single row):

CompareTables:

CompareRows:

And here’s an example of how to use:

Please feel free to ask questions or make comments below.

To make this easier to follow, please find a copy of the example workbook here.  Example workbook updated 25/8/16

Compare-Tables – fixed

Fun with List.Generate() – Tips and producing a Date Range

 

There have been a couple of blog posts about List.Generate()one by Chris Webb and another recently by Andrew Todd on the PowerPivotPro blog.

I recommend both of these articles as starting points for exploring List.Generate().  This blog post is not an introductory one.

List.Generate() makes lists.  It can make a straight list of scalar values, or a list of records, or a list of tables, or a list of records that contain lists etc.

The list that is returned is defined by the algorithm that you enter.  That is, you can, with a bit of work get it to produce pretty much any list that you like.

To achieve this, List.Generate iteratively produces records (well, they don’t have to be records, but they usually are).  These records are basically its mental scratchpad, which it (you) uses to decide both what item to add to the list that it ultimately generates and how to produce the next record (to produce the next item in its list and so on).

List.Generate accepts 4 parameters:

  1. A starting condition.  Usually this is a function returning a record.  For example, ()=>[A=1,B=1].  This means that on the first iteration, you will have a record containing two fields, A and B, where A=1 and B=1.
  2. When to stop.  This is some kind of conditional statement telling Power Query when to stop iterating.  e.g. each [A] < 5.  Note that when checking this condition, we will have received a record of the same format as the starting condition e.g. [A=something, B=something].  It then checks the value of A within this record and while it is less than 5 it keeps going.
  3. How to make the next record.  This is where the work happens.  It receives a copy of the current iteration’s record (think mental scratchpad, current values) and returns the record for the next iteration e.g.:

    Now, the [A] and [B] inside the outer square brackets are the values of A and B in the current iteration‘s record.  The A and B without the square brackets refer to A and B in the next iteration‘s record that we are making here.  One really important thing to note is that, in producing the next iteration’s record we must produce a record with the same fields (perhaps with rare exceptions).    If you don’t, then List.Generate will fail because A or B will not be defined in the next iteration.  So you couldn’t, for example, return a record here such as each [B=7]
  4. What to put into the list.  This the actual item we’re going to add to the list this iteration.  It’s not the record we’ve been passing around, at least not necessarily. It could literally be anything e.g. each 6. But usually it is based on the current record e.g. each [A]. Would return a list with the value of A from every iteration.  Or we could return a record with both values e.g. [A=[A], B=[B]] (this means we are, in this instance, returning the same record we’ve been using as our mental scratchpad).  It returns a record with the values of A and B for each iteration.

Note – while you can get away with returning a record of [[A],[B]], you can’t return a record of [[A], [B], [A]+[B]].  Why?  Because records need fields to have names and [A]+[B] doesn’t have a name unless you give it one:

Let’s look at a more complex example of using List.Generate().

I have some data where staff on a graduate programme move between teams. The data is a table of name, date, team:

2016-03-06 10_57_28-Graduates.xlsx - Excel

We will use List.Generate to turn this into date ranges.

2016-03-06 10_57_50-Employee Range.xlsx - Excel

Our starting condition is this:

So the record we’ll be using as our mental scratchpad will start with a blank Employee, Team, Date and a Counter starting at 0.

End condition:

Keep going while our Counter variable is less than the number of rows in our InputData (i.e. our list of Employees)

OK, here’s where the real work occurs.  This is how the next record in each iteration is produced:

First thing to note, we use our Counter to refer to fields in our InputData table.  So Employee=InputData{[Counter]}[#”Employee Name”], grabs the Employee Name from the Counterth row in our InputData.

Next thing to be aware of.  To calculate the value of Date in our next iteration, we use a function.  Date will be equal to whatever this function produces.  You can see that within the function we use the same approach as above to refer to elements within our InputData.  This function could do anything functions usually do (including calling other functions), what matters is that it returns a value.

Now what the function returns isn’t actually a date.  It’s a record.

So it will set the Date value in the record we are iterating to be equal to a record (so we’ve got a record within a record).

The rest of the code takes the generated list of records, expands it out to a table.  It then adds a special Index to allow us to clean pivot the output into the required output.

Here’s the full code:

There are ways to get this same output without using List.Generate(). The point here is to explore some other ways you can employ List.Generate() in your own projects.