Archive

Posts Tagged ‘Fill’

Power Query Fill Down; Practical Example

January 30, 2014 Leave a comment

Links related to this post:
Retro Sheet
Microsoft Office Fill Down Reference

Power Query Overview
The guts behind Power Query is the “M” Language. Once we specify our source data, the Power Query engine iterates through user-defined steps/formulas to transform it into a useful data set. The process is much like building a macro where each step depends on the previous step. Each step is a formula written in the “M” language that can be automatically generated from Power Query Designer toolbar buttons and drop down menus or, the sequence of steps can be written by hand if you need functionality not available in the user interface.

Practical Problem
You’re tasked with extracting, transforming and loading flat files of loosely structured data into a Data Warehouse. Typically, a flat file will have a header row and each detail record represents the same type of data as the rest until the end of the file. Our flat file has no header row and each row can be of a different type. Luckily, the data is consistent and the first column in each row tells us the type of row it is. In the image below, an “id” record tells us the identifier for 1 baseball game. The “version” record tells us the data format in which the game was recorded. And, the “info” record gives us 1 piece of information about the game specified in the closest “id” record above. There are multiple games in the 1951.EDA file, so we need to somehow match the game id “DET195106102”, to each “info” record below it, but not past the next “id” record:

And we have to do this using Power Query.

Practical Solution
The key to the solution here is the Table.FillDown function. This function can also be automatically written via drop down menu when right-clicking on a column:

Practical Example

The FillDown function is the key piece of code in our solution, but there are many steps needed beforehand to get us setup. In the link at the top of the page, you can download these files from RetroSheet.org, just make sure you read their terms and conditions. There are many files on this site and this particular problem file is an Event file. Once you have those downloaded somewhere you can access them from your Excel 2013 instance, follow these steps to play along at work (or at home if you’re REALLY bored):

1. Browse to your Power Query tab in Excel and click “From File”>”From Folder”:

2. Specify the folder in which your baseball files are located (yes, Power Query iterates through multiple files!) and click OK:

3. You should see a list of all files in the folder appear in a tabular format. We only want to concern ourselves with 4 different types of file extensions, but before we filter, we need to click the “Load more” link in the filter/sort dialog to get all distinct file extensions. Otherwise, we could get unexpected results. See link below:

4. Now that we have all of our distinct file extensions, we can feel confident about select the 4 file extensions that we are concerned with “.EDA”, “.EDN”, “.EVA” and “.EVN”:

5. Now that we have filtered the files, let’s clean up a bit removing all of the unnecessary columns (yes, we can remove the filtered column without losing our filter condition, don’t worry, it’s all stored in memory ;)):

6. This leaves us with the Content columns which, by default, stores the content of each file in a binary column. We can expand each row in each file by simply clicking the double-down arrow in the right corner of the Content column:

7. Voila! Now we have a table with every record… from every file. Next, we need to separate out the record type column by delimiting by the leftmost comma:

8. Set your delimiter parameters as such:

9. Since we’re trying to build some referential integrity into our data set, we need to add a new column that will explicitly link all of our “info” and “version” records types to the proper game id. To do that, we’ll need to click on Insert Custom Column from the Home ribbon and fill in the following formula (if [Column1.1] = “id” then [Column1.2] else null):

10. It’s important to specify “null” for the non “id” records to allow the FillDown function to work it’s magic. Click on the Fill Down menu item:

11. Voila again. If we scroll down the file, we can see that the Fill Down function fills null values with the last non empty value in that column sorted from 1st record to last:

Now you can proceed with loading a more properly structured data warehouse.

Here is the “M” language formula code that was generated by our user interface actions:

let
Source = Folder.Files(“D:\Data\RetroSheet\RawData”),
FilteredRows = Table.SelectRows(Source, each ([Extension] = “.EDA” or [Extension] = “.EDN” or [Extension] = “.EVA” or [Extension] = “.EVN”)),
RemovedColumns = Table.RemoveColumns(FilteredRows,{“Name”, “Extension”, “Date accessed”, “Date modified”, “Date created”, “Attributes”, “Folder Path”}),
CombinedBinaries = Binary.Combine(RemovedColumns[Content]),
ImportedText = Table.FromColumns({Lines.FromBinary(CombinedBinaries,null,null,1252)}),
SplitColumnDelimiter = Table.SplitColumn(ImportedText,”Column1″,Splitter.SplitTextByEachDelimiter({“,”}, null, false),{“Column1.1”, “Column1.2”}),
ChangedType = Table.TransformColumnTypes(SplitColumnDelimiter,{{“Column1.1”, type text}, {“Column1.2”, type text}}),
InsertedCustom = Table.AddColumn(ChangedType, “Custom”, each if [Column1.1] = “id” then [Column1.2] else null),
FillDown = Table.FillDown(InsertedCustom,{“Custom”})
in
FillDown

Cheers,
Nick.