Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.
While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.
If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.
This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.
Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the
First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:
The test data generated by that code looks like this:
And the results should be:
My first attempts at solving this problem using
data.table resulted in a pretty complex set of code.
This approached seemed “natural” in
data.table, although it took me a while to refine and debug since it was my first time using the package 1. Essentially, I use
rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.
I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:
Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.
With the release of
dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the
dplyr code would be easier to maintain since I frequently use
dplyr and only turn to
data.table in specific, sticky situations where performance matters.
In about a tenth the time it took to develop the original code, I came up with this new function:
At least to my eyes, this code is far more expressive and elegant. First, I generate a
data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.
There are a few funky things (note the use of regroup and grouping, which are related to
dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.
But was this code more performant? Imagine my glee when this happened:
That is a remarkable increase in performance!
Now, I realize that I may have cheated. My
data.table code isn’t very good and could probably follow a pattern closer to what I did in
dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that
dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.
It is not every day that a tool provides powerful expressiveness and yields greater performance.
I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.
It was over a year ago that I first wrote this code. ↩︎