Jason Becker
November 1, 2014

November marks the start of National Novel Writing Month (NaNoWriMo). The quick version is folks band together and support each other to write 50,000 words in November.

I would love to write a novel one day. I am not sure I could do it well, but I am pretty sure I could hit 50,000-80,000 words if I dedicated time to tell a story.

I don’t have a story to tell.

So this year, I have decided to not feel guilty about skipping out on another NaNoWriMo (always the reader, never the author), and instead I am modifying it to meet my needs. With no story to tell and no experience tackling a single project the size of a novel, I am going to tackle a smaller problem– this blog.

Instead of 50,000 words in 30 days, I am going to try and write 1000 words a day for the next four weeks. I will not hold myself to a topic. I will not even hold myself to non-fiction. I will not hold myself to a number of posts or the size of the posts I write. I will not even hold myself to true daily count, instead reviewing where I stand at the end of each week.

I am hoping that the practice of simply writing will grease my knuckles and start the avalanche that leads to writing more. A small confession– I write two or three blog posts every week that never leave my drafts. I find myself unable to hit publish because the ideas tend to be far larger or far smaller than I anticipate when I set out to write and share my frustrations. I also get nervous, particularly when writing about things I do professionally, about not writing the perfect post that’s clear, heavily researched, and expresses my views definitively and completely. This month, I say goodbye to that anxiety and start simply hitting publish.

I will leave you with several warnings.

  1. Things might get topically wacky. I might suddenly become a food blogger, or write about more personal issues, or write a short story and suddenly whiplash to talking about programming, education policy, or the upcoming election. If high volume, random topics aren’t your thing, you should probably unsubscribe from my RSS feed and check back in a month.
  2. I might write terrible arguments that are poorly supported and don’t reflect my views. This month, I will not accept my most common excuses for not publishing, which boil down to fear people will hold me to the views I express in my first-draft thinking. I am going to make mistakes this month in public and print the dialog I am having with myself. The voices I allow room to speak as I struggle with values, beliefs, and opinions may be shock and offend. This month, this blog is my internal dialog. Please read it as a struggle, however definitive the tone.
  3. I am often disappointed that the only things I publish are smaller ideas written hastily with poor editing. Again, this month I embrace the reality that almost everything I write that ends up published is the result of 20 minutes of furious typing with no looking back, rather than trying to be a strong writer with a strong view point and strong support.

I hope that the end of this month I will have written at least a couple of pieces I feel proud of, and hopefully, I will have a little less fear of hitting publish in the future.

October 5, 2014

A terrible thing is happening this year. Women all across the internet are finding themselves the target of violence, simply for existing. Women are being harassed for talking about video games, women are being harassed for talking about the technology industry, women are being harassed for talking, women are being harassed.

A terrible thing is happening. Women are finding themselves the target of violence.

A terrible thing has always happened.


I remember being a 16 year old posting frequently on internet forums. One in particular focused on guitar equipment. I loved playing in a band, and I loved the technology of making guitar sounds. Many people on the forum were between 16 and 24, although it was frequented by quite a few “adults” in their 30s, 40s, and 50s. It was a wonderful opportunity to interact as an adult, with adults.

Every week members created a new thread where they posted hundreds of photos of women. Most of them were professional photographs taken at various night clubs as patrons entered. Some were magazine clippings or fashion modeling. I remember taking part, both in gazing and supplying the occasional photograph from the internet. We were far from the early days of the world wide web, this being around 2003, but this was also before social media matured and online identity was well understood by the general public.

This thread became controversial. A change from private to corporate ownership of this forum led to increased moderation, and the weekly post with photos of women was one of the targets.

I did not understand.

In the debates about the appropriateness of the content and its place within our online community, I took the side of those who wanted the post to remain alive. I was not its most ardent supporter, nor was I moved to some of the extremes in language and entitlement that typically surround these conversations. However, my views were clear and easy. These were public photographs, largely taken with permission (often for compensation). And, of course, none of the pictures were pornographic.

Appropriateness for me at 16 was defined by pornography. I did not understand.


My parents did not raise me to be misogynist. One of the most influential moments in my life came on a car ride to the dentist. I was also around 16 or 17. I think it was on my way to get my wisdom teeth removed. I had been dating the same girl for a while, and it was time for my father to give me the talk. All he said to me was, “Women deserve your respect.”

That was it.


We were in college, and my friends and I were all internet natives. We had used the web for over ten years. We grew up in AOL chatrooms and forums. The backwaters of the internet at this time shifted from Something Awful to 4Chan. This was the height of some of the most prolific and hilarious memes: lolcats, Xzibit, advice dogs (a favorite was bachelor frog, which seemed to understand our worst impulses expressed in only modest exaggeration).

There was also violence.

It was not uncommon to see names, phone numbers, and addresses that 4chan was supposed to harass because someone said so. Various subcultures seemed to be alternatively mocked and harassed endlessly in the very place that had first embraced, supported, and connected people under the guise of radical anonymity. The most famous of the “Rules of the Internet” was Rule 34 – if you can think of it, there is a porn of it– and its follow up, Rule 35 – if you can not find porn of it, you should make it. 4chan seemed determined to make this a reality. But really the most troublesome thing was the attitude toward women. Nothing was as unacceptable to 4chan as suggesting that women are anything but objects for male gaze. In a place sometimes filled with radically liberal (if more left-libertarian than left-progressive) politics that would spawn groups like Anonymous, nothing brought out as much criticism as suggesting our culture has a problem with women.

My response was largely to fade from this part of the internet. I had only reached the point of being uncomfortable with this behavior. It would take more time for me to understand. It still felt like this was a problem of ignorant people.


I am rarely jealous of intelligence. I am rarely jealous of wealth. I am rarely jealous of experiences. What I am most often jealous of is what seems to me to be a preternatural maturity of others, particularly around issues of ethics and human rights.

Fully grappling with privilege is not something that happens over a moment, it is a sensitivity to be developed over a lifetime. We are confronted with media that builds and reinforces a culture that is fundamentally intolerant and conservative. There are countless microaggressions that are modeled everywhere for our acceptance as normal. It has taken me a decade of maturation, hard conversations, and self-examination to only begin to grow from fully complicit and participating in objectification of women to what I would now consider to be the most basic level of human decency.

The internet has gone from enabling my own aggression toward women to exposing me to a level of misogyny and violence that deeply disturbs and disgusts me, shattering any notion that my past offenses were harmless or victimless. The ugly underside of our culture is constantly on display, making it all the more obvious how what felt like isolated events on the “ok” side of the line were actaully creating a space that supported and nurtured the worst compulsions of men.


I often think about my own journey when I see disgusting behavior on the internet. I wonder whether I am facing a deeply, ugly person or myself at 16. I try to parse the difference between naïvety, ignorance, and hate and to understand if they require a unique response.

Mostly, I struggle with what would happen if Jason Today spoke to Jason 16.

Jason 16 could not skip over a decade of growth simply for having met Jason Today. It took me conversations with various folks playing the role of Jason Today over and over again, year after year. I wish I believed there was another way to reach the Jason 16s out there. I wish I knew how to help them become preternaturally aware of their actions. All I know how to do is try to be compassionate to those who hate while firmly correcting, try to meet the heightened expectations I place on myself, try to apologize when I need to, and try to support those that seem more equipped to push the conversation forward.

Along this path, I never lept to agreement so much as paused. Each time I heard a convincing point, I paused and considered. Growth came in a series of all too brief pauses.

Pauses are often private and quiet, its discoveries never on direct display.

If pauses are the best anyone can expect, then working to change our culture of violence toward women will rarely feel like much more than shouting at the void.

June 12, 2014

The Vergara v. California case has everyone in education talking. Key teacher tenure provisions in California are on the ropes, presumably because of the disparate impact on teacher, annd therefore education, quality for students who are less fortunate.

I have fairly loosely held views about the practice of tenure itself and the hiring and firing of teachers. However, I have strongly held views that unions made mistake with their efforts to move a lot of rules about the teaching labor market into state laws across the country. Deep rules and restrictions are better left to contracts, even from a union perpsective. At worst, these things should be a part of regulation, which can be more easily adapted and waived.

That said, here are a collection of interesting thoughts on tenure post-Vergara:

John Merrow, reacting to Vergara:

Tenure and due process are essential, in my view, but excessive protectionism (70+ steps to remove a teacher?) alienates the general public and the majority of effective teachers, particularly young teachers who are still full of idealism and resent seeing their union spend so much money defending teachers who probably should have been counseled out of the profession years ago.

With the modal ‘years of experience’ of teachers dropping dramatically, from 15 years in 1987 to 1 or 2 years today, young teachers are a force to be reckoned with. If a significant number of them abandon the familiar NEA/AFT model, or if they develop and adopt a new form of teacher unionism, public education and the teaching profession will be forever changed.

San Jose Mercury News reporting on the state thwarting a locally negotiated change to tenure:

With little discussion, the board rejected the request, 7 to 2. The California Teachers Association, one of the most powerful lobbies in Sacramento, had opposed granting a two-year waiver from the state Education Code – even though one of the CTA’s locals had sought the exemption… …San Jose Teachers Association President Jennifer Thomas, whose union had tediously negotiated with the district an agreement to improve teacher evaluations and teaching quality, called the vote frustrating… San Jose Unified and the local teachers association sought flexibility to grant teachers tenure after one year or to keep a teacher on probation for three years.

The district argued that 18 months – the point in a teacher’s career at which districts must make a tenure decision – sometimes doesn’t allow time to fairly evaluate a candidate for what can be a lifetime job.

Now, Thomas said, when faced with uncertainty over tenure candidates, administrators will err on the side of releasing them, which then leaves a stain on their records.

Kevin Welner summarzing some of the legal implications of Vergara:

Although I can’t help but feel troubled by the attack on teachers and their hard-won rights, and although I think the court’s opinion is quite weak, legally as well as logically, my intent here is not to disagree with that decision. In fact, as I explain below, the decision gives real teeth to the state’s Constitution, and that could be a very good thing. It’s those teeth that I find fascinating, since an approach like that used by the Vergara judge could put California courts in a very different role —as a guarantor of educational equality—than we have thus far seen in the United States… …To see why this is important, consider an area of education policy that I have researched a great deal over the years: tracking (aka “ability grouping”). There are likely hundreds of thousands of children in California who are enrolled in low-track classes, where the expectations, curricula and instruction are all watered down. These children are denied equal educational opportunities; the research regarding the harms of these low-track classes is much stronger and deeper than the research about teachers Judge Treu found persuasive in the Vergara case. That is, plaintiffs’ attorneys would easily be able to show a “real and appreciable impact” on students’ fundamental right to equality of education. Further, the harm from enrollment in low-track classes falls disproportionately on lower-income students and students of color. (I’ll include some citations to tracking research from myself and others at the end of this post.)

Welner also repeats a common refrain from the education-left that tenure and insulating teachers from evaluations is critical for attracting quality people into the teaching profession. This is an argument that the general equilibrium impact on the broader labor market is both larger in magnitude and in the opposite direction of any assumed positive impacts from easier dismissal of poor performing teachers:

This more holistic view is important because the statutes are central to the larger system of teacher employment. That is, one would expect that a LIFO statute or a due process statute or tenure statute would shape who decides to become a teacher and to stay in the profession. These laws, in short, influence the nature of teaching as a profession. The judge here omits any discussion of the value of stability and experience in teaching that tenure laws, however imperfectly, were designed to promote in order to attract and retain good teachers. By declining to consider the complexity of the system, the judge has started to pave a path that looks more narrowly at defined, selected, and immediate impact—which could potentially be of great benefit to future education rights plaintiffs.

Adam Ozimek of Modeled Behavior:

I can certainly imagine it is possible in some school districts they will find it optimal to fire very few teachers. But why isn’t it enough for administrators to simply rarely fire people, and for districts to cultivate reputations as places of stable employment? One could argue that administrators can’t be trusted to actually do this, but such distrust of administrators brings back a fundamental problem with this model of public education: if your administrators are too incompetent to cultivate a reputation that is optimal for student outcomes then banning tenure is hardly the problem, and imposing tenure is hardly a solution. This is closely related to a point I made yesterday: are we supposed to believe administrators fire sub-optimally but hire optimally

His piece from today (and this one from yesterday) argues that Welner’s take could be applied to just about any profession, and furthermore, requires accepting a far deeper, more fundamental structural problem in education that should be unacceptable. If administrators would broadly act so foolishly as to decimate the market for quality teaching talent and be wholly unable to successfully staff their schools, we have far bigger problems. And, says Ozimek, there is no reason to believe that tenure is at all a response to this issue.

Dana Goldstein would likely take a more historical view on the usefulness of tenure against adminstrator abuse.

But, writing for The Atlantic, she focuses instead on tenure as a red herring:

The lesson here is that California’s tenure policies may be insensible, but they aren’t the only, or even the primary, driver of the teacher-quality gap between the state’s middle-class and low-income schools. The larger problem is that too few of the best teachers are willing to work long-term in the country’s most racially isolated and poorest neighborhoods. There are lots of reasons why, ranging from plain old racism and classism to the higher principal turnover that turns poor schools into chaotic workplaces that mature teachers avoid. The schools with the most poverty are also more likely to focus on standardized test prep, which teachers dislike. Plus, teachers tend to live in middle-class neighborhoods and may not want a long commute.

May 19, 2014

I have never found dictionaries or even a thesaurus particularly useful as part of the writing process. I like to blame this on my lack of creative careful writing.

But just maybe, I have simply been using the wrong dictionaries. It is hard not to be seduced by the seeming superiority of Webster’s original style. A dictionary that is one-part explanatory and one-part exploratory provides a much richer experience of English as an enabler of ideas that transcend meager vocabulary.

May 12, 2014

I had never thought of a use for Brett Terpstra’s Marky the Markdownifier before listening today’s Systematic. Why would I want to turn a webpage into Markdown?

When I heard that Marky has an API, I was inspired. Pinboard has a “description” field that allows up to 65,000 characters. I never know what to put in this box. Wouldn’t it be great to put the full content of the page in Markdown into this field?

I set out to write a quick Python script to:

  1. Grab recent Pinboard links.
  2. Check to see if the URLs still resolve.
  3. Send the link to Marky and collect a Markdown version of the content.
  4. Post an updated link to Pinboard with the Markdown in the description field.

If all went well, I would release this script on Github as Pindown, a great way to put Markdown page content into your Pinboard links.

The script below is far from well-constructed. I would have spent more time cleaning it up with things like better error handling and a more complete CLI to give more granular control over which links receive Markdown content.

Unfortunately, I found that Pinboard consistently returns a 414 error code because the URLs are too long. Why is this a problem? Pinboard, in an attempt to maintain compatibility with the del.ico.us API uses only GET requests, whereas this kind of request would typically use a POST end point. As a result, I cannot send along a data payload.

So I’m sharing this just for folks who are interested in playing with Python, RESTful APIs, and Pinboard. I’m also posting for my own posterity since a non-Del.ico.us compatible version 2 of the Pinboard API is coming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import requests
import json
import yaml


def getDataSet(call):
  r = requests.get('[api.pinboard.in/v1/posts/...](https://api.pinboard.in/v1/posts/recent') + call)
  data_set = json.loads(r._content)
  return data_set

def checkURL(url=""):
  newurl = requests.get(url)
  if newurl.status_code==200:
    return newurl.url
  else:
    raise ValueError('your message', newurl.status_code)

def markyCall(url=""):
  r = requests.get('[heckyesmarkdown.com/go/](http://heckyesmarkdown.com/go/?u=') + url)
  return r._content

def process_site(call):
  data_set = getDataSet(call)
  processed_site = []
  errors = []
  for site in data_set['posts']:
    try:
      url = checkURL(site['href'])
    except ValueError:
      errors.append(site['href'])
    description = markyCall(url)
    site['extended'] = description
    processed_site.append(site)
  print errors
  return processed_site

def write_pinboard(site, auth_token):
  stem = 'https://api.pinboard.in/v1/posts/add?format=json&auth_token='
  payload = {}
  payload['url'] = site.get('href')
  payload['description'] = site.get('description', '')
  payload['extended'] = site.get('extended', '')
  payload['tags'] = site.get('tags', '')
  payload['shared'] = site.get('extended', 'no')
  payload['toread'] = site.get('toread', 'no')           
  r = requests.get(stem + auth_token, params = payload)
  print(site['href'] + '\t\t' + r.status_code)

def main():
  settings = file('AUTH.yaml', 'rw')
  identity = yaml.load(AUTH.yaml)
  auth_token = identity['user_name'] + ':' + identity['token']
  valid_sites = process_site('?format=json&auth_token=' + auth_token)
  for site in valid_sites:
    write_pinboard(site, auth_token)

if __name__ == '__main__':
  main()
April 1, 2014

I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ “inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

1
ln -s target_path link_path

R offers a function that does the same thing:

1
file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

March 9, 2014

Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.

While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.

If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.

This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.

Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the data.table package.

First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:

1
2
3
4
5
6
7
8
9
# Generate test data for modal_attribute().
modal_test <- data.frame(sid = c('1000', '1001', '1000', '1000', '1005', 
                                 '1005', rep('1006',4)),
                         race = c('Black', 'White', 'Black', 'Hispanic',
                                  'White', 'White', rep('Black',2), 
                                  rep('Hispanic',2)),
                         year = c(2006, 2006, 2007, 2008,
                                  2010, 2011, 2007, 2008,
                                  2010, 2011))

The test data generated by that code looks like this:

sasid race year
1000 Black 2006
1001 White 2006
1000 Black 2007
1000 Hispanic 2008
1005 White 2010
1005 White 2011
1006 Black 2007
1006 Black 2008
1006 Hispanic 2010
1006 Hispanic 2011

And the results should be:

sasid race
1000 Black
1001 White
1005 White
1006 Hispanic

My first attempts at solving this problem using data.table resulted in a pretty complex set of code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Calculate the modal attribute using data.table
modal_person_attribute_dt <- function(df, attribute){
  # df: rbind of all person tables from all years
  # attribute: vector name to calculate the modal value
  # Calculate the number of instances an attributed is associated with an id
  dt <- data.table(df, key='sasid')
  mode <- dt[, rle(as.character(.SD[[attribute]])), by=sasid]
  setnames(mode, c('sasid', 'counts', as.character(attribute)))
  setkeyv(mode, c('sasid', 'counts'))
  # Only include attributes with the maximum values. This is equivalent to the
  # mode with two records when there is a tie.
  mode <- mode[,subset(.SD, counts==max(counts)), by=sasid]
  mode[,counts:=NULL]
  setnames(mode, c('sasid', attribute))
  setkeyv(mode, c('sasid',attribute))
  # Produce the maximum year value associated with each ID-attribute 
  # pairing    
  setkeyv(dt, c('sasid',attribute))
  mode <- dt[,list(schoolyear=max(schoolyear)), by=c("sasid", attribute)][mode]
  setkeyv(mode, c('sasid', 'schoolyear'))
  # Select the last observation for each ID, which is equivalent to the highest
  # schoolyear value associated with the most frequent attribute.
  result <- mode[,lapply(.SD, tail, 1), by=sasid]
  # Remove the schoolyear to clean up the result
  result <- result[,schoolyear:=NULL]
  return(as.data.frame(result))
}

This approached seemed “natural” in data.table, although it took me a while to refine and debug since it was my first time using the package 1. Essentially, I use rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.

I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:

1
2
3
> system.time(modal_person_attribute(all_years, 'sex'))
 user  system elapsed 
 40.452   0.246  41.346 

Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.

With the release of dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the dplyr code would be easier to maintain since I frequently use dplyr and only turn to data.table in specific, sticky situations where performance matters.

In about a tenth the time it took to develop the original code, I came up with this new function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
modal_person_attribute <- function(x, sid, attribute, year){
  grouping <- lapply(list(sid, attribute), as.symbol)
  original <- x
  max_attributes <- x %.% 
                    regroup(grouping) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count))
  recent_max <- left_join(original, max_attributes) %.%
                regroup(list(grouping[[1]])) %.%
                filter(!is.na(count) & count == max(count))
  results <- recent_max %.% 
             regroup(list(grouping[[1]])) %.%
             filter(year == max(year))
  return(results[,c(sid, attribute)])
}

At least to my eyes, this code is far more expressive and elegant. First, I generate a data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.

There are a few funky things (note the use of regroup and grouping, which are related to dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.

But was this code more performant? Imagine my glee when this happened:

1
2
3
4
5
> system.time(modal_person_attribute_dplyr(all_years, sid='sasid', 
> attribute='sex', year='schoolyear'))
Joining by: c("sasid", "sex")
   user  system elapsed 
  1.657   0.087   1.852 

That is a remarkable increase in performance!

Now, I realize that I may have cheated. My data.table code isn’t very good and could probably follow a pattern closer to what I did in dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.

It is not every day that a tool provides powerful expressiveness and yields greater performance.

Update

I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
modal_person_attribute <- function(x, sid, attribute, year){
  # Select only the important columns
  x <- x[,c(sid, attribute, year)]
  names(x) <- c('sid', 'attribute', 'year')
  # Clean up years
  if(TRUE %in% grepl('_', x$year)){
    x$year <- gsub(pattern='[0-9]{4}_([0-9]{4})', '\\1', x$year)
  }  
  # Calculate the count for each person-attribute combo and select max
  max_attributes <- x %.% 
                    group_by(sid, attribute) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count)) %.%
                    select(sid, attribute)
  # Find the max year for each person-attribute combo
  results <- max_attributes %.% 
             left_join(x) %.%
             group_by(sid) %.%
             filter(year == max(year)) %.%
             select(sid, attribute)
  names(results) <- c(sid, attribute)
  return(results)
}

  1. It was over a year ago that I first wrote this code. ↩︎

February 26, 2014

We burden Latinos (and other traditionally underserved communities) with expensive housing because of the widespread practice of using homestead exemptions in Rhode Island. By lowering the real estate tax rate, typically by 50%, for owner occupied housing, we dramatically inflate the tax rate paid by Rhode Islanders who are renting.

Echoing a newly filed lawsuit in New York City over discriminatory real estate tax regimes, this new report emphasizes the racist incentives built into our property tax.

Homestead exemptions are built on the belief that renters are non-permanent residents of communities, care less for the properties they occupy and neighborhoods they live in, and are worse additions than homeowners. Frankly, it is an anti-White flight measure meant to assure people that only those with the means to purchase and the intent to stay will join their neighborhoods. Wealthy, largely White, property owners see homestead exemptions as fighting an influx of “slum lords”, which is basically the perception of anyone who purchases a home or builds apartments and rents them out.

Rather than encouraging denser communities with higher land utilization and more housing to reduce the cost of living in dignity, we subsidize low value (per acre) construction that maintain inflated housing costs.

Full disclosure: I own a condo in Providence and receive a 50% discount on my taxes. In fact, living in a condo Downcity, my home value is depressed because of the limited ways that I can use it. I could rent my current condo at market rate and lose money because of the doubling in taxes that I would endure versus turning a small monthly profit at the same rent with higher taxes. The flexibility to use my property as my own residence or as a rental unit more than pays for higher taxes.

So while I do have personal reasons to support removing the homestead exemption, even if I lived in a single family home on the East Side that was not attractive as a rental property, I would still think this situation is absurd. Homeowners’ taxes should easily be 20% higher to tax renters 30% less. Maybe some of our hulking, vacant infrastructure could be more viably converted into housing stock and lower the cost for all residents. Maybe we could even see denser development because there will actually be a market for renters at the monthly rates that would need to be charged to recuperate expenses. At least the rent wouldn’t be so damn high for too many people of color and people living in or near poverty.

February 17, 2014

Hadley Wickham has once again1 made R ridiculously better. Not only is dplyr incredibly fast, but the new syntax allows for some really complex operations to be expressed in a ridiculously beautiful way.

Consider a data set, course, with a student identifier, sid, a course identifier, courseno, a quarter, quarter, and a grade on a scale of 0 to 4, gpa. What if I wanted to know the number of a courses a student has failed over the entire year, as defined by having an overall grade of less than a 1.0?

In dplyr:

1
2
3
4
5
course %.% 
group_by(sid, courseno) %.%
summarise(gpa = mean(gpa)) %.%
filter(gpa <= 1.0) %.%
summarise(fails = n())

I refuse to even sully this post with the way I would have solved this problem in the past.


  1. Seriously, how many of the packages he has managed/written are indispensable to using R today? It is no exaggeration to say that the world would have many more Stata, SPSS, and SAS users if not for Hadleyverse. ↩︎

February 9, 2014

These quotes are absolutely striking, in that they give a clear glimpse into the ideological commitments of the Republican Party. From Sen. Blunt and Rep. Cole, we get the revelation that— for conservatives— the only “work” worth acknowledging is wage labor. To myself, and many others, someone who retires early to volunteer— or leaves a job to care for their children— is still working, they’re just outside the formal labor market. And indeed, their labor is still valuable— it just isn’t compensated with cash.

One of the greatest benefits of wealth is that it can liberate people to pursue happiness. When we tie a basic need for living complete lives of dignity to full time employment, people will find themselves willing to make many sacrifices to ensure this need. In our nation of great wealth with liberty and freedom as core values, it is hard to believe that the GOP would decry the liberating effect of ending the contingency of health care on work.

There is no work rule, regulation, or union that empowers workers more in their relationship with their employers than removing the threat of losing health care from the table. An increasingly libertarian right should be celebrating this as a key victory, rather than celebrate the existing coercive impact that health care has in our lives.

Republicans aren’t as worried as the idle rich, who— I suppose— have earned the right to avoid a life of endless toil. Otherwise— if Republicans really wanted everyone to work as much as possible— they’d support confiscatory tax rates. After all, nothing will drive an investment banker back to the office like the threat of losing 70 percent of her income to Uncle Sam.

Oh yeah, I forgot. For all their claims to loving liberty and freedom, what the GOP really stands for is protecting liberty and freedom for the existing “deserving” wealthy. They will fight tooth and nail to remove estate taxes because inheritance is a legitimate source of liberty. Removing the fear of entering a hospital uninsured after being unable to access preventive care is what deprives folks of “dignity”.

February 5, 2014

My Democracy Prep colleague Lindsay Malanga and I often say we should start an organization called the Coalition of Pretty Good Schools. We’d start with the following principles.

  1. Every child must have a safe, warm, disruption-free classroom as a non-negotiable, fundamental right.
  2. All children should be taught to read using phonics-based instruction.
  3. All children must master basic computational skills with automaticity before moving on to higher mathematics.
  4. Every child must be given a well-rounded education that includes science, civics, history, geography, music, the arts, and physical education.
  5. Accountability is an important safeguard of public funds, but must not drive or dominate a child’s education. Class time must not be used for standardized test preparation.

We have no end of people ready to tell you about their paradigmatic shift that will fix education overnight. There has been plenty of philosophizing about the goals, purpose, and means of education. Everyone is ready to pull out tropes about the “factory model” of education our system is built on.

The reality is that the education system too often fails at very basic delivery, period. I would love to see more folks draw a line in the sand of their minimum basic requirements, and not in an outrageous, political winky-wink where they are wrapping thier ideal in the language of the minimum. Lets have a deep discussion right now about the minimum basic requirements and lets get relentless about making that happen without the distraction of the dream. Frankly, whatever your dream is, so long as it involves kids going somewhere to learn 1, if we can’t deliver on the basics it will be dead on arrival.


  1. Of course, for a group of folks who are engaged in Dreamschooling, we cannot take for granted that schools will be places or that children will be students in any traditional sense of the word. However, I believe that if we have a frank conversation about the minimum expectations for education I suspect this will not be a particularly widely held sentiment. If our technofuturism does complete its mindmeld with the anarcho-____ movements on the left and right to lead to a dramatically different conceptualization of childhood in the developed world in my lifetime… ↩︎

January 6, 2014

James over at TransportPVD has a great post today talking about a Salt Lake City ordinance that makes property owners responsible for providing a bond that funds the landscaping and maintenance of vacant lots left after demolition. I love this as much as he does and would probably add several other provisions (like forfeiting any tax breaks on that property or any other property in the city and potentially forfeiture of the property itself if a demolition was approved based on site plans that are not adhered to within a given time frame). Ultimately, I do think the best solution to surface parking where it doesn’t belong, of either the temporary or permanent (and isn’t it all actually permanent?) kind, is a land value tax.

James goes one step further and suggests that we should adopt some similar rules around ALL parking developments and proposes a few. His hopes were that a mayoral candidate would chime in. For now, he will have to do with me.

His recommendations are built somewhat specific to the commission looking at building a state-funded parking garage in front of the Garrahy Complex in Downcity, about which many urbanists and transit advocates have expressed reservations or outright rejection. They are:

  1. The garage is parking neutral. As many spots need to be removed from the downtown as are added.
  2. An added bonus would be if some of the spots removed were on-street ones, to create protected bike lanes or transit lanes with greenery separating them from car traffic.
  3. The garage has the proposed bus hub.
  4. There are ground-level shops.
  5. The garage is left open 24-hours so that it can limit the need for other lots (this happens when a garage is used only during the day, or only at night, instead of letting it serve both markets).
  6. Cars pay full market price to park.

(Note: I’ve numbered rather than kept the bullets of the original to make responding easier.)

I disagree with the first and second point, which are really one and the same. We are in a district that has tremendously underutilized land. We want that space to be developed and as a result of that development we expect their to be much increased need for transit capacity. The goal should be both to increase accessibility and increase the share of transit capacity offered by walking, biking, or riding a bus or light rail. This does not require that we demand a spot-for-spot when building a public garage. I agree with the sentiment but disagree with the degree. Part of building rules and policies like this is to ensure comprehensive consideration of the transit context when developing parking. I see no reason to a priori assume that garages should only be permitted if they eliminate the same number of spaces they create.

The reason I combine these two points is because the city does not have the ability to remove off-street parking that is not publicly owned. Investing in smaller garages by footprint that have to be built taller and provide no change in capacity probably make no sense at all. If we’re going to build any kind of public garage at all, it should be with the goal of consolidating parking into infrastructure with reasonable land utilization. We would rather 3 or 4 large garages properly located than all of the current lots. Limiting their size because of the flexibility available due to reducing on-street parking or the footprint on existing lots doesn’t achieve that and doesn’t factor in orders-of-magnitude changes in capacity we should need for all transit modes in the next 20 years.

On point three, I am skeptical. I like the idea of improving bus infrastructure when building parking infrastructure in general. In fact, I voted against the \$40M Providence road paving bond even though that was much needed maintenance. My rationale was purely ideological– we should not use debt to pay for car maintenance without also investing in ways to reduce future maintenance costs through better utilization of those roads. However, I have a hard time believing that the Garrahy location is any good as a bus hub. If RIPTA did a great job identifying the need for an additional bus hub that the Garrahy location met the criteria for, I think it’s a reasonable idea. Short of that, it feels like throwing the transit community a wasteful bone.

I mostly agree on point four, but I doubt at the scale James would like to see. I think an appropriate level is probably not that different from the recently erected Johnson and Wales garage. The reality is that street-level retail is the right form, but there isn’t sufficient foot traffic to support it right now and won’t be for some time. There has to be street-level activation of any garage built in this area, but the square footage is likely fairly timid.

I absolutely agree with point five, without qualification. Not a dime should be spent on a public parking spot that is closed at any point in time, anywhere in the city. I would actually ditto this for surface parking lots on commercial properties of any kind after business hours. Not only should they have to be open, they should have to provide signs indicating the hours of commercial activity when parking is restricted and the hours when parking is available to the public. These hours of operations should require board approval. Owners could choose to charge during these off hours, but cars must be able to access the lot.

And point six should be a given for any public parking.

The real problem with Garrahy, in my opinion, is the cost is absurd, likely to be at least \$35,000 per space. There is plenty of existing parking, suggesting the demand right now is illusory and market rate for those spots right now means the investment is unlikely to ever be recovered. In a world with limited capacity for government spending on transit as a public good, I would rather subsidize transit infrastructure that benefits the poor and directly impacts the share of non-car transit as it increases capacity. Spending limited funds on parking infrastructure is ludicrous when demand isn’t sufficient to recover the investment. We already more than sufficiently subsidize parking in the area. And of course, the “study commission” is not really a study– it’s a meeting convened by those who want the project to happen putting the required usual suspects in the room to tepidly rubber stamp it. At least that’s my cynical take.

December 9, 2013

We find that public schools offered practically zero return education on the margin, yet they did enjoy significant political and financial support from local political elites, if they taught in the “right” language of instruction.

One thing that both progressives and libertarians agree upon are that social goals of education are woefully underappreciated and considered in the current school reform discussion. Both school choice and local, democratic control of schools are reactions to centralization resulting in “elites… [selecting] the ‘right’ language of instruction.”

I am inclined to agree with neither.

December 3, 2013

Update

Turns out the original code below was pretty messed up. All kinds of little errors I didn’t catch. I’ve updated it below. There are a lot of options to refactor this further that I’m currently considering. Sometimes it is really hard to know just how flexible something this big really should be. I think I am going to wait until I start developing tests to see where I land. I have a feeling moving toward a more test-driven work flow is going to force me toward a different structure.

I recently updated the function I posted about back in June that calculates the difference between two dates in days, months, or years in R. It is still surprising to me that difftime can only return units from seconds up until weeks. I suspect this has to do with the challenge of properly defining a “month” or “year” as a unit of time, since these are variable.

While there was nothing wrong with the original function, it did irk me that it always returned an integer. In other words, function returned only complete months or years. If the start date was on 2012-12-13 and the end date was on 2013-12-03, the function would return 0 years. Most of the time, this is the behavior I expect when calcuating age. But it is completely reasonable to want to include partial years or months, e.g. in the aforementioned example returning 0.9724605.

So after several failed attempts because of silly errors in my algorithm, here is the final code. It will be released as part of eeptools 0.3 which should be avialable on CRAN soon 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
age_calc <- function(dob, enddate=Sys.Date(), units='months', precise=TRUE){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  }
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  if(precise){
    start_is_leap <- ifelse(start$year %% 400 == 0, TRUE, 
                        ifelse(start$year %% 100 == 0, FALSE,
                               ifelse(start$year %% 4 == 0, TRUE, FALSE)))
    end_is_leap <- ifelse(end$year %% 400 == 0, TRUE, 
                        ifelse(end$year %% 100 == 0, FALSE,
                               ifelse(end$year %% 4 == 0, TRUE, FALSE)))
  }
  if(units=='days'){
    result <- difftime(end, start, units='days')
  }else if(units=='months'){
    months <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='months', SIMPLIFY=FALSE), 
                     length) - 1
    # length(seq(start, end, by='month')) - 1
    if(precise){
      month_length_end <- ifelse(end$mon==1, 28,
                                 ifelse(end$mon==1 & end_is_leap, 29,
                                        ifelse(end$mon %in% c(3, 5, 8, 10), 
                                               30, 31)))
      month_length_prior <- ifelse((end$mon-1)==1, 28,
                                     ifelse((end$mon-1)==1 & start_is_leap, 29,
                                            ifelse((end$mon-1) %in% c(3, 5, 8, 
                                                                      10), 
                                                   30, 31)))
      month_frac <- ifelse(end$mday > start$mday,
                           (end$mday-start$mday)/month_length_end,
                           ifelse(end$mday < start$mday, 
                            (month_length_prior - start$mday) / 
                                month_length_prior + 
                                end$mday/month_length_end, 0.0))
      result <- months + month_frac
    }else{
      result <- months
    }
  }else if(units=='years'){
    years <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='years', SIMPLIFY=FALSE), 
                     length) - 1
    if(precise){
      start_length <- ifelse(start_is_leap, 366, 365)
      end_length <- ifelse(end_is_leap, 366, 365)
      year_frac <- ifelse(start$yday < end$yday,
                          (end$yday - start$yday)/end_length,
                          ifelse(start$yday > end$yday, 
                                 (start_length-start$yday) / start_length +
                                end$yday / end_length, 0.0))
      result <- years + year_frac
    }else{
      result <- years
    }
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

  1. I should note that my mobility function will also be included in eeptools 0.3. I know I still owe a post on the actual code, but it is such a complex function I have been having a terrible time trying to write clearly about it. ↩︎

December 2, 2013

PISA Results

I wanted to call attention to these interesting PISA results. Turns out that student anxiety in the United States is lower than the OECD average and belief in ability is higher 1. I thought that all of the moves in education since the start of standard’s based reform were supposed to be generating tremendous anxiety and failing to produce students who had high sense of self-efficacy?

It is also worth noting that students in the United States were more likely to skip out on school dand this had a higher than typical impact on student performance. One interpretation of this could be that students are less engaged, but also that schooling activities do have a large impact on students rather than schools being of lesser importance than student inputs.

I have always had a hard time reconciling the calls for higher teacher pay and better work conditions and evidence that missing even just 10% of schooling has a huge impact on student outcomes with the belief that addressing other social inequities is the key way to achieve better outcomes for kids.

This is all an exercise in nonsense. It is incredibly difficult to transfer findings from surveys across dramatical cultural differences. It is also hard to imagine what can be learned about the delivery of education in the dramatically different contexts that exists. The whole international comparison game seems like one big Rorschach test where the price of admission is leaving any understanding of culture, context, and external validity at the door.

P.S.: The use of color in this visualization is awful. There is a sense that they are trying to be “value neutral” with data that is ordinal in nature (above, same, or below), and in doing so chose two colors that are very difficult to distinguish between. Yuck.


  1. The site describes prevalence of anxiety as, “proportion of students who feel helpless when faced with math problems” and belief in ability as, “proportion of students who feel confident in their math abilitites”. Note, based on these defitions, one might also think that either curricula were not so misaligned with international benchmarks or that we are already seeing the fruits of partial transition to Common Core. Not knowing the trend for this data, or some of the specifics about the collection instrument, makes that difficult to assess. ↩︎

November 22, 2013

Although it clocks in at 40+ pages, this is a worthwhile and relatively fast read for anyone in education policy on the future of assessment if we’re serious about college and career readiness. There is a ton to unpack, with a fair amount it agree with and a lot I am quite a bit less sure on.

I think this paper is meant for national and state level policy-makers, and so my major quibble is I think this is much more valuable for a district-level audience. I am less bullish on the state’s role in building comprehensive assessment systems. That’s just my initial reaction.

The accountability section is both less rich and less convincing than the assessment portion. I have long heard cries for so-called reciprocal accountability, but it is still entirely unclear to me what this means and looks like and the implications for current systems.

November 20, 2013

“We are trying to work towards late-exit ELL programs so (students) can learn the concepts in (their) native language,” Lusi said. Administrative goals have recently shifted to a focus on proficiency in both languages because bilingual education is preferred, she added.

But instituting district-wide bilingual education would require funding to hire teachers certified in both languages and to buy dual-language materials, she said.

I am pretty sure this is new. I am surprised there has not been a stronger effort to pass a legislative package in Rhode Island that provides both the policy framework and funding necessary to achieve universal bilinguage education for English language learners in RI schools.

One of the great advantages of transitioning to common standards1 is there should be greater availability of curricular materials in languages other than English. I suspect most of what is needed for bilingual education is start up money for materials, curriculum supports and developments, and assessment materials. There are a few policy things that need to be in place, possibly around state exams, but also rules around flexible teacher assignment, hiring, and dismissal staffing needs dramatically change.

Someone should be putting this package together. I suspect there would be broad support.


  1. Note, this is not necessarily a feature of the Common Core State Standards, just having standards in common with many other states. ↩︎

November 19, 2013

De Blasio and his advisers are still figuring out how much rent to charge well-funded charter schools, his transition team told me. “It would depend on the resources of the charter school or charter network,” he told WNYC, in early October. “Some are clearly very, very well resourced and have incredible wealthy backers. Others don’t. So my simple point was that programs that can afford to pay rent should be paying rent.” (In an October debate with the Republican candidate Joseph Lhota, he put it more bluntly: “I simply wouldn’t favor charters the way Mayor Bloomberg did because, in the end, our city rises or falls on our traditional public schools.”)

My impression of DeBlasio was that he went around collecting every plausible complaint from every interest group that was mad at Bloomberg and promised whatever they wanted. There didn’t really seem to be a coherent theory or any depth whatsoever to his policy prescriptions.

Already working hard to confirm this impression.

November 18, 2013

To recap, the first study discussed above established that children from disadvantaged backgrounds know less about a topic (i.e., birds) than their middle-class peers. Next, in study two, the researchers showed that differences in domain knowledge influenced children’s ability to understand words out of context, and to comprehend a story. Moreover, poor kids — who also had more limited knowledge — perform worse on these tasks than did their middle class peers. But could additional knowledge be used to level the playing field for children from less affluent backgrounds?

In study three, the researchers held the children’s prior knowledge constant by introducing a fictitious topic — i.e., a topic that was sure to be unknown to both groups. When the two groups of children were assessed on word learning and comprehension related to this new domain, the researchers found no significant differences in how poor and middle-class children learned words, comprehended a story or made inferences.

One of the “old” divides in education, from before the current crop of “edreform”, is whether or not content matters. Broadly, there are two camps, let’s call them the “Facts” and “Skills”, with the “Skills” camp clearly ahead in terms of mind share.

“Skills” is based on a fundamentally intuitive insight– students need to know how to do things not about the things themselves. In many ways it is built on our common experience of forgetting facts over time. We need 21st century skills, not an accumulation of specific, privileged knowledge that fades over time. Whatever the latest technology, from encyclopedias to calculators through to Google, each generation decides that the tools that adults use end the necessity of knowing about things rather than knowing how to find things.

This is very attractive. It seems to match our adult experiences accumulating knowledge and using it in our work. It seems to address students’ boredom with learning irrelevant information. It leaves space for groups to advocate for teaching whatever content they want since everyone can argue that content is fundamentally limited in value.

In classic turns out sense, however, the evidence keeps mounting that one must teach from the “Facts” approach to achieve the goals of the “Skills” position.

Turns out: skills and knowledge do not transfer well across domains. There is little evidence that learning how to read literary fiction translates to reading technical manuals with comprehension. In other words, critical thinking is not really an independent ability free of domain context 1. In fact, experts are able to learn more quickly, but only in their domain and only when they have prior knowledge to use as scaffolding 2.

Turns out: reading comprehension is strongly connected to whether or not students have prior knowledge (“Facts”) about the topic of the passage 3. Reading techniques only provide modest assistance for comprehension.

Turns out: privileging skills over content may have a serious differential impact on disadvantaged children. A well-intentioned goal of achieving equity through equality has led many to advocate that we do a disservice to children of color and children in poverty because their schools have not as completely embraced a “Skills” world and are too focused on “Facts”. The problem is that deep disparities we see when these students enter schooling point to having less prior knowledge than their peers 4.

What is remarkable, and tragic, is that the “Skills” camp has maintained its dominance through the demonization of “Facts”, with dramatic misinterpretations like:

  1. The “Facts” folks are just White colonialists seeking to maintain existing power structures through teaching the information of privilege.
  2. The “Facts” folks privilege memorization, rote learning, and recall-based assessment over other pedagogy that is more engaging and authentic.
  3. The “Facts” folks can only ever teach what was important yesterday; “Skills” camp can teach what matters to become a lifelong learner for tomorrow’s world.

None of these are true.

This post is largely brought to you by: E.D. Hirsch, Dan T. Willingham, and Malcolm Gladwell via Merlin Mann.


  1. http://www.aft.org/pdfs/americaneducator/summer2007/Crit_Thinking.pdf ↩︎

  2. http://www.ncbi.nlm.nih.gov/pubmed/11550744 ↩︎

  3. http://www.aft.org/newspubs/periodicals/ae/spring2006/willingham.cfm ↩︎

  4. This has pretty much been the thrust behind E.D. Hirsch’s work, who has been accused of being on the far right in education, despite his consistent belief that education equity is one of the most important goals to achieve. His firm belief, and I am mostly convinced, is that explicit factual content is the key tool for how teaching can dramatically improve educational equity. ↩︎

  1. More schooling, reoriented calendar
  2. Wider range of higher education
  3. Cheaper four-year degrees
  4. Eliminate property tax-based public education

This is an interesting list. I don’t agree with number four. There are several benefits to using property taxes not the least of which is their stability and lagged response during traditional economic downturns. However, there are many things we should do to reform our revenue system for education. I am keen on more taxes on “property”, using land value taxes that are levvied either statewide or regionally to address some of the inequities traditional, highly localized property taxes can lead to.

November 17, 2013

If I had to point to the key fissure in the education policy and research community it would be around poverty. Some seem to view it as an inexorable obstacle, deeply believing that the key improvement strategy is to decrease inequity of inputs. Some seem to view it as an obstacle that can be overcome by systems functioning at peak efficacy, deeply believing the great challenge is achieving that efficacy sustainably at scale. Both positions seem to grossly simplify causes and suggest policy structures and outcomes that are unachievable.

Paraphrasing Merlin Mann, always be skeptical of “turns out” research. In this case, are the results really that surprising? If they are, I might suggest that you have been focusing too much on the partial equilibrium impact of poverty and ignoring the bigger picture.

Not that I think integration is likely, easy, quick, or magically fixes things.

October 7, 2013

I spent most of high school writing, practicing, and performing music. I played guitar in two separate bands, and was the lead vocalist in one of those bands, and played trumpet in various wind ensembles and the jazz band at school. When I wasn’t a part of the creation process myself, there is a pretty good chance I was listening to music. Back then, it seemed trivial to find a new artist or album to obsess over.

Despite being steeped in music, I have always found it hard to write about. The truth is, I have limited ability to use words to explain just what makes a particular piece of music so wonderful. Oh sure, I could discuss structure, point out a particular hook in a particular section and how it sits in the mix. I could talk about the tone of the instrument or about quality of the performance or any number of other things. The problem with this language is it reduces what is great about this piece of music to a description that could easily fit some other piece of music. Verbalizing the experience of music projects a woefully flattened artifact of something breathtaking.

Now it might seem that recorded music has greatly diminished this challenge. After all, the experience of recorded music can scale– anyone can listen. Unfortunately, I found this to be completely untrue. When I play music for other people, it actually sounds different than when I experience it for myself. Little complexities that seem crucial to the mix seem to cower and hide rather than loom large in the presence of others. It is not really feasible to point out what makes the song so great while listening, because it disrupts the experience. Worst of all, no one else seems to experience what I experience when I listen.

Of course, all of this may seem obvious to someone who has read about aesthetics. I have not.

September 22, 2013

In a couple of previous posts, I outlined the importance of documenting business rules for common education statistics and described my take on how to best calculate student mobility. In this post, I will be sharing two versions of R function I wrote to implement this mobility calculation, reviewing their different structure and methods to reveal how I achieved an order of magnitude speed up between the two versions. 1 At the end of this post, I will propose several future routes for optimization that I believe should lead to the ability to handle millions of student records in seconds.

Version 0: Where Do I Begin?

The first thing I tend to do is whiteboard the rules I want to use through careful consideration and constant referal back to real data sets. By staying grounded in the data, I am less likely to encounter unexpected situations during my quality control. It also makes it much easier to develop test data, since I seek out outlier records in actual data during the business rule process.

Developing test data is a key part of the development process. Without a compact, but sufficiently complex, set of data to try with a newly developed function, there is no way to know whether or not it does what I intend.

Recall the business rules for mobility that I have proposed, all of which came out of this whiteboarding process:

  1. Entering the data with an enroll date after the start of the year counts as one move.
  2. Leaving the data with an exit date before the end of the year counts as one move.
  3. Changing schools sometime during the year without a large gap in enrollment counts as one move.
  4. Changing schools sometime during the year with a large gap in enrollment counts as two moves.
  5. Adjacent enrollment records for the same student in the same school without a large gap in enrollment does not count as moving.

Test data needs to represent each of these situations so that I can confirm the function is properly implementing each rule.

Below is a copy of my test data. As an exercise, I recommend determining the number of “moves” each of these students should be credited with after applying the above stated business rules.

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-10-15 2012-11-15
1000000 10103 2012-01-03 2013-03-13
1000000 10103 2012-03-20 2013-05-13
1000001 10101 2012-09-01 2013-06-15
1000002 10102 2012-09-01 2013-01-23
1000003 10102 2012-09-15 2012-11-15
1000003 10102 2013-03-15 2013-06-15
1000004 10103 2013-03-15 NA

Version 1: A Naïve Implementation

Once I have developed business rules and a test data set, I like to quickly confirm that I can produce the desired results. That’s particularly true when it comes to implementing a new, fairly complex business rules. My initial implementation of a new algorithm does not need to be efficient, easily understood, or maintainable. My goal is simply to follow my initial hunch on how to accomplish a task and get it working. Sometimes this naïve implementation turns out to be pretty close to my final implementation, but sometimes it can be quite far off. The main things I tend to improve with additional work are extensibility, readability, and performance.

In the case of this mobility calculation, I knew almost immediately that my initial approach was not going to have good performance characteristics. Here is a step by step discussion of Version 1.

Function Declaration: Parameters

1
2
3
4
5
6
7
8
moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sid', 
                       schid='schid',
                       enroll_date='enroll_date',
                       exit_date='exit_date')){

I named my function moves_calc() to match the style of age_calc() which was submitted and accepted to the eeptools package. This new function has eight parameters.

df: a data.frame containing the required data to do the mobility calculation.

enrollby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the start of the school year. Students whose first enrollment is after this date will have an additional move under the assumption that they enrolled somewhere prior to the first enrollment record in the data. This does not (and likely should not) match the actual first day of the school year.

exitby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the end of the school year. Students whose last exit is before this date will have an additional move under the assumption that they enrolled somewhere after this exit record that is excluded in the data. This date does not (and likely should not) match the actual last day of the school year.

gap: an atomic vector of type numeric that signifies how long a gap must exist between student records to record an additional move for that student under the assumption that they enrolled somewhere in between the two records in the data that is not recorded.

sid: an atomic vector of type character that represents the name of the vector in df that contains the unique student identifier. The default value is 'sid'.

schid: an atomic vector of type character that represents the name of the vector in df that contains the unique school identifier. The default value is schid.

enroll_date: an atomic vector of type character that represents the name of the vector in df that contains the enrollment date for each record. The default value is enroll_date.

exit_date: an atomic vector of type character that represents the name of the vector in df that contains the exit date for each record. The default value is exit_date.

Most of these parameters are about providing flexibility around the naming of attributes in the data set. Although I often write functions for my own work which accept data.frames, I can not help but to feel this is a bad practice. Assuming particular data attributes of the right name and type does not make for generalizable code. To make up for my shortcoming in this area, I have done my best to allow other users to enter whatever data column names they want, so long as they contain the right information to run the algorithm.

The next portion of the function loads some of the required packages and is common to many of my custom functions:

1
2
3
4
5
6
7
8
9
if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
require(data.table)

if("plyr" %in% rownames(installed.packages()) == FALSE){
    install.packages("plyr")
  } 
require(plyr)

Type Checking and Programmatic Defaults

Next, I do extensive type-checking to make sure that df is structured the way I expect it to be in order to run the algorithm. I do my best to supply humane warning() and stop() messages when things go wrong, and in some cases, set default values that may help the function run even if function is not called properly.

1
2
if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
    stop("Both enroll_date and exit_date must be Date objects")

The enroll_date and exit_date both have to be Date objects. I could have attempted to coerce those vectors into Date types using as.Date(), but I would rather not assume something like the date format. Since enroll_date and exit_date are the most critical attributes of each student, the function will stop() if they are the incorrect type, informing the analyst to clean up the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if(missing(enrollby)){
   enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
     enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                               '-09-15', sep=''), format='%Y-%m-%d'
     warning(paste("enrollby must be a string with format %Y-%m-%d,",
                   "defaulting to", 
                   enrollby, sep=' '))
  }else{
    enrollby <- as.Date(enrollby, format="%Y-%m-%d")
  }
}
if(missing(exitby)){
  exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                          '-06-01', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
    exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                              '-06-01', sep=''), format='%Y-%m-%d')
    warning(paste("exitby must be a string with format %Y-%m-%d,",
                  "defaulting to", 
                  exitby, sep=' '))
  }else{
    exitby <- as.Date(exitby, format="%Y-%m-%d")
  }
}
if(!is.numeric(gap)){
  gap <- 14
  warning("gap was not a number, defaulting to 14 days")
}

For maximum flexibility, I have parameterized the enrollby, exitby, and gap used by the algorithm to determine student moves. An astute observer of the function declaration may have noticed I did not set default values for enrollby or exitby. This is because these dates are naturally going to be different which each year of data. As a result, I want to enforce their explicit declaration.

However, we all make mistakes. So when I check to see if enrollby or exitby are missing(), I do not stop the function if it returns TRUE. Instead, I set the value enrollby to September 15 in the year that matches the minimum (first) enrollment record and exitby to June 1 in the year that matches the maximum (last) exit record. I then pop off a warning() that informs the user the expected values for each parameter and what values I have defaulted them to. I chose to use warning() because many R users set their environment to halt at warnings(). They are generally not good and should be pursued and fixed. No one should depend upon the defaulting process I use in the function. But the defaults that can be determined programmatically are sensible enough that I did not feel the need to always halt the function in its place.

I also check to see if gap is, in fact, defined as a number. If not, I also throw a warning() after setting gap equal to default value of 14.

Is this all of the type and error-checking I could have included? Probably not, but I think this represents a very sensible set that make this function much more generalizable outside of my coding environment. This kind of checking may be overkill for a project that is worked on independently and with a single data set, but colleagues, including your future self, will likely be thankful for their inclusion if any of your code is to be reused.

Initializing the Results

1
2
3
4
5
output <- data.frame(id = as.character(unique(df[[sid]])),
                     moves = vector(mode = 'numeric', 
                                    length = length(unique(df[[sid]]))))
output <- data.table(output, key='id')
df <- arrange(df, sid, enroll_date)

My naïve implementation uses a lot of for loops, a no-no when it comes to R performance. One way to make for loops a lot worse, and this is true in any language, is to reassign a variable within the loop. This means that each iteration has the overhead of creating and assigning that object. Especially when we are building up results for each observation, it is silly to do this. We know exactly how big the data will be and therefore only need to create the object once. We can then assign a much smaller part of that object (in this case, one value in a vector) rather than the whole object (a honking data.table).

Our output object is what the function returns. It is a simple data.table containing all of the unique student identifiers and the number of moves recorded for each student.

The last line in this code chunk ensures that the data are arranged by the unique student identifier and enrollment date. This is key since the for loops assume that they are traversing a student’s record sequentially.

Business Rule 1: The Latecomer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for(i in 1:(length(df[[sid]])-1)){
  if(i>1 && df[sid][i,]!=df[sid][(i-1),]){
    if(df[['enroll_date']][i]>enrollby){
      output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }else if(i==1){
    if(df[['enroll_date']][i]>enrollby){
    output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }

The first bit of logic checks if sid in row i is not equal to the sid in the i-1 row. In other words, is this the first time we are observing this student? If it is, then row i is the first observation for that student and therefore has the minimum enrollment date. The enroll_date is checked against enrollby. When enroll_date is after enrollby, then the moves attribute for that sid is incremented by 1. 2

Now, I didn’t really mention the conditional that i>1. This is needed because there is no i-1 observation for the very first row of the data.table. Therefore, i==1 is a special case where we once again perform the same check for enroll_date and enrollby. The i>1 condition is before the && operator, which ensures the statement after the && is not evaluated when the first conditional is FALSE. This avoids an “out of bounds”-type error where R tries to check df[0].

Business Rule 5: The Feint

Yeah, yeah– the business rule list above doesn’t match the order of my function. That’s ok. Remember, sometimes giving instructions to a computer does not follow the way you would organize instructions for humans.

Remember, the function is traversing through our data.frame one row at a time. First I checked to see if the function is at the first record for a particular student. Now I check to see if there are any records after the current record.

1
2
3
4
5
6
  if(df[sid][i,]==df[sid][(i+1),]){
    if(as.numeric(difftime(df[['enroll_date']][i+1], 
                           df[['exit_date']][i], units='days')) < gap &
       df[schid][(i+1),]==df[schid][i,]){
        next
    }else if ...

For the case where the i+1 record has the same sid, then the enroll_date of i+1 is subtracted from the exit_date of i and checked against gap. If it is both less than gap and the schid of i+1 is the same as i, then next, which basically breaks out of this conditional and moves on without altering moves. In other words, students who are in the same school with only a few days between the time they exited are not counting has having moved.

The ... above is not the special ... in R, rather, I’m continuing that line below.

Business Rule 3: The Smooth Mover

1
2
3
4
5
  }else if(as.numeric(difftime(df[['enroll_date']][i+1], 
                               df[['exit_date']][i], 
                               units='days')) < gap){
    output[as.character(df[[sid]][i]), moves:=moves+1L] 
  }else{ ...

Here we have the simple case where a student has moved to another school (recall, this is still within the if conditional where the next record is the same student as the current record) with a very short period of time between the exit_date at the current record and the enroll_date of the next record. This is considered a “seamless” move from one school to another, and therefore that student’s moves are incremented by 1.

Business Rule 4: The Long Hop

Our final scenario for a student moving between schools is when the gap between the exit_date at the i school and the enroll_date at the i+1 school is large, defined as > gap. In this scenario, the assumption is that the student moved to a jurisdiction outside of the data set, such as out of district for district-level data or out of state for state level data, and enrolled in at least one school not present in their enrollment record. The result is these students receive 2 moves– one out from the i school to a missing school and one in to the i+1 school from the missing school.

The code looks like this (again a repeat from the else{... above which was not using the ... character):

1
2
3
4
  }else{
    output[as.character(df[[sid]][i]), moves:=moves+2L] 
  }
}else...

This ends with a } which closes the if conditional that checked if the i+1 student was the same as the i student, leaving only one more business rule to check.

Business Rule 2: The Early Summer

1
2
3
4
5
6
7
}else{
  if(is.na(df[['exit_date']][i])){
    next
  }else if(df[['exit_date']][i] < exitby){
        output[as.character(df[[sid]][i]), moves:=moves+1L]
  }
}

Recall that this else block is only called if sid of the i+1 record is not the same as i. This means that this is the final entry for a particular student. First, I check to see if that student has a missing exit_date and if so charges no move to the student implementing the next statement to break out of this iteration of the loop. Students never have missing enroll_date for any of the data I have seen over 8 years. This is because most systems minimally autogenerate the enroll_date for the current date when a student first enters a student information system. However, sometimes districts forget to properly exit a student and are unable to supply an accurate exit_date. In a very small number of cases I have seen these missing dates. So I do not want the function to fail in this scenario. My solution here was simply to break out and move to the next iteration of the loop.

Finally, I apply the last rule, which compares the final exit_date for a student to exitby, incrementing moves if the student left prior to the end of the year and likely enrolled elsewhere before the summer.

The last step is to close the for loop and return our result:

1
2
3
  }
  return(output)
}

Version 2: 10x Speed And More Readable

The second version of this code is vastly quicker.

The opening portion of the code, including the error checking is essentially a repeat of before, as is the initialization of the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sasid', 
                       schid='schno',
                       enroll_date='enroll_date',
                       exit_date='exit_date'){
  if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
  require(data.table)
  if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
      stop("Both enroll_date and exit_date must be Date objects")
  if(missing(enrollby)){
    enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
      enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                                '-09-15', sep=''), format='%Y-%m-%d')
      warning(paste("enrollby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    enrollby, sep=' '))
    }else{
      enrollby <- as.Date(enrollby, format="%Y-%m-%d")
    }
  }
  if(missing(exitby)){
    exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                            '-06-01', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
      exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                                '-06-01', sep=''), format='%Y-%m-%d')
      warning(paste("exitby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    exitby, sep=' '))
    }else{
      exitby <- as.Date(exitby, format="%Y-%m-%d")
    }
  }
  if(!is.numeric(gap)){
    gap <- 14
    warning("gap was not a number, defaulting to 14 days")
  }
  output <- data.frame(id = as.character(unique(df[[sid]])),
                       moves = vector(mode = 'numeric', 
                                      length = length(unique(df[[sid]]))))

Where things start to get interesting is in the calculation of the number of student moves.

Handling Missing Data

One of the clever bits of code I forgot about when I initially tried to refactor Version 1 appears under “Business Rule 2: The Early Summer”. When the exit_date is missing, this code simply breaks out of the loop:

1
2
  if(is.na(df[['exit_date']][i])){
    next

Because the new code will not be utilizing for loops or really any more of the basic control flow, I had to device a different way to treat missing data. The steps to apply the business rules that I present below will fail spectacularly with missing data.

So the first thing that I do is select the students who have missing data, assign the moves in the output to NA, and then subset the data to exclude these students.

1
2
3
4
5
6
7
incomplete <- df[!complete.cases(df[, c(enroll_date, exit_date)]), ]
if(dim(incomplete)[1]>0){
  output[which(output[['id']] %in% incomplete[[sid]]),][['moves']] <- NA
}
output <- data.table(output, key='id')
df <- df[complete.cases(df[, c(enroll_date, exit_date)]), ]
dt <- data.table(df, key=sid)

Woe with data.table

Now with the data complete and in a data.table, I have to do a little bit of work to assist with my frustrations with data.table. Because data.table does a lot of work with the [ operator, I find it very challenging to use a string argument to reference a column in the data. So I just gave up and internally rename these attributes.

1
2
3
dt$sasid <- as.factor(as.character(dt$sasid))
setnames(dt, names(dt)[which(names(dt) %in% enroll_date)], "enroll_date")
setnames(dt, names(dt)[which(names(dt) %in% exit_date)], "exit_date")

Magic with data.table: Business Rules 1 and 2 in two lines each

Despite by challenges with the way that data.table re-imagines [, it does allow for clear, simple syntax for complex processes. Gone are the for loops and conditional blocks. How does data.table allow me to quickly identified whether or not a students first or last enrollment are before or after my cutoffs?

1
2
3
4
first <- dt[, list(enroll_date=min(enroll_date)), by=sid]
output[id %in% first[enroll_date>enrollby][[sid]], moves:=moves+1L]
last <- dt[, list(exit_date=max(exit_date)), by=sid]  
output[id %in% last[exit_date<exitby][[sid]], moves:=moves+1L]

Line 1 creates a data.table with the student identifier and a new enroll_date column that is equal to the minimum enroll_date for that student.

The second line is very challenging to parse if you’ve never used data.table. The first argument for [ in data.table is a subset/select function. In this case,

1
id %in% first[enroll_date>enrollby][[sid]]

means,

Select the rows in first where the enroll_date attribute (which was previously assigned as the minimum enroll_date) is less than the global function argument enrollby and check if the id of output is in the sid vector.

So output is being subset to only include those records that meet that condition, in other words, the students who should have a move because they entered the school year late.

The second argument of [ for data.tables is explained in this footnote 2 if you’re not familiar with it.

Recursion. Which is also known as recursion.

The logic for Business Rules 3-5 are substantially more complex. At first it was not plainly obvious how to avoid a slow for loop for this process. Each of the rules on switching schools requires an awareness of context– how does one record of a student compare to the very next record for that student?

The breakthrough was thinking back to my single semester of computer science and the concept of recursion. I created a new function inside of this function that can count how many moves are associated with a set of enrollment records, ignoring the considerations in Business Rules 1 and 2. Here’s my solution. I decided to include inline comments because I think it’s easier to understand that way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
school_switch <- function(dt, x=0){
  # This function accepts a data.table dt and initializes the output to 0.
    if(dim(dt)[1]<2){
    # When there is only one enrollment record, there are no school changes to
    # apply rules 3-5. Therefore, the function returns the value of x. If the
    # initial data.table contains a student with just one enrollment record, 
    # this function will return 0 since we initialize x as 0.
      return(x)
    }else{
      # More than one record, find the minimum exit_date which is the "first"
      # record
      exit <- min(dt[, exit_date])
      # Find out which school the "first" record was at.
      exit_school <- dt[exit_date==exit][[schid]]
      # Select which rows come after the "first" record and only keep them
      # in the data.table
      rows <- dt[, enroll_date] > exit
      dt <- dt[rows,]
      # Find the minimum enrollment date in the subsetted table. This is the
      # enrollment that follows the identified exit record
      enroll <- min(dt[, enroll_date])
      # Find the school associated with that enrollment date
      enroll_school <- dt[enroll_date==enroll][[schid]]
      # When the difference between the enrollment and exit dates are less than
      # the gap and the schools are the same, there are no moves. We assign y,
      # our count of moves to x, whatever the number of moves were in this call
      # of school_switch
      if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap &
         exit_school==enroll_school){
        y = x
      # When the difference in days is less than the gap (and the schools are
      # different), then our number of moves are incremented by 1.
      }else if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap){
        y = x + 1L
      }else{
      # Whenever the dates are separated by more than the gap, regardless of which
      # school a student is enrolled in at either point, we increment by two.
        y = x + 2L
      }
      # Explained below outside of the code block.
      school_switch(dt, y)
    }
  }

The recursive aspect of this method is calling school_switch within school_switch once the function reaches its end. Because I subset out the row with the minimum exit_date, the data.table has one row processed with each iteration of school_switch. By passing the number of moves, y back into school_switch, I am “saving” my work from each iteration. Only when a single row remains for a particular student does the function return a value.

This function is called using data.table’s special .SD object, which accesses the subset of the full data.table when using the by argument.

1
dt[, moves:= school_switch(.SD), by=sid]

This calls school_switch after splitting the data.table by each sid and then stitches the work back together, split-apply-combine style, resulting in a data.table with a set of moves per student identifier. With a little bit of clean up, I can simply add these moves to those recorded earlier in output based on Business Rules 1 and 2.

1
2
3
4
  dt <- dt[,list(switches=unique(moves)), by=sid]
  output[dt, moves:=moves+switches]
  return(output)
}

Quick and Dirty system.time


  1. On a mid-2012 Macbook Air, the current mobility calculation is very effective with tens of thousands of student records and practical for use in the low-hundreds of thousands of records range. ↩︎

  2. I thought I was going to use data.table for some of its speedier features as I wrote this initial function. I didn’t in this go (though I do in Version 2). However, I do find the data.table syntax for assigning values to be really convenient, particularly the := operator which is common in several other languages. In data.table, the syntax dt[,name:=value] assigns value to an exist (or new) column called name. Because of the need select operator in data.table, I can just use dt[id,moves:=moves+1L] to select only the rows where the table key, in this case sid, matches id, and then increment moves. Nice. ↩︎ ↩︎

September 16, 2013

How do we calculate student mobility? I am currently soliciting responses from other data professionals across the country. But when I needed to produce mobility numbers for some of my work a couple of months ago, I decided to develop a set of business rules without any exposure to how the federal government, states, or other existing systems define mobility. 1

I am fairly proud of my work on mobility. This post will review how I defined student mobility. I am hopeful that it matches or bests current techniques for calculating the number of schools a student has attended. In my next post, I will share the first two major versions of my implementation of these mobility business rules in R. 2 Together, these posts will represent the work I referred to in my previous post on the importance of documenting business rules and sharing code.

The Rules

Working with district data presents a woefully incomplete picture of the education mobile students receive. Particularly in a state like Rhode Island, where our districts are only a few miles wide, there is substantial interdistrict mobility. When a student moves across district lines, their enrollment is not recorded in local district data. However, even with state level data, highly mobile students cross state lines and present incomplete data. A key consideration for calculating how many schools a student has attended in a particular year is capturing “missing” data sensibly.

The typical structure of enrollment records looks something like this:

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-09-01 2012-11-15
1000000 10103 2012-11-16 2013-06-15

A compound key for this data consists of the Unique Student ID, School Code, and Enrollment Date, meaning that each row must be a unique combination of these three factors. The data above shows a simple case of a student enrolling at the start of the school year, switching schools once with no gap in enrollment, and continuing at the new school until the end of the school year. For the purposes of mobility, I would define the above as having moved one time.

But it is easy to see how some very complex scenarios could quickly arise. What if student 1000000’s record looked like this?

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-10-15 2012-11-15
1000000 10103 2013-01-03 2013-03-13
1000000 10103 2013-03-20 2013-05-13

There are several features that make it challenging to assign a number of “moves” to this student. First, the student does not enroll in school until October 15, 2012. This is nearly six weeks into the typical school year in the Northeastern United States. Should we assume that this student has enrolled in no school at all prior to October 15th or should we assume that the student was enrolled in a school that was outside of this district and therefore missing in the data? Next, we notice the enrollment gap between November 15, 2012 and January 3, 2013. Is it right to assume that the student has moved only once in this period of time with a gap of enrollment of over a month and a half? Then we notice that the student exited school 10103 on March 13, 2013 but was re-enrolled in the same school a week later on March 20, 2013. Has the student truly “moved” in this period? Lastly, the student exits the district on May 13, 2013 for the final time. This is nearly a month before the end of school. Has this student moved to a different school?

There is an element missing that most enrollment data has which can enrich our understanding of this student’s record. All district collect an exit type, which explains if a student is leaving to enroll in another school within the district, another school in a different district in the same state, another school in a different state, a private school, etc. It also defines whether a student is dropping out, graduating, or has entered the juvenile justice system, for example. However, it has been my experience that this data is reported inconsistently and unreliably. Frequently a student will be reported as changing schools within the district without a subsequent enrollment record, or reported as leaving the district but enroll within the same district a few days later. Therefore, I think that we should try and infer the number of schools that a student has attended using soley the enrollment date, exit date, and school code for each student record. This data is far more reliable for a host of reasons, and, ultimately, provides us with all the information we need to make intelligent decisions.

My proposed set of business rules examines school code, enrollment date, and exit date against three parameters: enrollment by, exit by, and gap. Each students minimum enrollment date is compared to enrollment by. If that student entered the data set for the first time before the enrollment by, the assumption is that this record represents the first time the student enrolls in any school for that year, and therefore the student has 0 moves. If the student enrolls for the first time after enrollment by, then the record is considered the second school a student has attended and their moves attribute is incremented by 1. Similarly, if a student’s maximium exit date is after exit by, then this considered to be the student’s last school enrolled in for the year and they are credited with 0 moves, but if exit date is prior to exit by, then that student’s moves is incremented by 1.

That takes care of the “ends”, but what happens as students switch schools in the “middle”? I proposed that each exit date is compared to the subsequent enrollment date. If enrollment date occurs within gap days of the previous exit date, and the school code of enrollment is not the same as the school code of exit, then a student’s moves are incremented by 1. If the school codes are identical and the difference between dates is less than gap, then the student is said to have not moved at all. If the difference between the enrollment date and the previous exit date is greater than gap, then the student’s moves is incremented by 2, the assumption being that the student likely attended a different school between the two observations in the data.

Whereas calculating student mobility may have seemed a simple matter of counting the number of records in the enrollment file, clearly there is a level of complexity this would fail to capture.

Check back in a few days to see my next post where I will share my initial implementation of these business rules and how I achieved an 10x speed up with a massive code refactor.


  1. My ignorance was intentional. It is good to stretch those brain muscles that think through sticky problems like developing business rules for a key statistic. I can’t be sure that I have developed the most considered and complete set of rules for mobility, which is why I’m now soliciting other’s views, but I am hopeful my solution is at least as good. ↩︎

  2. I think showing my first two implementation of these business rules is an excellent opportunity to review several key design considerations when programming in R. From version 1 to version 2 I achieved a 10x speedup due to a complete refactor that avoided for loops, used data.table, and included some clever use of recursion. ↩︎

September 12, 2013

One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.

Documentation of business rules is a critical aspect of a data analysts job that, in my experience, is often regrettably overlooked. If you have ever tried to reproduce someone else’s analysis, asked different people for the same statistic, or tried to compare data from multiple years, you have probably encountered difficulties getting a consistent answer on standard statistics, e.g. how many students were proficient in math, how many students graduated in four years, what proportion of students were chronically absent? All too often documentation of business rules is poor or non-existent. The result is that two analysts with the same data will produce inconsistent statistics. This is not because of something inherent in the quality of the data or an indictment of the analyst’s skills. In most cases, the undocumented business rules are essentially trivial, in that the results of any decision has a small impact on the final result and any of the decisions made by the analysts are equally defensible.

This major problem of lax or non-existent documentation is one of the main reasons I feel that analysts, and in particular analysts working in the public sector, should extensively use tools for code sharing and version control like Github, use free tools whenever possible, and generally adhere to best practices in reproducible research.

I am trying to put as much of my code on Github as I can these days. Much of what I write is still very disorganized and, frankly, embarrassing. A lot of what is in my Github repositories is old, abandoned code written as I was learning my craft. A lot of it is written to work with very specific, private data. Most of it is poorly documented because I am the only one who has ever had to use it, I don’t interact with anyone through practices like code reviews, and frankly I am lazy when pressed with a deadline. But that’s not really the point, is it? The worst documented code is code that is hidden away on a personal hard drive, written for an expensive proprietary environment most people and organizations cannot use, or worse, is not code at all but rather a series of destructive data edits and manipulations. 1

One way that I have been trying to improve the quality and utility of the code I write is by contributing to an open source R package, eeptools. This is a package written and maintained by Jared Knowles, an employee of the Wisconsin Department of Public Instruction, whom I met at a Strategic Data Project convening. eeptools is consolidating several functions in R for common tasks education data analysts are faced with. Because this package is available on CRAN, the primary repository for R packages, any education analyst can have access to its functions in one line:

1
install.packages('eeptools'); require(eeptools)

Submitting code to a CRAN package reinforces several habits. First, I get to practice writing R documentation, explaining how to use a function, and therefore, articulating the assumptions and business rules I am applying. Second, I have to write my code with a wider tolerance for input data. One of the easy pitfalls of a beginning analyst is writing code that is too specific to the dataset in front of you. Most of the errors I have found in analyses during quality control stem from assumptions embedded in code that were perfectly reasonable with a single data set that lead to serious errors when using different data. One way to avoid this issue is through test-driven development, writing a good testing suite that tests a wide range of unexpected inputs. I am not quite there yet, personally, but thinking about how my code would have to work with arbitrary inputs and ensuring it fails gracefully 2 is an excellent side benefit of preparing a pull request 3 . Third, it is an opportunity to write code for someone other than myself. Because I am often the sole analyst with my skillset working on a project, it is easy to not consider things like style, optimizations, clarity, etc. This can lead to large build-ups of technical debt, complacency toward learning new techniques, and general sloppiness. Submitting a pull request feels like publishing. The world has to read this, so it better be something I am proud of that can stand up to the scrutiny of third-party users.

My first pull request, which was accepted into the package, calculates age in years, months, or days at an arbitrary date based on date of birth. While even a beginning R programmer can develop a similar function, it is the perfect example of an easily compartmentalized component, with a broad set of applications, that can be accessed frequently .

Today I submitted by second pull request that I hope will be accepted. This time I covered a much more complex task– calculating student mobility. To be honest, I am completely unaware of existing business rules and algorithms used to produce the mobility numbers that are federally reported. I wrote this function from scratch thinking through how I would calculate the number of schools attended by a student in a given year. I am really proud of both the business rules I have developed and the code I wrote to apply those rules. My custom function can accept fairly arbitrary inputs, fails gracefully when it finds data it does not expect, and is pretty fast. The original version of my code took close to 10 minutes to run on ~30,000 rows of data. I have reduced that with a complete rewrite prior to submission to 16 seconds.

While I am not sure if this request will be accepted, I will be thrilled if it is. Mobility is a tremendously important statistic in education research and a standard, reproducible way to calculate it would be a great help to researchers. How great would it be if eeptools becomes one of the first packages education data analysts load and my mobility calculations are used broadly by researchers and analysts? But even if it’s not accepted because it falls out of scope, the process of developing the business rules, writing an initial implementation of those rules, and then refining that code to be far simpler, faster, and less error prone was incredibly rewarding.

My next post will probably be a review of that process and some parts of my moves_calc function that I’m particularly proud of.


  1. Using a spreadsheet program, such as Excel, encourages directly manipulating and editing the source data. Each change permanently changes the data. Even if you keep an original version of the data, there is no recording of exactly what was done to change the data to produce your results. Reproducibility is all but impossible of any significant analysis done using spreadsheet software. ↩︎

  2. Instead of halting the function with hard to understand error when things go wrong, I do my best to “correct” easily anticipated errors or report back to users in a plain way what needs to be fixed. See also fault-tolerant system↩︎

  3. A pull request is when you submit your additions, deletions, or any other modifications to be incorporated in someone else’s repository. ↩︎