Jason Becker
February 26, 2014

We burden Latinos (and other traditionally underserved communities) with expensive housing because of the widespread practice of using homestead exemptions in Rhode Island. By lowering the real estate tax rate, typically by 50%, for owner occupied housing, we dramatically inflate the tax rate paid by Rhode Islanders who are renting.

Echoing a newly filed lawsuit in New York City over discriminatory real estate tax regimes, this new report emphasizes the racist incentives built into our property tax.

Homestead exemptions are built on the belief that renters are non-permanent residents of communities, care less for the properties they occupy and neighborhoods they live in, and are worse additions than homeowners. Frankly, it is an anti-White flight measure meant to assure people that only those with the means to purchase and the intent to stay will join their neighborhoods. Wealthy, largely White, property owners see homestead exemptions as fighting an influx of “slum lords”, which is basically the perception of anyone who purchases a home or builds apartments and rents them out.

Rather than encouraging denser communities with higher land utilization and more housing to reduce the cost of living in dignity, we subsidize low value (per acre) construction that maintain inflated housing costs.

Full disclosure: I own a condo in Providence and receive a 50% discount on my taxes. In fact, living in a condo Downcity, my home value is depressed because of the limited ways that I can use it. I could rent my current condo at market rate and lose money because of the doubling in taxes that I would endure versus turning a small monthly profit at the same rent with higher taxes. The flexibility to use my property as my own residence or as a rental unit more than pays for higher taxes.

So while I do have personal reasons to support removing the homestead exemption, even if I lived in a single family home on the East Side that was not attractive as a rental property, I would still think this situation is absurd. Homeowners’ taxes should easily be 20% higher to tax renters 30% less. Maybe some of our hulking, vacant infrastructure could be more viably converted into housing stock and lower the cost for all residents. Maybe we could even see denser development because there will actually be a market for renters at the monthly rates that would need to be charged to recuperate expenses. At least the rent wouldn’t be so damn high for too many people of color and people living in or near poverty.

February 17, 2014

Hadley Wickham has once again1 made R ridiculously better. Not only is dplyr incredibly fast, but the new syntax allows for some really complex operations to be expressed in a ridiculously beautiful way.

Consider a data set, course, with a student identifier, sid, a course identifier, courseno, a quarter, quarter, and a grade on a scale of 0 to 4, gpa. What if I wanted to know the number of a courses a student has failed over the entire year, as defined by having an overall grade of less than a 1.0?

In dplyr:

1
2
3
4
5
course %.% 
group_by(sid, courseno) %.%
summarise(gpa = mean(gpa)) %.%
filter(gpa <= 1.0) %.%
summarise(fails = n())

I refuse to even sully this post with the way I would have solved this problem in the past.


  1. Seriously, how many of the packages he has managed/written are indispensable to using R today? It is no exaggeration to say that the world would have many more Stata, SPSS, and SAS users if not for Hadleyverse. ↩︎

February 9, 2014

These quotes are absolutely striking, in that they give a clear glimpse into the ideological commitments of the Republican Party. From Sen. Blunt and Rep. Cole, we get the revelation that— for conservatives— the only “work” worth acknowledging is wage labor. To myself, and many others, someone who retires early to volunteer— or leaves a job to care for their children— is still working, they’re just outside the formal labor market. And indeed, their labor is still valuable— it just isn’t compensated with cash.

One of the greatest benefits of wealth is that it can liberate people to pursue happiness. When we tie a basic need for living complete lives of dignity to full time employment, people will find themselves willing to make many sacrifices to ensure this need. In our nation of great wealth with liberty and freedom as core values, it is hard to believe that the GOP would decry the liberating effect of ending the contingency of health care on work.

There is no work rule, regulation, or union that empowers workers more in their relationship with their employers than removing the threat of losing health care from the table. An increasingly libertarian right should be celebrating this as a key victory, rather than celebrate the existing coercive impact that health care has in our lives.

Republicans aren’t as worried as the idle rich, who— I suppose— have earned the right to avoid a life of endless toil. Otherwise— if Republicans really wanted everyone to work as much as possible— they’d support confiscatory tax rates. After all, nothing will drive an investment banker back to the office like the threat of losing 70 percent of her income to Uncle Sam.

Oh yeah, I forgot. For all their claims to loving liberty and freedom, what the GOP really stands for is protecting liberty and freedom for the existing “deserving” wealthy. They will fight tooth and nail to remove estate taxes because inheritance is a legitimate source of liberty. Removing the fear of entering a hospital uninsured after being unable to access preventive care is what deprives folks of “dignity”.

February 5, 2014

My Democracy Prep colleague Lindsay Malanga and I often say we should start an organization called the Coalition of Pretty Good Schools. We’d start with the following principles.

  1. Every child must have a safe, warm, disruption-free classroom as a non-negotiable, fundamental right.
  2. All children should be taught to read using phonics-based instruction.
  3. All children must master basic computational skills with automaticity before moving on to higher mathematics.
  4. Every child must be given a well-rounded education that includes science, civics, history, geography, music, the arts, and physical education.
  5. Accountability is an important safeguard of public funds, but must not drive or dominate a child’s education. Class time must not be used for standardized test preparation.

We have no end of people ready to tell you about their paradigmatic shift that will fix education overnight. There has been plenty of philosophizing about the goals, purpose, and means of education. Everyone is ready to pull out tropes about the “factory model” of education our system is built on.

The reality is that the education system too often fails at very basic delivery, period. I would love to see more folks draw a line in the sand of their minimum basic requirements, and not in an outrageous, political winky-wink where they are wrapping thier ideal in the language of the minimum. Lets have a deep discussion right now about the minimum basic requirements and lets get relentless about making that happen without the distraction of the dream. Frankly, whatever your dream is, so long as it involves kids going somewhere to learn 1, if we can’t deliver on the basics it will be dead on arrival.


  1. Of course, for a group of folks who are engaged in Dreamschooling, we cannot take for granted that schools will be places or that children will be students in any traditional sense of the word. However, I believe that if we have a frank conversation about the minimum expectations for education I suspect this will not be a particularly widely held sentiment. If our technofuturism does complete its mindmeld with the anarcho-____ movements on the left and right to lead to a dramatically different conceptualization of childhood in the developed world in my lifetime… ↩︎

January 6, 2014

James over at TransportPVD has a great post today talking about a Salt Lake City ordinance that makes property owners responsible for providing a bond that funds the landscaping and maintenance of vacant lots left after demolition. I love this as much as he does and would probably add several other provisions (like forfeiting any tax breaks on that property or any other property in the city and potentially forfeiture of the property itself if a demolition was approved based on site plans that are not adhered to within a given time frame). Ultimately, I do think the best solution to surface parking where it doesn’t belong, of either the temporary or permanent (and isn’t it all actually permanent?) kind, is a land value tax.

James goes one step further and suggests that we should adopt some similar rules around ALL parking developments and proposes a few. His hopes were that a mayoral candidate would chime in. For now, he will have to do with me.

His recommendations are built somewhat specific to the commission looking at building a state-funded parking garage in front of the Garrahy Complex in Downcity, about which many urbanists and transit advocates have expressed reservations or outright rejection. They are:

  1. The garage is parking neutral. As many spots need to be removed from the downtown as are added.
  2. An added bonus would be if some of the spots removed were on-street ones, to create protected bike lanes or transit lanes with greenery separating them from car traffic.
  3. The garage has the proposed bus hub.
  4. There are ground-level shops.
  5. The garage is left open 24-hours so that it can limit the need for other lots (this happens when a garage is used only during the day, or only at night, instead of letting it serve both markets).
  6. Cars pay full market price to park.

(Note: I’ve numbered rather than kept the bullets of the original to make responding easier.)

I disagree with the first and second point, which are really one and the same. We are in a district that has tremendously underutilized land. We want that space to be developed and as a result of that development we expect their to be much increased need for transit capacity. The goal should be both to increase accessibility and increase the share of transit capacity offered by walking, biking, or riding a bus or light rail. This does not require that we demand a spot-for-spot when building a public garage. I agree with the sentiment but disagree with the degree. Part of building rules and policies like this is to ensure comprehensive consideration of the transit context when developing parking. I see no reason to a priori assume that garages should only be permitted if they eliminate the same number of spaces they create.

The reason I combine these two points is because the city does not have the ability to remove off-street parking that is not publicly owned. Investing in smaller garages by footprint that have to be built taller and provide no change in capacity probably make no sense at all. If we’re going to build any kind of public garage at all, it should be with the goal of consolidating parking into infrastructure with reasonable land utilization. We would rather 3 or 4 large garages properly located than all of the current lots. Limiting their size because of the flexibility available due to reducing on-street parking or the footprint on existing lots doesn’t achieve that and doesn’t factor in orders-of-magnitude changes in capacity we should need for all transit modes in the next 20 years.

On point three, I am skeptical. I like the idea of improving bus infrastructure when building parking infrastructure in general. In fact, I voted against the \$40M Providence road paving bond even though that was much needed maintenance. My rationale was purely ideological– we should not use debt to pay for car maintenance without also investing in ways to reduce future maintenance costs through better utilization of those roads. However, I have a hard time believing that the Garrahy location is any good as a bus hub. If RIPTA did a great job identifying the need for an additional bus hub that the Garrahy location met the criteria for, I think it’s a reasonable idea. Short of that, it feels like throwing the transit community a wasteful bone.

I mostly agree on point four, but I doubt at the scale James would like to see. I think an appropriate level is probably not that different from the recently erected Johnson and Wales garage. The reality is that street-level retail is the right form, but there isn’t sufficient foot traffic to support it right now and won’t be for some time. There has to be street-level activation of any garage built in this area, but the square footage is likely fairly timid.

I absolutely agree with point five, without qualification. Not a dime should be spent on a public parking spot that is closed at any point in time, anywhere in the city. I would actually ditto this for surface parking lots on commercial properties of any kind after business hours. Not only should they have to be open, they should have to provide signs indicating the hours of commercial activity when parking is restricted and the hours when parking is available to the public. These hours of operations should require board approval. Owners could choose to charge during these off hours, but cars must be able to access the lot.

And point six should be a given for any public parking.

The real problem with Garrahy, in my opinion, is the cost is absurd, likely to be at least \$35,000 per space. There is plenty of existing parking, suggesting the demand right now is illusory and market rate for those spots right now means the investment is unlikely to ever be recovered. In a world with limited capacity for government spending on transit as a public good, I would rather subsidize transit infrastructure that benefits the poor and directly impacts the share of non-car transit as it increases capacity. Spending limited funds on parking infrastructure is ludicrous when demand isn’t sufficient to recover the investment. We already more than sufficiently subsidize parking in the area. And of course, the “study commission” is not really a study– it’s a meeting convened by those who want the project to happen putting the required usual suspects in the room to tepidly rubber stamp it. At least that’s my cynical take.

December 9, 2013

We find that public schools offered practically zero return education on the margin, yet they did enjoy significant political and financial support from local political elites, if they taught in the “right” language of instruction.

One thing that both progressives and libertarians agree upon are that social goals of education are woefully underappreciated and considered in the current school reform discussion. Both school choice and local, democratic control of schools are reactions to centralization resulting in “elites… [selecting] the ‘right’ language of instruction.”

I am inclined to agree with neither.

December 3, 2013

Update

Turns out the original code below was pretty messed up. All kinds of little errors I didn’t catch. I’ve updated it below. There are a lot of options to refactor this further that I’m currently considering. Sometimes it is really hard to know just how flexible something this big really should be. I think I am going to wait until I start developing tests to see where I land. I have a feeling moving toward a more test-driven work flow is going to force me toward a different structure.

I recently updated the function I posted about back in June that calculates the difference between two dates in days, months, or years in R. It is still surprising to me that difftime can only return units from seconds up until weeks. I suspect this has to do with the challenge of properly defining a “month” or “year” as a unit of time, since these are variable.

While there was nothing wrong with the original function, it did irk me that it always returned an integer. In other words, function returned only complete months or years. If the start date was on 2012-12-13 and the end date was on 2013-12-03, the function would return 0 years. Most of the time, this is the behavior I expect when calcuating age. But it is completely reasonable to want to include partial years or months, e.g. in the aforementioned example returning 0.9724605.

So after several failed attempts because of silly errors in my algorithm, here is the final code. It will be released as part of eeptools 0.3 which should be avialable on CRAN soon 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
age_calc <- function(dob, enddate=Sys.Date(), units='months', precise=TRUE){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  }
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  if(precise){
    start_is_leap <- ifelse(start$year %% 400 == 0, TRUE, 
                        ifelse(start$year %% 100 == 0, FALSE,
                               ifelse(start$year %% 4 == 0, TRUE, FALSE)))
    end_is_leap <- ifelse(end$year %% 400 == 0, TRUE, 
                        ifelse(end$year %% 100 == 0, FALSE,
                               ifelse(end$year %% 4 == 0, TRUE, FALSE)))
  }
  if(units=='days'){
    result <- difftime(end, start, units='days')
  }else if(units=='months'){
    months <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='months', SIMPLIFY=FALSE), 
                     length) - 1
    # length(seq(start, end, by='month')) - 1
    if(precise){
      month_length_end <- ifelse(end$mon==1, 28,
                                 ifelse(end$mon==1 & end_is_leap, 29,
                                        ifelse(end$mon %in% c(3, 5, 8, 10), 
                                               30, 31)))
      month_length_prior <- ifelse((end$mon-1)==1, 28,
                                     ifelse((end$mon-1)==1 & start_is_leap, 29,
                                            ifelse((end$mon-1) %in% c(3, 5, 8, 
                                                                      10), 
                                                   30, 31)))
      month_frac <- ifelse(end$mday > start$mday,
                           (end$mday-start$mday)/month_length_end,
                           ifelse(end$mday < start$mday, 
                            (month_length_prior - start$mday) / 
                                month_length_prior + 
                                end$mday/month_length_end, 0.0))
      result <- months + month_frac
    }else{
      result <- months
    }
  }else if(units=='years'){
    years <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='years', SIMPLIFY=FALSE), 
                     length) - 1
    if(precise){
      start_length <- ifelse(start_is_leap, 366, 365)
      end_length <- ifelse(end_is_leap, 366, 365)
      year_frac <- ifelse(start$yday < end$yday,
                          (end$yday - start$yday)/end_length,
                          ifelse(start$yday > end$yday, 
                                 (start_length-start$yday) / start_length +
                                end$yday / end_length, 0.0))
      result <- years + year_frac
    }else{
      result <- years
    }
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

  1. I should note that my mobility function will also be included in eeptools 0.3. I know I still owe a post on the actual code, but it is such a complex function I have been having a terrible time trying to write clearly about it. ↩︎

December 2, 2013

PISA Results

I wanted to call attention to these interesting PISA results. Turns out that student anxiety in the United States is lower than the OECD average and belief in ability is higher 1. I thought that all of the moves in education since the start of standard’s based reform were supposed to be generating tremendous anxiety and failing to produce students who had high sense of self-efficacy?

It is also worth noting that students in the United States were more likely to skip out on school dand this had a higher than typical impact on student performance. One interpretation of this could be that students are less engaged, but also that schooling activities do have a large impact on students rather than schools being of lesser importance than student inputs.

I have always had a hard time reconciling the calls for higher teacher pay and better work conditions and evidence that missing even just 10% of schooling has a huge impact on student outcomes with the belief that addressing other social inequities is the key way to achieve better outcomes for kids.

This is all an exercise in nonsense. It is incredibly difficult to transfer findings from surveys across dramatical cultural differences. It is also hard to imagine what can be learned about the delivery of education in the dramatically different contexts that exists. The whole international comparison game seems like one big Rorschach test where the price of admission is leaving any understanding of culture, context, and external validity at the door.

P.S.: The use of color in this visualization is awful. There is a sense that they are trying to be “value neutral” with data that is ordinal in nature (above, same, or below), and in doing so chose two colors that are very difficult to distinguish between. Yuck.


  1. The site describes prevalence of anxiety as, “proportion of students who feel helpless when faced with math problems” and belief in ability as, “proportion of students who feel confident in their math abilitites”. Note, based on these defitions, one might also think that either curricula were not so misaligned with international benchmarks or that we are already seeing the fruits of partial transition to Common Core. Not knowing the trend for this data, or some of the specifics about the collection instrument, makes that difficult to assess. ↩︎

November 22, 2013

Although it clocks in at 40+ pages, this is a worthwhile and relatively fast read for anyone in education policy on the future of assessment if we’re serious about college and career readiness. There is a ton to unpack, with a fair amount it agree with and a lot I am quite a bit less sure on.

I think this paper is meant for national and state level policy-makers, and so my major quibble is I think this is much more valuable for a district-level audience. I am less bullish on the state’s role in building comprehensive assessment systems. That’s just my initial reaction.

The accountability section is both less rich and less convincing than the assessment portion. I have long heard cries for so-called reciprocal accountability, but it is still entirely unclear to me what this means and looks like and the implications for current systems.

November 20, 2013

“We are trying to work towards late-exit ELL programs so (students) can learn the concepts in (their) native language,” Lusi said. Administrative goals have recently shifted to a focus on proficiency in both languages because bilingual education is preferred, she added.

But instituting district-wide bilingual education would require funding to hire teachers certified in both languages and to buy dual-language materials, she said.

I am pretty sure this is new. I am surprised there has not been a stronger effort to pass a legislative package in Rhode Island that provides both the policy framework and funding necessary to achieve universal bilinguage education for English language learners in RI schools.

One of the great advantages of transitioning to common standards1 is there should be greater availability of curricular materials in languages other than English. I suspect most of what is needed for bilingual education is start up money for materials, curriculum supports and developments, and assessment materials. There are a few policy things that need to be in place, possibly around state exams, but also rules around flexible teacher assignment, hiring, and dismissal staffing needs dramatically change.

Someone should be putting this package together. I suspect there would be broad support.


  1. Note, this is not necessarily a feature of the Common Core State Standards, just having standards in common with many other states. ↩︎

November 19, 2013

De Blasio and his advisers are still figuring out how much rent to charge well-funded charter schools, his transition team told me. “It would depend on the resources of the charter school or charter network,” he told WNYC, in early October. “Some are clearly very, very well resourced and have incredible wealthy backers. Others don’t. So my simple point was that programs that can afford to pay rent should be paying rent.” (In an October debate with the Republican candidate Joseph Lhota, he put it more bluntly: “I simply wouldn’t favor charters the way Mayor Bloomberg did because, in the end, our city rises or falls on our traditional public schools.”)

My impression of DeBlasio was that he went around collecting every plausible complaint from every interest group that was mad at Bloomberg and promised whatever they wanted. There didn’t really seem to be a coherent theory or any depth whatsoever to his policy prescriptions.

Already working hard to confirm this impression.

November 18, 2013

To recap, the first study discussed above established that children from disadvantaged backgrounds know less about a topic (i.e., birds) than their middle-class peers. Next, in study two, the researchers showed that differences in domain knowledge influenced children’s ability to understand words out of context, and to comprehend a story. Moreover, poor kids — who also had more limited knowledge — perform worse on these tasks than did their middle class peers. But could additional knowledge be used to level the playing field for children from less affluent backgrounds?

In study three, the researchers held the children’s prior knowledge constant by introducing a fictitious topic — i.e., a topic that was sure to be unknown to both groups. When the two groups of children were assessed on word learning and comprehension related to this new domain, the researchers found no significant differences in how poor and middle-class children learned words, comprehended a story or made inferences.

One of the “old” divides in education, from before the current crop of “edreform”, is whether or not content matters. Broadly, there are two camps, let’s call them the “Facts” and “Skills”, with the “Skills” camp clearly ahead in terms of mind share.

“Skills” is based on a fundamentally intuitive insight– students need to know how to do things not about the things themselves. In many ways it is built on our common experience of forgetting facts over time. We need 21st century skills, not an accumulation of specific, privileged knowledge that fades over time. Whatever the latest technology, from encyclopedias to calculators through to Google, each generation decides that the tools that adults use end the necessity of knowing about things rather than knowing how to find things.

This is very attractive. It seems to match our adult experiences accumulating knowledge and using it in our work. It seems to address students’ boredom with learning irrelevant information. It leaves space for groups to advocate for teaching whatever content they want since everyone can argue that content is fundamentally limited in value.

In classic turns out sense, however, the evidence keeps mounting that one must teach from the “Facts” approach to achieve the goals of the “Skills” position.

Turns out: skills and knowledge do not transfer well across domains. There is little evidence that learning how to read literary fiction translates to reading technical manuals with comprehension. In other words, critical thinking is not really an independent ability free of domain context 1. In fact, experts are able to learn more quickly, but only in their domain and only when they have prior knowledge to use as scaffolding 2.

Turns out: reading comprehension is strongly connected to whether or not students have prior knowledge (“Facts”) about the topic of the passage 3. Reading techniques only provide modest assistance for comprehension.

Turns out: privileging skills over content may have a serious differential impact on disadvantaged children. A well-intentioned goal of achieving equity through equality has led many to advocate that we do a disservice to children of color and children in poverty because their schools have not as completely embraced a “Skills” world and are too focused on “Facts”. The problem is that deep disparities we see when these students enter schooling point to having less prior knowledge than their peers 4.

What is remarkable, and tragic, is that the “Skills” camp has maintained its dominance through the demonization of “Facts”, with dramatic misinterpretations like:

  1. The “Facts” folks are just White colonialists seeking to maintain existing power structures through teaching the information of privilege.
  2. The “Facts” folks privilege memorization, rote learning, and recall-based assessment over other pedagogy that is more engaging and authentic.
  3. The “Facts” folks can only ever teach what was important yesterday; “Skills” camp can teach what matters to become a lifelong learner for tomorrow’s world.

None of these are true.

This post is largely brought to you by: E.D. Hirsch, Dan T. Willingham, and Malcolm Gladwell via Merlin Mann.


  1. http://www.aft.org/pdfs/americaneducator/summer2007/Crit_Thinking.pdf ↩︎

  2. http://www.ncbi.nlm.nih.gov/pubmed/11550744 ↩︎

  3. http://www.aft.org/newspubs/periodicals/ae/spring2006/willingham.cfm ↩︎

  4. This has pretty much been the thrust behind E.D. Hirsch’s work, who has been accused of being on the far right in education, despite his consistent belief that education equity is one of the most important goals to achieve. His firm belief, and I am mostly convinced, is that explicit factual content is the key tool for how teaching can dramatically improve educational equity. ↩︎

  1. More schooling, reoriented calendar
  2. Wider range of higher education
  3. Cheaper four-year degrees
  4. Eliminate property tax-based public education

This is an interesting list. I don’t agree with number four. There are several benefits to using property taxes not the least of which is their stability and lagged response during traditional economic downturns. However, there are many things we should do to reform our revenue system for education. I am keen on more taxes on “property”, using land value taxes that are levvied either statewide or regionally to address some of the inequities traditional, highly localized property taxes can lead to.

November 17, 2013

If I had to point to the key fissure in the education policy and research community it would be around poverty. Some seem to view it as an inexorable obstacle, deeply believing that the key improvement strategy is to decrease inequity of inputs. Some seem to view it as an obstacle that can be overcome by systems functioning at peak efficacy, deeply believing the great challenge is achieving that efficacy sustainably at scale. Both positions seem to grossly simplify causes and suggest policy structures and outcomes that are unachievable.

Paraphrasing Merlin Mann, always be skeptical of “turns out” research. In this case, are the results really that surprising? If they are, I might suggest that you have been focusing too much on the partial equilibrium impact of poverty and ignoring the bigger picture.

Not that I think integration is likely, easy, quick, or magically fixes things.

October 7, 2013

I spent most of high school writing, practicing, and performing music. I played guitar in two separate bands, and was the lead vocalist in one of those bands, and played trumpet in various wind ensembles and the jazz band at school. When I wasn’t a part of the creation process myself, there is a pretty good chance I was listening to music. Back then, it seemed trivial to find a new artist or album to obsess over.

Despite being steeped in music, I have always found it hard to write about. The truth is, I have limited ability to use words to explain just what makes a particular piece of music so wonderful. Oh sure, I could discuss structure, point out a particular hook in a particular section and how it sits in the mix. I could talk about the tone of the instrument or about quality of the performance or any number of other things. The problem with this language is it reduces what is great about this piece of music to a description that could easily fit some other piece of music. Verbalizing the experience of music projects a woefully flattened artifact of something breathtaking.

Now it might seem that recorded music has greatly diminished this challenge. After all, the experience of recorded music can scale– anyone can listen. Unfortunately, I found this to be completely untrue. When I play music for other people, it actually sounds different than when I experience it for myself. Little complexities that seem crucial to the mix seem to cower and hide rather than loom large in the presence of others. It is not really feasible to point out what makes the song so great while listening, because it disrupts the experience. Worst of all, no one else seems to experience what I experience when I listen.

Of course, all of this may seem obvious to someone who has read about aesthetics. I have not.

September 22, 2013

In a couple of previous posts, I outlined the importance of documenting business rules for common education statistics and described my take on how to best calculate student mobility. In this post, I will be sharing two versions of R function I wrote to implement this mobility calculation, reviewing their different structure and methods to reveal how I achieved an order of magnitude speed up between the two versions. 1 At the end of this post, I will propose several future routes for optimization that I believe should lead to the ability to handle millions of student records in seconds.

Version 0: Where Do I Begin?

The first thing I tend to do is whiteboard the rules I want to use through careful consideration and constant referal back to real data sets. By staying grounded in the data, I am less likely to encounter unexpected situations during my quality control. It also makes it much easier to develop test data, since I seek out outlier records in actual data during the business rule process.

Developing test data is a key part of the development process. Without a compact, but sufficiently complex, set of data to try with a newly developed function, there is no way to know whether or not it does what I intend.

Recall the business rules for mobility that I have proposed, all of which came out of this whiteboarding process:

  1. Entering the data with an enroll date after the start of the year counts as one move.
  2. Leaving the data with an exit date before the end of the year counts as one move.
  3. Changing schools sometime during the year without a large gap in enrollment counts as one move.
  4. Changing schools sometime during the year with a large gap in enrollment counts as two moves.
  5. Adjacent enrollment records for the same student in the same school without a large gap in enrollment does not count as moving.

Test data needs to represent each of these situations so that I can confirm the function is properly implementing each rule.

Below is a copy of my test data. As an exercise, I recommend determining the number of “moves” each of these students should be credited with after applying the above stated business rules.

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-10-15 2012-11-15
1000000 10103 2012-01-03 2013-03-13
1000000 10103 2012-03-20 2013-05-13
1000001 10101 2012-09-01 2013-06-15
1000002 10102 2012-09-01 2013-01-23
1000003 10102 2012-09-15 2012-11-15
1000003 10102 2013-03-15 2013-06-15
1000004 10103 2013-03-15 NA

Version 1: A Naïve Implementation

Once I have developed business rules and a test data set, I like to quickly confirm that I can produce the desired results. That’s particularly true when it comes to implementing a new, fairly complex business rules. My initial implementation of a new algorithm does not need to be efficient, easily understood, or maintainable. My goal is simply to follow my initial hunch on how to accomplish a task and get it working. Sometimes this naïve implementation turns out to be pretty close to my final implementation, but sometimes it can be quite far off. The main things I tend to improve with additional work are extensibility, readability, and performance.

In the case of this mobility calculation, I knew almost immediately that my initial approach was not going to have good performance characteristics. Here is a step by step discussion of Version 1.

Function Declaration: Parameters

1
2
3
4
5
6
7
8
moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sid', 
                       schid='schid',
                       enroll_date='enroll_date',
                       exit_date='exit_date')){

I named my function moves_calc() to match the style of age_calc() which was submitted and accepted to the eeptools package. This new function has eight parameters.

df: a data.frame containing the required data to do the mobility calculation.

enrollby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the start of the school year. Students whose first enrollment is after this date will have an additional move under the assumption that they enrolled somewhere prior to the first enrollment record in the data. This does not (and likely should not) match the actual first day of the school year.

exitby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the end of the school year. Students whose last exit is before this date will have an additional move under the assumption that they enrolled somewhere after this exit record that is excluded in the data. This date does not (and likely should not) match the actual last day of the school year.

gap: an atomic vector of type numeric that signifies how long a gap must exist between student records to record an additional move for that student under the assumption that they enrolled somewhere in between the two records in the data that is not recorded.

sid: an atomic vector of type character that represents the name of the vector in df that contains the unique student identifier. The default value is 'sid'.

schid: an atomic vector of type character that represents the name of the vector in df that contains the unique school identifier. The default value is schid.

enroll_date: an atomic vector of type character that represents the name of the vector in df that contains the enrollment date for each record. The default value is enroll_date.

exit_date: an atomic vector of type character that represents the name of the vector in df that contains the exit date for each record. The default value is exit_date.

Most of these parameters are about providing flexibility around the naming of attributes in the data set. Although I often write functions for my own work which accept data.frames, I can not help but to feel this is a bad practice. Assuming particular data attributes of the right name and type does not make for generalizable code. To make up for my shortcoming in this area, I have done my best to allow other users to enter whatever data column names they want, so long as they contain the right information to run the algorithm.

The next portion of the function loads some of the required packages and is common to many of my custom functions:

1
2
3
4
5
6
7
8
9
if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
require(data.table)

if("plyr" %in% rownames(installed.packages()) == FALSE){
    install.packages("plyr")
  } 
require(plyr)

Type Checking and Programmatic Defaults

Next, I do extensive type-checking to make sure that df is structured the way I expect it to be in order to run the algorithm. I do my best to supply humane warning() and stop() messages when things go wrong, and in some cases, set default values that may help the function run even if function is not called properly.

1
2
if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
    stop("Both enroll_date and exit_date must be Date objects")

The enroll_date and exit_date both have to be Date objects. I could have attempted to coerce those vectors into Date types using as.Date(), but I would rather not assume something like the date format. Since enroll_date and exit_date are the most critical attributes of each student, the function will stop() if they are the incorrect type, informing the analyst to clean up the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if(missing(enrollby)){
   enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
     enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                               '-09-15', sep=''), format='%Y-%m-%d'
     warning(paste("enrollby must be a string with format %Y-%m-%d,",
                   "defaulting to", 
                   enrollby, sep=' '))
  }else{
    enrollby <- as.Date(enrollby, format="%Y-%m-%d")
  }
}
if(missing(exitby)){
  exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                          '-06-01', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
    exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                              '-06-01', sep=''), format='%Y-%m-%d')
    warning(paste("exitby must be a string with format %Y-%m-%d,",
                  "defaulting to", 
                  exitby, sep=' '))
  }else{
    exitby <- as.Date(exitby, format="%Y-%m-%d")
  }
}
if(!is.numeric(gap)){
  gap <- 14
  warning("gap was not a number, defaulting to 14 days")
}

For maximum flexibility, I have parameterized the enrollby, exitby, and gap used by the algorithm to determine student moves. An astute observer of the function declaration may have noticed I did not set default values for enrollby or exitby. This is because these dates are naturally going to be different which each year of data. As a result, I want to enforce their explicit declaration.

However, we all make mistakes. So when I check to see if enrollby or exitby are missing(), I do not stop the function if it returns TRUE. Instead, I set the value enrollby to September 15 in the year that matches the minimum (first) enrollment record and exitby to June 1 in the year that matches the maximum (last) exit record. I then pop off a warning() that informs the user the expected values for each parameter and what values I have defaulted them to. I chose to use warning() because many R users set their environment to halt at warnings(). They are generally not good and should be pursued and fixed. No one should depend upon the defaulting process I use in the function. But the defaults that can be determined programmatically are sensible enough that I did not feel the need to always halt the function in its place.

I also check to see if gap is, in fact, defined as a number. If not, I also throw a warning() after setting gap equal to default value of 14.

Is this all of the type and error-checking I could have included? Probably not, but I think this represents a very sensible set that make this function much more generalizable outside of my coding environment. This kind of checking may be overkill for a project that is worked on independently and with a single data set, but colleagues, including your future self, will likely be thankful for their inclusion if any of your code is to be reused.

Initializing the Results

1
2
3
4
5
output <- data.frame(id = as.character(unique(df[[sid]])),
                     moves = vector(mode = 'numeric', 
                                    length = length(unique(df[[sid]]))))
output <- data.table(output, key='id')
df <- arrange(df, sid, enroll_date)

My naïve implementation uses a lot of for loops, a no-no when it comes to R performance. One way to make for loops a lot worse, and this is true in any language, is to reassign a variable within the loop. This means that each iteration has the overhead of creating and assigning that object. Especially when we are building up results for each observation, it is silly to do this. We know exactly how big the data will be and therefore only need to create the object once. We can then assign a much smaller part of that object (in this case, one value in a vector) rather than the whole object (a honking data.table).

Our output object is what the function returns. It is a simple data.table containing all of the unique student identifiers and the number of moves recorded for each student.

The last line in this code chunk ensures that the data are arranged by the unique student identifier and enrollment date. This is key since the for loops assume that they are traversing a student’s record sequentially.

Business Rule 1: The Latecomer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for(i in 1:(length(df[[sid]])-1)){
  if(i>1 && df[sid][i,]!=df[sid][(i-1),]){
    if(df[['enroll_date']][i]>enrollby){
      output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }else if(i==1){
    if(df[['enroll_date']][i]>enrollby){
    output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }

The first bit of logic checks if sid in row i is not equal to the sid in the i-1 row. In other words, is this the first time we are observing this student? If it is, then row i is the first observation for that student and therefore has the minimum enrollment date. The enroll_date is checked against enrollby. When enroll_date is after enrollby, then the moves attribute for that sid is incremented by 1. 2

Now, I didn’t really mention the conditional that i>1. This is needed because there is no i-1 observation for the very first row of the data.table. Therefore, i==1 is a special case where we once again perform the same check for enroll_date and enrollby. The i>1 condition is before the && operator, which ensures the statement after the && is not evaluated when the first conditional is FALSE. This avoids an “out of bounds”-type error where R tries to check df[0].

Business Rule 5: The Feint

Yeah, yeah– the business rule list above doesn’t match the order of my function. That’s ok. Remember, sometimes giving instructions to a computer does not follow the way you would organize instructions for humans.

Remember, the function is traversing through our data.frame one row at a time. First I checked to see if the function is at the first record for a particular student. Now I check to see if there are any records after the current record.

1
2
3
4
5
6
  if(df[sid][i,]==df[sid][(i+1),]){
    if(as.numeric(difftime(df[['enroll_date']][i+1], 
                           df[['exit_date']][i], units='days')) < gap &
       df[schid][(i+1),]==df[schid][i,]){
        next
    }else if ...

For the case where the i+1 record has the same sid, then the enroll_date of i+1 is subtracted from the exit_date of i and checked against gap. If it is both less than gap and the schid of i+1 is the same as i, then next, which basically breaks out of this conditional and moves on without altering moves. In other words, students who are in the same school with only a few days between the time they exited are not counting has having moved.

The ... above is not the special ... in R, rather, I’m continuing that line below.

Business Rule 3: The Smooth Mover

1
2
3
4
5
  }else if(as.numeric(difftime(df[['enroll_date']][i+1], 
                               df[['exit_date']][i], 
                               units='days')) < gap){
    output[as.character(df[[sid]][i]), moves:=moves+1L] 
  }else{ ...

Here we have the simple case where a student has moved to another school (recall, this is still within the if conditional where the next record is the same student as the current record) with a very short period of time between the exit_date at the current record and the enroll_date of the next record. This is considered a “seamless” move from one school to another, and therefore that student’s moves are incremented by 1.

Business Rule 4: The Long Hop

Our final scenario for a student moving between schools is when the gap between the exit_date at the i school and the enroll_date at the i+1 school is large, defined as > gap. In this scenario, the assumption is that the student moved to a jurisdiction outside of the data set, such as out of district for district-level data or out of state for state level data, and enrolled in at least one school not present in their enrollment record. The result is these students receive 2 moves– one out from the i school to a missing school and one in to the i+1 school from the missing school.

The code looks like this (again a repeat from the else{... above which was not using the ... character):

1
2
3
4
  }else{
    output[as.character(df[[sid]][i]), moves:=moves+2L] 
  }
}else...

This ends with a } which closes the if conditional that checked if the i+1 student was the same as the i student, leaving only one more business rule to check.

Business Rule 2: The Early Summer

1
2
3
4
5
6
7
}else{
  if(is.na(df[['exit_date']][i])){
    next
  }else if(df[['exit_date']][i] < exitby){
        output[as.character(df[[sid]][i]), moves:=moves+1L]
  }
}

Recall that this else block is only called if sid of the i+1 record is not the same as i. This means that this is the final entry for a particular student. First, I check to see if that student has a missing exit_date and if so charges no move to the student implementing the next statement to break out of this iteration of the loop. Students never have missing enroll_date for any of the data I have seen over 8 years. This is because most systems minimally autogenerate the enroll_date for the current date when a student first enters a student information system. However, sometimes districts forget to properly exit a student and are unable to supply an accurate exit_date. In a very small number of cases I have seen these missing dates. So I do not want the function to fail in this scenario. My solution here was simply to break out and move to the next iteration of the loop.

Finally, I apply the last rule, which compares the final exit_date for a student to exitby, incrementing moves if the student left prior to the end of the year and likely enrolled elsewhere before the summer.

The last step is to close the for loop and return our result:

1
2
3
  }
  return(output)
}

Version 2: 10x Speed And More Readable

The second version of this code is vastly quicker.

The opening portion of the code, including the error checking is essentially a repeat of before, as is the initialization of the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sasid', 
                       schid='schno',
                       enroll_date='enroll_date',
                       exit_date='exit_date'){
  if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
  require(data.table)
  if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
      stop("Both enroll_date and exit_date must be Date objects")
  if(missing(enrollby)){
    enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
      enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                                '-09-15', sep=''), format='%Y-%m-%d')
      warning(paste("enrollby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    enrollby, sep=' '))
    }else{
      enrollby <- as.Date(enrollby, format="%Y-%m-%d")
    }
  }
  if(missing(exitby)){
    exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                            '-06-01', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
      exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                                '-06-01', sep=''), format='%Y-%m-%d')
      warning(paste("exitby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    exitby, sep=' '))
    }else{
      exitby <- as.Date(exitby, format="%Y-%m-%d")
    }
  }
  if(!is.numeric(gap)){
    gap <- 14
    warning("gap was not a number, defaulting to 14 days")
  }
  output <- data.frame(id = as.character(unique(df[[sid]])),
                       moves = vector(mode = 'numeric', 
                                      length = length(unique(df[[sid]]))))

Where things start to get interesting is in the calculation of the number of student moves.

Handling Missing Data

One of the clever bits of code I forgot about when I initially tried to refactor Version 1 appears under “Business Rule 2: The Early Summer”. When the exit_date is missing, this code simply breaks out of the loop:

1
2
  if(is.na(df[['exit_date']][i])){
    next

Because the new code will not be utilizing for loops or really any more of the basic control flow, I had to device a different way to treat missing data. The steps to apply the business rules that I present below will fail spectacularly with missing data.

So the first thing that I do is select the students who have missing data, assign the moves in the output to NA, and then subset the data to exclude these students.

1
2
3
4
5
6
7
incomplete <- df[!complete.cases(df[, c(enroll_date, exit_date)]), ]
if(dim(incomplete)[1]>0){
  output[which(output[['id']] %in% incomplete[[sid]]),][['moves']] <- NA
}
output <- data.table(output, key='id')
df <- df[complete.cases(df[, c(enroll_date, exit_date)]), ]
dt <- data.table(df, key=sid)

Woe with data.table

Now with the data complete and in a data.table, I have to do a little bit of work to assist with my frustrations with data.table. Because data.table does a lot of work with the [ operator, I find it very challenging to use a string argument to reference a column in the data. So I just gave up and internally rename these attributes.

1
2
3
dt$sasid <- as.factor(as.character(dt$sasid))
setnames(dt, names(dt)[which(names(dt) %in% enroll_date)], "enroll_date")
setnames(dt, names(dt)[which(names(dt) %in% exit_date)], "exit_date")

Magic with data.table: Business Rules 1 and 2 in two lines each

Despite by challenges with the way that data.table re-imagines [, it does allow for clear, simple syntax for complex processes. Gone are the for loops and conditional blocks. How does data.table allow me to quickly identified whether or not a students first or last enrollment are before or after my cutoffs?

1
2
3
4
first <- dt[, list(enroll_date=min(enroll_date)), by=sid]
output[id %in% first[enroll_date>enrollby][[sid]], moves:=moves+1L]
last <- dt[, list(exit_date=max(exit_date)), by=sid]  
output[id %in% last[exit_date<exitby][[sid]], moves:=moves+1L]

Line 1 creates a data.table with the student identifier and a new enroll_date column that is equal to the minimum enroll_date for that student.

The second line is very challenging to parse if you’ve never used data.table. The first argument for [ in data.table is a subset/select function. In this case,

1
id %in% first[enroll_date>enrollby][[sid]]

means,

Select the rows in first where the enroll_date attribute (which was previously assigned as the minimum enroll_date) is less than the global function argument enrollby and check if the id of output is in the sid vector.

So output is being subset to only include those records that meet that condition, in other words, the students who should have a move because they entered the school year late.

The second argument of [ for data.tables is explained in this footnote 2 if you’re not familiar with it.

Recursion. Which is also known as recursion.

The logic for Business Rules 3-5 are substantially more complex. At first it was not plainly obvious how to avoid a slow for loop for this process. Each of the rules on switching schools requires an awareness of context– how does one record of a student compare to the very next record for that student?

The breakthrough was thinking back to my single semester of computer science and the concept of recursion. I created a new function inside of this function that can count how many moves are associated with a set of enrollment records, ignoring the considerations in Business Rules 1 and 2. Here’s my solution. I decided to include inline comments because I think it’s easier to understand that way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
school_switch <- function(dt, x=0){
  # This function accepts a data.table dt and initializes the output to 0.
    if(dim(dt)[1]<2){
    # When there is only one enrollment record, there are no school changes to
    # apply rules 3-5. Therefore, the function returns the value of x. If the
    # initial data.table contains a student with just one enrollment record, 
    # this function will return 0 since we initialize x as 0.
      return(x)
    }else{
      # More than one record, find the minimum exit_date which is the "first"
      # record
      exit <- min(dt[, exit_date])
      # Find out which school the "first" record was at.
      exit_school <- dt[exit_date==exit][[schid]]
      # Select which rows come after the "first" record and only keep them
      # in the data.table
      rows <- dt[, enroll_date] > exit
      dt <- dt[rows,]
      # Find the minimum enrollment date in the subsetted table. This is the
      # enrollment that follows the identified exit record
      enroll <- min(dt[, enroll_date])
      # Find the school associated with that enrollment date
      enroll_school <- dt[enroll_date==enroll][[schid]]
      # When the difference between the enrollment and exit dates are less than
      # the gap and the schools are the same, there are no moves. We assign y,
      # our count of moves to x, whatever the number of moves were in this call
      # of school_switch
      if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap &
         exit_school==enroll_school){
        y = x
      # When the difference in days is less than the gap (and the schools are
      # different), then our number of moves are incremented by 1.
      }else if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap){
        y = x + 1L
      }else{
      # Whenever the dates are separated by more than the gap, regardless of which
      # school a student is enrolled in at either point, we increment by two.
        y = x + 2L
      }
      # Explained below outside of the code block.
      school_switch(dt, y)
    }
  }

The recursive aspect of this method is calling school_switch within school_switch once the function reaches its end. Because I subset out the row with the minimum exit_date, the data.table has one row processed with each iteration of school_switch. By passing the number of moves, y back into school_switch, I am “saving” my work from each iteration. Only when a single row remains for a particular student does the function return a value.

This function is called using data.table’s special .SD object, which accesses the subset of the full data.table when using the by argument.

1
dt[, moves:= school_switch(.SD), by=sid]

This calls school_switch after splitting the data.table by each sid and then stitches the work back together, split-apply-combine style, resulting in a data.table with a set of moves per student identifier. With a little bit of clean up, I can simply add these moves to those recorded earlier in output based on Business Rules 1 and 2.

1
2
3
4
  dt <- dt[,list(switches=unique(moves)), by=sid]
  output[dt, moves:=moves+switches]
  return(output)
}

Quick and Dirty system.time


  1. On a mid-2012 Macbook Air, the current mobility calculation is very effective with tens of thousands of student records and practical for use in the low-hundreds of thousands of records range. ↩︎

  2. I thought I was going to use data.table for some of its speedier features as I wrote this initial function. I didn’t in this go (though I do in Version 2). However, I do find the data.table syntax for assigning values to be really convenient, particularly the := operator which is common in several other languages. In data.table, the syntax dt[,name:=value] assigns value to an exist (or new) column called name. Because of the need select operator in data.table, I can just use dt[id,moves:=moves+1L] to select only the rows where the table key, in this case sid, matches id, and then increment moves. Nice. ↩︎ ↩︎

September 16, 2013

How do we calculate student mobility? I am currently soliciting responses from other data professionals across the country. But when I needed to produce mobility numbers for some of my work a couple of months ago, I decided to develop a set of business rules without any exposure to how the federal government, states, or other existing systems define mobility. 1

I am fairly proud of my work on mobility. This post will review how I defined student mobility. I am hopeful that it matches or bests current techniques for calculating the number of schools a student has attended. In my next post, I will share the first two major versions of my implementation of these mobility business rules in R. 2 Together, these posts will represent the work I referred to in my previous post on the importance of documenting business rules and sharing code.

The Rules

Working with district data presents a woefully incomplete picture of the education mobile students receive. Particularly in a state like Rhode Island, where our districts are only a few miles wide, there is substantial interdistrict mobility. When a student moves across district lines, their enrollment is not recorded in local district data. However, even with state level data, highly mobile students cross state lines and present incomplete data. A key consideration for calculating how many schools a student has attended in a particular year is capturing “missing” data sensibly.

The typical structure of enrollment records looks something like this:

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-09-01 2012-11-15
1000000 10103 2012-11-16 2013-06-15

A compound key for this data consists of the Unique Student ID, School Code, and Enrollment Date, meaning that each row must be a unique combination of these three factors. The data above shows a simple case of a student enrolling at the start of the school year, switching schools once with no gap in enrollment, and continuing at the new school until the end of the school year. For the purposes of mobility, I would define the above as having moved one time.

But it is easy to see how some very complex scenarios could quickly arise. What if student 1000000’s record looked like this?

Unique Student ID School Code Enrollment Date Exit Date
1000000 10101 2012-10-15 2012-11-15
1000000 10103 2013-01-03 2013-03-13
1000000 10103 2013-03-20 2013-05-13

There are several features that make it challenging to assign a number of “moves” to this student. First, the student does not enroll in school until October 15, 2012. This is nearly six weeks into the typical school year in the Northeastern United States. Should we assume that this student has enrolled in no school at all prior to October 15th or should we assume that the student was enrolled in a school that was outside of this district and therefore missing in the data? Next, we notice the enrollment gap between November 15, 2012 and January 3, 2013. Is it right to assume that the student has moved only once in this period of time with a gap of enrollment of over a month and a half? Then we notice that the student exited school 10103 on March 13, 2013 but was re-enrolled in the same school a week later on March 20, 2013. Has the student truly “moved” in this period? Lastly, the student exits the district on May 13, 2013 for the final time. This is nearly a month before the end of school. Has this student moved to a different school?

There is an element missing that most enrollment data has which can enrich our understanding of this student’s record. All district collect an exit type, which explains if a student is leaving to enroll in another school within the district, another school in a different district in the same state, another school in a different state, a private school, etc. It also defines whether a student is dropping out, graduating, or has entered the juvenile justice system, for example. However, it has been my experience that this data is reported inconsistently and unreliably. Frequently a student will be reported as changing schools within the district without a subsequent enrollment record, or reported as leaving the district but enroll within the same district a few days later. Therefore, I think that we should try and infer the number of schools that a student has attended using soley the enrollment date, exit date, and school code for each student record. This data is far more reliable for a host of reasons, and, ultimately, provides us with all the information we need to make intelligent decisions.

My proposed set of business rules examines school code, enrollment date, and exit date against three parameters: enrollment by, exit by, and gap. Each students minimum enrollment date is compared to enrollment by. If that student entered the data set for the first time before the enrollment by, the assumption is that this record represents the first time the student enrolls in any school for that year, and therefore the student has 0 moves. If the student enrolls for the first time after enrollment by, then the record is considered the second school a student has attended and their moves attribute is incremented by 1. Similarly, if a student’s maximium exit date is after exit by, then this considered to be the student’s last school enrolled in for the year and they are credited with 0 moves, but if exit date is prior to exit by, then that student’s moves is incremented by 1.

That takes care of the “ends”, but what happens as students switch schools in the “middle”? I proposed that each exit date is compared to the subsequent enrollment date. If enrollment date occurs within gap days of the previous exit date, and the school code of enrollment is not the same as the school code of exit, then a student’s moves are incremented by 1. If the school codes are identical and the difference between dates is less than gap, then the student is said to have not moved at all. If the difference between the enrollment date and the previous exit date is greater than gap, then the student’s moves is incremented by 2, the assumption being that the student likely attended a different school between the two observations in the data.

Whereas calculating student mobility may have seemed a simple matter of counting the number of records in the enrollment file, clearly there is a level of complexity this would fail to capture.

Check back in a few days to see my next post where I will share my initial implementation of these business rules and how I achieved an 10x speed up with a massive code refactor.


  1. My ignorance was intentional. It is good to stretch those brain muscles that think through sticky problems like developing business rules for a key statistic. I can’t be sure that I have developed the most considered and complete set of rules for mobility, which is why I’m now soliciting other’s views, but I am hopeful my solution is at least as good. ↩︎

  2. I think showing my first two implementation of these business rules is an excellent opportunity to review several key design considerations when programming in R. From version 1 to version 2 I achieved a 10x speedup due to a complete refactor that avoided for loops, used data.table, and included some clever use of recursion. ↩︎

September 12, 2013

One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.

Documentation of business rules is a critical aspect of a data analysts job that, in my experience, is often regrettably overlooked. If you have ever tried to reproduce someone else’s analysis, asked different people for the same statistic, or tried to compare data from multiple years, you have probably encountered difficulties getting a consistent answer on standard statistics, e.g. how many students were proficient in math, how many students graduated in four years, what proportion of students were chronically absent? All too often documentation of business rules is poor or non-existent. The result is that two analysts with the same data will produce inconsistent statistics. This is not because of something inherent in the quality of the data or an indictment of the analyst’s skills. In most cases, the undocumented business rules are essentially trivial, in that the results of any decision has a small impact on the final result and any of the decisions made by the analysts are equally defensible.

This major problem of lax or non-existent documentation is one of the main reasons I feel that analysts, and in particular analysts working in the public sector, should extensively use tools for code sharing and version control like Github, use free tools whenever possible, and generally adhere to best practices in reproducible research.

I am trying to put as much of my code on Github as I can these days. Much of what I write is still very disorganized and, frankly, embarrassing. A lot of what is in my Github repositories is old, abandoned code written as I was learning my craft. A lot of it is written to work with very specific, private data. Most of it is poorly documented because I am the only one who has ever had to use it, I don’t interact with anyone through practices like code reviews, and frankly I am lazy when pressed with a deadline. But that’s not really the point, is it? The worst documented code is code that is hidden away on a personal hard drive, written for an expensive proprietary environment most people and organizations cannot use, or worse, is not code at all but rather a series of destructive data edits and manipulations. 1

One way that I have been trying to improve the quality and utility of the code I write is by contributing to an open source R package, eeptools. This is a package written and maintained by Jared Knowles, an employee of the Wisconsin Department of Public Instruction, whom I met at a Strategic Data Project convening. eeptools is consolidating several functions in R for common tasks education data analysts are faced with. Because this package is available on CRAN, the primary repository for R packages, any education analyst can have access to its functions in one line:

1
install.packages('eeptools'); require(eeptools)

Submitting code to a CRAN package reinforces several habits. First, I get to practice writing R documentation, explaining how to use a function, and therefore, articulating the assumptions and business rules I am applying. Second, I have to write my code with a wider tolerance for input data. One of the easy pitfalls of a beginning analyst is writing code that is too specific to the dataset in front of you. Most of the errors I have found in analyses during quality control stem from assumptions embedded in code that were perfectly reasonable with a single data set that lead to serious errors when using different data. One way to avoid this issue is through test-driven development, writing a good testing suite that tests a wide range of unexpected inputs. I am not quite there yet, personally, but thinking about how my code would have to work with arbitrary inputs and ensuring it fails gracefully 2 is an excellent side benefit of preparing a pull request 3 . Third, it is an opportunity to write code for someone other than myself. Because I am often the sole analyst with my skillset working on a project, it is easy to not consider things like style, optimizations, clarity, etc. This can lead to large build-ups of technical debt, complacency toward learning new techniques, and general sloppiness. Submitting a pull request feels like publishing. The world has to read this, so it better be something I am proud of that can stand up to the scrutiny of third-party users.

My first pull request, which was accepted into the package, calculates age in years, months, or days at an arbitrary date based on date of birth. While even a beginning R programmer can develop a similar function, it is the perfect example of an easily compartmentalized component, with a broad set of applications, that can be accessed frequently .

Today I submitted by second pull request that I hope will be accepted. This time I covered a much more complex task– calculating student mobility. To be honest, I am completely unaware of existing business rules and algorithms used to produce the mobility numbers that are federally reported. I wrote this function from scratch thinking through how I would calculate the number of schools attended by a student in a given year. I am really proud of both the business rules I have developed and the code I wrote to apply those rules. My custom function can accept fairly arbitrary inputs, fails gracefully when it finds data it does not expect, and is pretty fast. The original version of my code took close to 10 minutes to run on ~30,000 rows of data. I have reduced that with a complete rewrite prior to submission to 16 seconds.

While I am not sure if this request will be accepted, I will be thrilled if it is. Mobility is a tremendously important statistic in education research and a standard, reproducible way to calculate it would be a great help to researchers. How great would it be if eeptools becomes one of the first packages education data analysts load and my mobility calculations are used broadly by researchers and analysts? But even if it’s not accepted because it falls out of scope, the process of developing the business rules, writing an initial implementation of those rules, and then refining that code to be far simpler, faster, and less error prone was incredibly rewarding.

My next post will probably be a review of that process and some parts of my moves_calc function that I’m particularly proud of.


  1. Using a spreadsheet program, such as Excel, encourages directly manipulating and editing the source data. Each change permanently changes the data. Even if you keep an original version of the data, there is no recording of exactly what was done to change the data to produce your results. Reproducibility is all but impossible of any significant analysis done using spreadsheet software. ↩︎

  2. Instead of halting the function with hard to understand error when things go wrong, I do my best to “correct” easily anticipated errors or report back to users in a plain way what needs to be fixed. See also fault-tolerant system↩︎

  3. A pull request is when you submit your additions, deletions, or any other modifications to be incorporated in someone else’s repository. ↩︎

August 21, 2013

In December 2009, the education department head, Professor Kenneth K. Wong, another graduate student and myself were part of a three-person team consulting the Rhode Island Department of Education (RIDE) on how to establish a new state funding formula. We worked with finance and legal staff at the department to develop the legislation for the 2010 session that would establish a state funding formula for the first time in 15 years. 
 The Board of Regents had already passed a resolution with its policy priorities that they wanted enshrined in the formula. Additionally, there had been many attempts over the past 5-10 years to pass a new formula that failed for various reasons, chief among them that all previously proposed formulas were accompanied with a call to increase state funding for education 30-50%, with some even envisioning nearly doubling the state education funding. Our task was to research funding formulas, both in practice in other states and in the literature research on school finance, and achieve the goals of the Board of Regents without proposing a mammoth increase in state aid that would sink the entire endeavor 1. The general sense was that while more state aid had the potential to improve the progressiveness of education expenditures, the reality is the overall spending level in Rhode Island is high, and introducing new money was less important than redistributing state aid. I share this belief, particularly because I think adding money to the right places is simple once there is already a way to equitably distribute those funds. Tying up the increase in funding alongside a distribution method is a recipe for political horse-trading that can result in all kinds of distortions that prevent aid from flowing where needed.


My role in this process was primarily to create Excel simulators that would allow us to immediately track the impacts of changing different parts of the formula. I also helped RIDE staff interpret the meaning of changes to the math behind the funding formula and understand what levers existed to change the formula and how these changes impacted both the resulting distribution and policy.


We had three months.


There are a lot of people who are unhappy about the results of the formula that ultimately passed in June 2010. Because we are redistributing essentially the same amount of state aid, there are some districts that are losing money while others are gaining funds 2. Some dislike the fact that we used only a “weight” for free and reduced price lunch status. Alternative formulas (and formulas in other states) typically include numerous weights, from limited English proficiency and special education status, to gifted and talented and career and technical education 3. And yet others were displeased that many costs, including transportation and facilities maintenance, were excluded from the state-base of education aid. Then there are those who think the transition is all off– five years is too long to wait to get the increases the formula proposes, and ten years is far too fast to lose the money the formula proposes. 4

A Good State Aid System


In the end, I am proud of the formula produced for several reasons.
 First, it passed successfully and has been fully funded (and sometimes more than fully funded) each year of implementation throughout a period of massive structural budget deficits. This is no small accomplishment. The advocacy community rightfully pushes us to build ideal systems, but in our role of policy entrepreneurs we are faced with the reality that a policy that does not become law and is not supported as law may as well not exist. Producing a formula that has some of the other positive qualities discussed below, passing that formula, and implementing the formula with little fanfare is not a small accomplishment.


Second, the formula is highly progressive, sending as much as 20 times more aid to some of our poorest communities in Rhode Island compared to the wealthiest. I am not positive how this compares to other states– that’s a topic I certainly want to work on for a future post– but with just 39 cities and towns, it seems to show a high preference for vertical equity, treating different cities and towns differently. There are communities on both ends of the distribution who want substantially more state funding, and our state aid formula is not sufficient to effectively crowd out local capacity for education spending and ensure that our poorest communities are spending more than our wealthier ones 5, but it’s a very strong start.


Third, the formula is relatively simple. While I do not necessarily agree that it is a virtue to have fewer weights and a simple formula in perpetuity, the experience with other states and other formula-based programs show that weights and complexities are very easy to add and very hard to take away. Once a particular policy preference is enshrined in the distribution method, it had better be right because a community of advocacy will maintain that weight long into the future. Personally, I felt it critical to start with the very simple “core” formula that could be adjusted over time. I have some ideas on how I might modify/add to this core that I will be sharing in this post, but I firmly believe that starting with a simple core was the right move. It is also worth noting that because of the need to ensure the transition is smoothed out so that gains in some districts equal the total losses in the others meant that even a more progressive weighting scheme would not impact school funding until the far back end of the transition period (which we proposed as 7 years but was pushed to 5 years during the legislative process), since communities were already gaining funds as fast as we could move them. For this reason, not only was a simple core preferable from my technocratic perspective, but it also was not likely to have any immediate downside.


Fourth, we removed the long-term regionalization bonuses. Rhode Island had sought to reduce the ridiculous number of school districts by providing a bonus for regionalizing in the early 90s. Unfortunately, because of the timing of the abandonment of the previous state aid formula, the districts that did choose to regionalize had their base funding locked in at a level 6-8% higher than it should have been, because they were receiving a bonus that was meant to fade away over the course of several years. I could justify a small increase in state funding to pay some of the transition expenses of regionalizing districts, but long term funding increases? Part of the goal of regionalization is the reduction of overhead that allows for decreased costs (or increased services at the same costs). There is no ongoing need to supply a massive state bonus for regionalizing.


Now just because I am proud of this work does not mean that I think we have “solved” education funding in Rhode Island. Personally, I believe there are other defensible ways to distribute funding in Rhode Island, each of which represents slightly different policy preferences. There is no hard and fast “right” or “wrong” way to do this, within certain guidelines. As I see it, so long as the formula is progressive and moving toward a greater chance of seeing a day where Providence has the highest paid staff in the state 5, we are on the right path. I don’t believe that Rhode Island will have a truly “great” education finance climate without a substantial growth in the economy or a huge new tax that dramatically lowers the ability of municipalities to generate school funding while bolstering state aid. However, I think we have a great foundation and a “good” system.


For the remainder of this post, I would like to propose a few ideas that could help move Rhode Island from “good” to “very good” that I think are feasible within the next five years 6.


A Very Good State Aid Program


After a little over three years since its establishment, I think we are ready to tackle several additional aspects of state education funding in Rhode Island. One thing you may notice is that few of these ideas impact the original formula. Part of why that is comes from my aforementioned preference for a simple formula, and part is because these include some non-formula issues that were not pursued in 2010 in an effort to keep the focus on the main policy matters.
 First, and perhaps the most consequential change that can be made to state funding, is the teacher pension fund payments. Currently, the state and local districts split the cost of teacher pension contributions 60/40. This is a flat split, regardless of the wealth of the community. I think it’s absurd to ignore community wealth for such a large portion of state education expenditures. Using the Adjusted Equalized Weighted Assessment Values (AEWAV) to determine the reimbursement rates would be a big improvement on the progressiveness of school funding.


Second, I would make a slight change to the way that we fund charter schools. When we were developing the formula, there was broad agreement among policymakers that the “money should follow the child”. In one sense, this is the system we proposed since school district funding is based on enrollments. However, I think an irrational desire to not “double count” students, alongside the need to keep funding as flat as possible, pushed the formula a bit too far when it comes to charters. The old way of funding charter schools allowed districts to hold back 5% of the total per pupil expenditure from their charter school tuitions. This meant charter schools received 5% less funding than traditional public schools, but it also recognized that there are some fixed costs in districts that are not immediately recoverable when students leave on the margins. I think the state should return to this practice, however only if the state is willing to pay the withheld 5% to charters. I do think its fair to take into account some fixed costs, but I don’t believe it’s fair that charter schools received less funding as a result.


Third, we excluded all building maintenance costs from the base amount of state aid. This was largely because the formula was supposed to represent only the marginal instructional costs associated with each student. I don’t necessarily think that these costs have to be added into the base amount. However, I would like to see the state contribute to the maintenance of buildings more directly. I think the state should provide a flat dollar amount, say $100,000, per building in each district, provided that key criteria are met. The buildings should be at 90% occupancy/utilization, should have a minimum size set based on the research on efficiency (roughly 300 students at the elementary level and 600 students for high schools), and there should be some minimum standard for building systems quality and upkeep. These requirements are mostly about making sure this flat fund, which is really about the fixed costs of maintaining buildings, doesn’t create incentives to build more. It may seem inconsequential, but I think it’s important to state the preference for well-sized, occupied 7, and maintained buildings is worthwhile.


I think it’s wrong that the minimum reimbursement rate for school construction aid was raised to 40% during the funding formula debates in the General Assembly. This amounts to a massive subsidy for suburban schools and the previous 30% minimum is part of why we have such stark facilities inequities in the state. We should remove the minimums on construction reimbursement and simply use AEWAV to determine the reimbursement rate. Also, we need to establish a revolving facilities loan fund, much like the one used for sewers (and now roads and bridges). Access to lower interest bonds should not be dependent on city finances.


Fourth, one thing we did not include in the original funding formula that has come under considerably criticism is a special weight for students who are labeled English language learners. There are a few reasons we made this decision. The districts that have ELLs are the same districts that have high levels of poverty. In fact, the five communities that had more than 5% of their students classified as ELLs were, in order, also the top five districts with regards to free and reduced price lunch eligibility. Combined with a transition plan that was already increasing funding to these districts as rapidly as could be afforded, there were virtually no short-term consequences of not including an ELL weight. It’s worth noting that formula dollars are not categorical funds– there are no restrictions on how districts should spend this money, and there are no guarantees that an ELL weight would have any impact on ELL spending.


We were also concerned with incentivizing over-identification and failing to exit students who should no longer be classified as ELLs. I am also personally concerned with mistaking the additional supports we want to target as needed for English language acquisition; it would not only inspire the wrong policies and supports for these students, but it fails to recognize a host of needs that persist for these students well beyond English acquisition.


During the funding formula hearings at House and Senate Finance Committees we discussed the need for further study on this issue. I think that the next weight in the formula should be based on the Census and American Communities Survey. By using these data sets, classification of students who are eligible for the weight would not be dependent on the school district itself. Rather than focus on child language acquisition, I think we should broaden this weight to be applied based on the percentage of households that speak a language other than English in the home, where English is spoken at a level below “very well” 8. This would ensure that students who live in language minority households receive additional supports throughout their education, regardless of their language acquisition status. I would make this weight lower than some in the literature because it would apply to a broader set of students, probably somewhere around 40% like the poverty weight. For reference, the latest five-year estimate from the ACS data shows that 24.3% of households fit this definition in the city of Providence. With a 40% weight, at 22,500 students, with a foundation amount of around $9,000 per student, this weight would increase funding to Providence by a little over $16,000,000. Similar to other formula aid, these funds would be unrestricted.


Now, while I think that $16,000,000 is no small potatoes, and I am happy to express our policy preference to drive funding into communities where families are not using English in the home, some perspective is warranted. Providence will receive almost $240,000,000 in state aid when the formula is fully transitioned, compared to about $190,000,000 before. Adding this weight would only represent a 6% increase in state aid from the full formula amount. It’s an important increase, but I hope you’ll forgive me if I felt it was not grossly unfair to exclude it in the first iteration of the funding formula, especially considering we still have not fully transitioned to those higher dollar amounts sent to districts that would benefit from these funds.


It Takes Money


Each of these recommendations, in my view, would improve the way that Rhode Island distributes education aid. Some of the changes are technical, others address areas that are currently not considered, and some are purely about increasing the progressiveness of aid. All of these changes will require an even greater state contribution to education aid, but these increases would be an order of magnitude lower than what it would take to increase the state aid to covering 50-60% of all education expenditures. While I would support some pretty radical changes to drive more money into the state aid system, I think that each of these improvements are worth doing on the path to increased aid.



  1. I should note, that few people I spoke to were not in favor of raising the amount of state aid. We all want more money to come from the state because those dollars are far more progressive. However, Rhode Island was deep in its recession at this point in time and the dollar amounts to make a real dent in the state to local share in education are just staggering. Rhode Island currently funds just short of 40% of total school expenditures at the state level. To increase that to 60%, which is closer to the national average, they would have to contribute $500M more– a roughly 60% increase from the current level. Just for some context, the main tax fight of Rhode Island progressives has been to repeal tax cuts for higher income individuals that were instituted starting in 2006 in an attempt to move toward a flat income tax rate in Rhode Island. The impact of this repeal would be an increase in revenues that would cover roughly 10% of the increase in school funding required to move from 40% to 60% state aid. Of course, those dollars are supposed to pay for some portion of restoring pension benefits, so it’s already spoken for. ↩︎

  2. Hold harmless provisions, when introduced in other states, serve to dramatically distort the redistributive properties of state aid and almost always require a huge influx of funds. In fact, a hold harmless provision in Rhode Island would have required a doubling of state aid, which ultimately would have guaranteed that wealthy communities continue to receive too much state aid while less wealthy communities are stuck fighting year after year for tremendous revenue increases through taxation just to get their fair share. Essentially, hold harmless would ensure that you never reach formula-level spending and guarantee that state aid would not be very progressive. ↩︎

  3. One very popular progressive member of the Rhode Island General Assembly had been working for years to pass a new funding formula and had five or six such weights in her version. Interestingly, with the glaring exception of sending $0 to Newport in state aid, the difference in the overall distribution of funds by district in Rhode Island using this formula and our formula was tiny, almost always <5%. ↩︎

  4. Smoothing the “gains” and “losses” overtime was important to keep the formula as close to revenue neutral as possible. Of course, there are increases due to inflation and other factors each year as a part of the base, but our goal was to truly redistribute the funds such that not only is the end number not a big increase in total state aid but that getting through the transition period did not have huge costs. If it did, there is no way we could feel confident we would ever reach the point where the formula actually dictated state aid, much like the hold harmless provision prevents a full transition. Modeling various transition plans was a nightmare for me. ↩︎

  5. Many people forget that education spending is about competition within a single market. Overall spending matters less within this market than how you spend compared to others. The trick is that an urban school primarily working with traditionally under served families needs to be able to pay not just for more material supplies, but mostly for higher quality teachers and staff (and perhaps quantity). Because of compensating wage differentials, even hiring teachers and staff that are the same quality as wealthy communities costs more. ↩︎ ↩︎

  6. Perhaps I will write a future post on some ideas of how to push Rhode Island to “great”, even though I view all of those solutions as politically impossible. ↩︎

  7. I would include any leased space as occupied. We should encourage full utilization of the buildings, whether that includes charter schools, central office use, city government, or private companies. ↩︎

  8. This definition is clunky, but its how the ACS and Census track these things. This definition is clunky, but its how the ACS and Census track these things. We could verify the data using the data reported by districts about language spoken in the home. I would recommend using this data point to assist with whether or not to include these weights for charter schools. For example, approximately half of those families that do not speak English in the home also speak English very poorly. Therefore, I might apply half of the weight to each individual child whose family reports speaking a language other than English at home. Of course, the actual proportion of the weight should be specific to the ratio of speakers of language other than English to non-very well speakers of English by community. ↩︎

August 14, 2013

This post originally appeared on my old blog on January 2, 2013 but did not make the transition to this site due to error. I decided to repost it with a new date after recovering it from a cached version on the web.

Rhode Island passed sweeping pension reform last fall, angering the major labor unions and progressives throughout the state. These reforms have significantly decreased both the short and long-run costs to the state, while decreasing the benefits of both current and future retirees.

One of the most controversial measures in the pension reform package was suspending annual raises 1 for current retirees. I have noticed two main critiques of this element. The first criticism was that ending this practice constitutes a decrease in benefits to existing retirees who did not consent to these changes, constituting a breach of contract and assault on property rights. This critique is outside of the scope of this post. What I would like to address is the second criticism, that annual raises are critical to retirement security due to inflation, especially for the most vulnerable pensioners who earn near-poverty level wages from their pensions.

While I am broadly supportive of the changes made to the pension system in Rhode Island, I also believe that it is important to recognize the differential impact suspending annual raises has on a retired statehouse janitor who currently earns $22,000 a year from their pension and a former state department director earning $70,000 a year from their pension. Protecting the income of those most vulnerable to inflation is a worthy goal 2.

I have a simple recommendation that I think can have a substantial, meaningful impact on the most vulnerable retirees at substantially less cost than annual raises. This recommendation will be attractive to liberals and conservatives, as well as the “business elite” that have long called for increasing Rhode Island’s competitiveness with neighboring states. It is time that Rhode Island leaves the company of just three other states– Minnesota, Nebraska, and Vermont– that have no tax exemptions for retirement income 3. Rhode Island should exempt all income from pensions and social security up to 200% of the federal poverty level from state income taxes. This would go a long way to ensuring retirement security for those who are the most in need. It would also bring greater parity between our tax code and popular retirement destination states, potentially decreasing the impulse to move to New Hampshire, North Carolina, and Florida.

It’s a progressive win. It’s a decrease in taxes that conservatives should like. It shouldn’t have a serious impact on revenues, especially if it goes a long way toward quelling the union and progressive rancor about the recent reforms. And it’s far from unprecedented– in fact, some form of retirement income tax exemption exists in virtually every other state.

We should not be proud of taking away our most vulnerable pensioners’ annual raises, even if it was necessary. Instead of ignoring the clear impact of this provision, my hope for 2013 is that we address it, while keeping an overall pretty good change to Rhode Island’s state retirement system.


  1. Not a cost-of-living adjustment, or COLA, as some call them. ↩︎

  2. Interesting, increases in food prices has largely slowed and the main driver of inflation are healthcare costs. I wonder to what extent Medicare/Medicaid and Obamacare shield retirees from rising healthcare costs ↩︎

  3. www.ncsl.org/documents… ↩︎

July 28, 2013

One of the most interesting discussions I had in class during graduate school was about how to interpret the body of evidence that existed about Teach for America. At the time, Kane, Rockoff and Staiger (KRS) had just published “What does certification tell us about teacher effectiveness? Evidence from New York City” in Economics of Education Review . KRS produced value-added estimates for teachers and analyzed whether their initial certification described any variance in teacher effectiveness at raising student achievement scores. The results were, at least to me, astonishing. All else being equal, there was little difference if teachers were uncertified, traditionally certified, a NYC teaching fellow, or a TFA core member.

Most people viewed these results as a positive finding for TFA. With minimal training, TFA teachers were able to compete with teachers hired by other means. Is this not a vindication that the selection process minimally ensures an equal quality workforce?

I will not be discussing the finer points of

[points out: scholasticadministrator.typepad.com/thisweeki…

July 22, 2013

CCSSI Mathematics posted a scathing look at the items released by the Smarter Balanced Assessment Consortium (SBAC). While the rest of the internet seems to be obsessed over Georgia leaving the Partnership for Assessment of College and Careers1, the real concern should be over the quality of these test items.

Although CCSSI also deligently point out questions that are not well aligned to the standards, this is the least of my worries. Adjusting the difficulty of items and better alignment is something that testing companies know how to do and deal with all the time. Computerized testing is the new ground and a big part of why states are, rightfully, excited about the consortium.

The problem with the SBAC items is they represent the worst of computerized assessment. Rather than demonstrating more authentic and complex tasks, they present convoluted scenarios and even more convoluted input methods. Rather than present multimedia in a way that is authentic to the tasks, we see heavy language describing how to input what amounts to multiple choice or fill-in the blank answers. What I see here is not worth the investment in time and equipment that states are being asked to make, and it is hardly a “next generation” set of items that will allow us to attain more accurate measures of achievement.

SBAC looks poised to set up students to fail because of the mechanations of test taking. This is not only tragic at face value, but assures an increase in test-prep as the items are less authentic.


  1. There was a lot of concern trolling over Georgia leaving PARCC by Andy Smarick on Twitter and Flypaper. I don’t really see this as devastating, nor do I think some kind of supplication to the Tea Party could have changed this. Short of federal mandating of common tests and standards, Georgia was never going to stay aligned with a consortium that includes Massachusetts. Georgia has an incredibly inexpensive testing program, because they have built really poor assessments that are almost entirely multiple choice. They also have some of the lowest proficiency standards in the country. There was no way this state would move up to a testing regime that costs more than twice as much (but is around the country median) that is substantially more complex and will have a much higher standard for proficiency. Georgia is one of those states that clearly demonstrates some of the “soft bigotry of low expectations” by hiding behind inflated proficiency due to low standards. ↩︎

This summer has been very productive for my fiction reading backlog. Here are just some of the things I have read since Memorial Day. 1

Novels

The Name of the Wind by Patrick Rothfuss

I picked up The Name of the Wind on a whim while cruising through the bookstore. I was glad I did. This book tells a classic story– a precocious young wizard learns to use his powers, building toward being the most important person in the world. The book is framed around an innkeeper and his apprentice who are more than they seem. When a man claiming to be the most famous storyteller in the land enters the inn, we learn that our innkeeper has past filled with spectacular exploits that our bard wants to record. Lucky for our reader, Kvothe, in addition to be a warrior-wizard of extraordinary talent, is a narcissist who decides to tell his whole story just this once to this most famous of all chroniclers. 2 Although I have spoken to several folks who found Kvothe to be utterly unlikeable because of both his sly form of arrogance and Rothfuss’s decision to seemingly make Kvothe worthy of such high self-worth, I loved this book.

In this first book of the The Kingkiller Chronicle (as these things tend to be named), we learn all about Kvothe’s formative years. We spend substantial time exploring dark times in Kvothe’s life when he endures tragedy, trauma, and horrible poverty before finally beginning to learn how to truly use his talents. It is a fair critique that Kvothe seems almost “too good”, but much of the story is about how skill, luck, and folly all contribute to his success and fame, much of which is based on exaggerated tellings of true events.

If you are a fan of this sort of fantasy, with magic, destiny, love, power, and coming of age, I recommend picking up this book. Rothfuss has a gift. The sequel, The Wise Man’s Fear is already available, and I will certainly be reading it before the end of the summer.

Endymion and The Rise of Endymion by Dan Simmons

Endymion and The Rise of Endymion are the must anticipated (15 years ago) follow up to Dan Simmons’s brilliant Hyperion and Fall of Hyperion. I strongly recommend the originals, which is one of the greatest tales in all of science fiction. I also recommend creating some distance between reading each set of books. Six years separate the publishing of these duologies. Each story is so rich, I think it is hard to appreciate if you read all four books in one go. Yet, the narrative is so compelling it might be hard to resist. I waited about one year between reading the original Cantos and this follow up and I was glad I did.

Set 272 years after the events of the original books, Endymion and The Rise of Endymion serve as crucial stories that satisfyingly close loops I did not even realize were open at the end of the originals. What was once a glimpse at future worlds and great cosmic powers now unfurl as major players, their primary motivations unveiled.

These books are so entwined with the original that I will not say anything about its plot so that there are no spoilers. What I can offer is the following. Whereas books 1 and 2 play with story structure to captivating effect, these books do not. Instead, we are treated to a uniquely omniscient narrator, who is both truly omniscient and integral to the events of the story. How he gains this omniscience is a major plot point that’s pulled off effortlessly. The first two books are framed as an epic poem, known as the Hyperion Cantos, written by one of the major characters in those events. Another thing we learn is the original Cantos is not entirely reliable. Their author, who as not omniscient, had to fill in some blanks to complete the story, and also failed to understand some of the “heady” aspects of what happened and was sloppy in their explanations. Thus, we are treated both to key future events and simultaneously charged with a new reading of the original novels as written by a less than reliable narrator. What is true and what is not will all be told in this excellent follow up.

A word to the wise– Simmons may feel a bit “mushy” in his message for some “hard” science fiction readers. I think there is both profound depth and beautiful presentation of ideas, both complex enough to “earn” this treatment and some simpler than the story seems to warrant.

The Rook

The Rook is a fantastically fun debut novel3 written by an Australian bureaucrat. I learned about this book from one of my favorite podcasts, The Incomparable. Episode 128: Bureaucracy was Her Superpower is an excellent discussion that you should listen to after reading this book. I feel the hosts of that show captured perfectly what made this book great– it was completely honest and fair to its reader.

It is not giving anything away to say that The Rook centers around a Myfanwy (mispronounced even by the main character as Miffany, like Tiffany with an M) who suddenly becomes aware of her troublesome surroundings but with complete amnesia. It would be easy to dismiss the memory loss as a trite plot driver, used as a cheap way to trick our characters and readers. But O’Malley is brilliant in his use of Myfanwy’s memory loss. This book does not lie to its reader or its characters. Memory loss does not conceal some simple literary irony. Instead, it serves to create a fascinating experience for a reader who learns to understand and love a character as she creates, understands, and learns to love herself.

Myfanwy is not just an ordinary young woman with memory loss. She’s a high ranking official in what can best be described as the British X-Men who run MI-5. And she knew her memory loss was imminent. As such, she prepared letters for her future, new self to learn all about her life and her attempts to uncover the plot that would lead to her own memory loss. Again, the letters could be seen as cheap opportunities for exposition and to create false tension, but O’Malley never holds too tight to their use as a structure. We read more letters at the beginning of the story, and fewer later on as the reader is availed of facts and back story as they become relevant, without a poorly orchestrated attempt to withhold information from the main character. Instead of assuming Myfanwy is reading along with us, we easily slip into an understanding that shortly after our story begins, Myfanwy actually takes the time to read all the letters and we, thankfully, are not dragged along for the ride blow by blow.

The Rook manages to tread space in both story and structure that should feel wholly unoriginal and formulaic but never becoming either. The powers of the various individuals are fascinating, original, and consequential. The structure of the book is additive, but the plot itself is not dependent on its machinations.

Most of all, The Rook is completely fun and totally satisfying. That’s not something we say often in a post-Sopranos, post-Batman Begins world.

The Ocean at the End of the Lane

Speaking of delightful, Neil Gaiman is at his best with The Ocean at the End of the Lane. Gaiman is the master of childhood, which is where I think he draws his power as a fantasy writer. He is able to so capture the imagination of a child in beautiful prose it is as thought I am transformed into an 8-year old boy reading by flashlight in bed late at night, anxious and frightened.

The Ocean at the End of the Lane is a beautiful, dark fairy tale. Our narrator has recently experienced a loss in the family that has affected him profoundly, such that he is driven to detour back to the home he grew up in. Most of us can appreciate how deep sadness can drive us toward spending some time alone in a nostalgic place, both mentally and physically, as we work through our feelings.

There, we are greeted with the resurfacing of memories from childhood when events most unnatural conspired to do harm against him and his family.

I really don’t want to say much from this book except that it is heartbreakingly beautiful in a way that only someone like Gaiman can manage. This is a book that should be read in just one or two sittings. It is profoundly satisfying for anyone who loves to read books that transform who and where they are. Gaiman achieves this completely.

Comics

Locke and Key

Joe Hill is a master of his craft. Over Memorial Day weekend there was a great Comixology sale that dramatically reduced the price of getting in on Locke and Key and I jumped right on board.

I have rarely cared so much for a set of characters, regardless of the medium.

Our main characters, the Locke family (three young children and their mother), are faced with tragedy in the very first panels of Welcome to Lovecraft, the opening volume of this six-part series. I think what makes Locke and Key unique is rather than use tragedy simply as the opportunity to produce heroism, our protagonists are faced with real, long lasting, deep, and horrifying consequences.

All the while, we are thrust into the fascinating world of Key House, the Locke family home where our main characters’ father grew up. Key House is home to magical keys each of which can open one locked door. Step through that door, and there are fantastical consequences like dying, becoming a spirit free to float around the house until your spirit returns through the door. One door might bring great strength, another flight.

It is not surprising that the tragedy that drives the Locke family back to Key House is deeply connected to the mysterious home’s history, and the very source of its magic. What is brilliant is how Joe Hill quietly reveals the greater plot through the every day misadventures of children who are dealing with a massive life change. These characters are rich, their world is fully realized, and the story is quite compelling. A must read.

Don’t believe me? The Incomparable strikes again with a great episode on the first volume of Locke and Key.

American Vampire

I was turned on to American Vampire by Dan Benjamin. Wow. Phenomenal. These are real vampires.

East of West

Saga


  1. Affiliate links throughout, if that kind of thing bugs you. If that kind of thing does bug you, could you shoot me an email and explain why? I admit to not getting all the rage around affiliate linking. ↩︎

  2. Learning more about our main character, I am somewhat dubious that this is the only time he has told of his exploits, although this older Kvothe may have become a lot less inclined to boasting. ↩︎

  3. Actually, The Rook is the second debut on this list. The Name of the Wind was Rothfuss’s first. ↩︎

June 19, 2013

The Economic Policy Institute has release a short issue brief on the Rhode Island Retirement Security Act (RIRSA) by Robert Hiltonsmith that manages to get all of the details right but the big picture entirely wrong.

The EPI Issue Brief details the differences between the retirement system for state workers before and after the passage of RIRSA as accurately and clearly as I have ever seen. Mr. Hiltonsmith has done a notable job explaining the differences between the new system and the old system.

The brief, unfortunately, fails by engaging in two common fallacies to support its broader conclusions. The first is the straw man fallacy. Mr. Hiltonsmith takes a limited set of the objectives of the entire RIRSA legislation and says defined contribution plans do not meet those objectives. That is true, but ignores the other objectives it does accomplish which were also part of the motivation behind RIRSA. The second is circular reasoning. In this case, Mr. Hiltonsmith states that the reason for a low funding ratio is because the state did not put 100% of its paper liability into the pension fund. This is a tautology and not in dispute and should not be trumpeted as a conclusion of analysis.

Here are his three main points that he believes makes RIRSA a bad policy:

  1. The defined contribution plan does not save the state money from its annual pension contributions.
  2. The defined contribution plan is likely to earn lower returns and therefore result in lower benefits for retirees.
  3. The defined contribution plan does not solve the low funding ratio of the pension plan which exists because law makers did not make required contributions.

Of course, the defined contribution portion of RIRSA was not in place to do any of these three things. The purpose of including a defined contribution plan in the new state pension system is to create stability in annual budget allocations and avoid locking the government into promises it has demonstrated it fails to keep. Defined benefit plans require the state to change pension contributions when there are market fluctuations and leads to anti-cyclical costs, where the state is forced to put substantially more resources into pensions when revenues are lowest and spending on social welfare is most important. The defined contribution plan keeps the payments required by the state consistent and highly predictable. This is far preferable from a budget perspective.

It is unfortunate that there are lower returns to defined contribution plans which may lead to a decrease in overall benefits. It is my opinion that the unions in Rhode Island should be pushing for a substantially better match on the defined contribution portion of their plan that more closely resembles private sector match rates. This could more than alleviate the difference in benefits while maintaining the predictability, for budgeting purposes, of the defined contribution plan. I doubt this policy would have much hope of passing while Rhode Island slowly crawls out of a deep recession, but it is certainly a reasonable matter for future legislatures.

There are only two ways to decrease the current pension fund shortfalls: increase payments to the fund or decrease benefits. There is no structural magic sauce to get around this. Structural changes in the pension system are aimed at reducing the likelihood that the state will reproduce its current situation, with liabilities well outstripping funds. It is true that the “savings” largely came from cutting benefits. I have not heard anyone claim otherwise. The only alternative was to put a big lump sum into the pension fund. That clearly was not a part of RIRSA.

It is absurd to judge RIRSA on the ability of defined contribution plans to achieve policy objectives that are unrelated to the purpose of this structural change.

Perhaps the most troubling conclusion of this brief was that,

The shortfall in Rhode Island’s pension plan for public employees is largely due not to overly generous benefits, but to the failure of state and local government employers to pay their required share of pensions’ cost.

I read that and expected to see evidence of skipped payments or a discussion of overly ambitious expectations for investment returns, etc. Instead, it seems that this conclusion is based simply on the fact that the benefits in Rhode Island were not deemed outrageously large, and therefore Rhode Island should just pay the liability hole. The “failure” here is predicated entirely on the idea that the pensions as offered should be met, period, whatever the cost to the government. This is the “required share”. Which, of course, is technically true without a change in the law, but feels disingenuous. It is essentially a wholesale agreement with the union interpretation of the state pension system as an immutable contract. The courts will likely resolve whether or not this is true. My objection is that Mr. Hiltonsmith makes a definitive statement on this rationale without describing it. In such a lucid description of how the retirement system has changed, it seems this could only be intentional omission intended to support a predetermined conclusion rather than illuminate the unconvinced.

Mr. Hiltonsmith also claims that, “Over the long term, RIRSA may cost the state upwards of $15 million a year in additional contributions while providing a smaller benefit for the average full-career worker.” I am not 100% certain, but based on his use of the normal cost 1 to do these calculations, it appears this conclusion is drawn only based on the marginal contributions to current employees. In other words, if we completely ignore the existing liability, the new plan cost the state more money marginally while potentially decreasing benefits for employees. It is my opinion that Mr. Hiltonsmith is intentionally creating the perception that RIRSA costs more than the current plan while providing fewer benefits. Again, this is true for future liabilities, but ignores that RIRSA also dramatically decreased the unfunded liabilities through cutting existing retiree benefits. So the overall cost for the act is far less, while the marginal cost was increased with the objective of decreasing the instability in government appropriations.

We can have a serious debate about whether there is value in the state goals of a defined contribution plan. In my view, the purpose of switching to this structure is about:

  1. Portability of plans for more mobile workers, potentially serving to attract younger and more highly skilled employees.
  2. Stability in government expenditures on retiree benefits from year to year that are less susceptible to market forces. This includes avoiding the temptation to reduce payments when there are strong market returns as well as the crushing difficulty of increasing payments when the market (and almost certainly government receipts) are down.
  3. Insulating workers from a government that perpetually writes checks they can cash, as was the case with the current system.

This paper does not address any of these objectives or others I might have forgotten. In essence, the brief looks at only one subset of the perceived costs of this structural change, but it is far from a comprehensive analysis of the potential universe of both costs and benefits. In fact, it fails to even address the most commonly cited benefits. That is why I view it as heavily biased and flawed, even if I might draw similar conclusions from a more thorough analysis.


  1. Definition: Active participants earn new benefits each year. Actuaries call that the normal cost. The normal cost is always reflected in the cash and accounting cost of the plan. Source In other words, the normal cost only looks at the new benefits added to the liability, not the existing liability. ↩︎

June 11, 2013

A few months back I wrote some code to calculate age from a date of birth and arbitrary end date. It is not a real tricky task, but it is certainly one that comes up often when doing research on individual-level data.

I was a bit surprised to only find bits and pieces of code and advice on how to best go about this task. After reading through some old R-help and Stack Overflow responses on various ways to do date math in R, this is the function I wrote 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
age_calc <- function(dob, enddate=Sys.Date(), units='months'){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date"))
    stop("Both dob and enddate must be Date class objects")
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  
  years <- end$year - start$year
  if(units=='years'){
    result <- ifelse((end$mon < start$mon) | 
                      ((end$mon == start$mon) & (end$mday < start$mday)),
                      years - 1, years)    
  }else if(units=='months'){
    months <- (years-1) * 12
    result <- months + start$mon
  }else if(units=='days'){
    result <- difftime(end, start, units='days')
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

A few notes on proper usage and the choices I made in writing this function:

  • The parameters dob and enddate expect data that is already in one of the various classes that minimally inherits the base class Date.
  • This function takes advantage of the way that R treats vectors, so both dob and enddate can be a single or multi-element vector. For example enddate is a single date, as is the default, then the function will return a vector that calculates the difference between dob and that single date for each element in dob. If dob and enddate are both vectors with n>1, then the returned vector will contain the element-wise difference between dob and enddate. When the vectors are of different sizes, the shorter vector will be repeated over until it reaches the same length as the longer vector. This is known as recycling, and it is the default behavior in R.
  • This function always returns an integer. Calculating age in years will never return, say, 26.2. Instead, it assumes that the correct behavior for age calculations is something like a floor function. For examle, the function will only return 27 if enddate is minimally your 27th birthday. Up until that day you are considered 26. The same is true for age in months.

This is probably the first custom function in almost 3 years using R that I wrote to be truly generalizable. I was inspired by three factors. First, this is a truly frequent task that I will have to apply to many data sets in the future that I don’t want to have to revisit. Second, a professional acquaintance, Jared Knowles, is putting together a CRAN package with various convenience functions for folks who are new to R and using it to analyze education data 2. This seemed like an appropriate addition to that package, so I wanted to write it to that standard. In fact, it was my first (and to date, only) submitted and accepted pull request on Github. Third, it is a tiny, simple function so it was easy to wrap my head around and write it well. I will let you be the judge of my success or failure 3.


  1. I originally used Sys.time() not realizing there was a Sys.Date() function. Thanks to Jared Knowles for that edit in preparation for a CRAN check. ↩︎

  2. Check out eeptools on Github. ↩︎

  3. Thanks to Matt’s Stats n Stuff for getting me to write this post. When I saw another age calculation function pop up on the r-bloggers feed I immediately thought of this function. Matt pointed out that it was quite hard to Google for age calculations in R, lamenting that Google doesn’t meaningfully crawl Github where I linked to find my code. So this post is mostly about providing some help to less experience R folks who are frantically Googling as both Matt and I did when faced with this need. ↩︎