May 12, 2014

Pindown: Failed Dreams

8:00PM

I had never thought of a use for Brett Terpstra’s Marky the Markdownifier before listening today’s Systematic. Why would I want to turn a webpage into Markdown?

When I heard that Marky has an API, I was inspired. Pinboard has a “description” field that allows up to 65,000 characters. I never know what to put in this box. Wouldn’t it be great to put the full content of the page in Markdown into this field?

I set out to write a quick Python script to:

Grab recent Pinboard links.
Check to see if the URLs still resolve.
Send the link to Marky and collect a Markdown version of the content.
Post an updated link to Pinboard with the Markdown in the description field.

If all went well, I would release this script on Github as Pindown, a great way to put Markdown page content into your Pinboard links.

The script below is far from well-constructed. I would have spent more time cleaning it up with things like better error handling and a more complete CLI to give more granular control over which links receive Markdown content.

Unfortunately, I found that Pinboard consistently returns a 414 error code because the URLs are too long. Why is this a problem? Pinboard, in an attempt to maintain compatibility with the del.ico.us API uses only GET requests, whereas this kind of request would typically use a POST end point. As a result, I cannot send along a data payload.

So I’m sharing this just for folks who are interested in playing with Python, RESTful APIs, and Pinboard. I’m also posting for my own posterity since a non-Del.ico.us compatible version 2 of the Pinboard API is coming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


import requests
import json
import yaml


def getDataSet(call):
  r = requests.get('[api.pinboard.in/v1/posts/...](https://api.pinboard.in/v1/posts/recent') + call)
  data_set = json.loads(r._content)
  return data_set

def checkURL(url=""):
  newurl = requests.get(url)
  if newurl.status_code==200:
    return newurl.url
  else:
    raise ValueError('your message', newurl.status_code)

def markyCall(url=""):
  r = requests.get('[heckyesmarkdown.com/go/](http://heckyesmarkdown.com/go/?u=') + url)
  return r._content

def process_site(call):
  data_set = getDataSet(call)
  processed_site = []
  errors = []
  for site in data_set['posts']:
    try:
      url = checkURL(site['href'])
    except ValueError:
      errors.append(site['href'])
    description = markyCall(url)
    site['extended'] = description
    processed_site.append(site)
  print errors
  return processed_site

def write_pinboard(site, auth_token):
  stem = 'https://api.pinboard.in/v1/posts/add?format=json&auth_token='
  payload = {}
  payload['url'] = site.get('href')
  payload['description'] = site.get('description', '')
  payload['extended'] = site.get('extended', '')
  payload['tags'] = site.get('tags', '')
  payload['shared'] = site.get('extended', 'no')
  payload['toread'] = site.get('toread', 'no')           
  r = requests.get(stem + auth_token, params = payload)
  print(site['href'] + '\t\t' + r.status_code)

def main():
  settings = file('AUTH.yaml', 'rw')
  identity = yaml.load(AUTH.yaml)
  auth_token = identity['user_name'] + ':' + identity['token']
  valid_sites = process_site('?format=json&auth_token=' + auth_token)
  for site in valid_sites:
    write_pinboard(site, auth_token)

if __name__ == '__main__':
  main()

April 1, 2014

Symlinking Your Data

8:00PM

I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ “inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

1

ln -s target_path link_path

R offers a function that does the same thing:

1

file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

March 9, 2014

Expressiveness Counts

8:00PM

Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.

While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.

If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.

This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.

Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the data.table package.

First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:

1
2
3
4
5
6
7
8
9


# Generate test data for modal_attribute().
modal_test <- data.frame(sid = c('1000', '1001', '1000', '1000', '1005', 
                                 '1005', rep('1006',4)),
                         race = c('Black', 'White', 'Black', 'Hispanic',
                                  'White', 'White', rep('Black',2), 
                                  rep('Hispanic',2)),
                         year = c(2006, 2006, 2007, 2008,
                                  2010, 2011, 2007, 2008,
                                  2010, 2011))

The test data generated by that code looks like this:

sasid	race	year
1000	Black	2006
1001	White	2006
1000	Black	2007
1000	Hispanic	2008
1005	White	2010
1005	White	2011
1006	Black	2007
1006	Black	2008
1006	Hispanic	2010
1006	Hispanic	2011

And the results should be:

sasid	race
1000	Black
1001	White
1005	White
1006	Hispanic

My first attempts at solving this problem using data.table resulted in a pretty complex set of code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


# Calculate the modal attribute using data.table
modal_person_attribute_dt <- function(df, attribute){
  # df: rbind of all person tables from all years
  # attribute: vector name to calculate the modal value
  # Calculate the number of instances an attributed is associated with an id
  dt <- data.table(df, key='sasid')
  mode <- dt[, rle(as.character(.SD[[attribute]])), by=sasid]
  setnames(mode, c('sasid', 'counts', as.character(attribute)))
  setkeyv(mode, c('sasid', 'counts'))
  # Only include attributes with the maximum values. This is equivalent to the
  # mode with two records when there is a tie.
  mode <- mode[,subset(.SD, counts==max(counts)), by=sasid]
  mode[,counts:=NULL]
  setnames(mode, c('sasid', attribute))
  setkeyv(mode, c('sasid',attribute))
  # Produce the maximum year value associated with each ID-attribute 
  # pairing    
  setkeyv(dt, c('sasid',attribute))
  mode <- dt[,list(schoolyear=max(schoolyear)), by=c("sasid", attribute)][mode]
  setkeyv(mode, c('sasid', 'schoolyear'))
  # Select the last observation for each ID, which is equivalent to the highest
  # schoolyear value associated with the most frequent attribute.
  result <- mode[,lapply(.SD, tail, 1), by=sasid]
  # Remove the schoolyear to clean up the result
  result <- result[,schoolyear:=NULL]
  return(as.data.frame(result))
}

This approached seemed “natural” in data.table, although it took me a while to refine and debug since it was my first time using the package ¹. Essentially, I use rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.

I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:

1
2
3


> system.time(modal_person_attribute(all_years, 'sex'))
 user  system elapsed 
 40.452   0.246  41.346 

Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.

With the release of dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the dplyr code would be easier to maintain since I frequently use dplyr and only turn to data.table in specific, sticky situations where performance matters.

In about a tenth the time it took to develop the original code, I came up with this new function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


modal_person_attribute <- function(x, sid, attribute, year){
  grouping <- lapply(list(sid, attribute), as.symbol)
  original <- x
  max_attributes <- x %.% 
                    regroup(grouping) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count))
  recent_max <- left_join(original, max_attributes) %.%
                regroup(list(grouping[[1]])) %.%
                filter(!is.na(count) & count == max(count))
  results <- recent_max %.% 
             regroup(list(grouping[[1]])) %.%
             filter(year == max(year))
  return(results[,c(sid, attribute)])
}

At least to my eyes, this code is far more expressive and elegant. First, I generate a data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.

There are a few funky things (note the use of regroup and grouping, which are related to dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.

But was this code more performant? Imagine my glee when this happened:

1
2
3
4
5


> system.time(modal_person_attribute_dplyr(all_years, sid='sasid', 
> attribute='sex', year='schoolyear'))
Joining by: c("sasid", "sex")
   user  system elapsed 
  1.657   0.087   1.852 

That is a remarkable increase in performance!

Now, I realize that I may have cheated. My data.table code isn’t very good and could probably follow a pattern closer to what I did in dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.

It is not every day that a tool provides powerful expressiveness and yields greater performance.

Update

I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


modal_person_attribute <- function(x, sid, attribute, year){
  # Select only the important columns
  x <- x[,c(sid, attribute, year)]
  names(x) <- c('sid', 'attribute', 'year')
  # Clean up years
  if(TRUE %in% grepl('_', x$year)){
    x$year <- gsub(pattern='[0-9]{4}_([0-9]{4})', '\\1', x$year)
  }  
  # Calculate the count for each person-attribute combo and select max
  max_attributes <- x %.% 
                    group_by(sid, attribute) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count)) %.%
                    select(sid, attribute)
  # Find the max year for each person-attribute combo
  results <- max_attributes %.% 
             left_join(x) %.%
             group_by(sid) %.%
             filter(year == max(year)) %.%
             select(sid, attribute)
  names(results) <- c(sid, attribute)
  return(results)
}

It was over a year ago that I first wrote this code. ↩︎

February 26, 2014

Latinos in Rhode Island Face Housing Burden

8:00PM

We burden Latinos (and other traditionally underserved communities) with expensive housing because of the widespread practice of using homestead exemptions in Rhode Island. By lowering the real estate tax rate, typically by 50%, for owner occupied housing, we dramatically inflate the tax rate paid by Rhode Islanders who are renting.

Echoing a newly filed lawsuit in New York City over discriminatory real estate tax regimes, this new report emphasizes the racist incentives built into our property tax.

Homestead exemptions are built on the belief that renters are non-permanent residents of communities, care less for the properties they occupy and neighborhoods they live in, and are worse additions than homeowners. Frankly, it is an anti-White flight measure meant to assure people that only those with the means to purchase and the intent to stay will join their neighborhoods. Wealthy, largely White, property owners see homestead exemptions as fighting an influx of “slum lords”, which is basically the perception of anyone who purchases a home or builds apartments and rents them out.

Rather than encouraging denser communities with higher land utilization and more housing to reduce the cost of living in dignity, we subsidize low value (per acre) construction that maintain inflated housing costs.

Full disclosure: I own a condo in Providence and receive a 50% discount on my taxes. In fact, living in a condo Downcity, my home value is depressed because of the limited ways that I can use it. I could rent my current condo at market rate and lose money because of the doubling in taxes that I would endure versus turning a small monthly profit at the same rent with higher taxes. The flexibility to use my property as my own residence or as a rental unit more than pays for higher taxes.

So while I do have personal reasons to support removing the homestead exemption, even if I lived in a single family home on the East Side that was not attractive as a rental property, I would still think this situation is absurd. Homeowners’ taxes should easily be 20% higher to tax renters 30% less. Maybe some of our hulking, vacant infrastructure could be more viably converted into housing stock and lower the cost for all residents. Maybe we could even see denser development because there will actually be a market for renters at the monthly rates that would need to be charged to recuperate expenses. At least the rent wouldn’t be so damn high for too many people of color and people living in or near poverty.

February 17, 2014

Appreciating the Beauty of dplyr

8:00PM

Hadley Wickham has once again¹ made R ridiculously better. Not only is dplyr incredibly fast, but the new syntax allows for some really complex operations to be expressed in a ridiculously beautiful way.

Consider a data set, course, with a student identifier, sid, a course identifier, courseno, a quarter, quarter, and a grade on a scale of 0 to 4, gpa. What if I wanted to know the number of a courses a student has failed over the entire year, as defined by having an overall grade of less than a 1.0?

In dplyr:

1
2
3
4
5


course %.% 
group_by(sid, courseno) %.%
summarise(gpa = mean(gpa)) %.%
filter(gpa <= 1.0) %.%
summarise(fails = n())

I refuse to even sully this post with the way I would have solved this problem in the past.

Seriously, how many of the packages he has managed/written are indispensable to using R today? It is no exaggeration to say that the world would have many more Stata, SPSS, and SAS users if not for Hadleyverse. ↩︎

February 9, 2014

Freedom Should Be Reserved for the Wealthy

8:00PM

These quotes are absolutely striking, in that they give a clear glimpse into the ideological commitments of the Republican Party. From Sen. Blunt and Rep. Cole, we get the revelation that— for conservatives— the only “work” worth acknowledging is wage labor. To myself, and many others, someone who retires early to volunteer— or leaves a job to care for their children— is still working, they’re just outside the formal labor market. And indeed, their labor is still valuable— it just isn’t compensated with cash.

One of the greatest benefits of wealth is that it can liberate people to pursue happiness. When we tie a basic need for living complete lives of dignity to full time employment, people will find themselves willing to make many sacrifices to ensure this need. In our nation of great wealth with liberty and freedom as core values, it is hard to believe that the GOP would decry the liberating effect of ending the contingency of health care on work.

There is no work rule, regulation, or union that empowers workers more in their relationship with their employers than removing the threat of losing health care from the table. An increasingly libertarian right should be celebrating this as a key victory, rather than celebrate the existing coercive impact that health care has in our lives.

Republicans aren’t as worried as the idle rich, who— I suppose— have earned the right to avoid a life of endless toil. Otherwise— if Republicans really wanted everyone to work as much as possible— they’d support confiscatory tax rates. After all, nothing will drive an investment banker back to the office like the threat of losing 70 percent of her income to Uncle Sam.

Oh yeah, I forgot. For all their claims to loving liberty and freedom, what the GOP really stands for is protecting liberty and freedom for the existing “deserving” wealthy. They will fight tooth and nail to remove estate taxes because inheritance is a legitimate source of liberty. Removing the fear of entering a hospital uninsured after being unable to access preventive care is what deprives folks of “dignity”.

February 5, 2014

Dreamschooling

8:00PM

My Democracy Prep colleague Lindsay Malanga and I often say we should start an organization called the Coalition of Pretty Good Schools. We’d start with the following principles.

Every child must have a safe, warm, disruption-free classroom as a non-negotiable, fundamental right.

All children should be taught to read using phonics-based instruction.

All children must master basic computational skills with automaticity before moving on to higher mathematics.

Every child must be given a well-rounded education that includes science, civics, history, geography, music, the arts, and physical education.

Accountability is an important safeguard of public funds, but must not drive or dominate a child’s education. Class time must not be used for standardized test preparation.

We have no end of people ready to tell you about their paradigmatic shift that will fix education overnight. There has been plenty of philosophizing about the goals, purpose, and means of education. Everyone is ready to pull out tropes about the “factory model” of education our system is built on.

The reality is that the education system too often fails at very basic delivery, period. I would love to see more folks draw a line in the sand of their minimum basic requirements, and not in an outrageous, political winky-wink where they are wrapping thier ideal in the language of the minimum. Lets have a deep discussion right now about the minimum basic requirements and lets get relentless about making that happen without the distraction of the dream. Frankly, whatever your dream is, so long as it involves kids going somewhere to learn ¹, if we can’t deliver on the basics it will be dead on arrival.

Of course, for a group of folks who are engaged in Dreamschooling, we cannot take for granted that schools will be places or that children will be students in any traditional sense of the word. However, I believe that if we have a frank conversation about the minimum expectations for education I suspect this will not be a particularly widely held sentiment. If our technofuturism does complete its mindmeld with the anarcho-____ movements on the left and right to lead to a dramatically different conceptualization of childhood in the developed world in my lifetime… ↩︎

January 6, 2014

Garrahy Complex: Rules for Public Investment in Parking

8:00PM

James over at TransportPVD has a great post today talking about a Salt Lake City ordinance that makes property owners responsible for providing a bond that funds the landscaping and maintenance of vacant lots left after demolition. I love this as much as he does and would probably add several other provisions (like forfeiting any tax breaks on that property or any other property in the city and potentially forfeiture of the property itself if a demolition was approved based on site plans that are not adhered to within a given time frame). Ultimately, I do think the best solution to surface parking where it doesn’t belong, of either the temporary or permanent (and isn’t it all actually permanent?) kind, is a land value tax.

James goes one step further and suggests that we should adopt some similar rules around ALL parking developments and proposes a few. His hopes were that a mayoral candidate would chime in. For now, he will have to do with me.

His recommendations are built somewhat specific to the commission looking at building a state-funded parking garage in front of the Garrahy Complex in Downcity, about which many urbanists and transit advocates have expressed reservations or outright rejection. They are:

The garage is parking neutral. As many spots need to be removed from the downtown as are added.

An added bonus would be if some of the spots removed were on-street ones, to create protected bike lanes or transit lanes with greenery separating them from car traffic.

The garage has the proposed bus hub.

There are ground-level shops.

The garage is left open 24-hours so that it can limit the need for other lots (this happens when a garage is used only during the day, or only at night, instead of letting it serve both markets).

Cars pay full market price to park.

(Note: I’ve numbered rather than kept the bullets of the original to make responding easier.)

I disagree with the first and second point, which are really one and the same. We are in a district that has tremendously underutilized land. We want that space to be developed and as a result of that development we expect their to be much increased need for transit capacity. The goal should be both to increase accessibility and increase the share of transit capacity offered by walking, biking, or riding a bus or light rail. This does not require that we demand a spot-for-spot when building a public garage. I agree with the sentiment but disagree with the degree. Part of building rules and policies like this is to ensure comprehensive consideration of the transit context when developing parking. I see no reason to a priori assume that garages should only be permitted if they eliminate the same number of spaces they create.

The reason I combine these two points is because the city does not have the ability to remove off-street parking that is not publicly owned. Investing in smaller garages by footprint that have to be built taller and provide no change in capacity probably make no sense at all. If we’re going to build any kind of public garage at all, it should be with the goal of consolidating parking into infrastructure with reasonable land utilization. We would rather 3 or 4 large garages properly located than all of the current lots. Limiting their size because of the flexibility available due to reducing on-street parking or the footprint on existing lots doesn’t achieve that and doesn’t factor in orders-of-magnitude changes in capacity we should need for all transit modes in the next 20 years.

On point three, I am skeptical. I like the idea of improving bus infrastructure when building parking infrastructure in general. In fact, I voted against the \$40M Providence road paving bond even though that was much needed maintenance. My rationale was purely ideological– we should not use debt to pay for car maintenance without also investing in ways to reduce future maintenance costs through better utilization of those roads. However, I have a hard time believing that the Garrahy location is any good as a bus hub. If RIPTA did a great job identifying the need for an additional bus hub that the Garrahy location met the criteria for, I think it’s a reasonable idea. Short of that, it feels like throwing the transit community a wasteful bone.

I mostly agree on point four, but I doubt at the scale James would like to see. I think an appropriate level is probably not that different from the recently erected Johnson and Wales garage. The reality is that street-level retail is the right form, but there isn’t sufficient foot traffic to support it right now and won’t be for some time. There has to be street-level activation of any garage built in this area, but the square footage is likely fairly timid.

I absolutely agree with point five, without qualification. Not a dime should be spent on a public parking spot that is closed at any point in time, anywhere in the city. I would actually ditto this for surface parking lots on commercial properties of any kind after business hours. Not only should they have to be open, they should have to provide signs indicating the hours of commercial activity when parking is restricted and the hours when parking is available to the public. These hours of operations should require board approval. Owners could choose to charge during these off hours, but cars must be able to access the lot.

And point six should be a given for any public parking.

The real problem with Garrahy, in my opinion, is the cost is absurd, likely to be at least \$35,000 per space. There is plenty of existing parking, suggesting the demand right now is illusory and market rate for those spots right now means the investment is unlikely to ever be recovered. In a world with limited capacity for government spending on transit as a public good, I would rather subsidize transit infrastructure that benefits the poor and directly impacts the share of non-car transit as it increases capacity. Spending limited funds on parking infrastructure is ludicrous when demand isn’t sufficient to recover the investment. We already more than sufficiently subsidize parking in the area. And of course, the “study commission” is not really a study– it’s a meeting convened by those who want the project to happen putting the required usual suspects in the room to tepidly rubber stamp it. At least that’s my cynical take.

December 9, 2013

Did public schools build economies, or did economies build public schools?

8:00PM

We find that public schools offered practically zero return education on the margin, yet they did enjoy significant political and financial support from local political elites, if they taught in the “right” language of instruction.

One thing that both progressives and libertarians agree upon are that social goals of education are woefully underappreciated and considered in the current school reform discussion. Both school choice and local, democratic control of schools are reactions to centralization resulting in “elites… [selecting] the ‘right’ language of instruction.”

I am inclined to agree with neither.

December 3, 2013

Calculating Age with Precision in R

8:00PM

Update

Turns out the original code below was pretty messed up. All kinds of little errors I didn’t catch. I’ve updated it below. There are a lot of options to refactor this further that I’m currently considering. Sometimes it is really hard to know just how flexible something this big really should be. I think I am going to wait until I start developing tests to see where I land. I have a feeling moving toward a more test-driven work flow is going to force me toward a different structure.

I recently updated the function I posted about back in June that calculates the difference between two dates in days, months, or years in R. It is still surprising to me that difftime can only return units from seconds up until weeks. I suspect this has to do with the challenge of properly defining a “month” or “year” as a unit of time, since these are variable.

While there was nothing wrong with the original function, it did irk me that it always returned an integer. In other words, function returned only complete months or years. If the start date was on 2012-12-13 and the end date was on 2013-12-03, the function would return 0 years. Most of the time, this is the behavior I expect when calcuating age. But it is completely reasonable to want to include partial years or months, e.g. in the aforementioned example returning 0.9724605.

So after several failed attempts because of silly errors in my algorithm, here is the final code. It will be released as part of eeptools 0.3 which should be avialable on CRAN soon ¹.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


age_calc <- function(dob, enddate=Sys.Date(), units='months', precise=TRUE){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  }
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  if(precise){
    start_is_leap <- ifelse(start$year %% 400 == 0, TRUE, 
                        ifelse(start$year %% 100 == 0, FALSE,
                               ifelse(start$year %% 4 == 0, TRUE, FALSE)))
    end_is_leap <- ifelse(end$year %% 400 == 0, TRUE, 
                        ifelse(end$year %% 100 == 0, FALSE,
                               ifelse(end$year %% 4 == 0, TRUE, FALSE)))
  }
  if(units=='days'){
    result <- difftime(end, start, units='days')
  }else if(units=='months'){
    months <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='months', SIMPLIFY=FALSE), 
                     length) - 1
    # length(seq(start, end, by='month')) - 1
    if(precise){
      month_length_end <- ifelse(end$mon==1, 28,
                                 ifelse(end$mon==1 & end_is_leap, 29,
                                        ifelse(end$mon %in% c(3, 5, 8, 10), 
                                               30, 31)))
      month_length_prior <- ifelse((end$mon-1)==1, 28,
                                     ifelse((end$mon-1)==1 & start_is_leap, 29,
                                            ifelse((end$mon-1) %in% c(3, 5, 8, 
                                                                      10), 
                                                   30, 31)))
      month_frac <- ifelse(end$mday > start$mday,
                           (end$mday-start$mday)/month_length_end,
                           ifelse(end$mday < start$mday, 
                            (month_length_prior - start$mday) / 
                                month_length_prior + 
                                end$mday/month_length_end, 0.0))
      result <- months + month_frac
    }else{
      result <- months
    }
  }else if(units=='years'){
    years <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='years', SIMPLIFY=FALSE), 
                     length) - 1
    if(precise){
      start_length <- ifelse(start_is_leap, 366, 365)
      end_length <- ifelse(end_is_leap, 366, 365)
      year_frac <- ifelse(start$yday < end$yday,
                          (end$yday - start$yday)/end_length,
                          ifelse(start$yday > end$yday, 
                                 (start_length-start$yday) / start_length +
                                end$yday / end_length, 0.0))
      result <- years + year_frac
    }else{
      result <- years
    }
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

I should note that my mobility function will also be included in eeptools 0.3. I know I still owe a post on the actual code, but it is such a complex function I have been having a terrible time trying to write clearly about it. ↩︎

December 2, 2013

A Different Angle on PISA

8:00PM

I wanted to call attention to these interesting PISA results. Turns out that student anxiety in the United States is lower than the OECD average and belief in ability is higher ¹. I thought that all of the moves in education since the start of standard’s based reform were supposed to be generating tremendous anxiety and failing to produce students who had high sense of self-efficacy?

It is also worth noting that students in the United States were more likely to skip out on school dand this had a higher than typical impact on student performance. One interpretation of this could be that students are less engaged, but also that schooling activities do have a large impact on students rather than schools being of lesser importance than student inputs.

I have always had a hard time reconciling the calls for higher teacher pay and better work conditions and evidence that missing even just 10% of schooling has a huge impact on student outcomes with the belief that addressing other social inequities is the key way to achieve better outcomes for kids.

This is all an exercise in nonsense. It is incredibly difficult to transfer findings from surveys across dramatical cultural differences. It is also hard to imagine what can be learned about the delivery of education in the dramatically different contexts that exists. The whole international comparison game seems like one big Rorschach test where the price of admission is leaving any understanding of culture, context, and external validity at the door.

P.S.: The use of color in this visualization is awful. There is a sense that they are trying to be “value neutral” with data that is ordinal in nature (above, same, or below), and in doing so chose two colors that are very difficult to distinguish between. Yuck.

The site describes prevalence of anxiety as, “proportion of students who feel helpless when faced with math problems” and belief in ability as, “proportion of students who feel confident in their math abilitites”. Note, based on these defitions, one might also think that either curricula were not so misaligned with international benchmarks or that we are already seeing the fruits of partial transition to Common Core. Not knowing the trend for this data, or some of the specifics about the collection instrument, makes that difficult to assess. ↩︎

November 22, 2013

A Good Long Read on Assessment and Accountability

8:00PM

Although it clocks in at 40+ pages, this is a worthwhile and relatively fast read for anyone in education policy on the future of assessment if we’re serious about college and career readiness. There is a ton to unpack, with a fair amount it agree with and a lot I am quite a bit less sure on.

I think this paper is meant for national and state level policy-makers, and so my major quibble is I think this is much more valuable for a district-level audience. I am less bullish on the state’s role in building comprehensive assessment systems. That’s just my initial reaction.

The accountability section is both less rich and less convincing than the assessment portion. I have long heard cries for so-called reciprocal accountability, but it is still entirely unclear to me what this means and looks like and the implications for current systems.

November 20, 2013

Bilingual Education at Providence Public Schools

8:00PM

“We are trying to work towards late-exit ELL programs so (students) can learn the concepts in (their) native language,” Lusi said. Administrative goals have recently shifted to a focus on proficiency in both languages because bilingual education is preferred, she added.

But instituting district-wide bilingual education would require funding to hire teachers certified in both languages and to buy dual-language materials, she said.

I am pretty sure this is new. I am surprised there has not been a stronger effort to pass a legislative package in Rhode Island that provides both the policy framework and funding necessary to achieve universal bilinguage education for English language learners in RI schools.

One of the great advantages of transitioning to common standards¹ is there should be greater availability of curricular materials in languages other than English. I suspect most of what is needed for bilingual education is start up money for materials, curriculum supports and developments, and assessment materials. There are a few policy things that need to be in place, possibly around state exams, but also rules around flexible teacher assignment, hiring, and dismissal staffing needs dramatically change.

Someone should be putting this package together. I suspect there would be broad support.

Note, this is not necessarily a feature of the Common Core State Standards, just having standards in common with many other states. ↩︎

November 19, 2013

DeBlasio: Weak on Implementation

8:00PM

De Blasio and his advisers are still figuring out how much rent to charge well-funded charter schools, his transition team told me. “It would depend on the resources of the charter school or charter network,” he told WNYC, in early October. “Some are clearly very, very well resourced and have incredible wealthy backers. Others don’t. So my simple point was that programs that can afford to pay rent should be paying rent.” (In an October debate with the Republican candidate Joseph Lhota, he put it more bluntly: “I simply wouldn’t favor charters the way Mayor Bloomberg did because, in the end, our city rises or falls on our traditional public schools.”)

My impression of DeBlasio was that he went around collecting every plausible complaint from every interest group that was mad at Bloomberg and promised whatever they wanted. There didn’t really seem to be a coherent theory or any depth whatsoever to his policy prescriptions.

Already working hard to confirm this impression.

November 18, 2013

More evidence for ”mere facts”

8:00PM

To recap, the first study discussed above established that children from disadvantaged backgrounds know less about a topic (i.e., birds) than their middle-class peers. Next, in study two, the researchers showed that differences in domain knowledge influenced children’s ability to understand words out of context, and to comprehend a story. Moreover, poor kids — who also had more limited knowledge — perform worse on these tasks than did their middle class peers. But could additional knowledge be used to level the playing field for children from less affluent backgrounds?

In study three, the researchers held the children’s prior knowledge constant by introducing a fictitious topic — i.e., a topic that was sure to be unknown to both groups. When the two groups of children were assessed on word learning and comprehension related to this new domain, the researchers found no significant differences in how poor and middle-class children learned words, comprehended a story or made inferences.

One of the “old” divides in education, from before the current crop of “edreform”, is whether or not content matters. Broadly, there are two camps, let’s call them the “Facts” and “Skills”, with the “Skills” camp clearly ahead in terms of mind share.

“Skills” is based on a fundamentally intuitive insight– students need to know how to do things not about the things themselves. In many ways it is built on our common experience of forgetting facts over time. We need 21st century skills, not an accumulation of specific, privileged knowledge that fades over time. Whatever the latest technology, from encyclopedias to calculators through to Google, each generation decides that the tools that adults use end the necessity of knowing about things rather than knowing how to find things.

This is very attractive. It seems to match our adult experiences accumulating knowledge and using it in our work. It seems to address students’ boredom with learning irrelevant information. It leaves space for groups to advocate for teaching whatever content they want since everyone can argue that content is fundamentally limited in value.

In classic turns out sense, however, the evidence keeps mounting that one must teach from the “Facts” approach to achieve the goals of the “Skills” position.

Turns out: skills and knowledge do not transfer well across domains. There is little evidence that learning how to read literary fiction translates to reading technical manuals with comprehension. In other words, critical thinking is not really an independent ability free of domain context ¹. In fact, experts are able to learn more quickly, but only in their domain and only when they have prior knowledge to use as scaffolding ².

Turns out: reading comprehension is strongly connected to whether or not students have prior knowledge (“Facts”) about the topic of the passage ³. Reading techniques only provide modest assistance for comprehension.

Turns out: privileging skills over content may have a serious differential impact on disadvantaged children. A well-intentioned goal of achieving equity through equality has led many to advocate that we do a disservice to children of color and children in poverty because their schools have not as completely embraced a “Skills” world and are too focused on “Facts”. The problem is that deep disparities we see when these students enter schooling point to having less prior knowledge than their peers ⁴.

What is remarkable, and tragic, is that the “Skills” camp has maintained its dominance through the demonization of “Facts”, with dramatic misinterpretations like:

The “Facts” folks are just White colonialists seeking to maintain existing power structures through teaching the information of privilege.
The “Facts” folks privilege memorization, rote learning, and recall-based assessment over other pedagogy that is more engaging and authentic.
The “Facts” folks can only ever teach what was important yesterday; “Skills” camp can teach what matters to become a lifelong learner for tomorrow’s world.

None of these are true.

This post is largely brought to you by: E.D. Hirsch, Dan T. Willingham, and Malcolm Gladwell via Merlin Mann.

http://www.aft.org/pdfs/americaneducator/summer2007/Crit_Thinking.pdf ↩︎
http://www.ncbi.nlm.nih.gov/pubmed/11550744 ↩︎
http://www.aft.org/newspubs/periodicals/ae/spring2006/willingham.cfm ↩︎
This has pretty much been the thrust behind E.D. Hirsch’s work, who has been accused of being on the far right in education, despite his consistent belief that education equity is one of the most important goals to achieve. His firm belief, and I am mostly convinced, is that explicit factual content is the key tool for how teaching can dramatically improve educational equity. ↩︎

The four ways to really fix education

8:00PM

More schooling, reoriented calendar

Wider range of higher education

Cheaper four-year degrees

Eliminate property tax-based public education

This is an interesting list. I don’t agree with number four. There are several benefits to using property taxes not the least of which is their stability and lagged response during traditional economic downturns. However, there are many things we should do to reform our revenue system for education. I am keen on more taxes on “property”, using land value taxes that are levvied either statewide or regionally to address some of the inequities traditional, highly localized property taxes can lead to.

November 17, 2013

It's Poverty Stupid... or is it?

8:00PM

If I had to point to the key fissure in the education policy and research community it would be around poverty. Some seem to view it as an inexorable obstacle, deeply believing that the key improvement strategy is to decrease inequity of inputs. Some seem to view it as an obstacle that can be overcome by systems functioning at peak efficacy, deeply believing the great challenge is achieving that efficacy sustainably at scale. Both positions seem to grossly simplify causes and suggest policy structures and outcomes that are unachievable.

Paraphrasing Merlin Mann, always be skeptical of “turns out” research. In this case, are the results really that surprising? If they are, I might suggest that you have been focusing too much on the partial equilibrium impact of poverty and ignoring the bigger picture.

Not that I think integration is likely, easy, quick, or magically fixes things.

October 7, 2013

Incomplete Evidence

8:00PM

I spent most of high school writing, practicing, and performing music. I played guitar in two separate bands, and was the lead vocalist in one of those bands, and played trumpet in various wind ensembles and the jazz band at school. When I wasn’t a part of the creation process myself, there is a pretty good chance I was listening to music. Back then, it seemed trivial to find a new artist or album to obsess over.

Despite being steeped in music, I have always found it hard to write about. The truth is, I have limited ability to use words to explain just what makes a particular piece of music so wonderful. Oh sure, I could discuss structure, point out a particular hook in a particular section and how it sits in the mix. I could talk about the tone of the instrument or about quality of the performance or any number of other things. The problem with this language is it reduces what is great about this piece of music to a description that could easily fit some other piece of music. Verbalizing the experience of music projects a woefully flattened artifact of something breathtaking.

Now it might seem that recorded music has greatly diminished this challenge. After all, the experience of recorded music can scale– anyone can listen. Unfortunately, I found this to be completely untrue. When I play music for other people, it actually sounds different than when I experience it for myself. Little complexities that seem crucial to the mix seem to cower and hide rather than loom large in the presence of others. It is not really feasible to point out what makes the song so great while listening, because it disrupts the experience. Worst of all, no one else seems to experience what I experience when I listen.

Of course, all of this may seem obvious to someone who has read about aesthetics. I have not.

September 22, 2013

Using R to Calculate Student Moblity

8:00PM

In a couple of previous posts, I outlined the importance of documenting business rules for common education statistics and described my take on how to best calculate student mobility. In this post, I will be sharing two versions of R function I wrote to implement this mobility calculation, reviewing their different structure and methods to reveal how I achieved an order of magnitude speed up between the two versions. ¹ At the end of this post, I will propose several future routes for optimization that I believe should lead to the ability to handle millions of student records in seconds.

Version 0: Where Do I Begin?

The first thing I tend to do is whiteboard the rules I want to use through careful consideration and constant referal back to real data sets. By staying grounded in the data, I am less likely to encounter unexpected situations during my quality control. It also makes it much easier to develop test data, since I seek out outlier records in actual data during the business rule process.

Developing test data is a key part of the development process. Without a compact, but sufficiently complex, set of data to try with a newly developed function, there is no way to know whether or not it does what I intend.

Recall the business rules for mobility that I have proposed, all of which came out of this whiteboarding process:

Entering the data with an enroll date after the start of the year counts as one move.
Leaving the data with an exit date before the end of the year counts as one move.
Changing schools sometime during the year without a large gap in enrollment counts as one move.
Changing schools sometime during the year with a large gap in enrollment counts as two moves.
Adjacent enrollment records for the same student in the same school without a large gap in enrollment does not count as moving.

Test data needs to represent each of these situations so that I can confirm the function is properly implementing each rule.

Below is a copy of my test data. As an exercise, I recommend determining the number of “moves” each of these students should be credited with after applying the above stated business rules.

Unique Student ID	School Code	Enrollment Date	Exit Date
1000000	10101	2012-10-15	2012-11-15
1000000	10103	2012-01-03	2013-03-13
1000000	10103	2012-03-20	2013-05-13
1000001	10101	2012-09-01	2013-06-15
1000002	10102	2012-09-01	2013-01-23
1000003	10102	2012-09-15	2012-11-15
1000003	10102	2013-03-15	2013-06-15
1000004	10103	2013-03-15	NA

Version 1: A Naïve Implementation

Once I have developed business rules and a test data set, I like to quickly confirm that I can produce the desired results. That’s particularly true when it comes to implementing a new, fairly complex business rules. My initial implementation of a new algorithm does not need to be efficient, easily understood, or maintainable. My goal is simply to follow my initial hunch on how to accomplish a task and get it working. Sometimes this naïve implementation turns out to be pretty close to my final implementation, but sometimes it can be quite far off. The main things I tend to improve with additional work are extensibility, readability, and performance.

In the case of this mobility calculation, I knew almost immediately that my initial approach was not going to have good performance characteristics. Here is a step by step discussion of Version 1.

Function Declaration: Parameters

1
2
3
4
5
6
7
8


moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sid', 
                       schid='schid',
                       enroll_date='enroll_date',
                       exit_date='exit_date')){

I named my function moves_calc() to match the style of age_calc() which was submitted and accepted to the eeptools package. This new function has eight parameters.

df: a data.frame containing the required data to do the mobility calculation.

enrollby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the start of the school year. Students whose first enrollment is after this date will have an additional move under the assumption that they enrolled somewhere prior to the first enrollment record in the data. This does not (and likely should not) match the actual first day of the school year.

exitby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the end of the school year. Students whose last exit is before this date will have an additional move under the assumption that they enrolled somewhere after this exit record that is excluded in the data. This date does not (and likely should not) match the actual last day of the school year.

gap: an atomic vector of type numeric that signifies how long a gap must exist between student records to record an additional move for that student under the assumption that they enrolled somewhere in between the two records in the data that is not recorded.

sid: an atomic vector of type character that represents the name of the vector in df that contains the unique student identifier. The default value is 'sid'.

schid: an atomic vector of type character that represents the name of the vector in df that contains the unique school identifier. The default value is schid.

enroll_date: an atomic vector of type character that represents the name of the vector in df that contains the enrollment date for each record. The default value is enroll_date.

exit_date: an atomic vector of type character that represents the name of the vector in df that contains the exit date for each record. The default value is exit_date.

Most of these parameters are about providing flexibility around the naming of attributes in the data set. Although I often write functions for my own work which accept data.frames, I can not help but to feel this is a bad practice. Assuming particular data attributes of the right name and type does not make for generalizable code. To make up for my shortcoming in this area, I have done my best to allow other users to enter whatever data column names they want, so long as they contain the right information to run the algorithm.

The next portion of the function loads some of the required packages and is common to many of my custom functions:

1
2
3
4
5
6
7
8
9


if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
require(data.table)

if("plyr" %in% rownames(installed.packages()) == FALSE){
    install.packages("plyr")
  } 
require(plyr)

Type Checking and Programmatic Defaults

Next, I do extensive type-checking to make sure that df is structured the way I expect it to be in order to run the algorithm. I do my best to supply humane warning() and stop() messages when things go wrong, and in some cases, set default values that may help the function run even if function is not called properly.

1
2


if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
    stop("Both enroll_date and exit_date must be Date objects")

The enroll_date and exit_date both have to be Date objects. I could have attempted to coerce those vectors into Date types using as.Date(), but I would rather not assume something like the date format. Since enroll_date and exit_date are the most critical attributes of each student, the function will stop() if they are the incorrect type, informing the analyst to clean up the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


if(missing(enrollby)){
   enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
     enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                               '-09-15', sep=''), format='%Y-%m-%d'
     warning(paste("enrollby must be a string with format %Y-%m-%d,",
                   "defaulting to", 
                   enrollby, sep=' '))
  }else{
    enrollby <- as.Date(enrollby, format="%Y-%m-%d")
  }
}
if(missing(exitby)){
  exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                          '-06-01', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
    exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                              '-06-01', sep=''), format='%Y-%m-%d')
    warning(paste("exitby must be a string with format %Y-%m-%d,",
                  "defaulting to", 
                  exitby, sep=' '))
  }else{
    exitby <- as.Date(exitby, format="%Y-%m-%d")
  }
}
if(!is.numeric(gap)){
  gap <- 14
  warning("gap was not a number, defaulting to 14 days")
}

For maximum flexibility, I have parameterized the enrollby, exitby, and gap used by the algorithm to determine student moves. An astute observer of the function declaration may have noticed I did not set default values for enrollby or exitby. This is because these dates are naturally going to be different which each year of data. As a result, I want to enforce their explicit declaration.

However, we all make mistakes. So when I check to see if enrollby or exitby are missing(), I do not stop the function if it returns TRUE. Instead, I set the value enrollby to September 15 in the year that matches the minimum (first) enrollment record and exitby to June 1 in the year that matches the maximum (last) exit record. I then pop off a warning() that informs the user the expected values for each parameter and what values I have defaulted them to. I chose to use warning() because many R users set their environment to halt at warnings(). They are generally not good and should be pursued and fixed. No one should depend upon the defaulting process I use in the function. But the defaults that can be determined programmatically are sensible enough that I did not feel the need to always halt the function in its place.

I also check to see if gap is, in fact, defined as a number. If not, I also throw a warning() after setting gap equal to default value of 14.

Is this all of the type and error-checking I could have included? Probably not, but I think this represents a very sensible set that make this function much more generalizable outside of my coding environment. This kind of checking may be overkill for a project that is worked on independently and with a single data set, but colleagues, including your future self, will likely be thankful for their inclusion if any of your code is to be reused.

Initializing the Results

1
2
3
4
5


output <- data.frame(id = as.character(unique(df[[sid]])),
                     moves = vector(mode = 'numeric', 
                                    length = length(unique(df[[sid]]))))
output <- data.table(output, key='id')
df <- arrange(df, sid, enroll_date)

My naïve implementation uses a lot of for loops, a no-no when it comes to R performance. One way to make for loops a lot worse, and this is true in any language, is to reassign a variable within the loop. This means that each iteration has the overhead of creating and assigning that object. Especially when we are building up results for each observation, it is silly to do this. We know exactly how big the data will be and therefore only need to create the object once. We can then assign a much smaller part of that object (in this case, one value in a vector) rather than the whole object (a honking data.table).

Our output object is what the function returns. It is a simple data.table containing all of the unique student identifiers and the number of moves recorded for each student.

The last line in this code chunk ensures that the data are arranged by the unique student identifier and enrollment date. This is key since the for loops assume that they are traversing a student’s record sequentially.

Business Rule 1: The Latecomer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


for(i in 1:(length(df[[sid]])-1)){
  if(i>1 && df[sid][i,]!=df[sid][(i-1),]){
    if(df[['enroll_date']][i]>enrollby){
      output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }else if(i==1){
    if(df[['enroll_date']][i]>enrollby){
    output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }

The first bit of logic checks if sid in row i is not equal to the sid in the i-1 row. In other words, is this the first time we are observing this student? If it is, then row i is the first observation for that student and therefore has the minimum enrollment date. The enroll_date is checked against enrollby. When enroll_date is after enrollby, then the moves attribute for that sid is incremented by 1. ²

Now, I didn’t really mention the conditional that i>1. This is needed because there is no i-1 observation for the very first row of the data.table. Therefore, i==1 is a special case where we once again perform the same check for enroll_date and enrollby. The i>1 condition is before the && operator, which ensures the statement after the && is not evaluated when the first conditional is FALSE. This avoids an “out of bounds”-type error where R tries to check df[0].

Business Rule 5: The Feint

Yeah, yeah– the business rule list above doesn’t match the order of my function. That’s ok. Remember, sometimes giving instructions to a computer does not follow the way you would organize instructions for humans.

Remember, the function is traversing through our data.frame one row at a time. First I checked to see if the function is at the first record for a particular student. Now I check to see if there are any records after the current record.

1
2
3
4
5
6


  if(df[sid][i,]==df[sid][(i+1),]){
    if(as.numeric(difftime(df[['enroll_date']][i+1], 
                           df[['exit_date']][i], units='days')) < gap &
       df[schid][(i+1),]==df[schid][i,]){
        next
    }else if ...

For the case where the i+1 record has the same sid, then the enroll_date of i+1 is subtracted from the exit_date of i and checked against gap. If it is both less than gap and the schid of i+1 is the same as i, then next, which basically breaks out of this conditional and moves on without altering moves. In other words, students who are in the same school with only a few days between the time they exited are not counting has having moved.

The ... above is not the special ... in R, rather, I’m continuing that line below.

Business Rule 3: The Smooth Mover

1
2
3
4
5


  }else if(as.numeric(difftime(df[['enroll_date']][i+1], 
                               df[['exit_date']][i], 
                               units='days')) < gap){
    output[as.character(df[[sid]][i]), moves:=moves+1L] 
  }else{ ...

Here we have the simple case where a student has moved to another school (recall, this is still within the if conditional where the next record is the same student as the current record) with a very short period of time between the exit_date at the current record and the enroll_date of the next record. This is considered a “seamless” move from one school to another, and therefore that student’s moves are incremented by 1.

Business Rule 4: The Long Hop

Our final scenario for a student moving between schools is when the gap between the exit_date at the i school and the enroll_date at the i+1 school is large, defined as > gap. In this scenario, the assumption is that the student moved to a jurisdiction outside of the data set, such as out of district for district-level data or out of state for state level data, and enrolled in at least one school not present in their enrollment record. The result is these students receive 2 moves– one out from the i school to a missing school and one in to the i+1 school from the missing school.

The code looks like this (again a repeat from the else{... above which was not using the ... character):

1
2
3
4


  }else{
    output[as.character(df[[sid]][i]), moves:=moves+2L] 
  }
}else...

This ends with a } which closes the if conditional that checked if the i+1 student was the same as the i student, leaving only one more business rule to check.

Business Rule 2: The Early Summer

1
2
3
4
5
6
7


}else{
  if(is.na(df[['exit_date']][i])){
    next
  }else if(df[['exit_date']][i] < exitby){
        output[as.character(df[[sid]][i]), moves:=moves+1L]
  }
}

Recall that this else block is only called if sid of the i+1 record is not the same as i. This means that this is the final entry for a particular student. First, I check to see if that student has a missing exit_date and if so charges no move to the student implementing the next statement to break out of this iteration of the loop. Students never have missing enroll_date for any of the data I have seen over 8 years. This is because most systems minimally autogenerate the enroll_date for the current date when a student first enters a student information system. However, sometimes districts forget to properly exit a student and are unable to supply an accurate exit_date. In a very small number of cases I have seen these missing dates. So I do not want the function to fail in this scenario. My solution here was simply to break out and move to the next iteration of the loop.

Finally, I apply the last rule, which compares the final exit_date for a student to exitby, incrementing moves if the student left prior to the end of the year and likely enrolled elsewhere before the summer.

The last step is to close the for loop and return our result:

1
2
3


  }
  return(output)
}

Version 2: 10x Speed And More Readable

The second version of this code is vastly quicker.

The opening portion of the code, including the error checking is essentially a repeat of before, as is the initialization of the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sasid', 
                       schid='schno',
                       enroll_date='enroll_date',
                       exit_date='exit_date'){
  if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
  require(data.table)
  if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
      stop("Both enroll_date and exit_date must be Date objects")
  if(missing(enrollby)){
    enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
      enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                                '-09-15', sep=''), format='%Y-%m-%d')
      warning(paste("enrollby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    enrollby, sep=' '))
    }else{
      enrollby <- as.Date(enrollby, format="%Y-%m-%d")
    }
  }
  if(missing(exitby)){
    exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                            '-06-01', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
      exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                                '-06-01', sep=''), format='%Y-%m-%d')
      warning(paste("exitby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    exitby, sep=' '))
    }else{
      exitby <- as.Date(exitby, format="%Y-%m-%d")
    }
  }
  if(!is.numeric(gap)){
    gap <- 14
    warning("gap was not a number, defaulting to 14 days")
  }
  output <- data.frame(id = as.character(unique(df[[sid]])),
                       moves = vector(mode = 'numeric', 
                                      length = length(unique(df[[sid]]))))

Where things start to get interesting is in the calculation of the number of student moves.

Handling Missing Data

One of the clever bits of code I forgot about when I initially tried to refactor Version 1 appears under “Business Rule 2: The Early Summer”. When the exit_date is missing, this code simply breaks out of the loop:

1
2


  if(is.na(df[['exit_date']][i])){
    next

Because the new code will not be utilizing for loops or really any more of the basic control flow, I had to device a different way to treat missing data. The steps to apply the business rules that I present below will fail spectacularly with missing data.

So the first thing that I do is select the students who have missing data, assign the moves in the output to NA, and then subset the data to exclude these students.

1
2
3
4
5
6
7


incomplete <- df[!complete.cases(df[, c(enroll_date, exit_date)]), ]
if(dim(incomplete)[1]>0){
  output[which(output[['id']] %in% incomplete[[sid]]),][['moves']] <- NA
}
output <- data.table(output, key='id')
df <- df[complete.cases(df[, c(enroll_date, exit_date)]), ]
dt <- data.table(df, key=sid)

Woe with `data.table`

Now with the data complete and in a data.table, I have to do a little bit of work to assist with my frustrations with data.table. Because data.table does a lot of work with the [ operator, I find it very challenging to use a string argument to reference a column in the data. So I just gave up and internally rename these attributes.

1
2
3


dt$sasid <- as.factor(as.character(dt$sasid))
setnames(dt, names(dt)[which(names(dt) %in% enroll_date)], "enroll_date")
setnames(dt, names(dt)[which(names(dt) %in% exit_date)], "exit_date")

Magic with `data.table`: Business Rules 1 and 2 in two lines each

Despite by challenges with the way that data.table re-imagines [, it does allow for clear, simple syntax for complex processes. Gone are the for loops and conditional blocks. How does data.table allow me to quickly identified whether or not a students first or last enrollment are before or after my cutoffs?

1
2
3
4


first <- dt[, list(enroll_date=min(enroll_date)), by=sid]
output[id %in% first[enroll_date>enrollby][[sid]], moves:=moves+1L]
last <- dt[, list(exit_date=max(exit_date)), by=sid]  
output[id %in% last[exit_date<exitby][[sid]], moves:=moves+1L]

Line 1 creates a data.table with the student identifier and a new enroll_date column that is equal to the minimum enroll_date for that student.

The second line is very challenging to parse if you’ve never used data.table. The first argument for [ in data.table is a subset/select function. In this case,

1

id %in% first[enroll_date>enrollby][[sid]]

means,

Select the rows in first where the enroll_date attribute (which was previously assigned as the minimum enroll_date) is less than the global function argument enrollby and check if the id of output is in the sid vector.

So output is being subset to only include those records that meet that condition, in other words, the students who should have a move because they entered the school year late.

The second argument of [ for data.tables is explained in this footnote ² if you’re not familiar with it.

Recursion. Which is also known as recursion.

The logic for Business Rules 3-5 are substantially more complex. At first it was not plainly obvious how to avoid a slow for loop for this process. Each of the rules on switching schools requires an awareness of context– how does one record of a student compare to the very next record for that student?

The breakthrough was thinking back to my single semester of computer science and the concept of recursion. I created a new function inside of this function that can count how many moves are associated with a set of enrollment records, ignoring the considerations in Business Rules 1 and 2. Here’s my solution. I decided to include inline comments because I think it’s easier to understand that way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


school_switch <- function(dt, x=0){
  # This function accepts a data.table dt and initializes the output to 0.
    if(dim(dt)[1]<2){
    # When there is only one enrollment record, there are no school changes to
    # apply rules 3-5. Therefore, the function returns the value of x. If the
    # initial data.table contains a student with just one enrollment record, 
    # this function will return 0 since we initialize x as 0.
      return(x)
    }else{
      # More than one record, find the minimum exit_date which is the "first"
      # record
      exit <- min(dt[, exit_date])
      # Find out which school the "first" record was at.
      exit_school <- dt[exit_date==exit][[schid]]
      # Select which rows come after the "first" record and only keep them
      # in the data.table
      rows <- dt[, enroll_date] > exit
      dt <- dt[rows,]
      # Find the minimum enrollment date in the subsetted table. This is the
      # enrollment that follows the identified exit record
      enroll <- min(dt[, enroll_date])
      # Find the school associated with that enrollment date
      enroll_school <- dt[enroll_date==enroll][[schid]]
      # When the difference between the enrollment and exit dates are less than
      # the gap and the schools are the same, there are no moves. We assign y,
      # our count of moves to x, whatever the number of moves were in this call
      # of school_switch
      if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap &
         exit_school==enroll_school){
        y = x
      # When the difference in days is less than the gap (and the schools are
      # different), then our number of moves are incremented by 1.
      }else if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap){
        y = x + 1L
      }else{
      # Whenever the dates are separated by more than the gap, regardless of which
      # school a student is enrolled in at either point, we increment by two.
        y = x + 2L
      }
      # Explained below outside of the code block.
      school_switch(dt, y)
    }
  }

The recursive aspect of this method is calling school_switch within school_switch once the function reaches its end. Because I subset out the row with the minimum exit_date, the data.table has one row processed with each iteration of school_switch. By passing the number of moves, y back into school_switch, I am “saving” my work from each iteration. Only when a single row remains for a particular student does the function return a value.

This function is called using data.table’s special .SD object, which accesses the subset of the full data.table when using the by argument.

1

dt[, moves:= school_switch(.SD), by=sid]

This calls school_switch after splitting the data.table by each sid and then stitches the work back together, split-apply-combine style, resulting in a data.table with a set of moves per student identifier. With a little bit of clean up, I can simply add these moves to those recorded earlier in output based on Business Rules 1 and 2.

1
2
3
4


  dt <- dt[,list(switches=unique(moves)), by=sid]
  output[dt, moves:=moves+switches]
  return(output)
}

Quick and Dirty `system.time`

On a mid-2012 Macbook Air, the current mobility calculation is very effective with tens of thousands of student records and practical for use in the low-hundreds of thousands of records range. ↩︎
I thought I was going to use data.table for some of its speedier features as I wrote this initial function. I didn’t in this go (though I do in Version 2). However, I do find the data.table syntax for assigning values to be really convenient, particularly the := operator which is common in several other languages. In data.table, the syntax dt[,name:=value] assigns value to an exist (or new) column called name. Because of the need select operator in data.table, I can just use dt[id,moves:=moves+1L] to select only the rows where the table key, in this case sid, matches id, and then increment moves. Nice. ↩︎ ↩︎

September 16, 2013

A New Calculation for Student Mobility

8:00PM

How do we calculate student mobility? I am currently soliciting responses from other data professionals across the country. But when I needed to produce mobility numbers for some of my work a couple of months ago, I decided to develop a set of business rules without any exposure to how the federal government, states, or other existing systems define mobility. ¹

I am fairly proud of my work on mobility. This post will review how I defined student mobility. I am hopeful that it matches or bests current techniques for calculating the number of schools a student has attended. In my next post, I will share the first two major versions of my implementation of these mobility business rules in R. ² Together, these posts will represent the work I referred to in my previous post on the importance of documenting business rules and sharing code.

The Rules

Working with district data presents a woefully incomplete picture of the education mobile students receive. Particularly in a state like Rhode Island, where our districts are only a few miles wide, there is substantial interdistrict mobility. When a student moves across district lines, their enrollment is not recorded in local district data. However, even with state level data, highly mobile students cross state lines and present incomplete data. A key consideration for calculating how many schools a student has attended in a particular year is capturing “missing” data sensibly.

The typical structure of enrollment records looks something like this:

Unique Student ID	School Code	Enrollment Date	Exit Date
1000000	10101	2012-09-01	2012-11-15
1000000	10103	2012-11-16	2013-06-15

A compound key for this data consists of the Unique Student ID, School Code, and Enrollment Date, meaning that each row must be a unique combination of these three factors. The data above shows a simple case of a student enrolling at the start of the school year, switching schools once with no gap in enrollment, and continuing at the new school until the end of the school year. For the purposes of mobility, I would define the above as having moved one time.

But it is easy to see how some very complex scenarios could quickly arise. What if student 1000000’s record looked like this?

Unique Student ID	School Code	Enrollment Date	Exit Date
1000000	10101	2012-10-15	2012-11-15
1000000	10103	2013-01-03	2013-03-13
1000000	10103	2013-03-20	2013-05-13

There are several features that make it challenging to assign a number of “moves” to this student. First, the student does not enroll in school until October 15, 2012. This is nearly six weeks into the typical school year in the Northeastern United States. Should we assume that this student has enrolled in no school at all prior to October 15th or should we assume that the student was enrolled in a school that was outside of this district and therefore missing in the data? Next, we notice the enrollment gap between November 15, 2012 and January 3, 2013. Is it right to assume that the student has moved only once in this period of time with a gap of enrollment of over a month and a half? Then we notice that the student exited school 10103 on March 13, 2013 but was re-enrolled in the same school a week later on March 20, 2013. Has the student truly “moved” in this period? Lastly, the student exits the district on May 13, 2013 for the final time. This is nearly a month before the end of school. Has this student moved to a different school?

There is an element missing that most enrollment data has which can enrich our understanding of this student’s record. All district collect an exit type, which explains if a student is leaving to enroll in another school within the district, another school in a different district in the same state, another school in a different state, a private school, etc. It also defines whether a student is dropping out, graduating, or has entered the juvenile justice system, for example. However, it has been my experience that this data is reported inconsistently and unreliably. Frequently a student will be reported as changing schools within the district without a subsequent enrollment record, or reported as leaving the district but enroll within the same district a few days later. Therefore, I think that we should try and infer the number of schools that a student has attended using soley the enrollment date, exit date, and school code for each student record. This data is far more reliable for a host of reasons, and, ultimately, provides us with all the information we need to make intelligent decisions.

My proposed set of business rules examines school code, enrollment date, and exit date against three parameters: enrollment by, exit by, and gap. Each students minimum enrollment date is compared to enrollment by. If that student entered the data set for the first time before the enrollment by, the assumption is that this record represents the first time the student enrolls in any school for that year, and therefore the student has 0 moves. If the student enrolls for the first time after enrollment by, then the record is considered the second school a student has attended and their moves attribute is incremented by 1. Similarly, if a student’s maximium exit date is after exit by, then this considered to be the student’s last school enrolled in for the year and they are credited with 0 moves, but if exit date is prior to exit by, then that student’s moves is incremented by 1.

That takes care of the “ends”, but what happens as students switch schools in the “middle”? I proposed that each exit date is compared to the subsequent enrollment date. If enrollment date occurs within gap days of the previous exit date, and the school code of enrollment is not the same as the school code of exit, then a student’s moves are incremented by 1. If the school codes are identical and the difference between dates is less than gap, then the student is said to have not moved at all. If the difference between the enrollment date and the previous exit date is greater than gap, then the student’s moves is incremented by 2, the assumption being that the student likely attended a different school between the two observations in the data.

Whereas calculating student mobility may have seemed a simple matter of counting the number of records in the enrollment file, clearly there is a level of complexity this would fail to capture.

Check back in a few days to see my next post where I will share my initial implementation of these business rules and how I achieved an 10x speed up with a massive code refactor.

My ignorance was intentional. It is good to stretch those brain muscles that think through sticky problems like developing business rules for a key statistic. I can’t be sure that I have developed the most considered and complete set of rules for mobility, which is why I’m now soliciting other’s views, but I am hopeful my solution is at least as good. ↩︎
I think showing my first two implementation of these business rules is an excellent opportunity to review several key design considerations when programming in R. From version 1 to version 2 I achieved a 10x speedup due to a complete refactor that avoided for loops, used data.table, and included some clever use of recursion. ↩︎

September 12, 2013

Documentation of Business Rules and Analysis

8:00PM

One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.

Documentation of business rules is a critical aspect of a data analysts job that, in my experience, is often regrettably overlooked. If you have ever tried to reproduce someone else’s analysis, asked different people for the same statistic, or tried to compare data from multiple years, you have probably encountered difficulties getting a consistent answer on standard statistics, e.g. how many students were proficient in math, how many students graduated in four years, what proportion of students were chronically absent? All too often documentation of business rules is poor or non-existent. The result is that two analysts with the same data will produce inconsistent statistics. This is not because of something inherent in the quality of the data or an indictment of the analyst’s skills. In most cases, the undocumented business rules are essentially trivial, in that the results of any decision has a small impact on the final result and any of the decisions made by the analysts are equally defensible.

This major problem of lax or non-existent documentation is one of the main reasons I feel that analysts, and in particular analysts working in the public sector, should extensively use tools for code sharing and version control like Github, use free tools whenever possible, and generally adhere to best practices in reproducible research.

I am trying to put as much of my code on Github as I can these days. Much of what I write is still very disorganized and, frankly, embarrassing. A lot of what is in my Github repositories is old, abandoned code written as I was learning my craft. A lot of it is written to work with very specific, private data. Most of it is poorly documented because I am the only one who has ever had to use it, I don’t interact with anyone through practices like code reviews, and frankly I am lazy when pressed with a deadline. But that’s not really the point, is it? The worst documented code is code that is hidden away on a personal hard drive, written for an expensive proprietary environment most people and organizations cannot use, or worse, is not code at all but rather a series of destructive data edits and manipulations. ¹

One way that I have been trying to improve the quality and utility of the code I write is by contributing to an open source R package, eeptools. This is a package written and maintained by Jared Knowles, an employee of the Wisconsin Department of Public Instruction, whom I met at a Strategic Data Project convening. eeptools is consolidating several functions in R for common tasks education data analysts are faced with. Because this package is available on CRAN, the primary repository for R packages, any education analyst can have access to its functions in one line:

1

install.packages('eeptools'); require(eeptools)

Submitting code to a CRAN package reinforces several habits. First, I get to practice writing R documentation, explaining how to use a function, and therefore, articulating the assumptions and business rules I am applying. Second, I have to write my code with a wider tolerance for input data. One of the easy pitfalls of a beginning analyst is writing code that is too specific to the dataset in front of you. Most of the errors I have found in analyses during quality control stem from assumptions embedded in code that were perfectly reasonable with a single data set that lead to serious errors when using different data. One way to avoid this issue is through test-driven development, writing a good testing suite that tests a wide range of unexpected inputs. I am not quite there yet, personally, but thinking about how my code would have to work with arbitrary inputs and ensuring it fails gracefully ² is an excellent side benefit of preparing a pull request ³ . Third, it is an opportunity to write code for someone other than myself. Because I am often the sole analyst with my skillset working on a project, it is easy to not consider things like style, optimizations, clarity, etc. This can lead to large build-ups of technical debt, complacency toward learning new techniques, and general sloppiness. Submitting a pull request feels like publishing. The world has to read this, so it better be something I am proud of that can stand up to the scrutiny of third-party users.

My first pull request, which was accepted into the package, calculates age in years, months, or days at an arbitrary date based on date of birth. While even a beginning R programmer can develop a similar function, it is the perfect example of an easily compartmentalized component, with a broad set of applications, that can be accessed frequently .

Today I submitted by second pull request that I hope will be accepted. This time I covered a much more complex task– calculating student mobility. To be honest, I am completely unaware of existing business rules and algorithms used to produce the mobility numbers that are federally reported. I wrote this function from scratch thinking through how I would calculate the number of schools attended by a student in a given year. I am really proud of both the business rules I have developed and the code I wrote to apply those rules. My custom function can accept fairly arbitrary inputs, fails gracefully when it finds data it does not expect, and is pretty fast. The original version of my code took close to 10 minutes to run on ~30,000 rows of data. I have reduced that with a complete rewrite prior to submission to 16 seconds.

While I am not sure if this request will be accepted, I will be thrilled if it is. Mobility is a tremendously important statistic in education research and a standard, reproducible way to calculate it would be a great help to researchers. How great would it be if eeptools becomes one of the first packages education data analysts load and my mobility calculations are used broadly by researchers and analysts? But even if it’s not accepted because it falls out of scope, the process of developing the business rules, writing an initial implementation of those rules, and then refining that code to be far simpler, faster, and less error prone was incredibly rewarding.

My next post will probably be a review of that process and some parts of my moves_calc function that I’m particularly proud of.

Using a spreadsheet program, such as Excel, encourages directly manipulating and editing the source data. Each change permanently changes the data. Even if you keep an original version of the data, there is no recording of exactly what was done to change the data to produce your results. Reproducibility is all but impossible of any significant analysis done using spreadsheet software. ↩︎
Instead of halting the function with hard to understand error when things go wrong, I do my best to “correct” easily anticipated errors or report back to users in a plain way what needs to be fixed. See also fault-tolerant system. ↩︎
A pull request is when you submit your additions, deletions, or any other modifications to be incorporated in someone else’s repository. ↩︎

August 21, 2013

Some Changes for Rhode Island State Aid to Education

8:00PM

In December 2009, the education department head, Professor Kenneth K. Wong, another graduate student and myself were part of a three-person team consulting the Rhode Island Department of Education (RIDE) on how to establish a new state funding formula. We worked with finance and legal staff at the department to develop the legislation for the 2010 session that would establish a state funding formula for the first time in 15 years.   The Board of Regents had already passed a resolution with its policy priorities that they wanted enshrined in the formula. Additionally, there had been many attempts over the past 5-10 years to pass a new formula that failed for various reasons, chief among them that all previously proposed formulas were accompanied with a call to increase state funding for education 30-50%, with some even envisioning nearly doubling the state education funding. Our task was to research funding formulas, both in practice in other states and in the literature research on school finance, and achieve the goals of the Board of Regents without proposing a mammoth increase in state aid that would sink the entire endeavor ¹. The general sense was that while more state aid had the potential to improve the progressiveness of education expenditures, the reality is the overall spending level in Rhode Island is high, and introducing new money was less important than redistributing state aid. I share this belief, particularly because I think adding money to the right places is simple once there is already a way to equitably distribute those funds. Tying up the increase in funding alongside a distribution method is a recipe for political horse-trading that can result in all kinds of distortions that prevent aid from flowing where needed. 

My role in this process was primarily to create Excel simulators that would allow us to immediately track the impacts of changing different parts of the formula. I also helped RIDE staff interpret the meaning of changes to the math behind the funding formula and understand what levers existed to change the formula and how these changes impacted both the resulting distribution and policy. 

We had three months. 

There are a lot of people who are unhappy about the results of the formula that ultimately passed in June 2010. Because we are redistributing essentially the same amount of state aid, there are some districts that are losing money while others are gaining funds ². Some dislike the fact that we used only a “weight” for free and reduced price lunch status. Alternative formulas (and formulas in other states) typically include numerous weights, from limited English proficiency and special education status, to gifted and talented and career and technical education ³. And yet others were displeased that many costs, including transportation and facilities maintenance, were excluded from the state-base of education aid. Then there are those who think the transition is all off– five years is too long to wait to get the increases the formula proposes, and ten years is far too fast to lose the money the formula proposes. ⁴ 

A Good State Aid System 

In the end, I am proud of the formula produced for several reasons.  First, it passed successfully and has been fully funded (and sometimes more than fully funded) each year of implementation throughout a period of massive structural budget deficits. This is no small accomplishment. The advocacy community rightfully pushes us to build ideal systems, but in our role of policy entrepreneurs we are faced with the reality that a policy that does not become law and is not supported as law may as well not exist. Producing a formula that has some of the other positive qualities discussed below, passing that formula, and implementing the formula with little fanfare is not a small accomplishment. 

Second, the formula is highly progressive, sending as much as 20 times more aid to some of our poorest communities in Rhode Island compared to the wealthiest. I am not positive how this compares to other states– that’s a topic I certainly want to work on for a future post– but with just 39 cities and towns, it seems to show a high preference for vertical equity, treating different cities and towns differently. There are communities on both ends of the distribution who want substantially more state funding, and our state aid formula is not sufficient to effectively crowd out local capacity for education spending and ensure that our poorest communities are spending more than our wealthier ones ⁵, but it’s a very strong start. 

Third, the formula is relatively simple. While I do not necessarily agree that it is a virtue to have fewer weights and a simple formula in perpetuity, the experience with other states and other formula-based programs show that weights and complexities are very easy to add and very hard to take away. Once a particular policy preference is enshrined in the distribution method, it had better be right because a community of advocacy will maintain that weight long into the future. Personally, I felt it critical to start with the very simple “core” formula that could be adjusted over time. I have some ideas on how I might modify/add to this core that I will be sharing in this post, but I firmly believe that starting with a simple core was the right move. It is also worth noting that because of the need to ensure the transition is smoothed out so that gains in some districts equal the total losses in the others meant that even a more progressive weighting scheme would not impact school funding until the far back end of the transition period (which we proposed as 7 years but was pushed to 5 years during the legislative process), since communities were already gaining funds as fast as we could move them. For this reason, not only was a simple core preferable from my technocratic perspective, but it also was not likely to have any immediate downside. 

Fourth, we removed the long-term regionalization bonuses. Rhode Island had sought to reduce the ridiculous number of school districts by providing a bonus for regionalizing in the early 90s. Unfortunately, because of the timing of the abandonment of the previous state aid formula, the districts that did choose to regionalize had their base funding locked in at a level 6-8% higher than it should have been, because they were receiving a bonus that was meant to fade away over the course of several years. I could justify a small increase in state funding to pay some of the transition expenses of regionalizing districts, but long term funding increases? Part of the goal of regionalization is the reduction of overhead that allows for decreased costs (or increased services at the same costs). There is no ongoing need to supply a massive state bonus for regionalizing. 

Now just because I am proud of this work does not mean that I think we have “solved” education funding in Rhode Island. Personally, I believe there are other defensible ways to distribute funding in Rhode Island, each of which represents slightly different policy preferences. There is no hard and fast “right” or “wrong” way to do this, within certain guidelines. As I see it, so long as the formula is progressive and moving toward a greater chance of seeing a day where Providence has the highest paid staff in the state ⁵, we are on the right path. I don’t believe that Rhode Island will have a truly “great” education finance climate without a substantial growth in the economy or a huge new tax that dramatically lowers the ability of municipalities to generate school funding while bolstering state aid. However, I think we have a great foundation and a “good” system. 

For the remainder of this post, I would like to propose a few ideas that could help move Rhode Island from “good” to “very good” that I think are feasible within the next five years ⁶. 

A Very Good State Aid Program 

After a little over three years since its establishment, I think we are ready to tackle several additional aspects of state education funding in Rhode Island. One thing you may notice is that few of these ideas impact the original formula. Part of why that is comes from my aforementioned preference for a simple formula, and part is because these include some non-formula issues that were not pursued in 2010 in an effort to keep the focus on the main policy matters.  First, and perhaps the most consequential change that can be made to state funding, is the teacher pension fund payments. Currently, the state and local districts split the cost of teacher pension contributions 60/40. This is a flat split, regardless of the wealth of the community. I think it’s absurd to ignore community wealth for such a large portion of state education expenditures. Using the Adjusted Equalized Weighted Assessment Values (AEWAV) to determine the reimbursement rates would be a big improvement on the progressiveness of school funding. 

Second, I would make a slight change to the way that we fund charter schools. When we were developing the formula, there was broad agreement among policymakers that the “money should follow the child”. In one sense, this is the system we proposed since school district funding is based on enrollments. However, I think an irrational desire to not “double count” students, alongside the need to keep funding as flat as possible, pushed the formula a bit too far when it comes to charters. The old way of funding charter schools allowed districts to hold back 5% of the total per pupil expenditure from their charter school tuitions. This meant charter schools received 5% less funding than traditional public schools, but it also recognized that there are some fixed costs in districts that are not immediately recoverable when students leave on the margins. I think the state should return to this practice, however only if the state is willing to pay the withheld 5% to charters. I do think its fair to take into account some fixed costs, but I don’t believe it’s fair that charter schools received less funding as a result. 

Third, we excluded all building maintenance costs from the base amount of state aid. This was largely because the formula was supposed to represent only the marginal instructional costs associated with each student. I don’t necessarily think that these costs have to be added into the base amount. However, I would like to see the state contribute to the maintenance of buildings more directly. I think the state should provide a flat dollar amount, say $100,000, per building in each district, provided that key criteria are met. The buildings should be at 90% occupancy/utilization, should have a minimum size set based on the research on efficiency (roughly 300 students at the elementary level and 600 students for high schools), and there should be some minimum standard for building systems quality and upkeep. These requirements are mostly about making sure this flat fund, which is really about the fixed costs of maintaining buildings, doesn’t create incentives to build more. It may seem inconsequential, but I think it’s important to state the preference for well-sized, occupied ⁷, and maintained buildings is worthwhile. 

I think it’s wrong that the minimum reimbursement rate for school construction aid was raised to 40% during the funding formula debates in the General Assembly. This amounts to a massive subsidy for suburban schools and the previous 30% minimum is part of why we have such stark facilities inequities in the state. We should remove the minimums on construction reimbursement and simply use AEWAV to determine the reimbursement rate. Also, we need to establish a revolving facilities loan fund, much like the one used for sewers (and now roads and bridges). Access to lower interest bonds should not be dependent on city finances. 

Fourth, one thing we did not include in the original funding formula that has come under considerably criticism is a special weight for students who are labeled English language learners. There are a few reasons we made this decision. The districts that have ELLs are the same districts that have high levels of poverty. In fact, the five communities that had more than 5% of their students classified as ELLs were, in order, also the top five districts with regards to free and reduced price lunch eligibility. Combined with a transition plan that was already increasing funding to these districts as rapidly as could be afforded, there were virtually no short-term consequences of not including an ELL weight. It’s worth noting that formula dollars are not categorical funds– there are no restrictions on how districts should spend this money, and there are no guarantees that an ELL weight would have any impact on ELL spending. 

We were also concerned with incentivizing over-identification and failing to exit students who should no longer be classified as ELLs. I am also personally concerned with mistaking the additional supports we want to target as needed for English language acquisition; it would not only inspire the wrong policies and supports for these students, but it fails to recognize a host of needs that persist for these students well beyond English acquisition. 

During the funding formula hearings at House and Senate Finance Committees we discussed the need for further study on this issue. I think that the next weight in the formula should be based on the Census and American Communities Survey. By using these data sets, classification of students who are eligible for the weight would not be dependent on the school district itself. Rather than focus on child language acquisition, I think we should broaden this weight to be applied based on the percentage of households that speak a language other than English in the home, where English is spoken at a level below “very well” ⁸. This would ensure that students who live in language minority households receive additional supports throughout their education, regardless of their language acquisition status. I would make this weight lower than some in the literature because it would apply to a broader set of students, probably somewhere around 40% like the poverty weight. For reference, the latest five-year estimate from the ACS data shows that 24.3% of households fit this definition in the city of Providence. With a 40% weight, at 22,500 students, with a foundation amount of around $9,000 per student, this weight would increase funding to Providence by a little over $16,000,000. Similar to other formula aid, these funds would be unrestricted. 

Now, while I think that $16,000,000 is no small potatoes, and I am happy to express our policy preference to drive funding into communities where families are not using English in the home, some perspective is warranted. Providence will receive almost $240,000,000 in state aid when the formula is fully transitioned, compared to about $190,000,000 before. Adding this weight would only represent a 6% increase in state aid from the full formula amount. It’s an important increase, but I hope you’ll forgive me if I felt it was not grossly unfair to exclude it in the first iteration of the funding formula, especially considering we still have not fully transitioned to those higher dollar amounts sent to districts that would benefit from these funds. 

It Takes Money 

Each of these recommendations, in my view, would improve the way that Rhode Island distributes education aid. Some of the changes are technical, others address areas that are currently not considered, and some are purely about increasing the progressiveness of aid. All of these changes will require an even greater state contribution to education aid, but these increases would be an order of magnitude lower than what it would take to increase the state aid to covering 50-60% of all education expenditures. While I would support some pretty radical changes to drive more money into the state aid system, I think that each of these improvements are worth doing on the path to increased aid. 

I should note, that few people I spoke to were not in favor of raising the amount of state aid. We all want more money to come from the state because those dollars are far more progressive. However, Rhode Island was deep in its recession at this point in time and the dollar amounts to make a real dent in the state to local share in education are just staggering. Rhode Island currently funds just short of 40% of total school expenditures at the state level. To increase that to 60%, which is closer to the national average, they would have to contribute $500M more– a roughly 60% increase from the current level. Just for some context, the main tax fight of Rhode Island progressives has been to repeal tax cuts for higher income individuals that were instituted starting in 2006 in an attempt to move toward a flat income tax rate in Rhode Island. The impact of this repeal would be an increase in revenues that would cover roughly 10% of the increase in school funding required to move from 40% to 60% state aid. Of course, those dollars are supposed to pay for some portion of restoring pension benefits, so it’s already spoken for. ↩︎
Hold harmless provisions, when introduced in other states, serve to dramatically distort the redistributive properties of state aid and almost always require a huge influx of funds. In fact, a hold harmless provision in Rhode Island would have required a doubling of state aid, which ultimately would have guaranteed that wealthy communities continue to receive too much state aid while less wealthy communities are stuck fighting year after year for tremendous revenue increases through taxation just to get their fair share. Essentially, hold harmless would ensure that you never reach formula-level spending and guarantee that state aid would not be very progressive. ↩︎
One very popular progressive member of the Rhode Island General Assembly had been working for years to pass a new funding formula and had five or six such weights in her version. Interestingly, with the glaring exception of sending $0 to Newport in state aid, the difference in the overall distribution of funds by district in Rhode Island using this formula and our formula was tiny, almost always <5%. ↩︎
Smoothing the “gains” and “losses” overtime was important to keep the formula as close to revenue neutral as possible. Of course, there are increases due to inflation and other factors each year as a part of the base, but our goal was to truly redistribute the funds such that not only is the end number not a big increase in total state aid but that getting through the transition period did not have huge costs. If it did, there is no way we could feel confident we would ever reach the point where the formula actually dictated state aid, much like the hold harmless provision prevents a full transition. Modeling various transition plans was a nightmare for me. ↩︎
Many people forget that education spending is about competition within a single market. Overall spending matters less within this market than how you spend compared to others. The trick is that an urban school primarily working with traditionally under served families needs to be able to pay not just for more material supplies, but mostly for higher quality teachers and staff (and perhaps quantity). Because of compensating wage differentials, even hiring teachers and staff that are the same quality as wealthy communities costs more. ↩︎ ↩︎
Perhaps I will write a future post on some ideas of how to push Rhode Island to “great”, even though I view all of those solutions as politically impossible. ↩︎
I would include any leased space as occupied. We should encourage full utilization of the buildings, whether that includes charter schools, central office use, city government, or private companies. ↩︎
This definition is clunky, but its how the ACS and Census track these things. This definition is clunky, but its how the ACS and Census track these things. We could verify the data using the data reported by districts about language spoken in the home. I would recommend using this data point to assist with whether or not to include these weights for charter schools. For example, approximately half of those families that do not speak English in the home also speak English very poorly. Therefore, I might apply half of the weight to each individual child whose family reports speaking a language other than English at home. Of course, the actual proportion of the weight should be specific to the ratio of speakers of language other than English to non-very well speakers of English by community. ↩︎

August 14, 2013

What can be done for Rhode Island Pensioners?

8:00PM

This post originally appeared on my old blog on January 2, 2013 but did not make the transition to this site due to error. I decided to repost it with a new date after recovering it from a cached version on the web.

Rhode Island passed sweeping pension reform last fall, angering the major labor unions and progressives throughout the state. These reforms have significantly decreased both the short and long-run costs to the state, while decreasing the benefits of both current and future retirees.

One of the most controversial measures in the pension reform package was suspending annual raises ¹ for current retirees. I have noticed two main critiques of this element. The first criticism was that ending this practice constitutes a decrease in benefits to existing retirees who did not consent to these changes, constituting a breach of contract and assault on property rights. This critique is outside of the scope of this post. What I would like to address is the second criticism, that annual raises are critical to retirement security due to inflation, especially for the most vulnerable pensioners who earn near-poverty level wages from their pensions.

While I am broadly supportive of the changes made to the pension system in Rhode Island, I also believe that it is important to recognize the differential impact suspending annual raises has on a retired statehouse janitor who currently earns $22,000 a year from their pension and a former state department director earning $70,000 a year from their pension. Protecting the income of those most vulnerable to inflation is a worthy goal ².

I have a simple recommendation that I think can have a substantial, meaningful impact on the most vulnerable retirees at substantially less cost than annual raises. This recommendation will be attractive to liberals and conservatives, as well as the “business elite” that have long called for increasing Rhode Island’s competitiveness with neighboring states. It is time that Rhode Island leaves the company of just three other states– Minnesota, Nebraska, and Vermont– that have no tax exemptions for retirement income ³. Rhode Island should exempt all income from pensions and social security up to 200% of the federal poverty level from state income taxes. This would go a long way to ensuring retirement security for those who are the most in need. It would also bring greater parity between our tax code and popular retirement destination states, potentially decreasing the impulse to move to New Hampshire, North Carolina, and Florida.

It’s a progressive win. It’s a decrease in taxes that conservatives should like. It shouldn’t have a serious impact on revenues, especially if it goes a long way toward quelling the union and progressive rancor about the recent reforms. And it’s far from unprecedented– in fact, some form of retirement income tax exemption exists in virtually every other state.

We should not be proud of taking away our most vulnerable pensioners’ annual raises, even if it was necessary. Instead of ignoring the clear impact of this provision, my hope for 2013 is that we address it, while keeping an overall pretty good change to Rhode Island’s state retirement system.

Not a cost-of-living adjustment, or COLA, as some call them. ↩︎
Interesting, increases in food prices has largely slowed and the main driver of inflation are healthcare costs. I wonder to what extent Medicare/Medicaid and Obamacare shield retirees from rising healthcare costs ↩︎
www.ncsl.org/documents… ↩︎

July 28, 2013

Thoughts on Teach for America

8:00PM

One of the most interesting discussions I had in class during graduate school was about how to interpret the body of evidence that existed about Teach for America. At the time, Kane, Rockoff and Staiger (KRS) had just published “What does certification tell us about teacher effectiveness? Evidence from New York City” in Economics of Education Review . KRS produced value-added estimates for teachers and analyzed whether their initial certification described any variance in teacher effectiveness at raising student achievement scores. The results were, at least to me, astonishing. All else being equal, there was little difference if teachers were uncertified, traditionally certified, a NYC teaching fellow, or a TFA core member.

Most people viewed these results as a positive finding for TFA. With minimal training, TFA teachers were able to compete with teachers hired by other means. Is this not a vindication that the selection process minimally ensures an equal quality workforce?

I will not be discussing the finer points of

[points out: scholasticadministrator.typepad.com/thisweeki…

July 22, 2013

Summer Reading

8:00PM

This summer has been very productive for my fiction reading backlog. Here are just some of the things I have read since Memorial Day. ¹

Novels

The Name of the Wind by Patrick Rothfuss

I picked up The Name of the Wind on a whim while cruising through the bookstore. I was glad I did. This book tells a classic story– a precocious young wizard learns to use his powers, building toward being the most important person in the world. The book is framed around an innkeeper and his apprentice who are more than they seem. When a man claiming to be the most famous storyteller in the land enters the inn, we learn that our innkeeper has past filled with spectacular exploits that our bard wants to record. Lucky for our reader, Kvothe, in addition to be a warrior-wizard of extraordinary talent, is a narcissist who decides to tell his whole story just this once to this most famous of all chroniclers. ² Although I have spoken to several folks who found Kvothe to be utterly unlikeable because of both his sly form of arrogance and Rothfuss’s decision to seemingly make Kvothe worthy of such high self-worth, I loved this book.

In this first book of the The Kingkiller Chronicle (as these things tend to be named), we learn all about Kvothe’s formative years. We spend substantial time exploring dark times in Kvothe’s life when he endures tragedy, trauma, and horrible poverty before finally beginning to learn how to truly use his talents. It is a fair critique that Kvothe seems almost “too good”, but much of the story is about how skill, luck, and folly all contribute to his success and fame, much of which is based on exaggerated tellings of true events.

If you are a fan of this sort of fantasy, with magic, destiny, love, power, and coming of age, I recommend picking up this book. Rothfuss has a gift. The sequel, The Wise Man’s Fear is already available, and I will certainly be reading it before the end of the summer.

Endymion and The Rise of Endymion by Dan Simmons

Endymion and The Rise of Endymion are the must anticipated (15 years ago) follow up to Dan Simmons’s brilliant Hyperion and Fall of Hyperion. I strongly recommend the originals, which is one of the greatest tales in all of science fiction. I also recommend creating some distance between reading each set of books. Six years separate the publishing of these duologies. Each story is so rich, I think it is hard to appreciate if you read all four books in one go. Yet, the narrative is so compelling it might be hard to resist. I waited about one year between reading the original Cantos and this follow up and I was glad I did.

Set 272 years after the events of the original books, Endymion and The Rise of Endymion serve as crucial stories that satisfyingly close loops I did not even realize were open at the end of the originals. What was once a glimpse at future worlds and great cosmic powers now unfurl as major players, their primary motivations unveiled.

These books are so entwined with the original that I will not say anything about its plot so that there are no spoilers. What I can offer is the following. Whereas books 1 and 2 play with story structure to captivating effect, these books do not. Instead, we are treated to a uniquely omniscient narrator, who is both truly omniscient and integral to the events of the story. How he gains this omniscience is a major plot point that’s pulled off effortlessly. The first two books are framed as an epic poem, known as the Hyperion Cantos, written by one of the major characters in those events. Another thing we learn is the original Cantos is not entirely reliable. Their author, who as not omniscient, had to fill in some blanks to complete the story, and also failed to understand some of the “heady” aspects of what happened and was sloppy in their explanations. Thus, we are treated both to key future events and simultaneously charged with a new reading of the original novels as written by a less than reliable narrator. What is true and what is not will all be told in this excellent follow up.

A word to the wise– Simmons may feel a bit “mushy” in his message for some “hard” science fiction readers. I think there is both profound depth and beautiful presentation of ideas, both complex enough to “earn” this treatment and some simpler than the story seems to warrant.

The Rook

The Rook is a fantastically fun debut novel³ written by an Australian bureaucrat. I learned about this book from one of my favorite podcasts, The Incomparable. Episode 128: Bureaucracy was Her Superpower is an excellent discussion that you should listen to after reading this book. I feel the hosts of that show captured perfectly what made this book great– it was completely honest and fair to its reader.

It is not giving anything away to say that The Rook centers around a Myfanwy (mispronounced even by the main character as Miffany, like Tiffany with an M) who suddenly becomes aware of her troublesome surroundings but with complete amnesia. It would be easy to dismiss the memory loss as a trite plot driver, used as a cheap way to trick our characters and readers. But O’Malley is brilliant in his use of Myfanwy’s memory loss. This book does not lie to its reader or its characters. Memory loss does not conceal some simple literary irony. Instead, it serves to create a fascinating experience for a reader who learns to understand and love a character as she creates, understands, and learns to love herself.

Myfanwy is not just an ordinary young woman with memory loss. She’s a high ranking official in what can best be described as the British X-Men who run MI-5. And she knew her memory loss was imminent. As such, she prepared letters for her future, new self to learn all about her life and her attempts to uncover the plot that would lead to her own memory loss. Again, the letters could be seen as cheap opportunities for exposition and to create false tension, but O’Malley never holds too tight to their use as a structure. We read more letters at the beginning of the story, and fewer later on as the reader is availed of facts and back story as they become relevant, without a poorly orchestrated attempt to withhold information from the main character. Instead of assuming Myfanwy is reading along with us, we easily slip into an understanding that shortly after our story begins, Myfanwy actually takes the time to read all the letters and we, thankfully, are not dragged along for the ride blow by blow.

The Rook manages to tread space in both story and structure that should feel wholly unoriginal and formulaic but never becoming either. The powers of the various individuals are fascinating, original, and consequential. The structure of the book is additive, but the plot itself is not dependent on its machinations.

Most of all, The Rook is completely fun and totally satisfying. That’s not something we say often in a post-Sopranos, post-Batman Begins world.

The Ocean at the End of the Lane

Speaking of delightful, Neil Gaiman is at his best with The Ocean at the End of the Lane. Gaiman is the master of childhood, which is where I think he draws his power as a fantasy writer. He is able to so capture the imagination of a child in beautiful prose it is as thought I am transformed into an 8-year old boy reading by flashlight in bed late at night, anxious and frightened.

The Ocean at the End of the Lane is a beautiful, dark fairy tale. Our narrator has recently experienced a loss in the family that has affected him profoundly, such that he is driven to detour back to the home he grew up in. Most of us can appreciate how deep sadness can drive us toward spending some time alone in a nostalgic place, both mentally and physically, as we work through our feelings.

There, we are greeted with the resurfacing of memories from childhood when events most unnatural conspired to do harm against him and his family.

I really don’t want to say much from this book except that it is heartbreakingly beautiful in a way that only someone like Gaiman can manage. This is a book that should be read in just one or two sittings. It is profoundly satisfying for anyone who loves to read books that transform who and where they are. Gaiman achieves this completely.

Comics

Locke and Key

Joe Hill is a master of his craft. Over Memorial Day weekend there was a great Comixology sale that dramatically reduced the price of getting in on Locke and Key and I jumped right on board.

I have rarely cared so much for a set of characters, regardless of the medium.

Our main characters, the Locke family (three young children and their mother), are faced with tragedy in the very first panels of Welcome to Lovecraft, the opening volume of this six-part series. I think what makes Locke and Key unique is rather than use tragedy simply as the opportunity to produce heroism, our protagonists are faced with real, long lasting, deep, and horrifying consequences.

All the while, we are thrust into the fascinating world of Key House, the Locke family home where our main characters’ father grew up. Key House is home to magical keys each of which can open one locked door. Step through that door, and there are fantastical consequences like dying, becoming a spirit free to float around the house until your spirit returns through the door. One door might bring great strength, another flight.

It is not surprising that the tragedy that drives the Locke family back to Key House is deeply connected to the mysterious home’s history, and the very source of its magic. What is brilliant is how Joe Hill quietly reveals the greater plot through the every day misadventures of children who are dealing with a massive life change. These characters are rich, their world is fully realized, and the story is quite compelling. A must read.

Don’t believe me? The Incomparable strikes again with a great episode on the first volume of Locke and Key.

American Vampire

I was turned on to American Vampire by Dan Benjamin. Wow. Phenomenal. These are real vampires.

East of West

Saga

Affiliate links throughout, if that kind of thing bugs you. If that kind of thing does bug you, could you shoot me an email and explain why? I admit to not getting all the rage around affiliate linking. ↩︎
Learning more about our main character, I am somewhat dubious that this is the only time he has told of his exploits, although this older Kvothe may have become a lot less inclined to boasting. ↩︎
Actually, The Rook is the second debut on this list. The Name of the Wind was Rothfuss’s first. ↩︎

←Newer Posts Older Posts→

Update

Version 0: Where Do I Begin?

Version 1: A Naïve Implementation

Function Declaration: Parameters

Type Checking and Programmatic Defaults

Initializing the Results

Business Rule 1: The Latecomer

Business Rule 5: The Feint

Business Rule 3: The Smooth Mover

Business Rule 4: The Long Hop

Business Rule 2: The Early Summer

Version 2: 10x Speed And More Readable

Handling Missing Data

Woe with data.table

Magic with data.table: Business Rules 1 and 2 in two lines each

Recursion. Which is also known as recursion.

Quick and Dirty system.time

The Rules

A Good State Aid System

A Very Good State Aid Program

It Takes Money

Novels

The Name of the Wind by Patrick Rothfuss

Endymion and The Rise of Endymion by Dan Simmons

The Rook

The Ocean at the End of the Lane

Comics

Locke and Key

American Vampire

East of West

Saga

Woe with `data.table`

Magic with `data.table`: Business Rules 1 and 2 in two lines each

Quick and Dirty `system.time`

A Good State Aid System 

A Very Good State Aid Program 

It Takes Money