Jason Becker
March 2, 2017

Here’s a fun common task. I have a data set that has a bunch of codes like:

Name Abbr Code
Alabama AL 01
Alaska AK 02
Arizona AZ 04
Arkansas AR 05
California CA 06
Colorado CO 08
Connecticut CT 09
Delaware DE 10

All of your data is labeled with the code value. In this case, you want to do a join so that you can use the actual names because it’s 2017 and we’re not animals.

But what if your data, like the accounting data we deal with at Allovue, has lots of code fields. You probably either have one table that contains all of the look ups in “long” format, where there is a column that represents which column in your data the code is for like this:

code type name
01 fips Alabama
02 fips Alaska

Alternatively, you may have a lookup table per data element (so one called fips, one called account, one called function, etc).

I bet most folks do the following in this scenario:

1
2
3
4
5
6
7
account <- left_join(account, account_lookup)
account <- left_join(account, fips)

## Maybe this instead ##
account %<>%
  left_join(account_lookup) %>%
  left_join(fips)

I want to encourage you to do this a little different using purrr. Here’s some annotated code that uses reduce_right to make magic.

1
2
3
4
5
6
7
8
# Load a directory of .csv files that has each of the lookup tables
lookups <- map(dir('data/lookups'), read.csv, stringsAsFactors = FALSE)
# Alternatively if you have a single lookup table with code_type as your
# data attribute you're looking up
# lookups <- split(lookups, code_type)
lookups$real_data <- read.csv('data/real_data.csv', 
                              stringsAsFactors = FALSE)
real_data <- reduce_right(lookups, left_join)

Boom, now you went from data with attributes like funds_code, function_code, state_code to data that also has funds_name, function_name, state_name1. What’s great is that this same code can be reused no matter how many fields require a hookup. I’m oftent dealing with accounting data where the accounts are defined by a different number of data fields, but my code doesn’t care at all.


  1. My recommendation is to use consistent naming conventions like _code and _name so that knowing how to do the lookups is really straightforward. This is not unlike the convention with Microsoft SQL where the primary key of a table is named Id and a foreign key to that table is named TableNameId. Anything that helps you figure out how to put things together without thinking is worth it. ↩︎

March 1, 2017

One of my goals for 2017 is to contribute more to the R open source community. At the beginning of last year, I spent a little time helping to refactor rio. It was one of the more rewarding things I did in all of 2016. It wasn’t a ton of work, and I feel like I gained a lot of confidence in writing R packages and using S3 methods. I wrote code that R users download and use thousands of times a month.

I have been on the lookout for a Javascript powered interactive charting library since ggvis was announced in 2014. But ggvis seems to have stalled out in favor of other projects (for now) and the evolution of rCharts into htmlwidgets left me feeling like there were far too many options and no clear choices.

What I was looking for was a plotting library to make clean, attractive graphics with tool tips that came with clear documentation and virtually no knowledge of Javascript required. Frankly, all of the htmlwidgets stuff was very intimidating. From my vantage point skimming blog posts and watching stuff come by on Twitter, htmlwidgets-based projects all felt very much directed at Javascript polyglots.

Vega and Vega-Lite had a lot of the qualities I sought in a plotting library. Reading and writing JSON is very accessible compared to learning Javascript, especially with R’s excellent translation from lists to JSON. And although I know almost no Javascript, I found in both Vega and Vega-Lite easy to understand documents that felt a lot like building grammar of graphics 1 plots.

So I decided to take the plunge– there was a vegalite package and the examples didn’t look so bad. It was time to use my first htmlwidgets package.

Things went great. I had some simple data and I wanted to make a bar chart. I wrote:

1
2
3
4
5
vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar()

A bar chart was made! But then I wanted to use the font Lato, which is what we use at Allovue. No worries, Vega-Lite has a property called titleFont for axes. So I went to do:

1
2
3
4
5
6
vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar() %>%
axis_x(titleFont = 'Lato')

Bummer. It didn’t work. I almost stopped there, experiment over. But then I remembered my goal and I thought, maybe I need to learn to contribute to a package that is an htmlwidget and not simply use an htmlwidget-based package. I should at least look at the code.

What I found surprised me. Under the hood, all the R package does is build up lists. It makes so much sense– pass JSON to Javascript to process and do what’s needed.

So it turned out, vegalite for R was a bit behind the current version of vegalite and didn’t have the titleFont property yet. And with that, I made my first commit. All I had to do was update the function definition and add the new arguments to the axis data like so:

1
if (!is.null(titleFont))    vl$x$encoding[[chnl]]$axis$titleFont <- titleFont

But why stop there? I wanted to update all of vegalite to use the newest available arguments. Doing so looked like a huge pain though. The original package author made these great functions like axis_x and axis_y. They both had the same arguments, the only difference was the “channel” was preset as x or y based on which function was called. Problem was that all of the arguments, all of the assignments, and all of the documentation had to be copied twice. It was worse with encode and scale which had many, many functions that are similar or identical in their “signature”. No wonder the package was missing so many Vega-Lite features– they were a total pain to add.

So as a final step, I decided I would do a light refactor across the whole package. In each of the core functions, like encode and axis, I would write a single generic function like encode_vl() that would hold all of the possible arguments for the encoding portion of Vega-Lite. Then the specific functions like encode_x could become wrapper functions that internally call encode_vl like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
encode_x <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "x", ...)
  vl
}

encode_y <- function(vl, ...) {
  vl <- encode_vl(vl, chnl ="y", ...)
  vl
}

encode_color <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "color", ...)
  vl
}

Now, in order to update the documentation and the arguments for encoding, I just have to update the encode_vl function. It’s a really nice demonstration, in my opinion, of the power of R’s ... syntax. All of the wrapper functions can just pass whatever additional arguments the caller wants to encode_vl without having to explicitly list them each time.

This greatly reduced duplication in the code and made it far easier to update vegalite to the newest version of Vega-Lite, which I also decided to do.

Now Vega-Lite itself is embarking on a 2.0 release that I have a feeling will have some pretty big changes in store. I’m not sure if I’ll be the one to update vegalite– in the end, I think that Vega-Lite is too simple for the visualizations I need to do– but I am certain whoever does the update will have a much easier go of it now than they would have just over a month ago.

Thanks to Bob Rudis for making vegalite and giving me free range after a couple of commits to go hog-wild on his package!


  1. The gg in ggplot2↩︎

February 27, 2017

You can check my Goodreads profile. I love science fiction and fantasy. And I know in 2017 and everyone has already observed the dominance of “geek culture”, with the dominance of Disney properties from Marvel and now Star Wars. Hell, Suicide Squad won a goddamn Oscar.

Powell's Best Selling Books
Powell's Best Selling Fiction, 2017-02-26

But I never felt like SFF was all that mainstream. SyFy might have made (and renewed) a TV series based on The Magicians, but I still feel like the disaffected entitled shit that held onto his love of genre fiction too long when I crawl into bed and hide in speculative fiction (thank you Quentin, for so completely capturing what a shit I was at 14).

Yesterday, I was confronted with the reality of SFF going mainstream at Powell’s City of Books. I was fully unprepared to see the contents of their Best Selling Fiction shelf.

By my count, at least 16 of the top 42 are SFF. The Name of the Wind, The Left Hand of Darkness, The Fifth Season, 2312, and Uprooted are some of the best books I’ve ready in the last four or five years. To think of these books as best sellers when they don’t have a TV show coming out (like American Gods, The Handmaid’s Tale, The Man in the High Castle, and The Magicians) and aren’t assigned in high school classrooms (1984, Slaughterhouse-Five) is just shocking. In my mind, these aren’t best sellers, they’re tiny nods between myself and other quiet bookshoppers that we are kin.

I am not sad though. I am thrilled. I want to live in a world where I can just assume acquaintances are reading The Fifth Season and Uprooted.

January 13, 2017

“I just got an Amazon Echo and surprisingly, I really love it.”

In one form or another, this story has repeated again and again across the internet. So while the recent headline seemed to be “Amazon’s Alexa is everywhere at CES 2017”, it really feels like this year was the Amazon Alexa year.

I have an Amazon Echo. I bought around a year ago during a sale as the buzz seemed to have peaked 1. My experience with the Amazon Echo has mostly been “I don’t get it.”

The Echo was kind of fun with Philip’s Hue lights or for a timer or unit conversion in the kitchen from time to time, but not much else. I was not much of a Siri user, and it turned out I was not much of an Amazon Echo user.

But I just bought and Echo Dot.

HomeKit and Siri Can’t Compete

Siri and HomeKit should be a match made in heaven, but if I say “Hey Siri, turn off the bedroom lights,” the most common response is “Sorry, I cannot find device ‘bedroom lights’.” I then repeat the same command and the lights go off. This literally has happened once in almost a year of Echo ownership, but it happened nearly every time with Siri.

Apple is miserable at sharing 2, and that means that even if Siri worked perfectly, HomeKit is built on a bad model.

My Philips Hue lights are HomeKit compatible. I use Family Sharing so that my partner Elsa can have access to all the apps I buy. Yet it took months to get the email to invite her to my home to send and work. And really, why should I have to invite someone to anything to turn on the lights in my home? Apple knows all about proximity, with it’s awesome use of iBeacons in its own stores. If being within reach of my light switches or thermostats were enough security to control my devices before, why is Apple making it so hard for people to access them via HomeKit? 3

A Better Model for HomeKit

HomeKit’s insistence that all devices have the same security profile and complex approval has meant that devices are rarely HomeKit compatible while everything is compatible with Alexa’s simple skills program.

Imagine if the HomeKit app and excellent Control Center interface 4 existed at the launch of HomeKit. Imagine if instead of a complex encryption setup that required new hardware, Apple had tiered security requirements, where door locks and video surveillance lay on one side with heavy security, but lights and thermostats lay on the other. Imagine if HomeKit sharing was a simple check box interface that the primary account with Family Sharing could click on and off. Imagine if controlling low security profile devices with Siri worked for anyone within a certain proximity using iBeacons.

This is a world where Apple’s Phil Schiller is right that “having my iPhone with me” could provide a better experience than a device designed to live in one place.

That’s not the product we have. And even if Apple gets into the “tower with microphones plugged into a wall” game, I don’t see them producing an Echo Dot like product that makes sure your voice assistant is everywhere. My iPhone might be with me. I may be able to turn on and off the lights with my voice. But without something like iBeacons in the picture, if someone comes to stay in my guest room they’re back to using the switch on the wall. If a family member uses Android, they are out of luck. If I have a child under the age where a cell phone is appropriate, they are back to living in a dumb home. The inexpensive Echo Dot means you can sprinkle Alexa everywhere the devices you want to control are for anyone to interact with.5 Apple doesn’t do inexpensive well.

I’m not sure they can resolve the product decisions around HomeKit that make it less appealing to hardware manufacturers. Worse, some of Alexa’s best skills are entirely software. Alexa’s skills can seemingly shoot off requests to API endpoints all over the place. So instead of needing to buy a Logitech Harmony Hub with complex encryption and specialized SiriKit/HomeKit skills, and tight integration, my existing Harmony Hub that has an API used by Logitech’s own application supports Alexa skills. An Alexa skill can be built in a day. Apple is allergic to doing simple services like this well, even though the entire web runs on them.

My Dot

Our new bedroom in Baltimore does not have recessed lighting much like our old bedroom. We’re using one of those inexpensive, black tower lamps in the bedroom. I don’t have a switch for the outlets in there. Philips doesn’t make any Hue bulbs that provides enough light to light the room with one lamp.

I needed an instant way to get the light on and off. That’s when I remembered I had an old WeMo bought from before the days of HomeKit. I used that WeMo to have a simple nightly schedule (turn some lights on at sundown and off at midnight each night) and never really thought about it. The WeMo was perfect, and lo and behold, it works with Alexa. Our Echo is a bit far from the bedroom though, and I don’t want to shout across the house to turn off the lights.

Not only was the inexpensive Echo Dot perfectly for sprinkling a little voice in the bedroom, it also meant our master bathroom can have Hue lights that are controlled with voice again. And, now I have a way to get Spotify onto my Kanto YU5 speakers in the bedroom without fussing with Bluetooth6 from my phone by just connecting an 1/8" phono plug in permanently.

Now we say “Alexa, turn on the bedroom light” and “Alexa, play Jazz for Sleep”. It’s great. It always works. If we had a guest bedroom with the same setup, anyone who stayed there would be able to use it just as easily. No wonder why the Wynn is putting Amazon Echo in their hotel. Apple literally can’t do that.

Whither Voice Control

Amazon, Apple, and Google seemed locked in a battle over voice 7. I can think of four main times I want to use voice:

  1. Walking with headphones
  2. Driving in the car
  3. Cooking in the kitchen
  4. Sharing an interface to shared devices

For Phil Schiller, and by extension, Apple, the killer feature of Siri is you always have it. 8 In a recent interview, Schiller is quoted as saying:

Personally, I still think the best intelligent assistant is the one that’s with you all the time. Having my iPhone with me as the thing I speak to is better than something stuck in my kitchen or on a wall somewhere.

Apple AirPods are all about maximizing (1). Siri works pretty well when I’m out with the dogs and need a quick bit of information or to make a call. But the truth is, I don’t really need to learn a lot from Siri on those dog walks. When I’m out walking, Siri is better than pulling out my phone, but once I’ve got a podcast or music going I don’t really need anything from Siri in those moments.

Driving is another great context for voice control. I can’t look much at a screen and shouldn’t anyway. Ignoring CarPlay, Apple’s real move here is Bluetooth interfaces, which places Siri in most cars. But again, what is my real use for voice control in this scenario? Reading SMS and iMessages makes for a cool demo, but not really something I need. Getting directions to a location by name is probably the best use here, but Apple’s location database is shit for this. Plus, most of the time I choose where I am going when I get in the car, when I can still use my screen and would prefer to. The most important use of voice control in the car is calling a contact, which Voice Control, the on-device precursor to Siri, did just fine. And now Alexa is entering the car space too.

While cooking, it is great to be able to set timers, convert measurement units, and change music or podcast while my hands are occupied. This is why so many people place their Amazon Echo in the kitchen– it works great for these simple task. “Hey Siri” and a Bluetooth speaker is a terrible solution in comparison. In fact, one thing that the Amazon Echo has done is cause me to wear my headphones less while cooking or doing the dishes, since the Amazon Echo works better and doesn’t mean I can’t hear Elsa in the other room. This isn’t a killer feature though. Early adopters may be all about the $180 kitchen timer with a modest speaker, but the Echo won’t be a mass product if this is its best value proposition.

Shared Interface to Shared Devices

There is a reason why home automation is where the Echo shines. Our homes are full of technology: light switches, appliances, televisions and all the boxes that plug in them, and everyone who enters the home has cause to control them. We expect that basically all home technology is reasonably transparent to anyone inside. Everyone knows how to lock and unlock doors, turn on the TV, turn on and off lights, or adjust the thermostat. Home automation has long been a geek honey pot that angers every cohabitant and visitor, but voice control offers an interface as easy and common as a light switch.

Home automation is the early adopter use case that reveals how and why voice control is a compelling user interface. Turning on the bedroom lights means saying “Alexa, turn on the bedroom lights.” There is no pause for Siri to be listening. There is no taking out my phone or lifting up my watch. There is no invite process. There is no worrying about guests being confused. Anyone inside my home has always been able to hit a light switch. Anyone inside my home can say “Alexa, turn on the living room lights.” That’s why Apple erred by not having a lower security, proximity based way to control HomeKit devices.

Voice control is great because it provides a shared interface to devices that are shared by multiple people. Computers, smartphones, and really most screen-based interfaces that we use are the aberration, pushing toward and suggesting that technology is becoming more personal. The world we actually live in is filled with artifacts that are communal, and as computer and information technology grow to infuse into the material technologies of our lives, we need communal interfaces. Amazon is well positioned to be a part of this future, but I don’t think Apple has a shot with its existing product strategy.


  1. It hadn’t. ↩︎

  2. We still don’t have multi-user iPads or iPhones. I have a new AppleTV, but all the TV app recommendations don’t work because two people watch TV. Unlike Netflix, we can’t have separate profiles. And the Apple Watch is billed as their most personal device yet. Where Amazon moves into the world of ever-present, open communal interfaces, Apple is looking toward individual, private worlds. ↩︎

  3. Ok, here comes the critiques about how HomeKit can be used to open door locks or activate video surveillance, etc. Great– those are cool uses of technology that also have mostly proximity based security but fine, I can see a case for heavy encryption and complex sharing setups for those devices. But the truth is, most of the internet of things aren’t these kinds of devices. A good product would easily scale to different security profiles. ↩︎

  4. An unconscionable amount of the time I see “No Response” in Control Center under my devices. Worse, I have to sit and wait for Apple to realize those devices are there because eventually they pop on. Instant interfaces matter, and they matter even more when trying to replace a light switch. ↩︎

  5. There’s probably a good critique about privilege here, assuming that you have multiple rooms that would need a separate assistant device. But I would like to remind you that we’re talking about spending hundreds of dollars on cylinders plugged into walls that you talk to to control things that cost 4x their traditional counterparts. For the foreseeable future, we are addressing a market of rich people and this technology will succeed or fail there long before we get to ubiquity. Plus, who cares what Apple has to say about any of this if we’re not talking about rich people? Apple’s market is rich people and that isn’t going to change. Affordable luxury is the brand and target, where affordable in the global scheme means fairly well off. ↩︎

  6. Bluetooth is a dumpster fire when you have two phones, two sets of wireless headphones, a Bluetooth speaker in the bathroom, a Bluetooth speaker in the bedroom, and a shared car with Bluetooth. All of these things will choose at various times to ~conveniently~ connect to whatever phone they want if you’re not diligent about powering things down all the time. Bluetooth audio is a shit experience. ↩︎

  7. Cortana isn’t anywhere that matters, so it doesn’t matter yet. ↩︎

  8. Apple Watch is about extending Siri wherever you are, but I don’t use Siri on my Watch much, because it’s not great in any of those four contexts. If I can raise my hand I have hands and I’d rather use my phone. ↩︎

December 28, 2016

A lot of the data I work with uses numeric codes rather than text to describe features of each record. For example, financial data often has a fund code that represents the account’s source of dollars and an object code that signals what is bought (e.g. salaries, benefits, supplies). This is a little like the factor data type in R, which to the frustration of many modern analysts is internally an integer that mapped to a character label (which is a level) with a fixed number of possible values.

I am often looking at data stored like this:

fund_code object_code debit credit
1000 2121 0 10000
1000 2122 1000 0

with the labels stored in another set of tables:

fund_code fund_name
1000 General

and

object_code object_name
2121 Social Security
2122 Life Insurance

Before purrr, I might have done a series of dplyr::left_join or merge to combine these data sets and get the labels in the same data.frame as my data.

But no longer!

Now, I can just create a list, add all the data to it, and use purrr:reduce to bring the data together. Incredibly convenient when up to 9 codes might exist for a single record!

1
2
3
4
5
# Assume each code-name pairing is in a CSV file in a directory
data_codes <- lapply(dir('codes/are/here/', full.names = TRUE ), 
                     readr::read_csv)
data_codes$transactions <- readr::read_csv('my_main_data_table.csv')
transactions <- purrr:reduce_right(data_codes, dplyr::left_join)
December 27, 2016

I have written a lot on the internet. This isn’t a surprise, I’ve been here since the mid-90s. But the truth is, most of what I write on the internet doesn’t make me proud. It hasn’t made the world any better. It certainly hasn’t made me feel any better. Most of this terrible writing is easy to associate with me, because a long time ago, I chose to use my real name as my online identity. Using my real name was supposed to make sure that I would stand by what I said, but the truth is that I am not always my better self on internet forums, Facebook, Twitter, or other places I get to write.

My personal blog is a bit different. The writing I’ve done over the years at my own domains has been… less embarrassing. I don’t mean to say that the quality of the writing is any better (it’s not); it’s just that the extra thought involved in opening up a text editor, writing something in Markdown, and taking the steps to post it has resulted in fewer emotional tirades. I do a much better job of deleting or never completing posts on my blog than I ever did writing somewhere someone else owned. It’s too bad the audience here is much smaller and harder to come by.

My blog has always been a testing ground. It’s where I’ve learned how to use Wordpress, Pelican, and now Hugo. It’s been a place to think about templating, structure, CSS, shared hosting, Github pages, server management, nginx and the like. This is where I try different types of writing like link posts, lists, professional-ish informational posts, public versions of notes to myself, images, and more. This blog hasn’t had a topic or a format. I’m not convinced it ever will. For me, a self-hosted blog is meant to be a personal lab bench.

I hope today I am starting what I consider to be the final version of this blog. I feel confident in the domain and name. I feel comfortable with Hugo and the flexiblility of my fully custom theme. I feel great about not having comments.

The look and content will change many times again, but I feel good that from here forward I’ll be using and evolving json.blog. This is my home on the web.

July 8, 2016

When I entered high school, video games were beginning to lose their appeal. So I sold my four video game systems and all their games at a garage sale and that money, plus some Chanukkah money, bought me my first guitar and amp. I had just tried joined a band as a singer with a couple of guys I knew from school. I didn’t know anything about playing guitar. In fact, it took me a while to figure out what distortion was and why I my guitar didn’t sound like Kris Roe from The Ataris.

In the beginning, being in a band was rough. We had a lot of fun playing together, but just keeping time and making it all the way through a song was a slog. We had agreed I wouldn’t try playing guitar with the band until I had been playing at least three months, self-taught. But this was 2001 and I lived on Long Island and we were playing pop-punk, so it didn’t take too long to catch up. Soon I was writing music, actual original music. To this day, I don’t really enjoy playing other people’s music, because from the moment I picked up a guitar it was an instrument for creating new music with other people.

Writing music was a kind of awful torture I was addicted to. For years, I was absorbing how music sounded. I used to listen so intently that I memorized whole compositions. I wanted to hear every strike of a kick drum, every open string ringing out, every tapped bass note, and every quiet piano layered in deep below the mix. But now that I was writing music, I became become obsessed with its construction. All the nuances I worked hard to hear in music took on a whole new layer of depth as I tried unravel how the song was made. I never heard music the same way again. But my appreciation for the craft of songwriting far exceeded the meager results a few months of having a stratocaster glued to my hands could produce. I would meet and practice with just one of the other members of the band for hours long writing sessions where we would struggle to create something good enough to bring to the rest of the guys and flesh out into a full song. But eventually, within a couple of years, we wrote about 15 songs, at least six or seven of which I was pretty proud of. It was so difficult to write those first songs. It took so many hours at home alone, then working hard with one or two other guys to write new parts and determine a structure, and then eventually months of practice with four guys sweating in a basement practicing the same music over and over again.

I wanted so badly to write my song.

Every band has one. Their song. The real one. The song that every musician who hears your album recognizes immediately as the song that trancends the talent of the individuals involved and is just plain better. It’s not the most complex song. It may not even be the most overtly emotional. It’s probably not your single. But it’s the song that stands out as a proud announcement to the people like me, the musicians who absorb every sound and experience the very structure of the music. Transcendent, to repeat myself, is really the best explanation for it. These are the songs that shook my soul, and I wanted to find mine.

I never did write my song. I ended up quitting that first band after two and a half years and playing with a different set of guys for a bit over year chasing “my song”. I hoped a different writing experience with different musicians might help. Throughout college, I still played guitar all the time, but I never got comfortable writing without collaborators and I never found the right people to fulfill that role. Nowadays I pick up a guitar so rarely. I hear a phrase in a song I love and immediately know I can play it and sometimes get the urge to actually prove that to myself. Once a year, the foggy edges of a song appears in the distance, enticing me to chase it for a short while, and I record a small phrase to add to the library of forgotten riffs and lyrics.

I still listen to music, though not as often and not really the same kinds anymore. And I still can’t listen the way I used to, the way it was before I picked up a guitar and tried singing into a microphone. That part of me is permanently broken in a way I expect only musicians can understand.

I learned something important about myself in my time as a musician. When I’m chasing something I truly love, I don’t feelsome great pleasure. Writing music was about throwing myself into an agonizing chase for the impossible. It was the euphoria of the small accomplishments– a good song performed on stage in front of a crowd that actually responds to your creation, or cracking how to transition from a verse to a chorus– that kept me going. And it was the imprint on my life, mind and soul, that brought me true joy from being a musician as time went on.

Working on product at Allovue feels like writing music. I have never done something this hard, but I do know what it is like to experience a profound need so deeply. There are moments of real euphoria, like when a user describes their experience with Balance in a way so perfectly aligns with our vision that I triple check they are not a plant. And there are moments of agony, like almost every time I start to “listen” to our product and deconstruct it, and feel the weight of a decade’s worth of ideas on what our product needs to match the vision I have had since the first time Jess told me what she’s trying to do.

It feels like for the first time, I just might be writing my song. The real one. And I’m terrified I’m not good enough or strong enough or just plain enough to see it through.

April 23, 2016

I read a lot of science fiction and fantasy, genres filled with long running book series. Until the last couple of years, I mostly avoided any series that wasn’t already complete. First, I don’t like truly “epic” sci-fi fantasy. On-going series without an end in sight, or series that go beyond roughly 3,000 to 4,000 pages never end well for me1. I simply lose interest. Second, I worry that series won’t actually reach completion, either because the books are not successful enough or the author gets writer’s block2, or even just getting caught up in waiting way too long between books3. Third, I like to actually remember what happened, especially in the kind of complex stories I like to read.

Some series do really well with sequels. I recently read through Kelly McCullough’s Fallen Blade series, and although it is complete and I did read the books in succession, they always made a clear attempt to reintroduce everything about the novel and the necessary bits of past events. In fact, McCullough was so good at this , it was almost obnoxious to read the series all in one go4.

But other books seem to provide no help at all. And I am now deeply invested in several series that have not yet completed. Right now I’m finally reading Poseidon’s Wake, the third and final novel in Alastair Reynolds’ Poseidon’s Children trilogy. Because it had been so long, I had forgotten critical parts of the earlier two novels that I enjoyed so much. Now, nearly 40% through the book and thoroughly engrossed, most of the key information has miraculously come back to me. But I found it difficult to get through the first 5% or so of the novel if for no other reason than I was trying to remember what was in Blue Remembered Earthand what was in Kim Stanley Robinson’s 23125.

I must admit, I am often impressed with my own ability to recall details of a story I read years earlier when encountering a sequel, because I seem to remember far more of it than expected. But I wonder, what must the editing process on a sequel be like? How do authors and editors decide what can be assumed and what cannot?


  1. See Wizard’s First Rule, Dune, and A Song of Ice and Fire↩︎

  2. See Patrick Rothfuss. ↩︎

  3. I think I really learned this waiting for the conclusion of His Dark Materials, which felt like it took a goddamn life time. ↩︎

  4. I assume these books must be geared toward young adults and that this impacted the “hand holding” involved in moving from book to book. I’m not sure if they’re considered YA fiction, but the writing certainly had that feel. Still, they were wonderfully fun quick reads. I read all six books from November 23rd through December 14th. ↩︎

  5. A novel I did not enjoy nearly as much, but which seemed to have very similar themes and setting and which I read three months prior to Blue Remembered Earth and On the Steel Breeze↩︎

July 5, 2015

I have had a Tumblr site for a long time but never knew what to do with it. What is Tumblr exactly? Is it a hosted blog? Is it a hosted blog for hipsters? Is it a social network? Why should I put content into Tumblr?

I have this blog, but I barely use it. I don’t have a Facebook page, because I don’t trust Facebook and how it repeatedly changed and confused privacy settings and after college, I rarely found that Facebook was a net positive in my life. Recently I crossed 1000 followers on Twitter.

I like the sense of control offered by owning where I put content. But the barrier to posting a blog post has always felt high to me. A blog feels somewhat permanent. It’s something I want my current and future employers and friends to read it. It’s a record of ideas that felt worthy of memorializing. I have tried over and over again to lower this perceived barrier to blogging and failed.

At the same time, I find the quick ability to favorite/like, retweet/re-broadcast, and respond on Twitter to be addicting. It is so easy to give feedback and join a conversation. As a result, I’ve probably written more, 140 characters at a time, on Twitter than I ever have on this blog.

For me, Twitter is an ephemeral medium. It is about instant conversation and access. What I dump into Twitter doesn’t have any lasting power, which is why it’s so easy to toss out thoughts. Twitter is my new IRC, not a microblog.

Writing on Twitter in 140 characters often seems to attract the worst in people. It’s not just #gamergate, it’s me. My ideas are more sarcastic, more acerbic, and less well considered because Twitter feels like an off the cuff conversation among friends. But it’s not a conversation among friends. It’s not really even a conversation. It’s a bunch of people shouting at each other in the same room. Twitter is less a late night dorm room debate and more the floor of the New York Stock Exchange.

Which brings me to Tumblr, a service I think I finally understand. Tumblr is Twitter, but for people who take a breath before shouting. It has the same rich post types that Twitter has implemented through Cards. It has the same ability to magnify posts I find interesting through its reblogging feature. It also has the same ability to send a bit of encouragement and acknowledgement through its hearts. But Tumblr also doesn’t have the limitation of 140 characters, so I can spread my thoughts out just a bit further. And Tumblr does have a reply/conversation mechanism, but it’s just slightly “heavier” feeling than a Twitter reply so I’m less likely to just shoot off my mouth with the first thoughts that come to mind. Though Tumblr is a hosted service, it also has a fairly good API that can be used to export posts and the ability to use a custom URL. I could generate more post types on my Pelican blog, but a self-hosted blog lacks some of the social features that are just fun. And the truth is, do I really want to just put a link to a song I’m listening to right now on my blog? Is that kind of ephemera really worthy of a blog post? Maybe, but that’s not the kind of blog I want.

So I am going back to Tumblr. I have been experimenting for a couple of days and I really like having a place to dump a link or a funny picture. I don’t want Tumblr to host my blog, but I do want Tumblr to eat into some of my Twitter posting. I can easily syndicate Tumblr posts over to Twitter, so why not take a little more space and breathe before deciding it is worth sharing something.

Please follow me on Tumblr. I think it’s going to be really fun.

Cross-posted on both my blog and my Tumblr

July 4, 2015

How many times have you written R functions that start with a bunch of code that looks like this?

1
2
3
4
5
6
my_funct <- function(dob, enddate = "2015-07-05"){
if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  } 
...
}

Because R was designed to be interactive, it is incredibly tolerant to bad user input. Functions are not type safe, meaning function arguments do not have to conform to specified data types. But most of my R code is not run interactively. I have to trust my code to run on servers on schedules or on demand as a part of production systems. So I find myself frequently writing code like the above– manually writing type checks for safety.

There has been some great action in the R community around assertive programming, as you can see in the link. My favorite development, by far, are type-safe functions in the ensurer package. The above function definition can now be written like this:

1
2
3
my_funct <- function_(dob ~ Date, enddate ~ Date: as.Date("2015-07-05"), {
  ...
})

All the type-checking is done.

I really like the reuse of the formula notation ~ and the use of : to indicate default values.

Along with packages like testthat, R is really growing up and modernizing.

June 14, 2015

When discussing policy in Rhode Island, I almost always encounter two bizarre arguments.

  1. Rhode Island is completely unique. Ideas from other places don’t adequately take into account our local context. What is working there either won’t work here or isn’t really comparable to our situation here.
  2. What is happening nationally is directly applicable to Rhode Island. We can make broad sweeping statements about a set of policies, ideas, or institutions currently in play in Rhode Island without any knowledge of how things are going locally and how it’s different from other places. We can simply graft a broader national narrative onto Rhode Island regardless of whether it makes any sense with our facts on the ground.

These seemingly in conflict points of view are often employed by the same actors.

It is probably not unique to Rhode Island, but that won’t stop me from calling it Rhode Island Disease.

April 16, 2015

An initial proposal has been made to the city of Providence and state of Rhode Island to keep the PawSox in Rhode Island and move them to a new stadium along the river in Providence.

The team is proposing that they privately finance all of the construction costs of the stadium while the land remains state (or city? I am not clear) owned. The state will lease the land underneath the stadium (the real value) with an option to buy for 30 years at $1 a year. The state will also pay $5,000,000 rent for the stadium itself annually for 30 years. The PawSox will then lease back the stadium at $1,000,000 per year. The net result will be the stadium is built and Rhode Island pays the PawSox owners $4,000,000 a year for 30 years.

The Good

Privately financing the upfront cost of the stadium puts risks of construction delays and cost overruns on the PawSox. Already they are underestimating the cost of moving a gas line below the park grounds. Whatever the cost of construction, whatever the impact on the team of a late opening, the costs to the state are fixed. There is essentially no risk in this plan for taxpayers, defining risk as a technical term for uncertainty. We know what this deal means: $120,000,000 over 30 years.

The interest rate is pretty low. Basically, although the risk is privatized, we should view this stadium as the PawSox providing the state of Rhode Island a loan of $85,000,000 which we will pay back at a rate of approximate 1.15% 1. Now just because the interest is low doesn’t mean we should buy…

The stadium design is largely attractive, even if the incorporated lighthouse is drawing ire. I don’t mind it, but I do like the idea of replacing it with an Anchor has some Greater City Providence commenters have recommended. Overall, I think the design fits with the neighborhood. It’s easy to get caught up in pretty renderings.

The pedestrian bridge remains and is accessible. As someone who lives in Downcity, I am very much looking forward to this dramatic improvement to my personal transit. I think the bridge’s importance for transit is underrated, although admittedly we could make Point Street Bridge friendlier to pedestrians and bike riders instead.

Brown University seems interested in hosting events, like football games, at the stadium. The plan also seems to give the state a lot of leeway in holding various events in the space when it’s not used for the baseball season. It could really be a great event space from mid-April until early November each year.

The Bad

Even the team’s own economic estimates only foresee $2,000,000 in increased tax revenues. Although they claim this estimate is conservative, I would take that with a huge grain of salt. You do not lead with a plan that straight up says the taxpayers will be out $60,000,000 over 30 years unless you don’t have a better foot to put forward. I am going to go ahead and assume this estimate is about right. It’s certainly in the ballpark. (Ugh.) But what that means is that Rhode Islanders should understand this is not an investment. This is not like building transit infrastructure or tax stabilization agreements to spur private construction. This deal is more akin to building schools. We do not, in fact cannot, expect that the economic impact makes this project a net positive for revenues. With $12,000,000 expected in direct spending, the project could be net positive for GDP, but even then it is obvious this is not the best annual investment to grow the economy. It is easy to come up with a laundry list of projects that cost less than this that could create more economic activity and/or more revenue to the state and city. Therefore, the project should be viewed primarily on use value. Will Rhode Islanders get $4,000,000 a year in value from the pleasure of using (and seeing) this stadium and its surrounding grounds? In school construction, we expect the benefits to be short term job creation, long term impacts on student health and well-being, ability to learn, and our ability to attract quality teachers. But most of those benefits are diffuse and hard to capture. Ultimately, we mostly support school construction because of the use benefits the kids and teachers see each year.

The time line is crazy. If they’re serious about a June decision, they’re nuts. We have a complicated budget process ongoing right now. We have a teacher contract in Providence to negotiate. We have a brand new I-195 Commission trying to make their mark and get cranes in the sky. There’s no way a negotiation in good faith can be completed in 60 days unless they agree to every counter. If this is a “final best offer”, essentially, due to time line, then it is disingenuous.

What happens in 30 years? We don’t have any guarantees of being whole in 30 years, and the same threats and challenges posed by the PawSox today will come up again in 30 years. Are we committed to a series of handouts until the team is of no monetary or cultural value?

Other cities are likely going to come into play. The PawSox don’t have to negotiate a deal that’s fair for Rhode Island. They just have to negotiate to a deal that’s comparable to an offer they think someone else will make. Rhode Island’s position is weak, provided that anyone else is willing to make a deal.

The Strange

The PawSox are asking for a 30-year property tax exemption. There’s a lot to think through here. First, there are at least two parcels that were meant to be tax generating that are a part of this plan– the land Brown’s Continuing Education building currently sits on and the small develop-able parcel that was cut out from the park for a high value hotel or similar use. The stadium wants both of these parcels in addition to the park. I think City Council President Aponte is being a bit silly talking about being “made whole” over this deal, unless he’s talking about those two parcels. The park land was never going to generate city tax revenue and was actually going to cost the city money to maintain. Part of my openness to any proposal on this park land is my lack of confidence that the city will invest appropriately to maintain a world-class park space along the waterfront. There’s very little “whole” to be made.

It is also possible that Providence will have to designate additional park space if the stadium is built. If that’s true and it’s coming off the tax roles than the PawSox absolutely should have to pay property taxes, period. There’s one possible exception I’ll address below…

I also feel very strongly about having a single process for tax stabilization across all I-195 land that is not politically driven but instead a matter of administrative decision. Exceptions for a big project breaks the major benefit of a single tax stabilization agreement ruling all the I-195 land, which is our need to send a signal that all players are equal, all developers are welcome, and political cronyism is not the path required to build. While some of those $2,000,000 in tax benefits will accrue to Providence through increased surrounding land value, many costs associated with the stadium will as well. There are police details, road wear and tear, fire and emergency services, and more to consider.

My Counter

I don’t think this deal is dead, but I am not sure that the PawSox, city, or state would accept my counter. I have struggled with whether I should share what I want to happen versus what I think a deal that would happen looks like. I would be tempted to personally just let the PawSox walk. But if Rhode Island really wants them to stay, here’s a plausible counter:

  1. The PawSox receive the same tax stabilization agreement all other developers get from the city of Providence. Terms for a fair valuation of the property are agreed upon up front that are derived from some portion of an average of annual revenues.
  2. The lease terms should be constructed such that the net cost (excluding the anticipated increase in tax receipts) is equal to the tax dollars owed to the city of Providence. Therefore, the state essentially pays for the $85,000,000 of principal and the city taxes. This could be through a PILOT, but I’d prefer that amount go to the PawSox and the PawSox transfer the dollars to the city. It’s just accounting, but I prefer the symbol of them paying property taxes. I don’t think it’s a terrible precedent for the state to offer PILOT payments to cover a gap between the city TSA in I-195 with a developer’s ask, if the state sees there is substantial public interest in that investment, but still better to actually get developers used to writing a check to the city.
  3. If the city has to make additional green space equivalent to the park we are losing, I foresee two options. First is the PawSox paying full load on whatever that land value is. The second is probably better, but harder to make happen. Brown should give up the Brown Stadium land to the city. They can make it into a park without reducing the foot print of taxable property in the city. If they did this, Brown should essentially get free use of the stadium with no fees (except police detail or similar that they would pay for their games on the East Side) in perpetuity. They should get first rights after the PawSox games themselves.
  4. The stadium itself will be reverted to ownership by the Rhode Island Convention Center Authority if the option to buy the land is not exercised in 30 years. This way the whole stadium and its land are state owned, since the state paid for it. The possible exception would be if Brown has to give up its stadium to park land, in which case I might prefer some arrangement be made with them.
  5. The PawSox ownership agrees to pay a large penalty to the state and the city if they move the team out of Rhode Island in the next 99 years.
  6. PawSox maintenance staff will be responsible for maintaining the Riverwalk park, stadium grounds, and the green-way that has been proposed for the I-195 district. Possible we can expand this to something like the Downcity Improvement District (or perhaps just have them pay to expand the DID into the Knowledge District). This will help ensure this creates more permanent jobs and reduces costs to the city for maintaining its public spaces that contribute to the broader attractiveness of the stadium.
  7. There should be a revenue share deal for any non-PawSox game events with the city and/or state for concession purchases and parking receipts.
  8. The stadium should not be exempt from future TIF assessments for infrastructure in the area.

I am not sure that I would pay even that much for the stadium, but this would be a far better deal overall. I can absolutely think of better ways to spend state dollars, but I also realize that the trade-off is not that simple. Rhode Island is not facing a windfall of $85,000,000 and trying to decide what to do with it. A stadium that keeps the PawSox in Rhode Island inspires emotion. The willingness to create these dollars for this purpose may be far higher than alternative uses. The correct counterfactual is not necessarily supporting 111 Westminster (a better plan for less). It is not necessarily better school buildings. It is not necessarily meaningful tax benefits for rooftop solar power. It is not lowering taxes, building a fund to provide seed capital to local startups, a streetcar, dedicated bus and/or bike lanes, or tax benefits to fill vacant properties and homes. The correct counterfactual could be nothing. It could be all of these things, but in much smaller measure. It is very hard to fully evaluate this proposal because we are not rational actors with a fixed budget line making marginal investment decisions. Ultimately, with big flashy projects like this, I lean toward evaluating them on their own merits. Typically, and I think this case is no exception, even evaluating a stadium plan on its own merits without considering alternative investments makes it clear these projects are bad deals. Yet cities and states make them over and over again. We would be wise to look at this gap in dollars and cents and our collective, repeated actions not as fits of insanity but instead as stark reminders of our inability to simply calculate the total benefits that all people receive.

In my day job, I get to speak to early stage investors. There I learned an important tidbit– a company can name whatever valuation they want if an investor can control the terms. That’s my feeling with the PawSox. The cash is important, it’s not nothing. But any potential plan should be judged by the terms.

Here’s hoping Rhode Island isn’t willing to accept bad terms at a high cost.


  1. $A = P(1 + \frac{r}{n})^{nt}$ where $A = $120,000,000$, $P = $85,000,000$, $n = 1$, and $t = 30$. I’ll leave you to the algebra. ↩︎

January 22, 2015

I keep this on my desktop.

Install:

1
2
3
4
5
6
brew install postgresql
initdb /usr/local/var/postgres -E utf8
gem install lunchy
### Start postgres with lunchy
mkdir -p ~/Library/LaunchAgents
cp /usr/local/Cellar/postgresql/9.3.3/homebrew.mxcl.postgresql.plist ~/Library/LaunchAgents/

Setup DB from SQL file:

1
2
3
4
5
### Setup DB
lunchy start postgres
created $DBNAME
psql -d $DBNAME -f '/path/to/file.sql'
lunchy stop postgres

Starting and Stopping PostgreSQL

1
2
lunchy start postgres
lunchy stop postgres

may run into trouble with local socket… try this:

1
rm /usr/local/var/postgres/postmaster.pid

Connecting with R

1
2
3
# make sure lunch start postgres in terminal first)
require(dplyr)
db <- src_postgres(dbname=$DBNAME)

Inspired by seeing this post and thought I should toss out what I do.

January 10, 2015

Severing My Daemon

When I was in high school, I piggy-backed on a friend’s website to host a page for my band. We could post pictures, show locations and dates, lyrics, and pretend like we produced music people cared about. It was mostly a fun way for me to play with the web and something to show folks when I said I played guitar and sang in a band. One day, my friend canceled his hosting. He wasn’t using his site for anything and he forgot that I had been using the site. I was 18, I never thought about backups, and I had long deleted all those pesky photos taking up space on my memory cards and small local hard drive.

Four years of photos from some of the best experiences of my life are gone. No one had copies. Everyone was using the site. In the decade since, no set of pictures has ever been as valuable as the ones I lost that day.

Who controls the past…

As you can imagine, this loss has had a profound effect on how I think about both my data and the permanence of the internet. Today, I have a deep system of backups for any digital data I produce, and I am far more likely to err on keeping data than discarding it. Things still sometimes go missing. 1

Perhaps the more lasting impact is my desire to maintain some control over all of my data. I use Fastmail for my email, even after over 10 years of GMail use. 2 I like knowing that I am storing some of my most important data in a standard way that easily syncs locally and backs up. I like that I pay directly for such an important service so that all of the incentive for my email provider is around making email work better for me. I am the customer. I use Bittorrent Sync for a good chunk of my data. I want redundancy across multiple machines and syncing, but I don’t want all of my work and all of my data to depend on being on a third party server like it is with Dropbox. 3. I also use a Transporter so that some of my files are stored on a local hard drive.

Raison D’être

Why does this blog exist? I have played with Tumblr in the past and I like its social and discovery tools, but I do not like the idea of pouring my thoughts into someone else’s service with no guarantee of easy or clean exit. I tried using Wordpress on a self-hosted blog for a while, but I took one look at the way my blog posts were being stored in the Wordpress database and kind of freaked out. All those convenient plugins and short codes were transforming the way my actual text was stored in hard to recover way. Plus, I didn’t really understand how my data was stored well enough to be comfortable I had solid back ups. I don’t want to lose my writing like I lost those pictures.

This blog exists, built on Pelican, because I needed to a place to write my thoughts in plain text that was as easy to back up as it was to share with the world. I don’t write often, and I feel I rarely write the “best” of my thoughts, but if I am going to take the time to put something out in the world I want to be damn sure that I control it.

Bag End

I recently began a journey that I thought was about simplifying tools. I began using vim a lot more for text editing, including writing prose like this post. But I quickly found that my grasping for new ways to do work was less about simplifying and more about better control. I want to be able to work well, with little interruption, on just about any computer. I don’t want to use anything that’s overly expensive or available only on one platform if I can avoid it. I want to strip away dependencies as much as possible. And while much of what I already use is free software, I didn’t feel like I was in control.

For example, git has been an amazing change for how I do all my work since about 2011. Github is a major part of my daily work and has saved me a bunch of money by allowing me to host this site for free. But I started getting frustrated with limitations of not having an actual server and not really having access to the power and control that a real server provides. So I recently moved this site off of Github and on to a Digital Ocean droplet. This is my first experiment with running a Linux VPS. Despite using desktop Linux for four years full time, I have never administered a server. It feels like a skill I should have and I really like the control.

Quentin’s Land

This whole blog is about having a place I control where I can write things. I am getting better at the control part, but I really need to work on the writing things part.

Here’s what I hope to do in the next few months. I am going to choose (or write) a new theme for the site that’s responsive and has a bit more detail. I am probably going to write a little bit about the cool, simple things I learned about nginx and how moving to my own server is helping me run this page (and other experiments) with a lot more flexibility. I am also going to try and shift some of my writing from tweetstorms to short blog posts. If I am truly trying to control my writing, I need to do a better job of thinking out loud in this space versus treating them as disposable and packing them on to Twitter. I will also be sharing more code snippets and ideas and less thoughts on policy and local (Rhode Island) politics. The code/statistics/data stuff feels easier to write and has always gotten more views and comments.

That’s the plan for 2015. Time to execute.


  1. I recently found some rare music missing that I had to retrieve through some heroic efforts that included Archive.org and stalking someone from an online forum that no longer exists (successfully). ↩︎

  2. I was a very early adopter of Gmail. ↩︎

  3. I still use Dropbox. I’m not an animal. But I like having an alternative. ↩︎

November 27, 2014

A few thoughts:

  1. This is a very interesting way to take advantage of a number of existing Amazon technologies–primarily their payment processing and review system.
  2. Services are an increasingly important part of the economy and is less subject to commoditization. This is Amazon dipping into a massive growth area by commoditizing discovery and payment. It also offloads some of the risk from both sides of the transaction. It’s very bold, possibly brilliant.
  3. If you have tried to find a reliable carpenter, electrician, plumber, house cleaning service, etc lately, it should be obvious the value that Amazon can provide. Even as a subscriber to Angie’s List, which has been invaluable, finding reliable, affordable, and quality services is still a frustrating experience.
  4. This is why technology companies get huge valuations. It is hard to anticipate just how technologies to become the first online booksellers will lead to a massive number of accounts with credit cards and a strongly trusted brand. It is hard to anticipate how book reviews and powerful search andfiltering become the way you find people to come into your home and fix a toilet. But truly, it’s hard to anticipate the limits of a company with massive reach into people’s wallets that scales.

It has been said a thousand times before, but I feel the need to say it again. So much of what Star Wars got right was creating a fully realized, fascinating world. As much as stunning visual effects that have largely stood the test of time were a part of that story, it was how Star Wars sounded that is most remarkable.

Watch that trailer. It has moments that look an awful lot like Star Wars– vast dunes in the middle of the desert, the Millenium Falcon speeding along, flipping at odd angles emphasizing its unique flat structure. But it also has a lot of elemetns that are decidedly modern and not Star Wars like. 1 I think what’s most remarkable is I can close my eyes and just listen. Immediately I can hear Star Wars. The sounds of Star Wars are not just iconic, they are deeply embedded in my psyche and embued with profound meaning.

I first had the opportunity to see Star Wars on the big screen it was during the release of the “Special Editions”. There is nothing like hearing Star Wars in a theater.


  1. Shakey-cam is the primary culprit. ↩︎

November 18, 2014

Because of the primacy of equity as a goal in school finance system design, the formulas disproportionately benefit less wealthy districts and those with high concentrations of needier students. … because of the universal impact on communities, school finance legislation requires broad political buy-in.

I think it is worth contrasting the political realities of constructing school finance law with the need and justification for state funding of education in the first place.

The state is the in the business of funding schools for redistributive purposes. If that wasn’t required, there’s little reason to not trade an inefficient pass through of sales and income tax dollars through to communities that could have lower sales and income taxes (or state sales and income taxes) replaced with local sales, income, property taxes , and fees. We come together as states to solve problems that extend beyond parochial boundaries, and our political unions exist to tackle problems we’re not better off tackling alone.

There are limits to redistributive policy. Support for the needs of other communities might wane, leading to challenging and reducing the rights of children with new law or legal battles, serious political consequences for supporters of redistirbution, and decreased in economic activity (in education, property value). These are real pressures that need to be combatted both by convincing voters and through policy success 1. There are also considerations around the ethics of “bailing out” communities that made costly mistakes like constructing too many buildings or offering far too generous rights to staff in contracts that they cannot afford to maintain. We struggle as policy experts to not create the opportunity for moral hazards as we push to support children who need our help today.

Policy experts and legal experts cannot excuse the needs of children today, nor can they fail to face the limits of support for redistribution or incentivizing bad adult behavior.


  1. I don’t doubt that support for redistributive policy goes south when it appears that our efforts to combat poverty and provide equal opportunities appear to fail, over and over again, and in many cases may actually make things worse. ↩︎

November 12, 2014

There are some basic facts about the teacher labor market that are inconvenient for many folks working to improve education. I am going to go through a few premises that I think should be broadly accepted and several lemma and contentions that I hope clarifies my own view on education resources and human capital management.

Teaching in low performing schools is challenging.

If I am looking for a job, all else being equal, I will generally not choose the more challenging one.

Some may object to the idea that teachers would not accept a position that offers a greater opportunity to make a difference, for example, teaching at an inner city school, over one that was less likely to have an impact, like teaching in a posh, suburban neighborhood. It is certainly true that some teachers, if not most teachers place value on making a greater impact. However, the question is how great is that preference? How much less compensation (not just wage) would the median teacher be willing to take to work in a more challenging environment?

I contend that it is atypical for teachers to accept lower compensation for a more challenging job. I would further suggest that even if there were a sufficient number of teachers to staff all urban schools with those that would accept lower compensation for a position in those schools, the gap in compensation that they would accept is low.

There are large gaps in non-pecuniary compensation between high performing school and low performing schools that is difficult to overcome.

Let us supposed that it’s true there are large parts of the teacher workforce that would accept lower compensation (wage and non-wage) to teach in urban schools. There are real benefits to taking on a role where the potential for impact is great.

However, we can consider this benefit as part of the hedonic wages supplied by a teaching role. Other forms of non-monetary compensation that teachers may experience include: a comfortable physical work environment with sufficient space, lighting, and climate control; sufficient supplies to teach effectively; support and acceptance of their students, their families, and the broader school communities; a safe work environment; job security; alignment to a strong, unified school culture; and strong self-efficacy.

Some of these features could be easily replicated in many low performing schools. It is possible to have better quality physical schools and sufficient funding for supplies. Other features can be replicated, but not nearly as easily. Low performing schools where students have complex challenges inside and outside of the classroom are not environments where everyone has a strong sense of self-efficacy. Even the initial sense that making a difference is within reach erodes for many after facing a challenging environment day after day, year after year. A safe environment and a strong school culture are well within reach, but hardly easy and hardly universal. These things should be universal. They require funding, leadership, and broadly successful organizations.

The key is not that all high performing schools always have these features and no low performing schools can or do have these features. What is important is that many of these features are less often found in low performing, particularly urban schools.

I contend that the typical gap in non-pecuniary compensation between high and low performing schools is large enough to wipe out any negative compensating wage differential that may exist due to a desire for greater impact.

The primary mechanism to get “more” education is increasing the quality or quantity of teaching.

Let us take the leap of suggesting that teaching is a key part of the production of education. If we want to improve educational equity and address the needs of low performing schools, we need some combination of more and higher quality teaching. This is a key driver of policies like extended learning time (more), smaller class sizes (more), professional development (better), and teacher evaluation and support systems (better). It is what is behind improving teacher preparation programs (better), alternative certification (better), and progressive support programs like RTI (more and better).

November 1, 2014

November marks the start of National Novel Writing Month (NaNoWriMo). The quick version is folks band together and support each other to write 50,000 words in November.

I would love to write a novel one day. I am not sure I could do it well, but I am pretty sure I could hit 50,000-80,000 words if I dedicated time to tell a story.

I don’t have a story to tell.

So this year, I have decided to not feel guilty about skipping out on another NaNoWriMo (always the reader, never the author), and instead I am modifying it to meet my needs. With no story to tell and no experience tackling a single project the size of a novel, I am going to tackle a smaller problem– this blog.

Instead of 50,000 words in 30 days, I am going to try and write 1000 words a day for the next four weeks. I will not hold myself to a topic. I will not even hold myself to non-fiction. I will not hold myself to a number of posts or the size of the posts I write. I will not even hold myself to true daily count, instead reviewing where I stand at the end of each week.

I am hoping that the practice of simply writing will grease my knuckles and start the avalanche that leads to writing more. A small confession– I write two or three blog posts every week that never leave my drafts. I find myself unable to hit publish because the ideas tend to be far larger or far smaller than I anticipate when I set out to write and share my frustrations. I also get nervous, particularly when writing about things I do professionally, about not writing the perfect post that’s clear, heavily researched, and expresses my views definitively and completely. This month, I say goodbye to that anxiety and start simply hitting publish.

I will leave you with several warnings.

  1. Things might get topically wacky. I might suddenly become a food blogger, or write about more personal issues, or write a short story and suddenly whiplash to talking about programming, education policy, or the upcoming election. If high volume, random topics aren’t your thing, you should probably unsubscribe from my RSS feed and check back in a month.
  2. I might write terrible arguments that are poorly supported and don’t reflect my views. This month, I will not accept my most common excuses for not publishing, which boil down to fear people will hold me to the views I express in my first-draft thinking. I am going to make mistakes this month in public and print the dialog I am having with myself. The voices I allow room to speak as I struggle with values, beliefs, and opinions may be shock and offend. This month, this blog is my internal dialog. Please read it as a struggle, however definitive the tone.
  3. I am often disappointed that the only things I publish are smaller ideas written hastily with poor editing. Again, this month I embrace the reality that almost everything I write that ends up published is the result of 20 minutes of furious typing with no looking back, rather than trying to be a strong writer with a strong view point and strong support.

I hope that the end of this month I will have written at least a couple of pieces I feel proud of, and hopefully, I will have a little less fear of hitting publish in the future.

October 5, 2014

A terrible thing is happening this year. Women all across the internet are finding themselves the target of violence, simply for existing. Women are being harassed for talking about video games, women are being harassed for talking about the technology industry, women are being harassed for talking, women are being harassed.

A terrible thing is happening. Women are finding themselves the target of violence.

A terrible thing has always happened.


I remember being a 16 year old posting frequently on internet forums. One in particular focused on guitar equipment. I loved playing in a band, and I loved the technology of making guitar sounds. Many people on the forum were between 16 and 24, although it was frequented by quite a few “adults” in their 30s, 40s, and 50s. It was a wonderful opportunity to interact as an adult, with adults.

Every week members created a new thread where they posted hundreds of photos of women. Most of them were professional photographs taken at various night clubs as patrons entered. Some were magazine clippings or fashion modeling. I remember taking part, both in gazing and supplying the occasional photograph from the internet. We were far from the early days of the world wide web, this being around 2003, but this was also before social media matured and online identity was well understood by the general public.

This thread became controversial. A change from private to corporate ownership of this forum led to increased moderation, and the weekly post with photos of women was one of the targets.

I did not understand.

In the debates about the appropriateness of the content and its place within our online community, I took the side of those who wanted the post to remain alive. I was not its most ardent supporter, nor was I moved to some of the extremes in language and entitlement that typically surround these conversations. However, my views were clear and easy. These were public photographs, largely taken with permission (often for compensation). And, of course, none of the pictures were pornographic.

Appropriateness for me at 16 was defined by pornography. I did not understand.


My parents did not raise me to be misogynist. One of the most influential moments in my life came on a car ride to the dentist. I was also around 16 or 17. I think it was on my way to get my wisdom teeth removed. I had been dating the same girl for a while, and it was time for my father to give me the talk. All he said to me was, “Women deserve your respect.”

That was it.


We were in college, and my friends and I were all internet natives. We had used the web for over ten years. We grew up in AOL chatrooms and forums. The backwaters of the internet at this time shifted from Something Awful to 4Chan. This was the height of some of the most prolific and hilarious memes: lolcats, Xzibit, advice dogs (a favorite was bachelor frog, which seemed to understand our worst impulses expressed in only modest exaggeration).

There was also violence.

It was not uncommon to see names, phone numbers, and addresses that 4chan was supposed to harass because someone said so. Various subcultures seemed to be alternatively mocked and harassed endlessly in the very place that had first embraced, supported, and connected people under the guise of radical anonymity. The most famous of the “Rules of the Internet” was Rule 34 – if you can think of it, there is a porn of it– and its follow up, Rule 35 – if you can not find porn of it, you should make it. 4chan seemed determined to make this a reality. But really the most troublesome thing was the attitude toward women. Nothing was as unacceptable to 4chan as suggesting that women are anything but objects for male gaze. In a place sometimes filled with radically liberal (if more left-libertarian than left-progressive) politics that would spawn groups like Anonymous, nothing brought out as much criticism as suggesting our culture has a problem with women.

My response was largely to fade from this part of the internet. I had only reached the point of being uncomfortable with this behavior. It would take more time for me to understand. It still felt like this was a problem of ignorant people.


I am rarely jealous of intelligence. I am rarely jealous of wealth. I am rarely jealous of experiences. What I am most often jealous of is what seems to me to be a preternatural maturity of others, particularly around issues of ethics and human rights.

Fully grappling with privilege is not something that happens over a moment, it is a sensitivity to be developed over a lifetime. We are confronted with media that builds and reinforces a culture that is fundamentally intolerant and conservative. There are countless microaggressions that are modeled everywhere for our acceptance as normal. It has taken me a decade of maturation, hard conversations, and self-examination to only begin to grow from fully complicit and participating in objectification of women to what I would now consider to be the most basic level of human decency.

The internet has gone from enabling my own aggression toward women to exposing me to a level of misogyny and violence that deeply disturbs and disgusts me, shattering any notion that my past offenses were harmless or victimless. The ugly underside of our culture is constantly on display, making it all the more obvious how what felt like isolated events on the “ok” side of the line were actaully creating a space that supported and nurtured the worst compulsions of men.


I often think about my own journey when I see disgusting behavior on the internet. I wonder whether I am facing a deeply, ugly person or myself at 16. I try to parse the difference between naïvety, ignorance, and hate and to understand if they require a unique response.

Mostly, I struggle with what would happen if Jason Today spoke to Jason 16.

Jason 16 could not skip over a decade of growth simply for having met Jason Today. It took me conversations with various folks playing the role of Jason Today over and over again, year after year. I wish I believed there was another way to reach the Jason 16s out there. I wish I knew how to help them become preternaturally aware of their actions. All I know how to do is try to be compassionate to those who hate while firmly correcting, try to meet the heightened expectations I place on myself, try to apologize when I need to, and try to support those that seem more equipped to push the conversation forward.

Along this path, I never lept to agreement so much as paused. Each time I heard a convincing point, I paused and considered. Growth came in a series of all too brief pauses.

Pauses are often private and quiet, its discoveries never on direct display.

If pauses are the best anyone can expect, then working to change our culture of violence toward women will rarely feel like much more than shouting at the void.

June 12, 2014

The Vergara v. California case has everyone in education talking. Key teacher tenure provisions in California are on the ropes, presumably because of the disparate impact on teacher, annd therefore education, quality for students who are less fortunate.

I have fairly loosely held views about the practice of tenure itself and the hiring and firing of teachers. However, I have strongly held views that unions made mistake with their efforts to move a lot of rules about the teaching labor market into state laws across the country. Deep rules and restrictions are better left to contracts, even from a union perpsective. At worst, these things should be a part of regulation, which can be more easily adapted and waived.

That said, here are a collection of interesting thoughts on tenure post-Vergara:

John Merrow, reacting to Vergara:

Tenure and due process are essential, in my view, but excessive protectionism (70+ steps to remove a teacher?) alienates the general public and the majority of effective teachers, particularly young teachers who are still full of idealism and resent seeing their union spend so much money defending teachers who probably should have been counseled out of the profession years ago.

With the modal ‘years of experience’ of teachers dropping dramatically, from 15 years in 1987 to 1 or 2 years today, young teachers are a force to be reckoned with. If a significant number of them abandon the familiar NEA/AFT model, or if they develop and adopt a new form of teacher unionism, public education and the teaching profession will be forever changed.

San Jose Mercury News reporting on the state thwarting a locally negotiated change to tenure:

With little discussion, the board rejected the request, 7 to 2. The California Teachers Association, one of the most powerful lobbies in Sacramento, had opposed granting a two-year waiver from the state Education Code – even though one of the CTA’s locals had sought the exemption… …San Jose Teachers Association President Jennifer Thomas, whose union had tediously negotiated with the district an agreement to improve teacher evaluations and teaching quality, called the vote frustrating… San Jose Unified and the local teachers association sought flexibility to grant teachers tenure after one year or to keep a teacher on probation for three years.

The district argued that 18 months – the point in a teacher’s career at which districts must make a tenure decision – sometimes doesn’t allow time to fairly evaluate a candidate for what can be a lifetime job.

Now, Thomas said, when faced with uncertainty over tenure candidates, administrators will err on the side of releasing them, which then leaves a stain on their records.

Kevin Welner summarzing some of the legal implications of Vergara:

Although I can’t help but feel troubled by the attack on teachers and their hard-won rights, and although I think the court’s opinion is quite weak, legally as well as logically, my intent here is not to disagree with that decision. In fact, as I explain below, the decision gives real teeth to the state’s Constitution, and that could be a very good thing. It’s those teeth that I find fascinating, since an approach like that used by the Vergara judge could put California courts in a very different role —as a guarantor of educational equality—than we have thus far seen in the United States… …To see why this is important, consider an area of education policy that I have researched a great deal over the years: tracking (aka “ability grouping”). There are likely hundreds of thousands of children in California who are enrolled in low-track classes, where the expectations, curricula and instruction are all watered down. These children are denied equal educational opportunities; the research regarding the harms of these low-track classes is much stronger and deeper than the research about teachers Judge Treu found persuasive in the Vergara case. That is, plaintiffs’ attorneys would easily be able to show a “real and appreciable impact” on students’ fundamental right to equality of education. Further, the harm from enrollment in low-track classes falls disproportionately on lower-income students and students of color. (I’ll include some citations to tracking research from myself and others at the end of this post.)

Welner also repeats a common refrain from the education-left that tenure and insulating teachers from evaluations is critical for attracting quality people into the teaching profession. This is an argument that the general equilibrium impact on the broader labor market is both larger in magnitude and in the opposite direction of any assumed positive impacts from easier dismissal of poor performing teachers:

This more holistic view is important because the statutes are central to the larger system of teacher employment. That is, one would expect that a LIFO statute or a due process statute or tenure statute would shape who decides to become a teacher and to stay in the profession. These laws, in short, influence the nature of teaching as a profession. The judge here omits any discussion of the value of stability and experience in teaching that tenure laws, however imperfectly, were designed to promote in order to attract and retain good teachers. By declining to consider the complexity of the system, the judge has started to pave a path that looks more narrowly at defined, selected, and immediate impact—which could potentially be of great benefit to future education rights plaintiffs.

Adam Ozimek of Modeled Behavior:

I can certainly imagine it is possible in some school districts they will find it optimal to fire very few teachers. But why isn’t it enough for administrators to simply rarely fire people, and for districts to cultivate reputations as places of stable employment? One could argue that administrators can’t be trusted to actually do this, but such distrust of administrators brings back a fundamental problem with this model of public education: if your administrators are too incompetent to cultivate a reputation that is optimal for student outcomes then banning tenure is hardly the problem, and imposing tenure is hardly a solution. This is closely related to a point I made yesterday: are we supposed to believe administrators fire sub-optimally but hire optimally

His piece from today (and this one from yesterday) argues that Welner’s take could be applied to just about any profession, and furthermore, requires accepting a far deeper, more fundamental structural problem in education that should be unacceptable. If administrators would broadly act so foolishly as to decimate the market for quality teaching talent and be wholly unable to successfully staff their schools, we have far bigger problems. And, says Ozimek, there is no reason to believe that tenure is at all a response to this issue.

Dana Goldstein would likely take a more historical view on the usefulness of tenure against adminstrator abuse.

But, writing for The Atlantic, she focuses instead on tenure as a red herring:

The lesson here is that California’s tenure policies may be insensible, but they aren’t the only, or even the primary, driver of the teacher-quality gap between the state’s middle-class and low-income schools. The larger problem is that too few of the best teachers are willing to work long-term in the country’s most racially isolated and poorest neighborhoods. There are lots of reasons why, ranging from plain old racism and classism to the higher principal turnover that turns poor schools into chaotic workplaces that mature teachers avoid. The schools with the most poverty are also more likely to focus on standardized test prep, which teachers dislike. Plus, teachers tend to live in middle-class neighborhoods and may not want a long commute.

May 19, 2014

I have never found dictionaries or even a thesaurus particularly useful as part of the writing process. I like to blame this on my lack of creative careful writing.

But just maybe, I have simply been using the wrong dictionaries. It is hard not to be seduced by the seeming superiority of Webster’s original style. A dictionary that is one-part explanatory and one-part exploratory provides a much richer experience of English as an enabler of ideas that transcend meager vocabulary.

May 12, 2014

I had never thought of a use for Brett Terpstra’s Marky the Markdownifier before listening today’s Systematic. Why would I want to turn a webpage into Markdown?

When I heard that Marky has an API, I was inspired. Pinboard has a “description” field that allows up to 65,000 characters. I never know what to put in this box. Wouldn’t it be great to put the full content of the page in Markdown into this field?

I set out to write a quick Python script to:

  1. Grab recent Pinboard links.
  2. Check to see if the URLs still resolve.
  3. Send the link to Marky and collect a Markdown version of the content.
  4. Post an updated link to Pinboard with the Markdown in the description field.

If all went well, I would release this script on Github as Pindown, a great way to put Markdown page content into your Pinboard links.

The script below is far from well-constructed. I would have spent more time cleaning it up with things like better error handling and a more complete CLI to give more granular control over which links receive Markdown content.

Unfortunately, I found that Pinboard consistently returns a 414 error code because the URLs are too long. Why is this a problem? Pinboard, in an attempt to maintain compatibility with the del.ico.us API uses only GET requests, whereas this kind of request would typically use a POST end point. As a result, I cannot send along a data payload.

So I’m sharing this just for folks who are interested in playing with Python, RESTful APIs, and Pinboard. I’m also posting for my own posterity since a non-Del.ico.us compatible version 2 of the Pinboard API is coming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import requests
import json
import yaml


def getDataSet(call):
  r = requests.get('[api.pinboard.in/v1/posts/...](https://api.pinboard.in/v1/posts/recent') + call)
  data_set = json.loads(r._content)
  return data_set

def checkURL(url=""):
  newurl = requests.get(url)
  if newurl.status_code==200:
    return newurl.url
  else:
    raise ValueError('your message', newurl.status_code)

def markyCall(url=""):
  r = requests.get('[heckyesmarkdown.com/go/](http://heckyesmarkdown.com/go/?u=') + url)
  return r._content

def process_site(call):
  data_set = getDataSet(call)
  processed_site = []
  errors = []
  for site in data_set['posts']:
    try:
      url = checkURL(site['href'])
    except ValueError:
      errors.append(site['href'])
    description = markyCall(url)
    site['extended'] = description
    processed_site.append(site)
  print errors
  return processed_site

def write_pinboard(site, auth_token):
  stem = 'https://api.pinboard.in/v1/posts/add?format=json&auth_token='
  payload = {}
  payload['url'] = site.get('href')
  payload['description'] = site.get('description', '')
  payload['extended'] = site.get('extended', '')
  payload['tags'] = site.get('tags', '')
  payload['shared'] = site.get('extended', 'no')
  payload['toread'] = site.get('toread', 'no')           
  r = requests.get(stem + auth_token, params = payload)
  print(site['href'] + '\t\t' + r.status_code)

def main():
  settings = file('AUTH.yaml', 'rw')
  identity = yaml.load(AUTH.yaml)
  auth_token = identity['user_name'] + ':' + identity['token']
  valid_sites = process_site('?format=json&auth_token=' + auth_token)
  for site in valid_sites:
    write_pinboard(site, auth_token)

if __name__ == '__main__':
  main()
April 1, 2014

I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ “inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

1
ln -s target_path link_path

R offers a function that does the same thing:

1
file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

March 9, 2014

Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.

While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.

If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.

This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.

Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the data.table package.

First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:

1
2
3
4
5
6
7
8
9
# Generate test data for modal_attribute().
modal_test <- data.frame(sid = c('1000', '1001', '1000', '1000', '1005', 
                                 '1005', rep('1006',4)),
                         race = c('Black', 'White', 'Black', 'Hispanic',
                                  'White', 'White', rep('Black',2), 
                                  rep('Hispanic',2)),
                         year = c(2006, 2006, 2007, 2008,
                                  2010, 2011, 2007, 2008,
                                  2010, 2011))

The test data generated by that code looks like this:

sasid race year
1000 Black 2006
1001 White 2006
1000 Black 2007
1000 Hispanic 2008
1005 White 2010
1005 White 2011
1006 Black 2007
1006 Black 2008
1006 Hispanic 2010
1006 Hispanic 2011

And the results should be:

sasid race
1000 Black
1001 White
1005 White
1006 Hispanic

My first attempts at solving this problem using data.table resulted in a pretty complex set of code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Calculate the modal attribute using data.table
modal_person_attribute_dt <- function(df, attribute){
  # df: rbind of all person tables from all years
  # attribute: vector name to calculate the modal value
  # Calculate the number of instances an attributed is associated with an id
  dt <- data.table(df, key='sasid')
  mode <- dt[, rle(as.character(.SD[[attribute]])), by=sasid]
  setnames(mode, c('sasid', 'counts', as.character(attribute)))
  setkeyv(mode, c('sasid', 'counts'))
  # Only include attributes with the maximum values. This is equivalent to the
  # mode with two records when there is a tie.
  mode <- mode[,subset(.SD, counts==max(counts)), by=sasid]
  mode[,counts:=NULL]
  setnames(mode, c('sasid', attribute))
  setkeyv(mode, c('sasid',attribute))
  # Produce the maximum year value associated with each ID-attribute 
  # pairing    
  setkeyv(dt, c('sasid',attribute))
  mode <- dt[,list(schoolyear=max(schoolyear)), by=c("sasid", attribute)][mode]
  setkeyv(mode, c('sasid', 'schoolyear'))
  # Select the last observation for each ID, which is equivalent to the highest
  # schoolyear value associated with the most frequent attribute.
  result <- mode[,lapply(.SD, tail, 1), by=sasid]
  # Remove the schoolyear to clean up the result
  result <- result[,schoolyear:=NULL]
  return(as.data.frame(result))
}

This approached seemed “natural” in data.table, although it took me a while to refine and debug since it was my first time using the package 1. Essentially, I use rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.

I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:

1
2
3
> system.time(modal_person_attribute(all_years, 'sex'))
 user  system elapsed 
 40.452   0.246  41.346 

Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.

With the release of dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the dplyr code would be easier to maintain since I frequently use dplyr and only turn to data.table in specific, sticky situations where performance matters.

In about a tenth the time it took to develop the original code, I came up with this new function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
modal_person_attribute <- function(x, sid, attribute, year){
  grouping <- lapply(list(sid, attribute), as.symbol)
  original <- x
  max_attributes <- x %.% 
                    regroup(grouping) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count))
  recent_max <- left_join(original, max_attributes) %.%
                regroup(list(grouping[[1]])) %.%
                filter(!is.na(count) & count == max(count))
  results <- recent_max %.% 
             regroup(list(grouping[[1]])) %.%
             filter(year == max(year))
  return(results[,c(sid, attribute)])
}

At least to my eyes, this code is far more expressive and elegant. First, I generate a data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.

There are a few funky things (note the use of regroup and grouping, which are related to dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.

But was this code more performant? Imagine my glee when this happened:

1
2
3
4
5
> system.time(modal_person_attribute_dplyr(all_years, sid='sasid', 
> attribute='sex', year='schoolyear'))
Joining by: c("sasid", "sex")
   user  system elapsed 
  1.657   0.087   1.852 

That is a remarkable increase in performance!

Now, I realize that I may have cheated. My data.table code isn’t very good and could probably follow a pattern closer to what I did in dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.

It is not every day that a tool provides powerful expressiveness and yields greater performance.

Update

I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
modal_person_attribute <- function(x, sid, attribute, year){
  # Select only the important columns
  x <- x[,c(sid, attribute, year)]
  names(x) <- c('sid', 'attribute', 'year')
  # Clean up years
  if(TRUE %in% grepl('_', x$year)){
    x$year <- gsub(pattern='[0-9]{4}_([0-9]{4})', '\\1', x$year)
  }  
  # Calculate the count for each person-attribute combo and select max
  max_attributes <- x %.% 
                    group_by(sid, attribute) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count)) %.%
                    select(sid, attribute)
  # Find the max year for each person-attribute combo
  results <- max_attributes %.% 
             left_join(x) %.%
             group_by(sid) %.%
             filter(year == max(year)) %.%
             select(sid, attribute)
  names(results) <- c(sid, attribute)
  return(results)
}

  1. It was over a year ago that I first wrote this code. ↩︎