January 22, 2023

Delimited files are hell-- a comparison of methods to read bad files

Here’s a real live example of why I am stuck with R and stuck with data.table. In my work, I often receive various delimited files from customers. Mostly, these delimited files are created from Oracle or MS SQL and have all kinds of gnarly things going on. Without sharing too much, here’s a partial example of the last few fields of one line in one of those files:

1

|25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54

Do you see the problem? I have a " character in a field, but the field itself is not quoted and the quote is not escaped.

Let’s compare how different systems handle this file. Before we do so, it’s important to know how many rows are in this data set:

1
2


wc -l my_file.txt
  239167 my_file.txt

DuckDB

The new hot thing, DuckDB is an in-memory database like sqlite, but optimized for analytics and reporting. It has support to automatically create a table from a CSV file. What happens when I try and do this?

1
2


D create table df as select * from read_csv_auto('my_file.txt');
Error: Invalid Input Error: Error in file "my_file.txt" on line 50614: quote should be followed by end of value, end of row or another quote. (DELIMITER='|' (auto detected), QUOTE='"' (auto detected), ESCAPE='' (auto detected), HEADER=1 (auto detected), SAMPLE_SIZE=20480, IGNORE_ERRORS=0, ALL_VARCHAR=0)

Yup, you’re right DuckDB. That’s absolutely the line with the problem. Guess I’m stuck.

Python/Pandas

How about Python and pandas, somehow the king of modern data science and data frames.

1
2


>>> import pandas as pd
>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False)

Hey, so far so good! This file get read without any messages, warnings, or errors.

1
2
3


>>> df
...
[238701 rows x 22 columns]

Uh oh. The data has only 238,701 rows, which is quite a bit less than 239,167 (well, 239,166 since this file does have a header row). This may not be a problem, because it’s possible that new lines exist in a text delimited file that is not a new record (if properly quoted). At least now that I have the data loaded, I can check for the sum of a column called “Amount”, because this is financial data. We can compare this to other methods later in addition to the row count so we can be sure the full data set was ready by pandas.

1
2


>>> sum(df.Amount)
196848446.45999622

R - readr

I am a full on tidyverse apologist. So of course I’m going to reach for readr::read_delim to get this file loaded.

1
2


> library(readr)
> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE))

Awesome. Like pandas, readr had no messages, warnings, or errors. Let’s see how many rows there are and the sum of that amount column.

1
2
3
4


> df |> nrow()
[1] 238609
> sum(df$Amount)
[1] 196828725

Uh oh again. It seems that readr::read_delim also doesn’t reach 239,166 lines, but instead has only 238,609 lines. That’s almost 100 less than pandas, and the sum is off by almost $20,000. I don’t know at this stage if pandas is right, but it seems pretty likely that readr just silently gave up on some lines that it shouldn’t have.

R - data.table

Let’s try the package data.table which has a function fread to read delimited files.

1
2
3
4


> df <- fread("my_file.txt")
Warning message:
In fread("my_file.txt") :
  Found and resolved improper quoting out-of-sample. First healed line 50614: <<25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

That’s interesting! When using fread, I get a warning that points out the very line I mentioned above. It even prints the line (I removed most of it) and says that it resolved the issue. It also recommends that I might want to try and specify quote = "" if, like on my file, fields are not quoted. We’ll come back to that.

1
2


> nrow(df)
[1] 239166

Well, would you look at that? Exactly the amount of lines I’d expect if there is one header row and no new lines. Let’s check that amount column.

1
2


> sum(df$Amount)
[1] 196926161

That’s almost $80,000 more than pandas. That’s a lot of money to have missing.

Just for fun, instead of expecting fread to figure everything out about my file on its own, what if I follow its suggestion and tell fread that there is no quote character and no fields are quoted?

1
2
3
4
5


> df <- fread('my_file.txt',  quote = "")
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161

By giving fread just a little information about the file, I get no warning about resolved lines, it just reads the file correctly with the same results.

So now, because of fread, I have some idea of what the problem was. Almost everything that reads a delimited file expects fields to be quoted, at least optionally, at least some of the time. They hate a quote at the start of a field that is actually in the raw data (understandably). Maybe if I tell these other tools that the quote character is nothing they’ll work better.

PostgreSQL

Let’s try PostgreSQL using COPY. Note, I’m not including the create table statement to avoid revealing more about the data.

1
2
3
4
5
6


jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|';
COPY 130426
jason=# select sum(amount) from df;
     sum
-------------
 97411927.64

Well, that’s not right. I should have nearly twice the number of lines. And sure enough, I’ve got only half the dollars. No warnings, no errors, no messages. Nothing. Silent failure.

Can I fix it?

Now that I know the issue is the quote character, let’s see if I can fix all the methods that failed to load this file.

readr

1
2
3
4
5


> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE, quote = '')
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161

pandas

1
2
3
4
5


>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False, quoting=3)
>>> df
[239166 rows x 22 columns]
>>> sum(df.Amount)
196926160.57999647

duckdb

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


D create table df as select * from read_csv_auto('my_file.txt', quote='');
D select count(*) from df;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       239166 │
└──────────────┘
D select sum(amount) from df;
┌───────────────────┐
│    sum(amount)    │
│      double       │
├───────────────────┤
│ 196926160.5800201 │
└───────────────────┘

postgresql

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|' QUOTE '';
ERROR:  COPY quote must be a single one-byte character
jason=# \copy df from 'my_file.txt' DELIMITER '|';
ERROR:  invalid input syntax for type integer: "col1"
CONTEXT:  COPY df, line 1, column col1: "col1"
jason=# \copy df from program 'tail -n +2 my_file.txt' DELIMITER '|';
COPY 239166
jason=# select sum(amount) from df;
     sum
--------------
 196926160.58
(1 row)

Would you look at that? Everyone does just fine when you tell them there is no quote character or quoted fields. ¹

All of these methods are essentially instant on my computer. Performance is not the issue. What’s incredible is that data.table::fread can identify file issues and resolve them. In this case, it turns out that data.table::fread was also able to describe the problem well enough that I could fix every other method of reading the file successfully. I will say, going back and reading the error for duckdb may have given me some hints, but `pandas, readr, and PostgreSQL completely failed to even notify me something was wrong. In an automated pipeline, I would have no indication that hundreds of rows, or in PostgreSQL’s case hundreds of thousands of rows, were just gone.

I was able to fix pandas, readr, duckdb, and PostgreSQL, but I have run into many scenarios where this is not the case. For example, what if I had this same row in a file that sometimes did quote certain fields. The fread function can handle this no problem, resolving the issue, warning me, and moving on. Every other method just wouldn’t work.

I don’t control this data. I don’t control producing it. It comes in on a scheduled basis, often hourly, and I need to accurately process it.

Silent failures are terrible.

The only way out is to embrace open, documented, binary data formats. But until all major RDBMS (and probably Excel) have native export and import, flat files will continue to be the de facto standard. In the meantime, it would be nice if PostgreSQL’s COPY, bcp, and spool could at least try and do things like quote fields and escape characters by default when targeting delimited files.

Some additional testing

I was asked to check vroom and arrow in R as well as polars via Explorer in Elixir. The results were, not good.

vroom

1
2
3
4
5


library(vroom)
> df <- vroom('my_file.txt', delim = '|')
Rows: 238609 Columns: 22
> sum(df$Amount)
[1] 196828725

arrow

1
2
3
4
5


> df <- arrow::read_delim_arrow('df.txt', delim = '|')
> sum(df$Amount)
[1] 196848446
> nrow(df)
[1] 238701

explorer / polars

1
2
3
4
5
6
7
8


iex(1)> Mix.install([
...(1)>   {:explorer, "~> 0.5.0"}
...(1)> ])
iex(2)> df = Explorer.DataFrame.from_csv(filename = "my_file.txt",
...(2)> delimiter: "|", infer_schema_length: nil)
{:error,
 {:polars,
  "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor  adding `OTHER` to the `null_values` list."}}

I wanted to show the pain of doing this in PostgreSQL. Only CSVs skip the first row with the HEADER option. But QUOTE can’t be set to blank/none/null. And using the default TEXT format, PostgreSQL can’t deal with the header row. So instead I had to use the PROGRAM option, which lets me run a bash script as the input and skip the first row, which then succeeds. ↩︎

April 26, 2022

Shortcuts -- who can use this thing?

10:01PM

Maybe Shortcuts is great, but every single time I think, “This is a great use for Shortcuts!” I absolutely bang my head up against it for hours before failing. I write code every day, but I can’t make Shortcuts work.

I want to run an R script every week and email the results to someone. I can do this entirely within R with something like the gmailr package, but I thought I’d see what Shortcuts can do. There’s a really great “Run Script” Shortcuts action. Awesome– I setup what I want to do to run perfectly with Rscript path_to_my_R_file.R. It generates a file in the same working directory and pumps the fully specified file path to stdout. Now I want to grab that file path and use it to attach that file to an email and send it.

But… you can’t supply a text based file path (or at least I can’t) to any of the Finder actions and successfully find the file. And I can’t really figure out how to get Mail to see that file and add it as an attachment for the Send Mail action.

I didn’t even get up to the part where I try and say, “Run this Shortcut every Monday at 8am” and already I’ve given up. Back to cron/launchd and scripting my email in R.

June 23, 2021

rJava on arm64 for R 4.1

9:42PM

In case you need this, if you install Java via homebrew, e.g. brew install openjdk, you’ll need to add this to your ~/.zshrc

1

export JAVA_HOME="/opt/homebrew/opt/openjdk/libexec/openjdk.jdk/Contents/Home"

and then run the typical sudo R CMD javareconf to get hooked back up.

May 31, 2021

Revisiting old code-- the power of across and tidyselect

9:17PM

Tonight I was revisiting an R script for cleaning up data that I wrote a couple of years ago. This script still runs nightly to process some data dropped on an SFTP server for some work we do at Allovue.

As I was reading through the file, I noticed that a lot of my code was repetitive. With some new powers I didn’t have back then, namely dplyr::across and many other tidyselect features, I wrote the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


clean <- map(raw,
             ~mutate(.x,
                     across(any_of(supplements), as.character),
                     across(where(is.character), na_if, ""),
                     across(where(is.character), na_if, "NULL"),
                     across(any_of('po_number'), na_if, "0"),
                     across(where(is.character), remove_non_ascii),
                     across(where(is.character), 
                            str_trim, side = "both"),
                     across(where(is.character),
                            str_replace,
                            pattern = "\\r\\n",
                            replacement = "\\n"),
                     across(where(is.character) & any_of('date'),
                            as.Date, format =  "%m/%d/%Y")))

My data is already stored in a list (raw) because I read the data using map with a custom function (extract_data), which knows how to process a yaml file and load data from various sources, like databases or flat files of various types. This allows me to automate executing the same script with different configuration files to collect data from many different sources. I mostly leave the data in a list so that I can use walk to export all the data once clean as text files with consistent formatting with a single call.

Until now, I never took advantage of having a list of data.frames for the T part of ETL. Without being able to use across with predicates based on data type or with tidyselect functions like any_of, it was harder to safely do mutations that were not attached to the underlying data structure. The code above can execute on any list of data.frames as long as a character vector named supplements exists. By using any_of, if those columns don’t exist in my data, they are ignored. And by using where on type definitions (largely characters in this case), I don’t have to worry about names of columns in the data I want to remove Windows-style line endings from– just take care of it on any character columns.

So while it’s not fancy, I’m tickled by removing four lines that mutated dates in 4 of 7 data.frames where that was relevant, and how many different places I added na_if before. Or how about the number of times I had to cast a specific value (like check_number) to character from some other format (usually integer because they’re quasi-numeric or logical because they are entirely missing from one data set) so that I can do a join or bind_rows. This is a classic case of time, experience, and an ever-improving tidyverse API reducing code complexity while making my code more generic and reusable.

September 13, 2020

Simpler Way to Fill Missing Data?

6:04PM

Hey #rstats, so I do a pattern like this all the time— essentially fill in missing values from some other data source. I feel like there must be some *_join or other incantation that does this cleaner… is there or should I just write my own utility function?

1
2
3
4


transactions %>%
  left_join(po_vendors, by = "po") %>%
  mutate(vendor = coalesce(vendor.x, vendor.y)) %>%
  select(-matches("\\..$"))

January 24, 2018

Stack of Glue

7:00PM

I have some text, but I want the content of that text to be dynamic based on data. This is a case for string interpolation. Lots of languages have the ability to write something like

1
2


pet = "dog"
puts "This is my {#pet}"

1
2


pet = "dog"
print(f"This is my {pet}")

There have been ways to do this in R, but I’ve mostly hated them until glue came along. Using glue in R should look really familiar now:

1
2


pet <- "dog"
glue("This is my {pet}")

Awesome! Now I have a way to make text bend to my bidding using data. But this is pretty simple, and we could have just used something like paste("This is my", pet) and been done with it.

Let me provide a little motivation in the form of data.frames, glue_data, and some purrr.

Pretend we have a field in a database called notes. I want to set the notes for each entity to follow the same pattern, but use other data to fill in the blanks. Like maybe something like this:

1

notes <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."

This is a terrible contrived example, but we can imagine displaying this note to someone with different content for each item. Now in most scenarios, the right thing to do for an application is to produce this content dynamically based on what’s in the database, but let’s pretend no one looked far enough ahead to store this data or like notes can serve lots of different purposes using different data. So there is no place for the application to find end_date, price_change, or new_price in its database. Instead, this was something prepared by sales in Excel yesterday and they want these notes added to all items to warn their customers.

Here’s how to take a table that has item_id, end_date, price_change, and new_price as columns and turn it into a table with item_id, and notes as columns, with your properly formatted note for each item to be updated in a database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


library(glue)
library(purrr)

item_notes <- data.frame(
  item_id = seq_len(10),
  end_date = c(rep(as.Date('2018-03-01', format = '%Y-%m-%d'), 5),
               rep(as.Date('2018-03-05', format = '%Y-%m-%d'), 3),
               rep(as.Date('2018-03-09', format = '%Y-%m-%d'), 2)),
  price_change = sample(x = seq_len(5),replace = TRUE,size = 10),
  new_price = sample(x = 10:20,replace = TRUE,size = 10)
)

template <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."

map_chr(split(item_notes, item_notes$item_id), 
    glue_data, 
    template) %>% 
stack() %>% 
rename(item_id = ind,
       notes = values)

What’s going on here? First, I want to apply my glue technique to rows of a data.frame, so I split the data into a list using item_id as the identifier. That’s because at the end of all this I want to preserve that id to match back up in a database. ¹ The function glue_data works like glue, but it accepts things that are “listish” as it’s first argument (like data.frames and named lists). So with handy map over my newly created list of “listish” data, I create a named list with the text I wanted to generate. I then use a base R function that’s new to me stack, which will take a list and make each element a row in a data.frame with ind as the name of the list element and values as the value.

Now I’ve got a nice data.frame, ready to be joined with any table that has item_id so it can have the attached note!

You can split on row.names if you don’t have a similar identifer and just want to go from data.frame to a list of your rows. ↩︎

January 2, 2018

Naming Manual Colors with ggplot2

7:00PM

I have been using ggplot2 for 7 years I think. In all that time, I’ve been frustrated that I can never figure out what order to put my color values in for scale_*_manual. Not only is the order mapping seemingly random to me, I know that sometimes if I change something about how I’m treating the data, the order switches up.

Countless hours could have been saved if I knew that this one, in hindsight, obvious thing was possible.

Whenever using scale_*_manual, you can directly reference a color using a character vector and then name your value in the scale_ call like so:

1
2
3


geom_blah(aes(color = 'good')) +
geom_blah(aes(color = 'bad')) +
scale_blah_manual(values = c(good = 'green', bad = 'red'))

Obviously this is a toy example, but holy game changer.

May 28, 2017

Functions as Arguments in R

7:00PM

Sometimes, silly small things about code I write just delight me. There are lots of ways to time things in R. ¹ Tools like microbenchmark are great for profiling code, but what I do all the time is log how long database queries that are scheduled to run each night are taking.

It is really easy to use calls to Sys.time and difftime when working interactively, but I didn’t want to pepper all of my code with the same log statements all over the place. So instead, I wrote a function.

Almost all of timing is straightforward to even a novice R user. I record what time it is using Sys.time, do a little formatting work to make things look the way I want for reading logs, and pass in an optional message.

The form of timing was easy for me to sketch out: ²

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


timing <- function(STUFF, msg = '') {
  start_time <- format(Sys.time(), '%a %b %d %X %Y')
  start_msg <- paste('Starting', msg,
                     'at:', start_time, '\n')
  cat(start_msg)
  # Call my function here
  end_time <- format(Sys.time(), '%a %b %d %X %Y')
  end_msg <- paste('Completed', 'at:', end_time, '\n')
  cat(end_msg)
  elapsed <- difftime(as.POSIXlt(end_time, format = '%a %b %d %X %Y'),
                      as.POSIXlt(start_time, format = '%a %b %d %X %Y'))
  cat('Elapsed Time: ', format(unclass(elapsed), digits = getOption('digits')),
      ' ', attr(elapsed, 'units'), '\n\n\n', sep = '')
  result
}

The thing I needed to learn when I wrote timing a few years back was how to fill in STUFF and # Call my function here.

Did you know that you can pass a function as an argument in another function in R? I had been using *apply with its FUN argument all over the place, but never really thought about it until I wrote timing. Of course in R you can pass a function name, and I even know how to pass arguments to that function– just like apply, just declare a function with the magical ... and pass that along to the fucntion being passed in.

So from there, it was clear to see how I’d want my function declartion to look. It would definitely have the form function(f, ..., msg = ''), where f was some function and ... were the arguments for that function. What I didn’t know was how to properly call that function. Normally, I’d write something like mean(...), but I don’t know what f is in this case!

As it turns out, the first thing I tried worked, much to my surprise. R actually makes this super easy– you can just write f(...), and f will be replaced with whatever the argument is to f! This just tickles me. It’s stupid elegant to my eyes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


timing <- function(f, ..., msg = '') {
  start_time <- format(Sys.time(), '%a %b %d %X %Y')
  start_msg <- paste('Starting', msg,
                     'at:', start_time, '\n')
  cat(start_msg)
  x <- f(...)
  end_time <- format(Sys.time(), '%a %b %d %X %Y')
  end_msg <- paste('Completed', 'at:', end_time, '\n')
  cat(end_msg)
  elapsed <- difftime(as.POSIXlt(end_time, format = '%a %b %d %X %Y'),
                      as.POSIXlt(start_time, format = '%a %b %d %X %Y'))
  cat('Elapsed Time: ', format(unclass(elapsed), digits = getOption('digits')),
      ' ', attr(elapsed, 'units'), '\n\n\n', sep = '')
  x
}

Now I can monitor the run time of any function by wrapping it in timing. For example:

1

timing(read.csv, 'my_big_file.csv', header = TRUE, stringsAsFactors = FALSE)`

And here’s an example of the output from a job that ran this morning:

1
2
3


Starting queries/accounts.sql at: Mon May 29 06:24:12 AM 2017
Completed at: Mon May 29 06:24:41 AM 2017
Elapsed Time: 29 secs

tictoc is new to me, but I’m glad it is. I would have probably never written the code in this post if it existed, and then I would be sad and this blog post wouldn’t exist. ↩︎
Yes, I realize that having the calls to paste and cat after setting start_time technically add those calls to the stack of stuff being timed and both of those things could occur after function execution. For my purposes, the timing does not have to be nearly that precise and the timing of those functions will contribute virtually nothing. So I opted for what I think is the clearer style of code as well as ensuring that live monitoring would inform me of what’s currently running. ↩︎

May 17, 2017

Making Sense of dplyr 0.6

7:00PM

Non-standard evaluation is one of R’s best features, and also one of it’s most perplexing. Recently I have been making good use of wrapr::let to allow me to write reusable functions without a lot of assumptions about my data. For example, let’s say I always want to group_by schools when adding up dollars spent, but that sometimes my data calls what is conceptually a school schools, school, location, cost_center, Loc.Name, etc. What I have been doing is storing a set of parameters in a list that mapped the actual names in my data to consistent names I want to use in my code. Sometimes that comes from using params in an Rmd file. So the top of my file may say something like:

1
2
3
4


params:
    school: "locations"
    amount: "dollars"
    enrollment: n

In my code, I may want to write a chain like

1
2
3
4
5


create_per_pupil <- . %>%
                    group_by(school) %>%
                    summarize(per_pupil = sum(amount) / n)
pp <- district_data %>%
      create_per_pupil

Only my problem is that school isn’t always school. In this toy case, you could use group_by_(params$school), but it’s pretty easy to run into limitations with the _ functions in dplyr when writing functions.

Using wrapr::let, I can easily use the code above:

1
2
3
4
5
6
7
8


let(alias = params, {
    create_per_pupil <- . %>%
                        group_by(school) %>%
                        summarize(per_pupil = sum(amount)/n)
})

pp <- district_data %>%
      create_per_pupil

The core of wrapr::let is really scary.

1
2
3
4
5
6
7
8
9


body <- strexpr
for (ni in names(alias)) {
    value <- as.character(alias[[ni]])
    if (ni != value) {
        pattern <- paste0("\\b", ni, "\\b")
        body <- gsub(pattern, value, body)
    }
}
parse(text = body)

Basically let is holidng onto the code block contained within it, iterating over the list of key-value pairs that are provided, and then runs a gsub on word boundaries to replace all instances of the list names with their values. Yikes.

This works, I use it all over, but I have never felt confident about it.

The New World of tidyeval

The release of dplyr 0.6 along with tidyeval brings wtih it a ton of features to making programming over dplyr functions far better supported. I am going to read this page by Hadley Wickham at least 100 times. There are all kinds of new goodies (!!! looks amazing).

So how would I re-write the chain above sans let?

1
2
3


create_per_pupil <- . %>%
                    group_by(!!sym(school)) %>%
                    summarize(per_pupil = sum(amount)/n)

If I understand tidyeval, then this is what’s going on.

sym evaluates school and makes the result a symbol
and !! says, roughly “evaluate that symbol now”.

This way with params$school having the value "school_name", sym(school) creates evaulates that to "school_name" and then makes it an unquoted symbol school_name. Then !! tells R “You can evaluate this next thing in place as it is.”

I originally wrote this post trying to understand enquo, but I never got it to work right and it makes no sense to me yet. What’s great is that rlang::sym and rlang::syms with !! and !!! respectively work really well so far. There is definitely less flexibility– with the full on quosure stuff you can have very complex evaluations. But I’m mostly worried about having very generic names for my data so sym and syms seems to work great.

April 2, 2017

Foreign Keys and assertr

4:19PM

I have been fascinated with assertive programming in R since 2015 ¹. Tony Fischetti wrote a great blog post to announce assertr 2.0’s release on CRAN that really clarified the package’s design.

UseRs often do crazy things that no sane developer in another language would do. Today I decided to build a way to check foreign key constraints in R to help me learn the assertr package.

What do you mean, foreign key constraints?

Well, in many ways this is an extension of my last post on using purrr::reduce. I have a set of data with codes (like FIPS codes, or user ids, etc) and I want to make sure that all of those codes are “real” codes (as in I have a defintion for that value). So I may have a FIPS code data.frame with fips_code and name as the columns or a user data.frame with columns id, fname, lname, email.

In a database, I might have a foreign key constraint on my table that just has codes so that I could not create a row that uses an id or code value or whatever that did not exist in my lookup table. Of course in R, our data is disconnected and non-relational. New users may exist in my dataset that weren’t there the last time I downloaded the users table, for example.

Ok, so these are just collections of enumerated values

Yup! That’s right! In some ways like R’s beloved factors, I want to have problems when my data contains values that don’t have a corresponding row in another data.frame, just like trying to insert a value into a factor that isn’t an existing level.

assertr anticipates just this, with the in_set helper. This way I can assert that my data is in a defined set of values or get an error.

1
2
3
4
5
6


my_df <- data.frame(x = c(0,1,1,2))
assert(my_df, in_set(0,1), x)
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
#   index value
# 1     4     2
# Error: assertr stopped execution

Please Don’t stop()

By default, assert raises an error with an incredibly helpful message. It tells you which column the assertion was on, what the assertion was, how many times that assertion failed, and then returns the column index and value of the failed cases.

Even better, assert has an argument for error_fun, which, combined with some built in functions, can allow for all kinds of fun behavior when an assertion fails. What if, for example, I actually want to collect that error message for later and not have a hard stop if an assertion failed?

By using error_append, assert will return the original data.frame when there’s a failure with a special attribute called assertr_errors that can be accessed later with all the information about failed assertions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


my_df %<>%
  assert(in_set(0,1), x, error_fun = error_append) %>%
  verify(x == 1, error_fun = error_append)
my_df
#   x
# 1 0
# 2 1
# 3 1
# 4 2
attr(my_df, 'assertr_errors')
# [[1]]
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
#   index value
# 1     4     2
# 
# [[2]]
# verification [x == 1] failed! (2 failures)

(Ok I cheated there folks. I used verify, a new function from assertr and a bunch of magrittr pipes like %<>%)

Enough with the toy examples

Ok, so here’s the code I wrote today. This started as a huge mess I ended up turning into two functions. First is_valid_fk provides a straight forward way to get TRUE or FALSE on whether or not all of your codes/ids exist in a lookup data.frame.

1
2
3
4
5
6
7
8


is_valid_fk <- function(data, key, values,
                        error_fun = error_logical,
                        success_fun = success_logical){

  assert_(data, in_set(values), key,
          error_fun = error_fun, success_fun = success_fun)

}

The first argument data is your data.frame, the second argument key is the foreign key column in data, and values are all valide values for key. Defaulting the error_fun and success_fun to *_logical means a single boolean is the expected response.

But I don’t really want to do these one column at a time. I want to check if all of the foreign keys in a table are good to go. I also don’t want a boolean, I want to get back all the errors in a useable format. So I wrote all_valid_fk.

Let’s take it one bit at a time.

1

all_valid_fk <- function(data, fk_list, id = 'code') {

data is the data.frame we’re checking foreign keys in.
fk_list is a list of data.frames. Each element is named for the key that it looks up; each data.frame contains the valid values for that key named…
id, the name of the column in each data.frame in the list fk_list that corresponds to the valid keys.

1

verify(data, do.call(has_all_names, as.list(names(fk_list))))

Right away, I want to know if my data has all the values my fk_list says it should. I have to do some do.call magic because has_all_names wants something like has_all_names('this', 'that', 'the_other') not has_all_names(c('this', 'that', 'the_other').

The next part is where the magic happens.

1
2
3
4
5
6
7
8


accumulated_errors <- map(names(fk_list),
                            ~ is_valid_fk(data,
                                          key = .x,
                                          values = fk_list[[.x]][[id]],
                                          error_fun = error_append,
                                          success_fun = success_continue)) %>%
                        map(attr, 'assertr_errors') %>%
                        reduce(append)

Using map, I am able to call is_valid_fk on each of the columns in data that have a corresponding lookup table in fk_list. The valid values are fk_list[[.x]][[id]], where .x is the name of the data.frame in fk_list (which corresponds to the name of the code we’re looking up in data and exists for sure, thanks to that verify call) and id is the name of the key in that data.frame as stated earlier. I’ve replaced error_fun and success_fun so that the code does not exist map as soon there are any problems. Instead, the data is returned for each assertion with the error attribute if one exists. ² Immediately, map is called on the resulting list of data.frames to collect the assertr_errors, which are reduced using append into a flattened list.

If there are no errors accumulated, accumulated_errors is NULL, and the function exits early.

1

if(is.null(accumulated_errors)) return(list())

I could have stopped here and returned all the messages in accumulated_errors. But I don’t like all that text, I want something neater to work with later. The structure I decided on was a list of data.frames, with each element named for the column with the failed foreign key assertion and the contents being the index and value that failed the constraint.

By calling str on data.frames returned by assertion, I was able to see that the index and value tables printed in the failed assert messages are contained in error_df. So next I extract each of those data.frames into a single list.

1
2
3


reporter <- accumulated_errors %>%
            map('error_df') %>%
            map(~ map_df(.x, as.character)) # because factors suck

I’m almost done. I have no way of identifying which column created each of those error_df in reporter. So to name each element based on the column that failed the foreign key contraint, I have to extract data from the message attribute. Here’s what I came up with.

1
2
3
4


names(reporter) <- accumulated_errors %>%
                   map_chr('message') %>%
                   gsub("^Column \'([a-zA-Z]+)\' .*$", '\\1', x = .)
reporter

So let’s create some fake data and run all_valid_fk to see the results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


> df <- data.frame(functions = c('1001','1002', '3001', '3002'),
                   objects = c('100','102', '103', '139'),
                   actuals = c(10000, 2431, 809, 50000),
                   stringsAsFactors = FALSE)

> chart <- list(functions = data.frame(code = c('1001', '1002', '3001'),
                                       name = c('Foo', 'Bar', 'Baz'),
                                       stringsAsFactors = FALSE),
                objects =   data.frame(code = c('100', '102', '103'),
                                       name = c('Mom', 'Dad', 'Baby'),
                                       stringsAsFactors = FALSE))
> all_valid_fk(data = df, fk_list = chart, id = 'code')
$functions
# A tibble: 1 × 2
  index value
  <chr> <chr>
1     4  3002

$objects
# A tibble: 1 × 2
  index value
  <chr> <chr>
1     4   139

Beautiful!

And here’s all_valid_fk in one big chunk.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


all_valid_fk <- function(data, fk_list, id = 'code') {
  verify(data, do.call(has_all_names, as.list(names(fk_list))))

  accumulated_errors <- map(names(fk_list),
                            ~ is_valid_fk(data,
                                          key = .x,
                                          values = fk_list[[.x]][[id]],
                                          error_fun = error_append,
                                          success_fun = success_continue)) %>%
                        map(attr, 'assertr_errors') %>%
                        reduce(append)

  if(is.null(accumulated_errors)) return(list())

  reporter <- accumulated_errors %>%
              map('error_df') %>%
              map(~ map_df(.x, as.character))

  names(reporter) <- accumulated_errors %>%
                     map_chr('message') %>%
                     gsub('Column \'(\\S*?)\'.*$', '\\1', x = .)
  reporter
}

My thanks to Jonathan Carroll who was kind enough to read this post closely and actually tried to run the code. As a result, I’ve fixed a couple of typos and now have an improved regex pattern above.

I appear to have forgotten to build link post types into my Hugo blog, so the missing link from that post is here. ↩︎
I am a little concerned about memory here. Eight assertions would mean, at least briefly, eight copies of the same data.frame copied here without the need for that actual data. There is probably a better way. ↩︎

March 2, 2017

Labeling Data with purrr

7:00PM

Here’s a fun common task. I have a data set that has a bunch of codes like:

Name	Abbr	Code
Alabama	AL	01
Alaska	AK	02
Arizona	AZ	04
Arkansas	AR	05
California	CA	06
Colorado	CO	08
Connecticut	CT	09
Delaware	DE	10

All of your data is labeled with the code value. In this case, you want to do a join so that you can use the actual names because it’s 2017 and we’re not animals.

But what if your data, like the accounting data we deal with at Allovue, has lots of code fields. You probably either have one table that contains all of the look ups in “long” format, where there is a column that represents which column in your data the code is for like this:

code	type	name
01	fips	Alabama
02	fips	Alaska

Alternatively, you may have a lookup table per data element (so one called fips, one called account, one called function, etc).

I bet most folks do the following in this scenario:

1
2
3
4
5
6
7


account <- left_join(account, account_lookup)
account <- left_join(account, fips)

## Maybe this instead ##
account %<>%
  left_join(account_lookup) %>%
  left_join(fips)

I want to encourage you to do this a little different using purrr. Here’s some annotated code that uses reduce_right to make magic.

1
2
3
4
5
6
7
8


# Load a directory of .csv files that has each of the lookup tables
lookups <- map(dir('data/lookups'), read.csv, stringsAsFactors = FALSE)
# Alternatively if you have a single lookup table with code_type as your
# data attribute you're looking up
# lookups <- split(lookups, code_type)
lookups$real_data <- read.csv('data/real_data.csv', 
                              stringsAsFactors = FALSE)
real_data <- reduce_right(lookups, left_join)

Boom, now you went from data with attributes like funds_code, function_code, state_code to data that also has funds_name, function_name, state_name¹. What’s great is that this same code can be reused no matter how many fields require a hookup. I’m oftent dealing with accounting data where the accounts are defined by a different number of data fields, but my code doesn’t care at all.

My recommendation is to use consistent naming conventions like _code and _name so that knowing how to do the lookups is really straightforward. This is not unlike the convention with Microsoft SQL where the primary key of a table is named Id and a foreign key to that table is named TableNameId. Anything that helps you figure out how to put things together without thinking is worth it. ↩︎

March 1, 2017

Contributing to vegalite

7:00PM

One of my goals for 2017 is to contribute more to the R open source community. At the beginning of last year, I spent a little time helping to refactor rio. It was one of the more rewarding things I did in all of 2016. It wasn’t a ton of work, and I feel like I gained a lot of confidence in writing R packages and using S3 methods. I wrote code that R users download and use thousands of times a month.

I have been on the lookout for a Javascript powered interactive charting library since ggvis was announced in 2014. But ggvis seems to have stalled out in favor of other projects (for now) and the evolution of rCharts into htmlwidgets left me feeling like there were far too many options and no clear choices.

What I was looking for was a plotting library to make clean, attractive graphics with tool tips that came with clear documentation and virtually no knowledge of Javascript required. Frankly, all of the htmlwidgets stuff was very intimidating. From my vantage point skimming blog posts and watching stuff come by on Twitter, htmlwidgets-based projects all felt very much directed at Javascript polyglots.

Vega and Vega-Lite had a lot of the qualities I sought in a plotting library. Reading and writing JSON is very accessible compared to learning Javascript, especially with R’s excellent translation from lists to JSON. And although I know almost no Javascript, I found in both Vega and Vega-Lite easy to understand documents that felt a lot like building grammar of graphics ¹ plots.

So I decided to take the plunge– there was a vegalite package and the examples didn’t look so bad. It was time to use my first htmlwidgets package.

Things went great. I had some simple data and I wanted to make a bar chart. I wrote:

1
2
3
4
5


vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar()

A bar chart was made! But then I wanted to use the font Lato, which is what we use at Allovue. No worries, Vega-Lite has a property called titleFont for axes. So I went to do:

1
2
3
4
5
6


vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar() %>%
axis_x(titleFont = 'Lato')

Bummer. It didn’t work. I almost stopped there, experiment over. But then I remembered my goal and I thought, maybe I need to learn to contribute to a package that is an htmlwidget and not simply use an htmlwidget-based package. I should at least look at the code.

What I found surprised me. Under the hood, all the R package does is build up lists. It makes so much sense– pass JSON to Javascript to process and do what’s needed.

So it turned out, vegalite for R was a bit behind the current version of vegalite and didn’t have the titleFont property yet. And with that, I made my first commit. All I had to do was update the function definition and add the new arguments to the axis data like so:

1

if (!is.null(titleFont))    vl$x$encoding[[chnl]]$axis$titleFont <- titleFont

But why stop there? I wanted to update all of vegalite to use the newest available arguments. Doing so looked like a huge pain though. The original package author made these great functions like axis_x and axis_y. They both had the same arguments, the only difference was the “channel” was preset as x or y based on which function was called. Problem was that all of the arguments, all of the assignments, and all of the documentation had to be copied twice. It was worse with encode and scale which had many, many functions that are similar or identical in their “signature”. No wonder the package was missing so many Vega-Lite features– they were a total pain to add.

So as a final step, I decided I would do a light refactor across the whole package. In each of the core functions, like encode and axis, I would write a single generic function like encode_vl() that would hold all of the possible arguments for the encoding portion of Vega-Lite. Then the specific functions like encode_x could become wrapper functions that internally call encode_vl like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


encode_x <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "x", ...)
  vl
}

encode_y <- function(vl, ...) {
  vl <- encode_vl(vl, chnl ="y", ...)
  vl
}

encode_color <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "color", ...)
  vl
}

Now, in order to update the documentation and the arguments for encoding, I just have to update the encode_vl function. It’s a really nice demonstration, in my opinion, of the power of R’s ... syntax. All of the wrapper functions can just pass whatever additional arguments the caller wants to encode_vl without having to explicitly list them each time.

This greatly reduced duplication in the code and made it far easier to update vegalite to the newest version of Vega-Lite, which I also decided to do.

Now Vega-Lite itself is embarking on a 2.0 release that I have a feeling will have some pretty big changes in store. I’m not sure if I’ll be the one to update vegalite– in the end, I think that Vega-Lite is too simple for the visualizations I need to do– but I am certain whoever does the update will have a much easier go of it now than they would have just over a month ago.

Thanks to Bob Rudis for making vegalite and giving me free range after a couple of commits to go hog-wild on his package!

The gg in ggplot2. ↩︎

December 28, 2016

A Gateway Drug to purrr

7:00PM

A lot of the data I work with uses numeric codes rather than text to describe features of each record. For example, financial data often has a fund code that represents the account’s source of dollars and an object code that signals what is bought (e.g. salaries, benefits, supplies). This is a little like the factor data type in R, which to the frustration of many modern analysts is internally an integer that mapped to a character label (which is a level) with a fixed number of possible values.

I am often looking at data stored like this:

fund_code	object_code	debit	credit
1000	2121	0	10000
1000	2122	1000	0

with the labels stored in another set of tables:

fund_code	fund_name
1000	General

and

object_code	object_name
2121	Social Security
2122	Life Insurance

Before purrr, I might have done a series of dplyr::left_join or merge to combine these data sets and get the labels in the same data.frame as my data.

But no longer!

Now, I can just create a list, add all the data to it, and use purrr:reduce to bring the data together. Incredibly convenient when up to 9 codes might exist for a single record!

1
2
3
4
5


# Assume each code-name pairing is in a CSV file in a directory
data_codes <- lapply(dir('codes/are/here/', full.names = TRUE ), 
                     readr::read_csv)
data_codes$transactions <- readr::read_csv('my_main_data_table.csv')
transactions <- purrr:reduce_right(data_codes, dplyr::left_join)

July 4, 2015

News on Assertions in R

7:00PM

How many times have you written R functions that start with a bunch of code that looks like this?

1
2
3
4
5
6


my_funct <- function(dob, enddate = "2015-07-05"){
if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  } 
...
}

Because R was designed to be interactive, it is incredibly tolerant to bad user input. Functions are not type safe, meaning function arguments do not have to conform to specified data types. But most of my R code is not run interactively. I have to trust my code to run on servers on schedules or on demand as a part of production systems. So I find myself frequently writing code like the above– manually writing type checks for safety.

There has been some great action in the R community around assertive programming, as you can see in the link. My favorite development, by far, are type-safe functions in the ensurer package. The above function definition can now be written like this:

1
2
3


my_funct <- function_(dob ~ Date, enddate ~ Date: as.Date("2015-07-05"), {
  ...
})

All the type-checking is done.

I really like the reuse of the formula notation ~ and the use of : to indicate default values.

Along with packages like testthat, R is really growing up and modernizing.

January 22, 2015

A PostgreSQL Cheat Sheet for OSX and R

7:00PM

I keep this on my desktop.

Install:

1
2
3
4
5
6


brew install postgresql
initdb /usr/local/var/postgres -E utf8
gem install lunchy
### Start postgres with lunchy
mkdir -p ~/Library/LaunchAgents
cp /usr/local/Cellar/postgresql/9.3.3/homebrew.mxcl.postgresql.plist ~/Library/LaunchAgents/

Setup DB from SQL file:

1
2
3
4
5


### Setup DB
lunchy start postgres
created $DBNAME
psql -d $DBNAME -f '/path/to/file.sql'
lunchy stop postgres

Starting and Stopping PostgreSQL

1
2


lunchy start postgres
lunchy stop postgres

may run into trouble with local socket… try this:

1

rm /usr/local/var/postgres/postmaster.pid

Connecting with R

1
2
3


# make sure lunch start postgres in terminal first)
require(dplyr)
db <- src_postgres(dbname=$DBNAME)

Inspired by seeing this post and thought I should toss out what I do.

April 1, 2014

Symlinking Your Data

7:00PM

I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ “inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

1

ln -s target_path link_path

R offers a function that does the same thing:

1

file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

March 9, 2014

Expressiveness Counts

7:00PM

Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.

While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.

If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.

This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.

Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the data.table package.

First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:

1
2
3
4
5
6
7
8
9


# Generate test data for modal_attribute().
modal_test <- data.frame(sid = c('1000', '1001', '1000', '1000', '1005', 
                                 '1005', rep('1006',4)),
                         race = c('Black', 'White', 'Black', 'Hispanic',
                                  'White', 'White', rep('Black',2), 
                                  rep('Hispanic',2)),
                         year = c(2006, 2006, 2007, 2008,
                                  2010, 2011, 2007, 2008,
                                  2010, 2011))

The test data generated by that code looks like this:

sasid	race	year
1000	Black	2006
1001	White	2006
1000	Black	2007
1000	Hispanic	2008
1005	White	2010
1005	White	2011
1006	Black	2007
1006	Black	2008
1006	Hispanic	2010
1006	Hispanic	2011

And the results should be:

sasid	race
1000	Black
1001	White
1005	White
1006	Hispanic

My first attempts at solving this problem using data.table resulted in a pretty complex set of code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


# Calculate the modal attribute using data.table
modal_person_attribute_dt <- function(df, attribute){
  # df: rbind of all person tables from all years
  # attribute: vector name to calculate the modal value
  # Calculate the number of instances an attributed is associated with an id
  dt <- data.table(df, key='sasid')
  mode <- dt[, rle(as.character(.SD[[attribute]])), by=sasid]
  setnames(mode, c('sasid', 'counts', as.character(attribute)))
  setkeyv(mode, c('sasid', 'counts'))
  # Only include attributes with the maximum values. This is equivalent to the
  # mode with two records when there is a tie.
  mode <- mode[,subset(.SD, counts==max(counts)), by=sasid]
  mode[,counts:=NULL]
  setnames(mode, c('sasid', attribute))
  setkeyv(mode, c('sasid',attribute))
  # Produce the maximum year value associated with each ID-attribute 
  # pairing    
  setkeyv(dt, c('sasid',attribute))
  mode <- dt[,list(schoolyear=max(schoolyear)), by=c("sasid", attribute)][mode]
  setkeyv(mode, c('sasid', 'schoolyear'))
  # Select the last observation for each ID, which is equivalent to the highest
  # schoolyear value associated with the most frequent attribute.
  result <- mode[,lapply(.SD, tail, 1), by=sasid]
  # Remove the schoolyear to clean up the result
  result <- result[,schoolyear:=NULL]
  return(as.data.frame(result))
}

This approached seemed “natural” in data.table, although it took me a while to refine and debug since it was my first time using the package ¹. Essentially, I use rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.

I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:

1
2
3


> system.time(modal_person_attribute(all_years, 'sex'))
 user  system elapsed 
 40.452   0.246  41.346 

Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.

With the release of dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the dplyr code would be easier to maintain since I frequently use dplyr and only turn to data.table in specific, sticky situations where performance matters.

In about a tenth the time it took to develop the original code, I came up with this new function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


modal_person_attribute <- function(x, sid, attribute, year){
  grouping <- lapply(list(sid, attribute), as.symbol)
  original <- x
  max_attributes <- x %.% 
                    regroup(grouping) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count))
  recent_max <- left_join(original, max_attributes) %.%
                regroup(list(grouping[[1]])) %.%
                filter(!is.na(count) & count == max(count))
  results <- recent_max %.% 
             regroup(list(grouping[[1]])) %.%
             filter(year == max(year))
  return(results[,c(sid, attribute)])
}

At least to my eyes, this code is far more expressive and elegant. First, I generate a data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.

There are a few funky things (note the use of regroup and grouping, which are related to dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.

But was this code more performant? Imagine my glee when this happened:

1
2
3
4
5


> system.time(modal_person_attribute_dplyr(all_years, sid='sasid', 
> attribute='sex', year='schoolyear'))
Joining by: c("sasid", "sex")
   user  system elapsed 
  1.657   0.087   1.852 

That is a remarkable increase in performance!

Now, I realize that I may have cheated. My data.table code isn’t very good and could probably follow a pattern closer to what I did in dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.

It is not every day that a tool provides powerful expressiveness and yields greater performance.

Update

I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


modal_person_attribute <- function(x, sid, attribute, year){
  # Select only the important columns
  x <- x[,c(sid, attribute, year)]
  names(x) <- c('sid', 'attribute', 'year')
  # Clean up years
  if(TRUE %in% grepl('_', x$year)){
    x$year <- gsub(pattern='[0-9]{4}_([0-9]{4})', '\\1', x$year)
  }  
  # Calculate the count for each person-attribute combo and select max
  max_attributes <- x %.% 
                    group_by(sid, attribute) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count)) %.%
                    select(sid, attribute)
  # Find the max year for each person-attribute combo
  results <- max_attributes %.% 
             left_join(x) %.%
             group_by(sid) %.%
             filter(year == max(year)) %.%
             select(sid, attribute)
  names(results) <- c(sid, attribute)
  return(results)
}

It was over a year ago that I first wrote this code. ↩︎

February 17, 2014

Appreciating the Beauty of dplyr

7:00PM

Hadley Wickham has once again¹ made R ridiculously better. Not only is dplyr incredibly fast, but the new syntax allows for some really complex operations to be expressed in a ridiculously beautiful way.

Consider a data set, course, with a student identifier, sid, a course identifier, courseno, a quarter, quarter, and a grade on a scale of 0 to 4, gpa. What if I wanted to know the number of a courses a student has failed over the entire year, as defined by having an overall grade of less than a 1.0?

In dplyr:

1
2
3
4
5


course %.% 
group_by(sid, courseno) %.%
summarise(gpa = mean(gpa)) %.%
filter(gpa <= 1.0) %.%
summarise(fails = n())

I refuse to even sully this post with the way I would have solved this problem in the past.

Seriously, how many of the packages he has managed/written are indispensable to using R today? It is no exaggeration to say that the world would have many more Stata, SPSS, and SAS users if not for Hadleyverse. ↩︎

December 3, 2013

Calculating Age with Precision in R

7:00PM

Update

Turns out the original code below was pretty messed up. All kinds of little errors I didn’t catch. I’ve updated it below. There are a lot of options to refactor this further that I’m currently considering. Sometimes it is really hard to know just how flexible something this big really should be. I think I am going to wait until I start developing tests to see where I land. I have a feeling moving toward a more test-driven work flow is going to force me toward a different structure.

I recently updated the function I posted about back in June that calculates the difference between two dates in days, months, or years in R. It is still surprising to me that difftime can only return units from seconds up until weeks. I suspect this has to do with the challenge of properly defining a “month” or “year” as a unit of time, since these are variable.

While there was nothing wrong with the original function, it did irk me that it always returned an integer. In other words, function returned only complete months or years. If the start date was on 2012-12-13 and the end date was on 2013-12-03, the function would return 0 years. Most of the time, this is the behavior I expect when calcuating age. But it is completely reasonable to want to include partial years or months, e.g. in the aforementioned example returning 0.9724605.

So after several failed attempts because of silly errors in my algorithm, here is the final code. It will be released as part of eeptools 0.3 which should be avialable on CRAN soon ¹.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


age_calc <- function(dob, enddate=Sys.Date(), units='months', precise=TRUE){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  }
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  if(precise){
    start_is_leap <- ifelse(start$year %% 400 == 0, TRUE, 
                        ifelse(start$year %% 100 == 0, FALSE,
                               ifelse(start$year %% 4 == 0, TRUE, FALSE)))
    end_is_leap <- ifelse(end$year %% 400 == 0, TRUE, 
                        ifelse(end$year %% 100 == 0, FALSE,
                               ifelse(end$year %% 4 == 0, TRUE, FALSE)))
  }
  if(units=='days'){
    result <- difftime(end, start, units='days')
  }else if(units=='months'){
    months <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='months', SIMPLIFY=FALSE), 
                     length) - 1
    # length(seq(start, end, by='month')) - 1
    if(precise){
      month_length_end <- ifelse(end$mon==1, 28,
                                 ifelse(end$mon==1 & end_is_leap, 29,
                                        ifelse(end$mon %in% c(3, 5, 8, 10), 
                                               30, 31)))
      month_length_prior <- ifelse((end$mon-1)==1, 28,
                                     ifelse((end$mon-1)==1 & start_is_leap, 29,
                                            ifelse((end$mon-1) %in% c(3, 5, 8, 
                                                                      10), 
                                                   30, 31)))
      month_frac <- ifelse(end$mday > start$mday,
                           (end$mday-start$mday)/month_length_end,
                           ifelse(end$mday < start$mday, 
                            (month_length_prior - start$mday) / 
                                month_length_prior + 
                                end$mday/month_length_end, 0.0))
      result <- months + month_frac
    }else{
      result <- months
    }
  }else if(units=='years'){
    years <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='years', SIMPLIFY=FALSE), 
                     length) - 1
    if(precise){
      start_length <- ifelse(start_is_leap, 366, 365)
      end_length <- ifelse(end_is_leap, 366, 365)
      year_frac <- ifelse(start$yday < end$yday,
                          (end$yday - start$yday)/end_length,
                          ifelse(start$yday > end$yday, 
                                 (start_length-start$yday) / start_length +
                                end$yday / end_length, 0.0))
      result <- years + year_frac
    }else{
      result <- years
    }
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

I should note that my mobility function will also be included in eeptools 0.3. I know I still owe a post on the actual code, but it is such a complex function I have been having a terrible time trying to write clearly about it. ↩︎

September 22, 2013

Using R to Calculate Student Moblity

7:00PM

In a couple of previous posts, I outlined the importance of documenting business rules for common education statistics and described my take on how to best calculate student mobility. In this post, I will be sharing two versions of R function I wrote to implement this mobility calculation, reviewing their different structure and methods to reveal how I achieved an order of magnitude speed up between the two versions. ¹ At the end of this post, I will propose several future routes for optimization that I believe should lead to the ability to handle millions of student records in seconds.

Version 0: Where Do I Begin?

The first thing I tend to do is whiteboard the rules I want to use through careful consideration and constant referal back to real data sets. By staying grounded in the data, I am less likely to encounter unexpected situations during my quality control. It also makes it much easier to develop test data, since I seek out outlier records in actual data during the business rule process.

Developing test data is a key part of the development process. Without a compact, but sufficiently complex, set of data to try with a newly developed function, there is no way to know whether or not it does what I intend.

Recall the business rules for mobility that I have proposed, all of which came out of this whiteboarding process:

Entering the data with an enroll date after the start of the year counts as one move.
Leaving the data with an exit date before the end of the year counts as one move.
Changing schools sometime during the year without a large gap in enrollment counts as one move.
Changing schools sometime during the year with a large gap in enrollment counts as two moves.
Adjacent enrollment records for the same student in the same school without a large gap in enrollment does not count as moving.

Test data needs to represent each of these situations so that I can confirm the function is properly implementing each rule.

Below is a copy of my test data. As an exercise, I recommend determining the number of “moves” each of these students should be credited with after applying the above stated business rules.

Unique Student ID	School Code	Enrollment Date	Exit Date
1000000	10101	2012-10-15	2012-11-15
1000000	10103	2012-01-03	2013-03-13
1000000	10103	2012-03-20	2013-05-13
1000001	10101	2012-09-01	2013-06-15
1000002	10102	2012-09-01	2013-01-23
1000003	10102	2012-09-15	2012-11-15
1000003	10102	2013-03-15	2013-06-15
1000004	10103	2013-03-15	NA

Version 1: A Naïve Implementation

Once I have developed business rules and a test data set, I like to quickly confirm that I can produce the desired results. That’s particularly true when it comes to implementing a new, fairly complex business rules. My initial implementation of a new algorithm does not need to be efficient, easily understood, or maintainable. My goal is simply to follow my initial hunch on how to accomplish a task and get it working. Sometimes this naïve implementation turns out to be pretty close to my final implementation, but sometimes it can be quite far off. The main things I tend to improve with additional work are extensibility, readability, and performance.

In the case of this mobility calculation, I knew almost immediately that my initial approach was not going to have good performance characteristics. Here is a step by step discussion of Version 1.

Function Declaration: Parameters

1
2
3
4
5
6
7
8


moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sid', 
                       schid='schid',
                       enroll_date='enroll_date',
                       exit_date='exit_date')){

I named my function moves_calc() to match the style of age_calc() which was submitted and accepted to the eeptools package. This new function has eight parameters.

df: a data.frame containing the required data to do the mobility calculation.

enrollby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the start of the school year. Students whose first enrollment is after this date will have an additional move under the assumption that they enrolled somewhere prior to the first enrollment record in the data. This does not (and likely should not) match the actual first day of the school year.

exitby: an atomic vector of type character or Date in the format YYYY-MM-DD. This parameter signifies the end of the school year. Students whose last exit is before this date will have an additional move under the assumption that they enrolled somewhere after this exit record that is excluded in the data. This date does not (and likely should not) match the actual last day of the school year.

gap: an atomic vector of type numeric that signifies how long a gap must exist between student records to record an additional move for that student under the assumption that they enrolled somewhere in between the two records in the data that is not recorded.

sid: an atomic vector of type character that represents the name of the vector in df that contains the unique student identifier. The default value is 'sid'.

schid: an atomic vector of type character that represents the name of the vector in df that contains the unique school identifier. The default value is schid.

enroll_date: an atomic vector of type character that represents the name of the vector in df that contains the enrollment date for each record. The default value is enroll_date.

exit_date: an atomic vector of type character that represents the name of the vector in df that contains the exit date for each record. The default value is exit_date.

Most of these parameters are about providing flexibility around the naming of attributes in the data set. Although I often write functions for my own work which accept data.frames, I can not help but to feel this is a bad practice. Assuming particular data attributes of the right name and type does not make for generalizable code. To make up for my shortcoming in this area, I have done my best to allow other users to enter whatever data column names they want, so long as they contain the right information to run the algorithm.

The next portion of the function loads some of the required packages and is common to many of my custom functions:

1
2
3
4
5
6
7
8
9


if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
require(data.table)

if("plyr" %in% rownames(installed.packages()) == FALSE){
    install.packages("plyr")
  } 
require(plyr)

Type Checking and Programmatic Defaults

Next, I do extensive type-checking to make sure that df is structured the way I expect it to be in order to run the algorithm. I do my best to supply humane warning() and stop() messages when things go wrong, and in some cases, set default values that may help the function run even if function is not called properly.

1
2


if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
    stop("Both enroll_date and exit_date must be Date objects")

The enroll_date and exit_date both have to be Date objects. I could have attempted to coerce those vectors into Date types using as.Date(), but I would rather not assume something like the date format. Since enroll_date and exit_date are the most critical attributes of each student, the function will stop() if they are the incorrect type, informing the analyst to clean up the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


if(missing(enrollby)){
   enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
     enrollby <- as.Date(paste(year(min(df$enroll_date, na.rm=TRUE)),
                               '-09-15', sep=''), format='%Y-%m-%d'
     warning(paste("enrollby must be a string with format %Y-%m-%d,",
                   "defaulting to", 
                   enrollby, sep=' '))
  }else{
    enrollby <- as.Date(enrollby, format="%Y-%m-%d")
  }
}
if(missing(exitby)){
  exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                          '-06-01', sep=''), format='%Y-%m-%d')
}else{
  if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
    exitby <- as.Date(paste(year(max(df$exit_date, na.rm=TRUE)),
                              '-06-01', sep=''), format='%Y-%m-%d')
    warning(paste("exitby must be a string with format %Y-%m-%d,",
                  "defaulting to", 
                  exitby, sep=' '))
  }else{
    exitby <- as.Date(exitby, format="%Y-%m-%d")
  }
}
if(!is.numeric(gap)){
  gap <- 14
  warning("gap was not a number, defaulting to 14 days")
}

For maximum flexibility, I have parameterized the enrollby, exitby, and gap used by the algorithm to determine student moves. An astute observer of the function declaration may have noticed I did not set default values for enrollby or exitby. This is because these dates are naturally going to be different which each year of data. As a result, I want to enforce their explicit declaration.

However, we all make mistakes. So when I check to see if enrollby or exitby are missing(), I do not stop the function if it returns TRUE. Instead, I set the value enrollby to September 15 in the year that matches the minimum (first) enrollment record and exitby to June 1 in the year that matches the maximum (last) exit record. I then pop off a warning() that informs the user the expected values for each parameter and what values I have defaulted them to. I chose to use warning() because many R users set their environment to halt at warnings(). They are generally not good and should be pursued and fixed. No one should depend upon the defaulting process I use in the function. But the defaults that can be determined programmatically are sensible enough that I did not feel the need to always halt the function in its place.

I also check to see if gap is, in fact, defined as a number. If not, I also throw a warning() after setting gap equal to default value of 14.

Is this all of the type and error-checking I could have included? Probably not, but I think this represents a very sensible set that make this function much more generalizable outside of my coding environment. This kind of checking may be overkill for a project that is worked on independently and with a single data set, but colleagues, including your future self, will likely be thankful for their inclusion if any of your code is to be reused.

Initializing the Results

1
2
3
4
5


output <- data.frame(id = as.character(unique(df[[sid]])),
                     moves = vector(mode = 'numeric', 
                                    length = length(unique(df[[sid]]))))
output <- data.table(output, key='id')
df <- arrange(df, sid, enroll_date)

My naïve implementation uses a lot of for loops, a no-no when it comes to R performance. One way to make for loops a lot worse, and this is true in any language, is to reassign a variable within the loop. This means that each iteration has the overhead of creating and assigning that object. Especially when we are building up results for each observation, it is silly to do this. We know exactly how big the data will be and therefore only need to create the object once. We can then assign a much smaller part of that object (in this case, one value in a vector) rather than the whole object (a honking data.table).

Our output object is what the function returns. It is a simple data.table containing all of the unique student identifiers and the number of moves recorded for each student.

The last line in this code chunk ensures that the data are arranged by the unique student identifier and enrollment date. This is key since the for loops assume that they are traversing a student’s record sequentially.

Business Rule 1: The Latecomer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


for(i in 1:(length(df[[sid]])-1)){
  if(i>1 && df[sid][i,]!=df[sid][(i-1),]){
    if(df[['enroll_date']][i]>enrollby){
      output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }else if(i==1){
    if(df[['enroll_date']][i]>enrollby){
    output[as.character(df[[sid]][i]), moves:=moves+1L]
    }
  }

The first bit of logic checks if sid in row i is not equal to the sid in the i-1 row. In other words, is this the first time we are observing this student? If it is, then row i is the first observation for that student and therefore has the minimum enrollment date. The enroll_date is checked against enrollby. When enroll_date is after enrollby, then the moves attribute for that sid is incremented by 1. ²

Now, I didn’t really mention the conditional that i>1. This is needed because there is no i-1 observation for the very first row of the data.table. Therefore, i==1 is a special case where we once again perform the same check for enroll_date and enrollby. The i>1 condition is before the && operator, which ensures the statement after the && is not evaluated when the first conditional is FALSE. This avoids an “out of bounds”-type error where R tries to check df[0].

Business Rule 5: The Feint

Yeah, yeah– the business rule list above doesn’t match the order of my function. That’s ok. Remember, sometimes giving instructions to a computer does not follow the way you would organize instructions for humans.

Remember, the function is traversing through our data.frame one row at a time. First I checked to see if the function is at the first record for a particular student. Now I check to see if there are any records after the current record.

1
2
3
4
5
6


  if(df[sid][i,]==df[sid][(i+1),]){
    if(as.numeric(difftime(df[['enroll_date']][i+1], 
                           df[['exit_date']][i], units='days')) < gap &
       df[schid][(i+1),]==df[schid][i,]){
        next
    }else if ...

For the case where the i+1 record has the same sid, then the enroll_date of i+1 is subtracted from the exit_date of i and checked against gap. If it is both less than gap and the schid of i+1 is the same as i, then next, which basically breaks out of this conditional and moves on without altering moves. In other words, students who are in the same school with only a few days between the time they exited are not counting has having moved.

The ... above is not the special ... in R, rather, I’m continuing that line below.

Business Rule 3: The Smooth Mover

1
2
3
4
5


  }else if(as.numeric(difftime(df[['enroll_date']][i+1], 
                               df[['exit_date']][i], 
                               units='days')) < gap){
    output[as.character(df[[sid]][i]), moves:=moves+1L] 
  }else{ ...

Here we have the simple case where a student has moved to another school (recall, this is still within the if conditional where the next record is the same student as the current record) with a very short period of time between the exit_date at the current record and the enroll_date of the next record. This is considered a “seamless” move from one school to another, and therefore that student’s moves are incremented by 1.

Business Rule 4: The Long Hop

Our final scenario for a student moving between schools is when the gap between the exit_date at the i school and the enroll_date at the i+1 school is large, defined as > gap. In this scenario, the assumption is that the student moved to a jurisdiction outside of the data set, such as out of district for district-level data or out of state for state level data, and enrolled in at least one school not present in their enrollment record. The result is these students receive 2 moves– one out from the i school to a missing school and one in to the i+1 school from the missing school.

The code looks like this (again a repeat from the else{... above which was not using the ... character):

1
2
3
4


  }else{
    output[as.character(df[[sid]][i]), moves:=moves+2L] 
  }
}else...

This ends with a } which closes the if conditional that checked if the i+1 student was the same as the i student, leaving only one more business rule to check.

Business Rule 2: The Early Summer

1
2
3
4
5
6
7


}else{
  if(is.na(df[['exit_date']][i])){
    next
  }else if(df[['exit_date']][i] < exitby){
        output[as.character(df[[sid]][i]), moves:=moves+1L]
  }
}

Recall that this else block is only called if sid of the i+1 record is not the same as i. This means that this is the final entry for a particular student. First, I check to see if that student has a missing exit_date and if so charges no move to the student implementing the next statement to break out of this iteration of the loop. Students never have missing enroll_date for any of the data I have seen over 8 years. This is because most systems minimally autogenerate the enroll_date for the current date when a student first enters a student information system. However, sometimes districts forget to properly exit a student and are unable to supply an accurate exit_date. In a very small number of cases I have seen these missing dates. So I do not want the function to fail in this scenario. My solution here was simply to break out and move to the next iteration of the loop.

Finally, I apply the last rule, which compares the final exit_date for a student to exitby, incrementing moves if the student left prior to the end of the year and likely enrolled elsewhere before the summer.

The last step is to close the for loop and return our result:

1
2
3


  }
  return(output)
}

Version 2: 10x Speed And More Readable

The second version of this code is vastly quicker.

The opening portion of the code, including the error checking is essentially a repeat of before, as is the initialization of the output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


moves_calc <- function(df, 
                       enrollby,
                       exitby,
                       gap=14,
                       sid='sasid', 
                       schid='schno',
                       enroll_date='enroll_date',
                       exit_date='exit_date'){
  if("data.table" %in% rownames(installed.packages()) == FALSE){
    install.packages("data.table")
  } 
  require(data.table)
  if (!inherits(df[[enroll_date]], "Date") | !inherits(df[[exit_date]], "Date"))
      stop("Both enroll_date and exit_date must be Date objects")
  if(missing(enrollby)){
    enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                              '-09-15', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(enrollby, format="%Y-%m-%d"))){
      enrollby <- as.Date(paste(year(min(df[[enroll_date]], na.rm=TRUE)),
                                '-09-15', sep=''), format='%Y-%m-%d')
      warning(paste("enrollby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    enrollby, sep=' '))
    }else{
      enrollby <- as.Date(enrollby, format="%Y-%m-%d")
    }
  }
  if(missing(exitby)){
    exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                            '-06-01', sep=''), format='%Y-%m-%d')
  }else{
    if(is.na(as.Date(exitby, format="%Y-%m-%d"))){
      exitby <- as.Date(paste(year(max(df[[exit_date]], na.rm=TRUE)),
                                '-06-01', sep=''), format='%Y-%m-%d')
      warning(paste("exitby must be a string with format %Y-%m-%d,",
                    "defaulting to", 
                    exitby, sep=' '))
    }else{
      exitby <- as.Date(exitby, format="%Y-%m-%d")
    }
  }
  if(!is.numeric(gap)){
    gap <- 14
    warning("gap was not a number, defaulting to 14 days")
  }
  output <- data.frame(id = as.character(unique(df[[sid]])),
                       moves = vector(mode = 'numeric', 
                                      length = length(unique(df[[sid]]))))

Where things start to get interesting is in the calculation of the number of student moves.

Handling Missing Data

One of the clever bits of code I forgot about when I initially tried to refactor Version 1 appears under “Business Rule 2: The Early Summer”. When the exit_date is missing, this code simply breaks out of the loop:

1
2


  if(is.na(df[['exit_date']][i])){
    next

Because the new code will not be utilizing for loops or really any more of the basic control flow, I had to device a different way to treat missing data. The steps to apply the business rules that I present below will fail spectacularly with missing data.

So the first thing that I do is select the students who have missing data, assign the moves in the output to NA, and then subset the data to exclude these students.

1
2
3
4
5
6
7


incomplete <- df[!complete.cases(df[, c(enroll_date, exit_date)]), ]
if(dim(incomplete)[1]>0){
  output[which(output[['id']] %in% incomplete[[sid]]),][['moves']] <- NA
}
output <- data.table(output, key='id')
df <- df[complete.cases(df[, c(enroll_date, exit_date)]), ]
dt <- data.table(df, key=sid)

Woe with `data.table`

Now with the data complete and in a data.table, I have to do a little bit of work to assist with my frustrations with data.table. Because data.table does a lot of work with the [ operator, I find it very challenging to use a string argument to reference a column in the data. So I just gave up and internally rename these attributes.

1
2
3


dt$sasid <- as.factor(as.character(dt$sasid))
setnames(dt, names(dt)[which(names(dt) %in% enroll_date)], "enroll_date")
setnames(dt, names(dt)[which(names(dt) %in% exit_date)], "exit_date")

Magic with `data.table`: Business Rules 1 and 2 in two lines each

Despite by challenges with the way that data.table re-imagines [, it does allow for clear, simple syntax for complex processes. Gone are the for loops and conditional blocks. How does data.table allow me to quickly identified whether or not a students first or last enrollment are before or after my cutoffs?

1
2
3
4


first <- dt[, list(enroll_date=min(enroll_date)), by=sid]
output[id %in% first[enroll_date>enrollby][[sid]], moves:=moves+1L]
last <- dt[, list(exit_date=max(exit_date)), by=sid]  
output[id %in% last[exit_date<exitby][[sid]], moves:=moves+1L]

Line 1 creates a data.table with the student identifier and a new enroll_date column that is equal to the minimum enroll_date for that student.

The second line is very challenging to parse if you’ve never used data.table. The first argument for [ in data.table is a subset/select function. In this case,

1

id %in% first[enroll_date>enrollby][[sid]]

means,

Select the rows in first where the enroll_date attribute (which was previously assigned as the minimum enroll_date) is less than the global function argument enrollby and check if the id of output is in the sid vector.

So output is being subset to only include those records that meet that condition, in other words, the students who should have a move because they entered the school year late.

The second argument of [ for data.tables is explained in this footnote ² if you’re not familiar with it.

Recursion. Which is also known as recursion.

The logic for Business Rules 3-5 are substantially more complex. At first it was not plainly obvious how to avoid a slow for loop for this process. Each of the rules on switching schools requires an awareness of context– how does one record of a student compare to the very next record for that student?

The breakthrough was thinking back to my single semester of computer science and the concept of recursion. I created a new function inside of this function that can count how many moves are associated with a set of enrollment records, ignoring the considerations in Business Rules 1 and 2. Here’s my solution. I decided to include inline comments because I think it’s easier to understand that way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


school_switch <- function(dt, x=0){
  # This function accepts a data.table dt and initializes the output to 0.
    if(dim(dt)[1]<2){
    # When there is only one enrollment record, there are no school changes to
    # apply rules 3-5. Therefore, the function returns the value of x. If the
    # initial data.table contains a student with just one enrollment record, 
    # this function will return 0 since we initialize x as 0.
      return(x)
    }else{
      # More than one record, find the minimum exit_date which is the "first"
      # record
      exit <- min(dt[, exit_date])
      # Find out which school the "first" record was at.
      exit_school <- dt[exit_date==exit][[schid]]
      # Select which rows come after the "first" record and only keep them
      # in the data.table
      rows <- dt[, enroll_date] > exit
      dt <- dt[rows,]
      # Find the minimum enrollment date in the subsetted table. This is the
      # enrollment that follows the identified exit record
      enroll <- min(dt[, enroll_date])
      # Find the school associated with that enrollment date
      enroll_school <- dt[enroll_date==enroll][[schid]]
      # When the difference between the enrollment and exit dates are less than
      # the gap and the schools are the same, there are no moves. We assign y,
      # our count of moves to x, whatever the number of moves were in this call
      # of school_switch
      if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap &
         exit_school==enroll_school){
        y = x
      # When the difference in days is less than the gap (and the schools are
      # different), then our number of moves are incremented by 1.
      }else if(difftime(min(dt[, enroll_date], na.rm=TRUE), exit) < gap){
        y = x + 1L
      }else{
      # Whenever the dates are separated by more than the gap, regardless of which
      # school a student is enrolled in at either point, we increment by two.
        y = x + 2L
      }
      # Explained below outside of the code block.
      school_switch(dt, y)
    }
  }

The recursive aspect of this method is calling school_switch within school_switch once the function reaches its end. Because I subset out the row with the minimum exit_date, the data.table has one row processed with each iteration of school_switch. By passing the number of moves, y back into school_switch, I am “saving” my work from each iteration. Only when a single row remains for a particular student does the function return a value.

This function is called using data.table’s special .SD object, which accesses the subset of the full data.table when using the by argument.

1

dt[, moves:= school_switch(.SD), by=sid]

This calls school_switch after splitting the data.table by each sid and then stitches the work back together, split-apply-combine style, resulting in a data.table with a set of moves per student identifier. With a little bit of clean up, I can simply add these moves to those recorded earlier in output based on Business Rules 1 and 2.

1
2
3
4


  dt <- dt[,list(switches=unique(moves)), by=sid]
  output[dt, moves:=moves+switches]
  return(output)
}

Quick and Dirty `system.time`

On a mid-2012 Macbook Air, the current mobility calculation is very effective with tens of thousands of student records and practical for use in the low-hundreds of thousands of records range. ↩︎
I thought I was going to use data.table for some of its speedier features as I wrote this initial function. I didn’t in this go (though I do in Version 2). However, I do find the data.table syntax for assigning values to be really convenient, particularly the := operator which is common in several other languages. In data.table, the syntax dt[,name:=value] assigns value to an exist (or new) column called name. Because of the need select operator in data.table, I can just use dt[id,moves:=moves+1L] to select only the rows where the table key, in this case sid, matches id, and then increment moves. Nice. ↩︎ ↩︎

September 12, 2013

Documentation of Business Rules and Analysis

7:00PM

One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.

Documentation of business rules is a critical aspect of a data analysts job that, in my experience, is often regrettably overlooked. If you have ever tried to reproduce someone else’s analysis, asked different people for the same statistic, or tried to compare data from multiple years, you have probably encountered difficulties getting a consistent answer on standard statistics, e.g. how many students were proficient in math, how many students graduated in four years, what proportion of students were chronically absent? All too often documentation of business rules is poor or non-existent. The result is that two analysts with the same data will produce inconsistent statistics. This is not because of something inherent in the quality of the data or an indictment of the analyst’s skills. In most cases, the undocumented business rules are essentially trivial, in that the results of any decision has a small impact on the final result and any of the decisions made by the analysts are equally defensible.

This major problem of lax or non-existent documentation is one of the main reasons I feel that analysts, and in particular analysts working in the public sector, should extensively use tools for code sharing and version control like Github, use free tools whenever possible, and generally adhere to best practices in reproducible research.

I am trying to put as much of my code on Github as I can these days. Much of what I write is still very disorganized and, frankly, embarrassing. A lot of what is in my Github repositories is old, abandoned code written as I was learning my craft. A lot of it is written to work with very specific, private data. Most of it is poorly documented because I am the only one who has ever had to use it, I don’t interact with anyone through practices like code reviews, and frankly I am lazy when pressed with a deadline. But that’s not really the point, is it? The worst documented code is code that is hidden away on a personal hard drive, written for an expensive proprietary environment most people and organizations cannot use, or worse, is not code at all but rather a series of destructive data edits and manipulations. ¹

One way that I have been trying to improve the quality and utility of the code I write is by contributing to an open source R package, eeptools. This is a package written and maintained by Jared Knowles, an employee of the Wisconsin Department of Public Instruction, whom I met at a Strategic Data Project convening. eeptools is consolidating several functions in R for common tasks education data analysts are faced with. Because this package is available on CRAN, the primary repository for R packages, any education analyst can have access to its functions in one line:

1

install.packages('eeptools'); require(eeptools)

Submitting code to a CRAN package reinforces several habits. First, I get to practice writing R documentation, explaining how to use a function, and therefore, articulating the assumptions and business rules I am applying. Second, I have to write my code with a wider tolerance for input data. One of the easy pitfalls of a beginning analyst is writing code that is too specific to the dataset in front of you. Most of the errors I have found in analyses during quality control stem from assumptions embedded in code that were perfectly reasonable with a single data set that lead to serious errors when using different data. One way to avoid this issue is through test-driven development, writing a good testing suite that tests a wide range of unexpected inputs. I am not quite there yet, personally, but thinking about how my code would have to work with arbitrary inputs and ensuring it fails gracefully ² is an excellent side benefit of preparing a pull request ³ . Third, it is an opportunity to write code for someone other than myself. Because I am often the sole analyst with my skillset working on a project, it is easy to not consider things like style, optimizations, clarity, etc. This can lead to large build-ups of technical debt, complacency toward learning new techniques, and general sloppiness. Submitting a pull request feels like publishing. The world has to read this, so it better be something I am proud of that can stand up to the scrutiny of third-party users.

My first pull request, which was accepted into the package, calculates age in years, months, or days at an arbitrary date based on date of birth. While even a beginning R programmer can develop a similar function, it is the perfect example of an easily compartmentalized component, with a broad set of applications, that can be accessed frequently .

Today I submitted by second pull request that I hope will be accepted. This time I covered a much more complex task– calculating student mobility. To be honest, I am completely unaware of existing business rules and algorithms used to produce the mobility numbers that are federally reported. I wrote this function from scratch thinking through how I would calculate the number of schools attended by a student in a given year. I am really proud of both the business rules I have developed and the code I wrote to apply those rules. My custom function can accept fairly arbitrary inputs, fails gracefully when it finds data it does not expect, and is pretty fast. The original version of my code took close to 10 minutes to run on ~30,000 rows of data. I have reduced that with a complete rewrite prior to submission to 16 seconds.

While I am not sure if this request will be accepted, I will be thrilled if it is. Mobility is a tremendously important statistic in education research and a standard, reproducible way to calculate it would be a great help to researchers. How great would it be if eeptools becomes one of the first packages education data analysts load and my mobility calculations are used broadly by researchers and analysts? But even if it’s not accepted because it falls out of scope, the process of developing the business rules, writing an initial implementation of those rules, and then refining that code to be far simpler, faster, and less error prone was incredibly rewarding.

My next post will probably be a review of that process and some parts of my moves_calc function that I’m particularly proud of.

Using a spreadsheet program, such as Excel, encourages directly manipulating and editing the source data. Each change permanently changes the data. Even if you keep an original version of the data, there is no recording of exactly what was done to change the data to produce your results. Reproducibility is all but impossible of any significant analysis done using spreadsheet software. ↩︎
Instead of halting the function with hard to understand error when things go wrong, I do my best to “correct” easily anticipated errors or report back to users in a plain way what needs to be fixed. See also fault-tolerant system. ↩︎
A pull request is when you submit your additions, deletions, or any other modifications to be incorporated in someone else’s repository. ↩︎

June 11, 2013

Calculating Age in R

7:00PM

A few months back I wrote some code to calculate age from a date of birth and arbitrary end date. It is not a real tricky task, but it is certainly one that comes up often when doing research on individual-level data.

I was a bit surprised to only find bits and pieces of code and advice on how to best go about this task. After reading through some old R-help and Stack Overflow responses on various ways to do date math in R, this is the function I wrote ¹:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


age_calc <- function(dob, enddate=Sys.Date(), units='months'){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date"))
    stop("Both dob and enddate must be Date class objects")
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  
  years <- end$year - start$year
  if(units=='years'){
    result <- ifelse((end$mon < start$mon) | 
                      ((end$mon == start$mon) & (end$mday < start$mday)),
                      years - 1, years)    
  }else if(units=='months'){
    months <- (years-1) * 12
    result <- months + start$mon
  }else if(units=='days'){
    result <- difftime(end, start, units='days')
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

A few notes on proper usage and the choices I made in writing this function:

The parameters dob and enddate expect data that is already in one of the various classes that minimally inherits the base class Date.
This function takes advantage of the way that R treats vectors, so both dob and enddate can be a single or multi-element vector. For example enddate is a single date, as is the default, then the function will return a vector that calculates the difference between dob and that single date for each element in dob. If dob and enddate are both vectors with n>1, then the returned vector will contain the element-wise difference between dob and enddate. When the vectors are of different sizes, the shorter vector will be repeated over until it reaches the same length as the longer vector. This is known as recycling, and it is the default behavior in R.
This function always returns an integer. Calculating age in years will never return, say, 26.2. Instead, it assumes that the correct behavior for age calculations is something like a floor function. For examle, the function will only return 27 if enddate is minimally your 27th birthday. Up until that day you are considered 26. The same is true for age in months.

This is probably the first custom function in almost 3 years using R that I wrote to be truly generalizable. I was inspired by three factors. First, this is a truly frequent task that I will have to apply to many data sets in the future that I don’t want to have to revisit. Second, a professional acquaintance, Jared Knowles, is putting together a CRAN package with various convenience functions for folks who are new to R and using it to analyze education data ². This seemed like an appropriate addition to that package, so I wanted to write it to that standard. In fact, it was my first (and to date, only) submitted and accepted pull request on Github. Third, it is a tiny, simple function so it was easy to wrap my head around and write it well. I will let you be the judge of my success or failure ³.

I originally used Sys.time() not realizing there was a Sys.Date() function. Thanks to Jared Knowles for that edit in preparation for a CRAN check. ↩︎
Check out eeptools on Github. ↩︎
Thanks to Matt’s Stats n Stuff for getting me to write this post. When I saw another age calculation function pop up on the r-bloggers feed I immediately thought of this function. Matt pointed out that it was quite hard to Google for age calculations in R, lamenting that Google doesn’t meaningfully crawl Github where I linked to find my code. So this post is mostly about providing some help to less experience R folks who are frantically Googling as both Matt and I did when faced with this need. ↩︎

August 21, 2012

There must be an easier way... survey questions in R

7:00PM

So I have this great little custom function I’ve used when looking at survey data in R. I call this function pull(). The goal of pull() is to quickly produce frequency tables with n sizes from individual-level survey data.

Before using pull(), I create a big table that includes information about the survey questions I want to pull. The data are structured like this:

quest	survey	year	break
ss01985	elementary	2011\_12	schoolcode

quest represents the question coding in the raw survey data.
survey is the name of the survey (in my case, the elementary school students, middle school students, high school students, parents, teachers, or administrators).
year is the year that the survey data are collected.
break is the ID I want to aggregate on like schoolcode or districtcode.

They key is that paste(survey, year,sep='') produces the name of the data.frame where I store the relevant survey data. Both quest and break are columns in the survey data.frame. Using a data.frame with this data allows me to apply through the rows and produce the table for all the relevant questions at once. pull() does the work of taking one row of this data.frame and producing the output that I’m looking for. I also use pull() one row at a time to save a data.frame that contains these data and do other things (like the visualizations in this post).

In some sense, pull() is really just a fancy version of prop.table that takes in passed paramaters and adds an “n” to each row and adding a “total” row. I feel as though there must be an implementation of an equivalent function in a popular package (or maybe even base) that I should be using rather than this technique. It would probably be more maintainable and easier for collaborators to work with this more common implementation, but I have no idea where to find it. So, please feel free to use the code below, but I’m actually hoping that someone will chime in and tell me I’ve wasted my time and I should just be using some function foo::bar.

P.S. This post is a great example of why I really need to change this blog to Markdown/R-flavored Markdown. All those inline references to functions, variables, or code should really be formatted in-line which the syntax highlighter plug-in used on this blog does not support. I’m nervous that using WP-Markdown plugin will botch formatting on older posts, so I may just need to setup a workflow where I pump out HTML from the Markdown and upload the posts from there. If anyone has experience with Markdown + Wordpress, advice is appreciated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62


pull <- function(rows){
  # Takes in a vector with all the information required to create crosstab with
  # percentages for a specific question for all schools.
  # Args:
  #  rows: Consists of a vector with four objects.
  #        quest: the question code from SurveyWorks
  #        level: the "level" of the survey, i.e.: elem, midd, high, teac, admn,
  #        pare, etc.
  #        year: the year the survey was administered, i.e. 2011_12
  #        sch_lea: the "break" indicator, i.e. schoolcode, districtcode, etc.
  # Returns:
  # A data.frame with a row for each "break", i.e. school, attributes for
  # each possible answer to quest, i.e. Agree and Disagree, and N size for each
  # break based on how many people responded to that question, not the survey as
  # a whole, i.e. 

  # Break each component of the vector rows into separate single-element vectors
  # for convenience and clarity.
  quest <- as.character(rows[1])
  survey <- as.character(rows[2])
  year  <- as.character(rows[3])
  break <- as.character(rows[4])
  data <- get(paste(level,year,sep=''))
  # Data is an alias for the data.frame described by level and year.
  # This alias reduces the number of "get" calls to speed up code and increase
  # clarity.
  results <- with(data,
                  dcast(data.frame(prop.table(table(data[[break]],
                                                    data[[quest]]),
                                              1))
                        ,Var1~Var2,value.var='Freq'))
  # Produces a table with the proportions for each response in wide format.
  n <- data.frame(Var1=rle(sort(
    subset(data, 
           is.na(data[[quest]])==F & is.na(data[[break]])==F)[[break]]))$values,
                  n=rle(sort(
                    subset(data,
                           is.na(data[[quest]])==F &
                             is.na(data[[break]])==F)[[break]]))$lengths)
  # Generates a data frame with each break element and the "length" of that break
  # element. rle counts the occurrences of a value in a vector in order. So first
  # you sort the vector so all common break values are adjacent then you use rle
  # to count their uninterupted appearance. The result is an rle object with 
  # two components: [[values]] which represent the values in the original, sorted
  # vector and [[length]] which is the count of their uninterupted repeated
  # appearance in that vector.
  results <- merge(results, n, by='Var1')
  # Combines N values with the results table.

  state <- data.frame(t(c(Var1='Rhode Island', 
                          prop.table(table(data[[quest]])),
                          n=dim(subset(data,is.na(data[[quest]])==F))[1])))
  names(state) <- names(results)
  for(i in 2:dim(state)[2]){
    state[,i] <- as.numeric(as.character(state[,i]))
  }
  # Because the state data.frame has only one row, R coerces to type factor.
  # If I rbind() a factor to a numeric attribute, R will coerce them both to
  # characters and refuses to convert back to type numeric.
  results <- rbind(results, state)
  results
}   

July 11, 2012

Using the Common Core Data on NCES

7:00PM

My analysis on Nesi’s Notes depended entirely on the National Center for Education Statistics’ Common Core Data. The per pupil amounts reported to NCES may look a bit different from state sources of this information. There are several explanations of this. First, the enrollment counts used to generate per pupil amounts are based on an October 1st headcount. In Rhode Island, we use something called “average daily membership” (ADM) as the denominator and not a headcount. The ADM of a district is calculated by taking all the students who attended the district at any point in the year and adding up the number of school days they were enrolled for. The total membership (i.e. all the student*days, for those who like to think about this in units) is divided by the number of school days per year, almost always 180 (so student*days / days/year = students/year). Additionally, NCES does not record the final three digits on most financial data. These rounding issues will also make the per pupil data seem different from state reports.

I wanted to use the NCES to make sure that the data in my post was easily reproducible by any member of the public. I also thought using NCES would serve as a great learning opportunity for the wonks and nerds out there who never even realized how much rich data about schools and school finance are available through the federal government. That being said, I do believe that the state reported numbers are far more accurate than those available from the federal government. That is not to say that the federal data is bad. On the contrary, that data is substantially vetted and validated and is very useful for research. My concern was only that some of the tiny differences in the NCES data that deviated from what I would consider to be ideal data might reach the level where they affected the validity of the conclusions I wanted to draw.

Although I was writing as a private citizen without the support of the Rhode Island Department of Education, I did use my access to RIDE data to ensure that differences in the federal reports were not significant enough to call into question my analysis. I found that both the direction and magnitude of all the trends that I describe in the Nesi’s Notes post held up with the state data. While all of that information is publicly available, it is less easily accessible than NCES data and doesn’t provide the same opportunity for analysis outside of financial data. For these reasons, I decided to stick with NCES.

So how do you reproduce the data I used?

First, go to the NCES Common Core Data Build a Table site. On the drop down, select “District” as the row variable and select the last fifteen years excluding 2009-10 (since there is no current financial data available for that year).

After clicking next, hit “I Agree” on the pop-up.

Now select “Finance Per Pupil Ratios” for the first column.

Click the green arrow that selects all years for local sources per student and state sources per student.

Click “Next>>” on the top right. Now select only RI-Rhode Island for your row variable.

Finally, click view table to see the results. I recommend downloading the Test (.csv) file to work with.

And finally, here’s the R code to reshape/rejigger the data I used and produce the graphics from the Nesi’s Notes post.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232


## Using NCES data to analyze education finances to Woonsocket over 15 years.
## Initialize required packages
require(plyr)
require(reshape2)
require(ggplot2)
require(scales)
## Best to ignore this function-- it's mostly magic to me too. Essentially,
## multiplot takes in a bunch of plots and then puts them into one image
## arranging them by columns equal to a paramter cols. Credit to:
## [wiki.stdout.org/rcookbook...](http://wiki.stdout.org/rcookbook/Graphs/Multiple%20graphs%20on%20one%20page%20()    ggplot2)/
multiplot <- function(..., plotlist=NULL, cols) {
  require(grid)
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  numPlots = length(plots)
  # Make the panel
  plotCols = cols                          # Number of columns of plots
  plotRows = ceiling(numPlots/plotCols) # Number of rows needed, calculated from #    of cols
  # Set up the page
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(plotRows, plotCols)))
  vplayout <- function(x, y)
    viewport(layout.pos.row = x, layout.pos.col = y)
  # Make each plot, in the correct location
  for (i in 1:numPlots) {
    curRow = ceiling(i/plotCols)
    curCol = (i-1) %% plotCols + 1
    print(plots[[i]], vp = vplayout(curRow, curCol ))
  }
}
## Load data from the modified CSV. I made the following changes from the NCES
## downloaded file: 1) I removed all of the description header so that row one
## of the CSV is the attribute names; 2) I pasted the transposed state values
## to the final observation so that I have a state observation row analogous to
## the other LEA rows.

raw_data <- read.csv('rawdata.csv')
## Change name of first column to make things easier for later.
names(raw_data)[1] <- c('distname')
## Creating Time Series Data for each community of interest.
## I'm going to use a custom function to automate the steps required to create
## district level data in a time series.

create_ts <- function(name){
  # First create a column vector with the local funding
  # A few things to note: First, t() is the transpose function and helps to
  # make my "wide" data (lots of columns) "long" (lots of rows). Second, R
  # has a funny behavior that is very covenient for data anaylsts. It performs
  # many common mathematical operations element-wise, so the simple division
  # of two vectors below actually divides element by element through the
  # vector, e.g. column 17 is divided by column 2 to provide the first element
  # in the resulting vector. This makes calculating per pupil amounts very
  # convenient.
  local <- t(subset(raw_data,distname==name)[,c(17:31)]/
             subset(raw_data,distname==name)[,c(2:16)])
  # Performing the same operation for state per pupil amounts.
  state <- t(subset(raw_data,distname==name)[,c(32:46)]/
             subset(raw_data,distname==name)[,c(2:16)])
  # Putting state and local data together and getting rid of the nasty
  # attribute names from NCES by just naming the rows with a sequence
  # of integers.
  results <- data.frame(local,state,row.names=seq(1,15,1))
  # Naming my two attributes
  names(results) <- c('local','state')
  # Generating the year attribute
  results[['year']] <- seq(1995, 2009, 1)
  # This command is a bit funky, but basically it makes my data as long as
  # possible so that each line has an ID (year in this case) and one value
  # (the dollars in this case). I also have a label that describes that value,
  # which is local or state.
  results <- melt(results, id.vars='year')
  # Returning my "results" object
  results
}

## Create the Woonsocket data-- note that R is case sensitive so I must use all
## capitals to match the NCES convention.
woonsocket <- create_ts('WOONSOCKET')
pawtucket <- create_ts('PAWTUCKET')
providence <- create_ts('PROVIDENCE')
westwarwick <- create_ts('WEST WARWICK')
state <- create_ts('STATE')

## Developing a plot of JUST local revenues for the selected communities
## First I create a percentage change data frame. I think that looking at
## percent change overtime is generally more fair. While the nominal dollar
## changes are revealing, my analysis is drawing attention to the trend rather
## than the initial values.

## First, I pull out just the local dollars.
perwoonlocal <- subset(woonsocket,variable=='local')
## Now I modify the value to be divided by the starting value - 100%
perwoonlocal[['value']] <- with(perwoonlocal, (value/value[1])-1)
## A little renaming for the combining step later
names(perwoonlocal) <-c('year','disname','value')
perwoonlocal[['disname']]<-'Woonsocket'

## I repeat this procedure for all the districts of interest.
perpawlocal <- subset(pawtucket,variable=='local')
perpawlocal[['value']] <- with(perpawlocal, (value/value[1])-1)
names(perpawlocal) <-c('year','disname','value')
awlocal[['disname']]<-'Pawtucket'

perprolocal <- subset(providence,variable=='local')
perprolocal[['value']] <- with(perprolocal, (value/value[1])-1)
names(perprolocal) <-c('year','disname','value')
perprolocal[['disname']]<-'Providence'

perwwlocal <- subset(westwarwick, variable=='local')
perwwlocal[['value']] <- with(perwwlocal, (value/value[1])-1)
names(perwwlocal) <-c('year','disname','value')
perwwlocal[['disname']]<-'West Warwick'

perrilocal <- subset(state,variable=='local')
perrilocal[['value']] <- with(perrilocal, (value/value[1])-1)
names(perrilocal) <-c('year','disname','value')
perrilocal[['disname']]<-'State Average'

## The same process can be used for state data
perwoonstate <- subset(woonsocket,variable=='state')
## Now I modify the value to be divided by the starting value - 100%
perwoonstate[['value']] <- with(perwoonstate, (value/value[1])-1)
## A little renaming for the combining step later
names(perwoonstate) <-c('year','disname','value')
perwoonstate[['disname']]<-'Woonsocket'

## I repeat this procedure for all the districts of interest.
perpawstate <- subset(pawtucket,variable=='state')
perpawstate[['value']] <- with(perpawstate, (value/value[1])-1)
names(perpawstate) <-c('year','disname','value')
perpawstate[['disname']]<-'Pawtucket'

perprostate <- subset(providence,variable=='state')
perprostate[['value']] <- with(perprostate, (value/value[1])-1)
names(perprostate) <-c('year','disname','value')
perprostate[['disname']]<-'Providence'

perwwstate <- subset(westwarwick, variable=='state')
perwwstate[['value']] <- with(perwwstate, (value/value[1])-1)
names(perwwstate) <-c('year','disname','value')
perwwstate[['disname']]<-'West Warwick'

perristate <- subset(state,variable=='state')
perristate[['value']] <- with(perristate, (value/value[1])-1)
names(perristate) <-c('year','disname','value')
perristate[['disname']]<-'State Average'

## Pull together the data sets for the overall picture.
localfunding <- rbind(perwoonlocal, perpawlocal,perprolocal,perwwlocal,perrilocal)
statefunding <- rbind(perwoonstate, perpawstate,perprostate,perwwstate,perristate)

## A little ggplot2 line plot magic...
localperplot <- ggplot(localfunding,aes(year, value, color=disname)) +
                geom_line() +
                geom_text(data=subset(localfunding, year==2009),
                          mapping=aes(year,value,
                                      label=paste(100*round(value,3),'%',sep='')),
                          vjust=-.4) +
                scale_y_continuous('Percent Change from FY1995',
                                   label=percent) +
                scale_x_continuous('Year') +
                opts(title='Percent Change in Local Per Pupil Revenue, FY1995-    FY2009') +
                opts(plot.title=theme_text(size=16,face='bold')) +
                opts(legend.title=theme_blank()) +
                opts(legend.position=c(.08,.82))
stateperplot <- ggplot(statefunding,aes(year, value, color=disname)) +
                geom_line() +
                geom_text(data=subset(statefunding, year==2008 | year==2009),
                          mapping=aes(year,value,
                          label=paste(100*round(value,3),'%',sep='')),
                          vjust=-.4) +
                scale_y_continuous('Percent Change from FY1995',
                                   label=percent) +
                scale_x_continuous('Year') +
                opts(title='Percent Change in State Per Pupil Revenue, FY1995-    FY2009') +
                opts(plot.title=theme_text(size=16,face='bold')) +
                opts(legend.title=theme_blank()) +
                opts(legend.position=c(.08,.82))
ggsave('localperplot.png',localperplot,width=10,height=8,units='in',dpi=72)
ggsave('stateperplot.png',stateperplot,width=10,height=8,units='in',dpi=72)
    
## Proportion of Aid
proportion <- function(data){
  # This reshapes the data so that there is a year, local, and state column.
  # The mean function has no purpose, because this data is unique by year
  # variable combinations.
  prop <- dcast(data,year~variable,mean)
  # Adding local and state get our total non-federal dollars
  prop[['total']] <- apply(prop[,2:3],1,sum)
  prop[['perlocal']] <- with(prop, local/total)
  prop
}


## Prepare new data frames for proportion graphs

propwoon <- as.data.frame(c(disname='Woonsocket',
                            proportion(woonsocket)))
proppaw <- as.data.frame(c(disname='Pawtucket',
                           proportion(pawtucket)))
propprov <- as.data.frame(c(disname='Providence',
                            proportion(providence)))
propww <- as.data.frame(c(disname='West Warwick',
                          proportion(westwarwick)))
propri <- as.data.frame(c(disname='State Average',
                          proportion(state)))

## Note, I could have called proportion() inside of the rbind(), but I wanted
## my code to be clearer and felt there may be some use for the independent
## proportion data frames in further analysis. Sometimes more lines of code
## and more objects is easier to maintain and more flexible for exploratory,
## non-production code. This is especially true when handling such small
## data sets that there is no impact on performance.

locprop <- rbind(propwoon, proppaw,propprov,propww,propri)

## Some ggplot2 magic time!

localpropplot <- ggplot(locprop,aes(year, perlocal, color=disname)) +
  geom_line() +
  geom_text(data=subset(locprop, year==1995 | year==2008 |     year==2009),
            mapping=aes(year,perlocal,
                        label=paste(100*round(perlocal,3),'%',sep='')),
            vjust=-.4) +
  scale_y_continuous('Percent Change from FY1995',
                     label=percent) +
  scale_x_continuous('Year') +
  opts(title='Percent Change in Local Proportion of Per Pupil    Revenue\n Excluding Federal Funding, FY1995-FY2009') +
  opts(plot.title=theme_text(size=16,face='bold')) +
  opts(legend.title=theme_blank()) +
  opts(legend.position=c(.9,.65))
ggsave('localpropplot.png',localpropplot,width=10,height=8,units='in',dpi=72)

This post is the third of a three-part follow up on my guest post for Nesi’s Notes. Parts I and II can be found here.

July 9, 2012

Ranked Likert-Scale Visualization

7:00PM

Update

See below for more information now that Ethan Brown has weighed in with some great code.

A recent post I came across on r-bloggers asked for input on visualizing ranked Likert-scale data.

I happen to be working on a substantial project using very similarly structured data so I thought I would share some code. In my efforts to be generic as possible, I decided to generate some fake data from scratch. As I peeled away the layers of context-specific aspects of my nearing-production level code, I ran into all kinds of trouble. So I apologize for the somewhat sloppy and unfinished code¹.

My preferred method for visualizing Likert-scale data from surveys is using net stacked distribution graphs. There are two major benefits of these kinds of graphs. First, they immediately draw attention to how strongly respondents feel about a question, particularly when multiple questions are visualized at once. The total width of any bar is equal to the total number of responded who had a non-neutral answer. Second, these graphs make it very easy to distinguish between positive and negative responses. In some cases, it is critical to view the distribution of data to visualize the differences in responses to one question or another. However, most of the time it is informative enough to simply know how positive or negative responses are. I find this is particularly true with 3, 4, and 5-point Likert scales, the most common I come across in education research.

Anyway, without further ado, some starter code for producing net stacked distribution graphs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95


require(ggplot2)
require(scales)
require(plyr)
dataSet <- data.frame(
  q1=as.ordered(round(runif(1500, 1, 15) + runif(1500,1,15))),
  q2=as.ordered(round(runif(1500, 1, 15) + runif(1500,1,15))))
dataSet[['q1']] <- with(dataSet, ifelse(q1<7,1,
                                       ifelse(q1>=7 & q1<13,2,
                                             ifelse(q1>=13 & q1<20,3,
                                                   ifelse(q1>=20 & q1<26,4,5)))))
dataSet[['q2']] <- with(dataSet, ifelse(q2<3,1,
                                       ifelse(q2>=3 & q2<14,2,
                                             ifelse(q2>=14 & q2<26,3,
                                                   ifelse(q2>=26 & q2<28,4,5)))))
dataSet[['q1']] <- as.ordered(dataSet[['q1']])
dataSet[['q2']] <- as.ordered(dataSet[['q2']])
levels(dataSet[['q1']]) <- c('Strongly Disagree',
                             'Disagree',
                             'Neither Agree or Disagree',
                             'Agree',
                             'Strongly Agree')
levels(dataSet[['q2']]) <- c('Strongly Disagree',
                             'Disagree',
                             'Neither Agree or Disagree',
                             'Agree',
                             'Strongly Agree')
# Convert the integer levels to have meaning.
q1Proportions <- data.frame(Name='q1', prop.table(table(dataSet[['q1']])))
q2Proportions <- data.frame(Name='q2', prop.table(table(dataSet[['q2']])))
# Produces a data frame with the proportions of respondents in each level.

# ggplot2 function for graphs
visualize <- function(data,
                      responses=c('Strongly Disagree',
                                  'Disagree',
                                  'Neither Agree or Disagree',
                                  'Agree',
                                  'Strongly Agree'),
                      desc='Title',
                      rm.neutral=TRUE){
  # This function will create net stacked distribution graphs. These are
  # a particularly useful visualization of Likert data when there is a neutral
  # option available and/or when emphasizing the difference between positive and
  # negative responses is a goal.
  # Args:
  #   data: This is a dataframe with percentages labeled with responses.
  #   responses: This is a vector with the response labels.
  #   desc: This is the title of the output ggplot2 graphic, typically the
  #         question text.
  #   rm.neutral: This is a single element logical vector that determines if the
  #               neutral response should be removed from the data. The default
  #               value is TRUE.
  for(i in 1:ceiling(length(responses)/2)-1){
      # This loop negates all the negative, non-neutral responses regardless of
      # the number of possible responses. This will center the non-neutral
      # responses around 0.
      data[i,3] <- -data[i,3]
  }
  if(rm.neutral==T){
    data <- ddply(data,.(Name), function(x) x[-(ceiling(length(responses)/2)),])
    responses <- responses[-(ceiling(length(responses)/2))]
  }
  else{

  }
  print(data)
  stackedchart <- ggplot() +
                  layer(data=data[1:2,],
                        mapping=aes(Name,Freq,fill=Var1,order=-as.numeric(Var1)),
                        geom='bar',
                        position='stack',
                        stat='identity')
  stackedchart <- stackedchart +
                  layer(data=data[3:4,],
                        mapping=aes(Name,Freq,fill=Var1,order=Var1),
                        geom='bar',
                        position='stack',
                        stat='identity')
  stackedchart <- stackedchart +
                  geom_hline(yintercept=0) +
                  opts(legend.title=theme_blank()) +
                  opts(axis.title.x=theme_blank()) +
                  opts(axis.title.y=theme_blank()) +
                  opts(title=desc) +
                  scale_y_continuous(labels=percent,
                                     limits=c(-1,1),
                                     breaks=seq(-1,1,.2)) +
                  scale_fill_manual(limits=responses,
                                    values=c('#AA1111',
                                             '#BB6666',
                                             '#66BB66',
                                             '#11AA11')) +
                  coord_flip()
  stackedchart
}

And the results of all that?

UPDATE:

So now that Ethan has weighed in with his code I thought I would add some things to make this post better reflect my production code. Below, I have included my comment on his blog as well as an actual copy of my current production code (which definitely is not sufficiently refactored for easy use across multiple projects). Again, excuse what I consider to be incomplete work on my part. I do intend on refactoring this code and eventually including it in my broader set of custom functions available across all of my projects. I suspect along that path that I will be “stealing” some of Ethan’s ideas.

Comment

Hi Ethan! Super excited to see this post. This is exactly why I put up my code– so others could run with it. There are a few things that you do here that I actually already had implemented into my code and removed in an attempt to be more neutral to scale that I really like.

For starters, in my actual production code I also separate out the positive and negative responses. In my code, I have a parameter called scaleName that allows me to switch between all of the scales that are available in my survey data. This includes Strongly Disagree to Strongly Agree (scaleName=='sdsa'), Never -> Always (scaleName=='sdsa') and even simple yes/no (scaleName=='ny'). This is not ideal because it does require 1) knowing all possible scales and including some work in the function to treat them differently 2) including an additional parameter. However, because I use this work to analyze just a few surveys, the upfront work of including this as a parameter has made this very flexible in dealing with multiple scales. As a result, I do not need to require that the columns are ordered in any particular way, just that the titles match existing scales. So I have a long set of if elseif statements that look something like this:

1
2
3
4
5


 if(scaleName=='sdsa'){
 scale <- c('Strongly Disagree','Disagree','Agree','Strongly Agree')
 pos <- c('Agree','Strongly Agree')
 neg <- c('Strongly Disagree','Disagree')
 }

This is actually really helpful for producing negative values and including some scales in my function which do not have values that are negative (so that it can be used for general stacked charts instead of just net-stacked):

1
2


if(length(neg)>0){
quest[,names(quest) %in% c(neg)] <- -(quest[,names(quest) %in% c(neg)])

(Recall that quest is what I call the dataframe and is equivalent to x in your code)

Another neat trick that I have instituted is having dynamic x-axis limits rather than always going from -100 to 100. I generally like to keep my scales representing the full logical range of data (0 - 100 for percentages, etc) so I might consider this a manipulation. However, after getting many charts with stubby centers, I found I was not really seeing sufficient variation by sticking to my -100 to 100 setup. So I added this:

1
2
3
4
5
6


pos_lims =0)+1))[1,]),
sum(subset(quest,select=c(which(quest[2,-1]>=0)+1))[2,])))
neg_lims <- max(abs(c(sum(subset(quest, 
                                 select=c(which(quest[1,-1]<=0) + 1))[1,]),
sum(subset(quest,select=c(which(quest[2,-1]<=0)+1))[2,]))))
x_axis_lims <- max(pos_lims,neg_lims)

Which helps to determine the value furthest from 0 in either direction across the data frame (I have to admit, this code looks a bit like magic reading it back. My comments actually are quite helpful:

1
2
3
4


# pos_lims and neg_lims subset each row of the data based on sign, then
# sums the values that remain (gettting the total positive or negative
# percentage for each row). Then, the max of the rows is saved as a candidate
# for the magnitude of the axis.

To make this more generalizable (my production code always compares two bars at once) , it would be fairly trivial to loop over all the rows (or use the apply functions which I’m still trying to get a hang of).

I then pad the x_limits value by some percent inside the limits attribute.

In my production code I also have the scale_fill_manual attribute added separately to the ggplot object. However, rather than add this after the fact like at the point of rendering, I include this in my function again set by scaleName. However, I think the best organization is probably to have a separate function that makes it easy to select the color scheme you want and apply it so that your final call could be something like colorNetStacked(net_stacked(x), 'blues').

My actual final return looks like this:

1
2
3


return(stackedchart + 
       scale_fill_manual(limits=scale, values=colors) + 
       coord_flip())

Where colors is set by a line like: colors <- brewer.pal(name='Blues',n=7)[3:7]

Seriously though, I am super excited you found my post and thought it was useful and improved what I presented!

Current production code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149


visualize <- function(quest,scaleName='sdsa',desc){
  # Produces the main net-stacked Likert graphic used for survey data in the
  # diagnostic tool
  # Args:
  #  quest: data.frame from pull() or pullByLevel() output
  #  scaleName: string code for the type of scale that is used for the question.
  #  desc: string for the title that will be displayed on the graphic.
  # Returns:
  # Net-Stacked Likert chart with a bar/row for each row in quest. Most scales
  # center around 0 with a distinct positive and negative set of responses.
  # The graphs are custom colored based on what best reflects the scale.
  # The x-axis limits are set dynamically based on a 10% buffer of the largest
  # magnitude to either the positive or negative responses.

  if(scaleName=='sdsa'){
    scale <- c('Strongly Disagree','Disagree','Agree','Strongly Agree')
    pos   <- c('Agree','Strongly Agree')
    neg   <- c('Strongly Disagree','Disagree')
  }else if(scaleName=='da'){
    scale <- c('Disagree','Agree')
    pos   <- c('Agree')
    neg   <- c('Disagree')
  }else if(scaleName=='neal'){
    scale <- c('Never','Sometimes','Usually','Always')
    pos   <- c('Usually','Always')
    neg   <- c('Never','Sometimes')
  }else if(scaleName=='noalot'){
    scale <- c('None','A Little','Some','A Lot')
    pos   <- c('None','A Little','Some','A Lot')
    neg   <- c()
  }else if(scaleName=='noall'){
    scale <- c('None of them','Some','Most','All of them')
    pos   <- c('None of them','Some','Most','All of them')
    neg   <- c()
  }else if(scaleName=='neda'){
    scale <- c('Never','A Few Times a Year','Monthly','Weekly','Daily')
    pos   <- c('Never','A Few Times a Year','Monthly','Weekly','Daily')
    neg   <- c()
  }else if(scaleName=='ny'){
    scale <- c('No','Yes')
    pos   <- c('Yes')
    neg   <- c('No')
  }else{
    print('Unrecognized Scale Name')
  }
  # Remove neutral and non-response based values in the pull tables like
  # n-size, Not Applicable, etc.
  quest <- quest[,!names(quest) %in%
    c('n','Not Applicable',"I don't know")]

  # Produce values less than 0 for negative responses
  if(length(neg)>0){
  quest[,names(quest) %in% c(neg)] <-
    -(quest[,names(quest) %in% c(neg)])
  # pos_lims and neg_lims subset each row of the data based on sign, then
  # sums the values that remain (gettting the total positive or negative
  # percentage for each row). Then, the max of the rows is saved as a candidate
  # for the magnitude of the axis.
  pos_lims <- max(c(sum(subset(quest,select=c(which(quest[1,-1]>=0)+1))[1,]),
                    sum(subset(quest,select=c(which(quest[2,-1]>=0)+1))[2,])))
  neg_lims <- max(abs(c(sum(subset(quest,select=c(which(quest[1,-1]<=0)+1))[1,]),
                        sum(subset(quest,select=c(which(quest[2,-1]<=0)+1))[2,]))))

  # The actual magnitude of the axis is the largest magnitude listed in pos_lims
  # or neg_lims, and will be inflated by .1 in each direction in the scale later
  x_axis_lims <- max(pos_lims,neg_lims)
  }else{

  }
  # Reshape the data so that each row has one value with a variable label.
  quest <- melt(quest,id.vars='Var1')

  # Factoring and ordering the response label ensures they are listed in the
  # proper order in the legend and on the stacked chart, i.e. strongly disagree
  # is furthest left and strongly agree is furthest right.
  quest[['variable']] <- factor(quest[['variable']],
                                levels=scale,
                                ordered=TRUE)

  # Build the plot using ggplot(). Layers are used so that positive and negative
  # can be drawn separately. This is important because the order of the negative
  # values needs to be switched.

  ##### Control flow required to change the behavior for the questions that
  ##### business requirements call for 0-100 scale with no indication of
  ##### positive or negative, i.e. the neda, noalot, and noall scaleName.
  stackedchart <- ggplot() +
    layer(data=subset(quest,
                      variable %in% pos),
          mapping=aes(Var1,
                      value,
                      fill=factor(variable)),
          geom='bar',
          stat='identity',
          position='stack') +
    geom_hline(yintercept=0) +
    opts(legend.title=theme_blank()) +
    opts(axis.title.x=theme_blank()) +
    opts(axis.title.y=theme_blank()) +
    opts(title=desc)
  if(length(neg)>0){
    stackedchart <- stackedchart +
      layer(data=subset(quest,
                        variable %in% neg),
            mapping=aes(Var1,
                        value,
                        fill=factor(variable),
                        order=-as.numeric(variable)),
            geom='bar',
            stat='identity',
            position='stack')
  }else{

  }
  if(scaleName %in% c('sdsa','neal')){
    colors <- c('#AA1111','#BB6666','#66BB66','#11AA11')
    stackedchart <-  stackedchart +
      scale_y_continuous(labels=percent,
                         limits=c(-x_axis_lims-.1, x_axis_lims+.1),
                         breaks=seq(-round(x_axis_lims,1)-.1,
                                    round(x_axis_lims,1)+.1,
                                    .2))
  }else if(scaleName %in% c('ny','da')){
    colors <- c('#BB6666','#66BB66')
    stackedchart <-  stackedchart +
      scale_y_continuous(labels=percent,
                         limits=c(-x_axis_lims-.1, x_axis_lims+.1),
                         breaks=seq(-round(x_axis_lims,1)-.1,
                                    round(x_axis_lims,1)+.1,
                                    .2))
  }else if(scaleName %in% c('noalot','noall')){
    colors <- brewer.pal(name='Blues',n=6)[3:6]
    stackedchart <-  stackedchart +
      scale_y_continuous(labels=percent,
                         limits=c(0,1.05),
                         breaks=seq(0,1,.1))
  }else if(scaleName %in% c('neda')){
    colors <- brewer.pal(name='Blues',n=7)[3:7]
    stackedchart <-  stackedchart +
      scale_y_continuous(labels=percent,
                         limits=c(0,1.05),
                         breaks=seq(0,1,.1))
  }else{
    print('Unrecognized scaleName')
  }
  return(stackedchart + scale_fill_manual(limits=scale,
                                          values=colors) +
                        coord_flip())
}

Mainly, I would like to abstract this code further. I am only about halfway there to assuring that I can use Likert-scale data of any size. I also would like to take in more than one question simultaneously with the visualize function. The latter is already possible in my production code and is particularly high impact for these kinds of graphics ↩︎

Older Posts→

DuckDB

Python/Pandas

R - readr

R - data.table

PostgreSQL

Can I fix it?

readr

pandas

duckdb

postgresql

Some additional testing

vroom

arrow

explorer / polars

The New World of tidyeval

What do you mean, foreign key constraints?

Ok, so these are just collections of enumerated values

Please Don’t stop()

Enough with the toy examples

Update

Version 0: Where Do I Begin?

Version 1: A Naïve Implementation

Function Declaration: Parameters

Type Checking and Programmatic Defaults

Initializing the Results

Business Rule 1: The Latecomer

Business Rule 5: The Feint

Business Rule 3: The Smooth Mover

Business Rule 4: The Long Hop

Business Rule 2: The Early Summer

Version 2: 10x Speed And More Readable

Handling Missing Data

Woe with data.table

Magic with data.table: Business Rules 1 and 2 in two lines each

Recursion. Which is also known as recursion.

Quick and Dirty system.time

Update

UPDATE:

Comment

Current production code:

Woe with `data.table`

Magic with `data.table`: Business Rules 1 and 2 in two lines each

Quick and Dirty `system.time`