UseRs often do crazy things that no sane developer in another language would do. Today I decided to build a way to check foreign key constraints in R to help me learn the
What do you mean, foreign key constraints?
Well, in many ways this is an extension of my last post on using
purrr::reduce. I have a set of data with codes (like FIPS codes, or user ids, etc) and I want to make sure that all of those codes are “real” codes (as in I have a defintion for that value). So I may have a FIPS code
name as the columns or a user
data.frame with columns
In a database, I might have a foreign key constraint on my table that just has codes so that I could not create a row that uses an
code value or whatever that did not exist in my lookup table. Of course in R, our data is disconnected and non-relational. New users may exist in my dataset that weren’t there the last time I downloaded the
users table, for example.
Ok, so these are just collections of enumerated values
Yup! That’s right! In some ways like R’s beloved
factors, I want to have problems when my data contains values that don’t have a corresponding row in another
data.frame, just like trying to insert a value into a
factor that isn’t an existing level.
assertr anticipates just this, with the
in_set helper. This way I can
assert that my data is in a defined set of values or get an error.
Please Don’t stop()
assert raises an error with an incredibly helpful message. It tells you which column the assertion was on, what the assertion was, how many times that assertion failed, and then returns the column index and value of the failed cases.
assert has an argument for
error_fun, which, combined with some built in functions, can allow for all kinds of fun behavior when an assertion fails. What if, for example, I actually want to collect that error message for later and not have a hard stop if an assertion failed?
assert will return the original
data.frame when there’s a failure with a special attribute called
assertr_errors that can be accessed later with all the information about failed assertions.
(Ok I cheated there folks. I used
verify, a new function from
assertr and a bunch of
magrittr pipes like
Enough with the toy examples
Ok, so here’s the code I wrote today. This started as a huge mess I ended up turning into two functions. First
is_valid_fk provides a straight forward way to get
FALSE on whether or not all of your codes/ids exist in a lookup
The first argument
data is your
data.frame, the second argument
key is the foreign key column in
values are all valide values for
key. Defaulting the
*_logical means a single boolean is the expected response.
But I don’t really want to do these one column at a time. I want to check if all of the foreign keys in a table are good to go. I also don’t want a boolean, I want to get back all the errors in a useable format. So I wrote
Let’s take it one bit at a time.
data.framewe’re checking foreign keys in.
fk_listis a list of
data.frames. Each element is named for the
keythat it looks up; each
data.framecontains the valid values for that
id, the name of the column in each
data.framein the list
fk_listthat corresponds to the valid
Right away, I want to know if my data has all the values my
fk_list says it should. I have to do some
do.call magic because
has_all_names wants something like
has_all_names('this', 'that', 'the_other') not
has_all_names(c('this', 'that', 'the_other').
The next part is where the magic happens.
map, I am able to call
is_valid_fk on each of the columns in
data that have a corresponding lookup table in
fk_list. The valid values are
.x is the name of the
fk_list (which corresponds to the name of the code we’re looking up in
data and exists for sure, thanks to that
verify call) and
id is the name of the key in that
data.frame as stated earlier. I’ve replaced
success_fun so that the code does not exist
map as soon there are any problems. Instead, the data is returned for each assertion with the error attribute if one exists. 2 Immediately,
map is called on the resulting list of
data.frames to collect the
assertr_errors, which are
append into a flattened list.
If there are no errors accumulated,
NULL, and the function exits early.
I could have stopped here and returned all the messages in
accumulated_errors. But I don’t like all that text, I want something neater to work with later. The structure I decided on was a list of
data.frames, with each element named for the column with the failed foreign key assertion and the contents being the index and value that failed the constraint.
data.frames returned by assertion, I was able to see that the
value tables printed in the failed
assert messages are contained in
error_df. So next I extract each of those
data.frames into a single list.
I’m almost done. I have no way of identifying which column created each of those
reporter. So to name each element based on the column that failed the foreign key contraint, I have to extract data from the
message attribute. Here’s what I came up with.
So let’s create some fake data and run
all_valid_fk to see the results:
all_valid_fk in one big chunk.
My thanks to Jonathan Carroll who was kind enough to read this post closely and actually tried to run the code. As a result, I’ve fixed a couple of typos and now have an improved regex pattern above.
I am a little concerned about memory here. Eight assertions would mean, at least briefly, eight copies of the same
data.framecopied here without the need for that actual data. There is probably a better way. ↩︎