I have been fascinated with assertive programming in R since 2015 1. Tony Fischetti wrote a great blog post to announce assertr
2.0’s release on CRAN that really clarified the package’s design.
UseRs often do crazy things that no sane developer in another language would do. Today I decided to build a way to check foreign key constraints in R to help me learn the assertr
package.
What do you mean, foreign key constraints?
Well, in many ways this is an extension of my last post on using purrr::reduce
. I have a set of data with codes (like FIPS codes, or user ids, etc) and I want to make sure that all of those codes are “real” codes (as in I have a defintion for that value). So I may have a FIPS code data.frame
with fips_code
and name
as the columns or a user data.frame
with columns id
, fname
, lname
, email
.
In a database, I might have a foreign key constraint on my table that just has codes so that I could not create a row that uses an id
or code
value or whatever that did not exist in my lookup table. Of course in R, our data is disconnected and non-relational. New users may exist in my dataset that weren’t there the last time I downloaded the users
table, for example.
Ok, so these are just collections of enumerated values
Yup! That’s right! In some ways like R’s beloved factors
, I want to have problems when my data contains values that don’t have a corresponding row in another data.frame
, just like trying to insert a value into a factor
that isn’t an existing level.
assertr
anticipates just this, with the in_set
helper. This way I can assert
that my data is in a defined set of values or get an error.
|
|
Please Don’t stop()
By default, assert
raises an error with an incredibly helpful message. It tells you which column the assertion was on, what the assertion was, how many times that assertion failed, and then returns the column index and value of the failed cases.
Even better, assert
has an argument for error_fun
, which, combined with some built in functions, can allow for all kinds of fun behavior when an assertion fails. What if, for example, I actually want to collect that error message for later and not have a hard stop if an assertion failed?
By using error_append
, assert
will return the original data.frame
when there’s a failure with a special attribute called assertr_errors
that can be accessed later with all the information about failed assertions.
|
|
(Ok I cheated there folks. I used verify
, a new function from assertr
and a bunch of magrittr
pipes like %<>%
)
Enough with the toy examples
Ok, so here’s the code I wrote today. This started as a huge mess I ended up turning into two functions. First is_valid_fk
provides a straight forward way to get TRUE
or FALSE
on whether or not all of your codes/ids exist in a lookup data.frame
.
|
|
The first argument data
is your data.frame
, the second argument key
is the foreign key column in data
, and values
are all valide values for key
. Defaulting the error_fun
and success_fun
to *_logical
means a single boolean is the expected response.
But I don’t really want to do these one column at a time. I want to check if all of the foreign keys in a table are good to go. I also don’t want a boolean, I want to get back all the errors in a useable format. So I wrote all_valid_fk
.
Let’s take it one bit at a time.
|
|
data
is thedata.frame
we’re checking foreign keys in.fk_list
is a list ofdata.frames
. Each element is named for thekey
that it looks up; eachdata.frame
contains the valid values for thatkey
named…id
, the name of the column in eachdata.frame
in the listfk_list
that corresponds to the validkeys
.
|
|
Right away, I want to know if my data has all the values my fk_list
says it should. I have to do some do.call
magic because has_all_names
wants something like has_all_names('this', 'that', 'the_other')
not has_all_names(c('this', 'that', 'the_other')
.
The next part is where the magic happens.
|
|
Using map
, I am able to call is_valid_fk
on each of the columns in data
that have a corresponding lookup table in fk_list
. The valid values are fk_list[[.x]][[id]]
, where .x
is the name of the data.frame
in fk_list
(which corresponds to the name of the code we’re looking up in data
and exists for sure, thanks to that verify
call) and id
is the name of the key in that data.frame
as stated earlier. I’ve replaced error_fun
and success_fun
so that the code does not exist map
as soon there are any problems. Instead, the data is returned for each assertion with the error attribute if one exists. 2 Immediately, map
is called on the resulting list of data.frame
s to collect the assertr_errors
, which are reduce
d using append
into a flattened list.
If there are no errors accumulated, accumulated_errors
is NULL
, and the function exits early.
|
|
I could have stopped here and returned all the messages in accumulated_errors
. But I don’t like all that text, I want something neater to work with later. The structure I decided on was a list of data.frame
s, with each element named for the column with the failed foreign key assertion and the contents being the index and value that failed the constraint.
By calling str
on data.frame
s returned by assertion, I was able to see that the index
and value
tables printed in the failed assert
messages are contained in error_df
. So next I extract each of those data.frame
s into a single list.
|
|
I’m almost done. I have no way of identifying which column created each of those error_df
in reporter
. So to name each element based on the column that failed the foreign key contraint, I have to extract data from the message
attribute. Here’s what I came up with.
|
|
So let’s create some fake data and run all_valid_fk
to see the results:
|
|
Beautiful!
And here’s all_valid_fk
in one big chunk.
|
|
My thanks to Jonathan Carroll who was kind enough to read this post closely and actually tried to run the code. As a result, I’ve fixed a couple of typos and now have an improved regex pattern above.
-
I appear to have forgotten to build link post types into my Hugo blog, so the missing link from that post is here. ↩︎
-
I am a little concerned about memory here. Eight assertions would mean, at least briefly, eight copies of the same
data.frame
copied here without the need for that actual data. There is probably a better way. ↩︎