Tonight I was revisiting an R script for cleaning up data that I wrote a couple of years ago. This script still runs nightly to process some data dropped on an SFTP server for some work we do at Allovue.
As I was reading through the file, I noticed that a lot of my code was repetitive. With some new powers I didn’t have back then, namely dplyr::across
and many other tidyselect
features, I wrote the following:
clean <- map(raw,
~mutate(.x,
across(any_of(supplements), as.character),
across(where(is.character), na_if, ""),
across(where(is.character), na_if, "NULL"),
across(any_of('po_number'), na_if, "0"),
across(where(is.character), remove_non_ascii),
across(where(is.character),
str_trim, side = "both"),
across(where(is.character),
str_replace,
pattern = "\\r\\n",
replacement = "\\n"),
across(where(is.character) & any_of('date'),
as.Date, format = "%m/%d/%Y")))
My data is already stored in a list (raw
) because I read the data using map
with a custom function (extract_data
), which knows how to process a yaml
file and load data from various sources, like databases or flat files of various types. This allows me to automate executing the same script with different configuration files to collect data from many different sources. I mostly leave the data in a list so that I can use walk
to export all the data once clean as text files with consistent formatting with a single call.
Until now, I never took advantage of having a list of data.frame
s for the T part of ETL. Without being able to use across
with predicates based on data type or with tidyselect
functions like any_of
, it was harder to safely do mutations that were not attached to the underlying data structure. The code above can execute on any list of data.frame
s as long as a character vector named supplements
exists. By using any_of
, if those columns don’t exist in my data, they are ignored. And by using where
on type definitions (largely character
s in this case), I don’t have to worry about names of columns in the data I want to remove Windows-style line endings from– just take care of it on any character
columns.
So while it’s not fancy, I’m tickled by removing four lines that mutate
d date
s in 4 of 7 data.frame
s where that was relevant, and how many different places I added na_if
before. Or how about the number of times I had to cast a specific value (like check_number
) to character from some other format (usually integer
because they’re quasi-numeric or logical
because they are entirely missing from one data set) so that I can do a join
or bind_rows
. This is a classic case of time, experience, and an ever-improving tidyverse
API reducing code complexity while making my code more generic and reusable.