Tonight I was revisiting an R script for cleaning up data that I wrote a couple of years ago. This script still runs nightly to process some data dropped on an SFTP server for some work we do at Allovue.

As I was reading through the file, I noticed that a lot of my code was repetitive. With some new powers I didn’t have back then, namely dplyr::across and many other tidyselect features, I wrote the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
clean <- map(raw,
             ~mutate(.x,
                     across(any_of(supplements), as.character),
                     across(where(is.character), na_if, ""),
                     across(where(is.character), na_if, "NULL"),
                     across(any_of('po_number'), na_if, "0"),
                     across(where(is.character), remove_non_ascii),
                     across(where(is.character), 
                            str_trim, side = "both"),
                     across(where(is.character),
                            str_replace,
                            pattern = "\\r\\n",
                            replacement = "\\n"),
                     across(where(is.character) & any_of('date'),
                            as.Date, format =  "%m/%d/%Y")))

My data is already stored in a list (raw) because I read the data using map with a custom function (extract_data), which knows how to process a yaml file and load data from various sources, like databases or flat files of various types. This allows me to automate executing the same script with different configuration files to collect data from many different sources. I mostly leave the data in a list so that I can use walk to export all the data once clean as text files with consistent formatting with a single call.

Until now, I never took advantage of having a list of data.frames for the T part of ETL. Without being able to use across with predicates based on data type or with tidyselect functions like any_of, it was harder to safely do mutations that were not attached to the underlying data structure. The code above can execute on any list of data.frames as long as a character vector named supplements exists. By using any_of, if those columns don’t exist in my data, they are ignored. And by using where on type definitions (largely characters in this case), I don’t have to worry about names of columns in the data I want to remove Windows-style line endings from– just take care of it on any character columns.

So while it’s not fancy, I’m tickled by removing four lines that mutated dates in 4 of 7 data.frames where that was relevant, and how many different places I added na_if before. Or how about the number of times I had to cast a specific value (like check_number) to character from some other format (usually integer because they’re quasi-numeric or logical because they are entirely missing from one data set) so that I can do a join or bind_rows. This is a classic case of time, experience, and an ever-improving tidyverse API reducing code complexity while making my code more generic and reusable.