Here’s a real live example of why I am stuck with R and stuck with data.table
. In my work, I often receive various delimited files from customers. Mostly, these delimited files are created from Oracle or MS SQL and have all kinds of gnarly things going on. Without sharing too much, here’s a partial example of the last few fields of one line in one of those files:
1
|
|25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54
|
Do you see the problem? I have a "
character in a field, but the field itself is not quoted and the quote is not escaped.
Let’s compare how different systems handle this file. Before we do so, it’s important to know how many rows are in this data set:
1
2
|
wc -l my_file.txt
239167 my_file.txt
|
DuckDB
The new hot thing, DuckDB is an in-memory database like sqlite, but optimized for analytics and reporting. It has support to automatically create a table from a CSV file. What happens when I try and do this?
1
2
|
D create table df as select * from read_csv_auto('my_file.txt');
Error: Invalid Input Error: Error in file "my_file.txt" on line 50614: quote should be followed by end of value, end of row or another quote. (DELIMITER='|' (auto detected), QUOTE='"' (auto detected), ESCAPE='' (auto detected), HEADER=1 (auto detected), SAMPLE_SIZE=20480, IGNORE_ERRORS=0, ALL_VARCHAR=0)
|
Yup, you’re right DuckDB. That’s absolutely the line with the problem. Guess I’m stuck.
Python/Pandas
How about Python and pandas
, somehow the king of modern data science and data frames.
1
2
|
>>> import pandas as pd
>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False)
|
Hey, so far so good! This file get read without any messages, warnings, or errors.
1
2
3
|
>>> df
...
[238701 rows x 22 columns]
|
Uh oh. The data has only 238,701 rows, which is quite a bit less than 239,167 (well, 239,166 since this file does have a header row). This may not be a problem, because it’s possible that new lines exist in a text delimited file that is not a new record (if properly quoted). At least now that I have the data loaded, I can check for the sum of a column called “Amount”, because this is financial data. We can compare this to other methods later in addition to the row count so we can be sure the full data set was ready by pandas
.
1
2
|
>>> sum(df.Amount)
196848446.45999622
|
R - readr
I am a full on tidyverse
apologist. So of course I’m going to reach for readr::read_delim
to get this file loaded.
1
2
|
> library(readr)
> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE))
|
Awesome. Like pandas
, readr
had no messages, warnings, or errors. Let’s see how many rows there are and the sum of that amount column.
1
2
3
4
|
> df |> nrow()
[1] 238609
> sum(df$Amount)
[1] 196828725
|
Uh oh again. It seems that readr::read_delim
also doesn’t reach 239,166 lines, but instead has only 238,609 lines. That’s almost 100 less than pandas
, and the sum is off by almost $20,000. I don’t know at this stage if pandas
is right, but it seems pretty likely that readr
just silently gave up on some lines that it shouldn’t have.
R - data.table
Let’s try the package data.table
which has a function fread
to read delimited files.
1
2
3
4
|
> df <- fread("my_file.txt")
Warning message:
In fread("my_file.txt") :
Found and resolved improper quoting out-of-sample. First healed line 50614: <<25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
|
That’s interesting! When using fread
, I get a warning that points out the very line I mentioned above. It even prints the line (I removed most of it) and says that it resolved the issue. It also recommends that I might want to try and specify quote = ""
if, like on my file, fields are not quoted. We’ll come back to that.
1
2
|
> nrow(df)
[1] 239166
|
Well, would you look at that? Exactly the amount of lines I’d expect if there is one header row and no new lines. Let’s check that amount column.
1
2
|
> sum(df$Amount)
[1] 196926161
|
That’s almost $80,000 more than pandas
. That’s a lot of money to have missing.
Just for fun, instead of expecting fread
to figure everything out about my file on its own, what if I follow its suggestion and tell fread
that there is no quote character and no fields are quoted?
1
2
3
4
5
|
> df <- fread('my_file.txt', quote = "")
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161
|
By giving fread
just a little information about the file, I get no warning about resolved lines, it just reads the file correctly with the same results.
So now, because of fread
, I have some idea of what the problem was. Almost everything that reads a delimited file expects fields to be quoted, at least optionally, at least some of the time. They hate a quote at the start of a field that is actually in the raw data (understandably). Maybe if I tell these other tools that the quote character is nothing they’ll work better.
PostgreSQL
Let’s try PostgreSQL using COPY
. Note, I’m not including the create table
statement to avoid revealing more about the data.
1
2
3
4
5
6
|
jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|';
COPY 130426
jason=# select sum(amount) from df;
sum
-------------
97411927.64
|
Well, that’s not right. I should have nearly twice the number of lines. And sure enough, I’ve got only half the dollars. No warnings, no errors, no messages. Nothing. Silent failure.
Can I fix it?
Now that I know the issue is the quote character, let’s see if I can fix all the methods that failed to load this file.
readr
1
2
3
4
5
|
> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE, quote = '')
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161
|
pandas
1
2
3
4
5
|
>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False, quoting=3)
>>> df
[239166 rows x 22 columns]
>>> sum(df.Amount)
196926160.57999647
|
duckdb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
D create table df as select * from read_csv_auto('my_file.txt', quote='');
D select count(*) from df;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 239166 │
└──────────────┘
D select sum(amount) from df;
┌───────────────────┐
│ sum(amount) │
│ double │
├───────────────────┤
│ 196926160.5800201 │
└───────────────────┘
|
postgresql
1
2
3
4
5
6
7
8
9
10
11
12
|
jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|' QUOTE '';
ERROR: COPY quote must be a single one-byte character
jason=# \copy df from 'my_file.txt' DELIMITER '|';
ERROR: invalid input syntax for type integer: "col1"
CONTEXT: COPY df, line 1, column col1: "col1"
jason=# \copy df from program 'tail -n +2 my_file.txt' DELIMITER '|';
COPY 239166
jason=# select sum(amount) from df;
sum
--------------
196926160.58
(1 row)
|
Would you look at that? Everyone does just fine when you tell them there is no quote character or quoted fields.
All of these methods are essentially instant on my computer. Performance is not the issue. What’s incredible is that data.table::fread
can identify file issues and resolve them. In this case, it turns out that data.table::fread
was also able to describe the problem well enough that I could fix every other method of reading the file successfully. I will say, going back and reading the error for duckdb
may have given me some hints, but `pandas, readr, and PostgreSQL completely failed to even notify me something was wrong. In an automated pipeline, I would have no indication that hundreds of rows, or in PostgreSQL’s case hundreds of thousands of rows, were just gone.
I was able to fix pandas
, readr
, duckdb
, and PostgreSQL
, but I have run into many scenarios where this is not the case. For example, what if I had this same row in a file that sometimes did quote certain fields. The fread
function can handle this no problem, resolving the issue, warning me, and moving on. Every other method just wouldn’t work.
I don’t control this data. I don’t control producing it. It comes in on a scheduled basis, often hourly, and I need to accurately process it.
Silent failures are terrible.
The only way out is to embrace open, documented, binary data formats. But until all major RDBMS (and probably Excel) have native export and import, flat files will continue to be the de facto standard. In the meantime, it would be nice if PostgreSQL’s COPY
, bcp
, and spool
could at least try and do things like quote fields and escape characters by default when targeting delimited files.
Some additional testing
I was asked to check vroom
and arrow
in R as well as polars
via Explorer
in Elixir. The results were, not good.
vroom
1
2
3
4
5
|
library(vroom)
> df <- vroom('my_file.txt', delim = '|')
Rows: 238609 Columns: 22
> sum(df$Amount)
[1] 196828725
|
arrow
1
2
3
4
5
|
> df <- arrow::read_delim_arrow('df.txt', delim = '|')
> sum(df$Amount)
[1] 196848446
> nrow(df)
[1] 238701
|
explorer / polars
1
2
3
4
5
6
7
8
|
iex(1)> Mix.install([
...(1)> {:explorer, "~> 0.5.0"}
...(1)> ])
iex(2)> df = Explorer.DataFrame.from_csv(filename = "my_file.txt",
...(2)> delimiter: "|", infer_schema_length: nil)
{:error,
{:polars,
"Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list."}}
|