Delimited files are hell-- a comparison of methods to read bad files

Here’s a real live example of why I am stuck with R and stuck with data.table. In my work, I often receive various delimited files from customers. Mostly, these delimited files are created from Oracle or MS SQL and have all kinds of gnarly things going on. Without sharing too much, here’s a partial example of the last few fields of one line in one of those files:

1

|25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54

Do you see the problem? I have a " character in a field, but the field itself is not quoted and the quote is not escaped.

Let’s compare how different systems handle this file. Before we do so, it’s important to know how many rows are in this data set:

1
2


wc -l my_file.txt
  239167 my_file.txt

DuckDB

The new hot thing, DuckDB is an in-memory database like sqlite, but optimized for analytics and reporting. It has support to automatically create a table from a CSV file. What happens when I try and do this?

1
2


D create table df as select * from read_csv_auto('my_file.txt');
Error: Invalid Input Error: Error in file "my_file.txt" on line 50614: quote should be followed by end of value, end of row or another quote. (DELIMITER='|' (auto detected), QUOTE='"' (auto detected), ESCAPE='' (auto detected), HEADER=1 (auto detected), SAMPLE_SIZE=20480, IGNORE_ERRORS=0, ALL_VARCHAR=0)

Yup, you’re right DuckDB. That’s absolutely the line with the problem. Guess I’m stuck.

Python/Pandas

How about Python and pandas, somehow the king of modern data science and data frames.

1
2


>>> import pandas as pd
>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False)

Hey, so far so good! This file get read without any messages, warnings, or errors.

1
2
3


>>> df
...
[238701 rows x 22 columns]

Uh oh. The data has only 238,701 rows, which is quite a bit less than 239,167 (well, 239,166 since this file does have a header row). This may not be a problem, because it’s possible that new lines exist in a text delimited file that is not a new record (if properly quoted). At least now that I have the data loaded, I can check for the sum of a column called “Amount”, because this is financial data. We can compare this to other methods later in addition to the row count so we can be sure the full data set was ready by pandas.

1
2


>>> sum(df.Amount)
196848446.45999622

R - readr

I am a full on tidyverse apologist. So of course I’m going to reach for readr::read_delim to get this file loaded.

1
2


> library(readr)
> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE))

Awesome. Like pandas, readr had no messages, warnings, or errors. Let’s see how many rows there are and the sum of that amount column.

1
2
3
4


> df |> nrow()
[1] 238609
> sum(df$Amount)
[1] 196828725

Uh oh again. It seems that readr::read_delim also doesn’t reach 239,166 lines, but instead has only 238,609 lines. That’s almost 100 less than pandas, and the sum is off by almost $20,000. I don’t know at this stage if pandas is right, but it seems pretty likely that readr just silently gave up on some lines that it shouldn’t have.

R - data.table

Let’s try the package data.table which has a function fread to read delimited files.

1
2
3
4


> df <- fread("my_file.txt")
Warning message:
In fread("my_file.txt") :
  Found and resolved improper quoting out-of-sample. First healed line 50614: <<25-AUG-22|"SAN81803 EXPO® White Board CARE Dry Erase Surface Cleaner 8 oz Spray Bottle|54>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

That’s interesting! When using fread, I get a warning that points out the very line I mentioned above. It even prints the line (I removed most of it) and says that it resolved the issue. It also recommends that I might want to try and specify quote = "" if, like on my file, fields are not quoted. We’ll come back to that.

1
2


> nrow(df)
[1] 239166

Well, would you look at that? Exactly the amount of lines I’d expect if there is one header row and no new lines. Let’s check that amount column.

1
2


> sum(df$Amount)
[1] 196926161

That’s almost $80,000 more than pandas. That’s a lot of money to have missing.

Just for fun, instead of expecting fread to figure everything out about my file on its own, what if I follow its suggestion and tell fread that there is no quote character and no fields are quoted?

1
2
3
4
5


> df <- fread('my_file.txt',  quote = "")
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161

By giving fread just a little information about the file, I get no warning about resolved lines, it just reads the file correctly with the same results.

So now, because of fread, I have some idea of what the problem was. Almost everything that reads a delimited file expects fields to be quoted, at least optionally, at least some of the time. They hate a quote at the start of a field that is actually in the raw data (understandably). Maybe if I tell these other tools that the quote character is nothing they’ll work better.

PostgreSQL

Let’s try PostgreSQL using COPY. Note, I’m not including the create table statement to avoid revealing more about the data.

1
2
3
4
5
6


jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|';
COPY 130426
jason=# select sum(amount) from df;
     sum
-------------
 97411927.64

Well, that’s not right. I should have nearly twice the number of lines. And sure enough, I’ve got only half the dollars. No warnings, no errors, no messages. Nothing. Silent failure.

Can I fix it?

Now that I know the issue is the quote character, let’s see if I can fix all the methods that failed to load this file.

readr

1
2
3
4
5


> df <- read_delim('my_file.txt', delim = '|', show_col_types = FALSE, quote = '')
> nrow(df)
[1] 239166
> sum(df$Amount)
[1] 196926161

pandas

1
2
3
4
5


>>> df = pd.read_csv('my_file.txt', sep = '|', low_memory=False, quoting=3)
>>> df
[239166 rows x 22 columns]
>>> sum(df.Amount)
196926160.57999647

duckdb

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


D create table df as select * from read_csv_auto('my_file.txt', quote='');
D select count(*) from df;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       239166 │
└──────────────┘
D select sum(amount) from df;
┌───────────────────┐
│    sum(amount)    │
│      double       │
├───────────────────┤
│ 196926160.5800201 │
└───────────────────┘

postgresql

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


jason=# \copy df from 'my_file.txt' CSV HEADER DELIMITER '|' QUOTE '';
ERROR:  COPY quote must be a single one-byte character
jason=# \copy df from 'my_file.txt' DELIMITER '|';
ERROR:  invalid input syntax for type integer: "col1"
CONTEXT:  COPY df, line 1, column col1: "col1"
jason=# \copy df from program 'tail -n +2 my_file.txt' DELIMITER '|';
COPY 239166
jason=# select sum(amount) from df;
     sum
--------------
 196926160.58
(1 row)

Would you look at that? Everyone does just fine when you tell them there is no quote character or quoted fields. ¹

All of these methods are essentially instant on my computer. Performance is not the issue. What’s incredible is that data.table::fread can identify file issues and resolve them. In this case, it turns out that data.table::fread was also able to describe the problem well enough that I could fix every other method of reading the file successfully. I will say, going back and reading the error for duckdb may have given me some hints, but `pandas, readr, and PostgreSQL completely failed to even notify me something was wrong. In an automated pipeline, I would have no indication that hundreds of rows, or in PostgreSQL’s case hundreds of thousands of rows, were just gone.

I was able to fix pandas, readr, duckdb, and PostgreSQL, but I have run into many scenarios where this is not the case. For example, what if I had this same row in a file that sometimes did quote certain fields. The fread function can handle this no problem, resolving the issue, warning me, and moving on. Every other method just wouldn’t work.

I don’t control this data. I don’t control producing it. It comes in on a scheduled basis, often hourly, and I need to accurately process it.

Silent failures are terrible.

The only way out is to embrace open, documented, binary data formats. But until all major RDBMS (and probably Excel) have native export and import, flat files will continue to be the de facto standard. In the meantime, it would be nice if PostgreSQL’s COPY, bcp, and spool could at least try and do things like quote fields and escape characters by default when targeting delimited files.

Some additional testing

I was asked to check vroom and arrow in R as well as polars via Explorer in Elixir. The results were, not good.

vroom

1
2
3
4
5


library(vroom)
> df <- vroom('my_file.txt', delim = '|')
Rows: 238609 Columns: 22
> sum(df$Amount)
[1] 196828725

arrow

1
2
3
4
5


> df <- arrow::read_delim_arrow('df.txt', delim = '|')
> sum(df$Amount)
[1] 196848446
> nrow(df)
[1] 238701

explorer / polars

1
2
3
4
5
6
7
8


iex(1)> Mix.install([
...(1)>   {:explorer, "~> 0.5.0"}
...(1)> ])
iex(2)> df = Explorer.DataFrame.from_csv(filename = "my_file.txt",
...(2)> delimiter: "|", infer_schema_length: nil)
{:error,
 {:polars,
  "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor  adding `OTHER` to the `null_values` list."}}

I wanted to show the pain of doing this in PostgreSQL. Only CSVs skip the first row with the HEADER option. But QUOTE can’t be set to blank/none/null. And using the default TEXT format, PostgreSQL can’t deal with the header row. So instead I had to use the PROGRAM option, which lets me run a bash script as the input and skip the first row, which then succeeds. ↩︎