Reading and Parsing Large CSVs in Ruby

By Travis 2024-05-28

We work with a lot of CSVs at Image Relay. They serve as the "love language" for Product Information Management (PIM) due to their pivotal role in simply and efficiently dealing with data. As software engineers, we often turn our noses up at the format, but to the rest of the world CSVs are a powerful tool for exporting/importing and managing data.

Ruby has support for CSV as part of Standard Lib, and the for the most part reading, writing, and working with CSVs using standard lib is fine. There are other solutions, but typically require 'csv' is all you need.

The are two primary methods used for reading CSVs in Standard Lib:

# From a file: all at once
arr_of_rows = CSV.read("path/to/file.csv", **options)

# iterator-style:
CSV.foreach("path/to/file.csv", **options) do |row|
  # ...
end

These examples are lifting right from the Standard Lib docs, and you'll notice two comments: "From a file, all at once" and "iterator-style".

These are deceptively very important when you're working with large files. And it's a two part problem. Using CSV.read first reads the entire file into memory and then calls CSV.parse on it to convert it into an array of rows. As the file size increases the memory usage will increase due to the read, and then again due to the parsing.

Using CSV.foreach (the iterator-style) will read each line one by one, only ever parsing one line at a time, with a much smaller memory footprint, and typically much faster.

Highlighting Memory Usage

Reading any file into large file into memory has a cost, let's demonstrate that. For this work I'll be using a CSV file from the New Zealand government's large dataset page, particurally this one. It's a good example because it's realworld data and has a little over 100,000 rows.

Note: I'm using MacOS and Ruby 3.3, you may need to make some tweaks on the *nix bits if you're not on MacOS

Let's check the filesize and linecount of our example file:

> ls -lh example.csv
-rw-r--r--@ 1 travis  staff    22M May 28 19:35 example.csv

> wc -l example.csv
101997 example.csv

So it's 22 megs and 101,997 lines.

First off, let's just see what reading the file into memory costs. Here's the script:

# file-read.rb
csv = File.read('example.csv')

And we'll run it, and grab the max resident set (max amount of memory) used in MB using /usr/bin/time and some grep | awk magic.

> /usr/bin/time -l ruby file-read.rb 2>&1 | grep "maximum resident set size" | awk '{print $1/(1024*1024) " MB"}'
32.3867 MB

So, there's already some overhead (and probably a little measurement fudging), but our file size is 22M and our memory usage, just reading it in puts as at 32M of memory usage.


Now let's read it with CSV.read:

# csv-read.rb
require 'csv'

CSV.read('example.csv').each do |row|
  row
end
> /usr/bin/time -l ruby parse.rb 2>&1 | grep "maximum resident set size" | awk '{print $1/(1024*1024) " MB"}'
106.41 MB

Ouch, more than three times the memory usage to read it and parse it with CSV.read.


Just for fun, let's make another script and read it into a string and then parse it with CSV.parse:

# csv-parse.rb
require 'csv'

csv = File.read('example.csv')

CSV.parse(csv).each do |row|
  row
end
> /usr/bin/time -l ruby csv-parse.rb 2>&1 | grep "maximum resident set size" | awk '{print $1/(1024*1024) " MB"}'
151.621 MB

Even worse... that's because CSV.parse actually calls CSV.new with the string, and the first line of initialize does this:

def initialize(...)
  @io = data.is_a?(String) ? StringIO.new(data) : data
  # ...
end

It builds a StringIO object out of the string data – more memory.


Okay, let's use the iterator style, line-by-line.

# csv-foreach.rb
require 'csv'

CSV.foreach('example.csv').each do |row|
  row
end
> /usr/bin/time -l ruby csv-foreach.rb 2>&1 | grep "maximum resident set size" | awk '{print $1/(1024*1024) " MB"}'
13.5273 MB

Nice! Not even as much memory as the file itself (remember it's 22M). That's because it's going line by line and not parsing it all before we iterate. Now to be clear, we're not doing anything with these rows. Just beware that if you're building heavy objects and dumping them into an array with each iteration, your memory will go up with each iteration, but at least you're not starting off with high memory.

Speed Comparision

Now that we have our implementations, let's run them through time and just compare the speeds.

> /usr/bin/time -p ruby csv-read.rb 2>&1 | grep real
real 1.89

> /usr/bin/time -p ruby csv-parse.rb 2>&1 | grep real
real 1.42

> /usr/bin/time -p ruby csv-foreach.rb 2>&1 | grep real
real 1.53

With speed the results are a little more interesting using our scripts and 100,000 line csv. CSV.foreach still out performs CSV.read, but is actually a little slower than CSV.parse.

More Rows More Problems

Just for fun, let's take a look at a much larger file from the collection mentioned above. Once unzipped the file is 139M with 6,457,054 rows.

I've put together a little script that uses ttyplot to plot memory usage of any pids with the process name "csv-profiling". It will wait for data before it plots anything. Here's the script:

#!/bin/bash

while :
do
  pid=$(pgrep -n csv-profiling)

  if [ -n "$pid" ]
  then
    ps -o rss= -p "$pid" | awk '{print int($1 / 1024)}'
  fi
  sleep 1
done | ttyplot -u Mb -t "CSV.foreach vs CSV.read Memory Usage"

I'll start the script, and then run our csv-foreach.rb script to get a memory baseline, then I'll run csv-read.rb to see how memory usage grows as the CSV is parsed. Here's the outcome:

So, to sum that up if you don't have the time to watch it all:

  • CSV.foreach used ~12Mb of memory and stays flat. It took 33 seconds to complete.
  • CSV.read memory usage grew and reached a max of ~2652Mb. It took 5 minutes and 18 seconds to complete.

Looking back to our original speed comparison using the 100,000 row CSV, it's more clear now that larger files benefit in terms of speed. The GC probably has a lot to do with this, but for now we'll leave it at that.

Conclusion

There are a few different ways to parse CSVs in ruby, but they're not equivilent, and there's a tradeoff with each implementation. Usually, there's not a ton of concern when you're working with smaller files, but the memory cost goes up a lot with bigger files and there are great benefits in speed too.

This doesn't mean that CSV.foreach is right for everything, but it's probably the right choice if you have big CSVs with lots of lines or columns.

We hope that helped clear some things up in CSV standard lib. Look for our next post on writing and sending CSVs efficiently!