Bracing against the wind  

Thursday, February 10, 2011

Approximate Line Count for Very Large Files

When dealing with very large files, the unix tool "wc" can be extremely slow. The alternative, byte size, is often not what I want to look at, especially when trying to estimate the number of reads in a fastq file.

A good estimate (2 sig figs) is, 90% of the time, what I need.

alc is my "approximate" line count tool. It counts the number of lines in a file, just like wc, except it only "samples" the file in a series of segments. By seeking and reading 200K from a dozen places in the file, rather than reading the whole thing, I get a good representative sample, and an accurate-enough count.

