Tools

RFC 4180 and the CSV Edge Cases That Break Every Parser

June 10, 2026·10 min read

A Format Nobody Owns

CSV is the format every system in your stack agrees to support and none of them implement the same way. It predates the web by about two decades, has no single authoritative specification, and is so superficially simple that engineers routinely write their own parser in an afternoon, ship it, and discover six months later that it has been silently corrupting one row in every ten thousand. The format's strength — that it is human-readable text any tool can write — is also why it never settled on a tight standard the way JSON did with RFC 8259.

RFC 4180, published in October 2005 by Yakov Shafranovich, is the closest thing CSV has to a written specification, and even the RFC describes itself as "informational" rather than as a standard. Its abstract is candid about the situation: there are countless implementations of CSV-style files, they do not interoperate cleanly, and the document is an attempt to write down a common subset that most implementations can agree on. Twenty years later, that subset is still not what most files in the wild actually look like.

This piece is a working tour of the edge cases. What RFC 4180 actually says, where real CSV files diverge from it, and the specific patterns that quietly break data on import. If you have ever stared at a JSON output that looks fine until row 4,217 where suddenly half the columns are shifted, this is the territory.

What RFC 4180 Actually Says

The full RFC is seven pages and worth reading directly, but the load-bearing rules fit on one screen. Records are separated by CRLF. Fields are separated by commas. A header row is optional but, if present, is the first record. Fields containing commas, double quotes, or line breaks must be enclosed in double quotes. A double quote inside a quoted field is escaped by doubling it. The MIME type is text/csv and the file extension is .csv. That is essentially the whole specification.

The RFC's most important sentence is the one many implementations ignore: "Each record is located on a separate line, delimited by a line break (CRLF)." CRLF, not LF. Not "whatever the operating system uses." The reasoning is that CSV files often travel between Windows, Mac, and Linux systems and a strict CRLF terminator is the only one that round-trips through all three without ambiguity. In practice, every CSV reader you will ever use accepts LF as well, because if they did not, the format would be unusable. But the RFC's strictness is the reason you still see CSV files that mix line endings in pathological ways.

The other underappreciated detail is that quoting is conditional, not universal. A field is quoted if it contains a comma, a double quote, or a line break. Otherwise it is bare. This means a single CSV file legitimately contains both Alice and "Smith, Alice" in the same column, and any parser that assumes "all values are quoted" or "no values are quoted" is wrong out of the gate.

The Edge Cases That Actually Bite

The trouble starts when you put RFC 4180 down and open the CSV file your finance team just sent you. Below is the working list of patterns that quietly break parsers, ordered roughly by how often they show up in the wild.

Embedded Newlines Inside Quoted Fields

RFC 4180 explicitly permits line breaks inside quoted fields. So a single logical row can span ten physical lines in the file if one of its fields is a multi-paragraph comment. The naive parser everyone writes first reads the file line-by-line, splits on commas, and produces garbage the moment it encounters one of these fields. The correct approach is to tokenize the file as a stream of characters, tracking quote state, and only emit a record when you see the terminator outside a quoted region.

This single issue is responsible for more silent CSV corruption than any other. Spreadsheet exports from Salesforce, Zendesk, and most CRMs contain long-form text fields, and the moment a user pastes a multi-line address or a copy-pasted email body into one of those fields, every line-based parser downstream of that export starts shifting columns.

The Doubled-Quote Escape

To put a literal double quote inside a quoted field, you write two double quotes in a row. The value She said "hello" becomes the literal CSV field "She said ""hello""". That triple-quote at the end is correct: one to close the embedded quote, one more for the literal, and one final one to close the field. Parsers that treat backslash as an escape character (because they were written by someone whose mental model was JSON or C strings) fail here. RFC 4180 does not define a backslash escape at all.

The doubled-quote rule also interacts badly with regex-based parsers. A "parse CSV with a single regex" approach can be made to work in theory, but the regex required to handle quoted fields with embedded quotes, embedded commas, and embedded newlines is long enough that no one reading it will ever be confident it is correct. Use a real tokenizer.

The UTF-8 BOM

Excel, when saving a file as "CSV UTF-8," prepends a three-byte byte-order mark (EF BB BF) to the file. The BOM is harmless to a UTF-8-aware reader that knows to strip it, and a quiet disaster to one that does not. The most common symptom is that the first column header of your file silently becomes name instead of name, and every subsequent lookup against the column name returns nothing. If your JSON output has one mysteriously-named first key, the BOM is almost always why.

Excel's Regional Delimiter Game

Microsoft Excel, on machines configured for German, French, Italian, or several other locales, exports CSV files using a semicolon as the delimiter rather than a comma. This is because the comma is the decimal separator in those locales, and Excel's logic is that semicolons avoid the ambiguity. The file is still called a .csv file, the magic bytes look identical, and only when you open it do you discover that all your "comma-separated values" are actually semicolon-separated. Any tool that converts CSV without letting you choose the delimiter is going to be wrong about half the time on European data.

This is exactly why our CSV to JSON converter exposes a delimiter dropdown rather than hardcoding the comma. You can pick comma, semicolon, tab, or pipe, which covers the realistic range of files you will encounter. It is also why the in-browser tool runs entirely on your device — if your CSV contains anything sensitive, you do not want it round-tripping through a server just to be reformatted.

Tab-Separated Values Pretending to Be CSV

TSV files are sometimes saved with a .csv extension because the user typed it in the Save dialog without thinking. The file is structurally fine but every parser that assumed comma-delimited will produce a single giant column. The fix is delimiter sniffing — Python's csv.Sniffer does this by analyzing the first few kilobytes and guessing the most likely delimiter — but sniffing fails on files where the first few rows happen to contain a comma but the actual delimiter is a tab or a pipe.

Inconsistent Quoting

Real CSV files often quote some fields and not others, in patterns that look arbitrary. The cause is usually that the exporter quotes only fields that need it (per RFC 4180), and most fields do not. Some exporters take the opposite approach and quote every field unconditionally, which is also valid. The pathological case is exporters that quote inconsistently within the same file — usually because the export logic has been edited by different people over time. A robust parser treats both quoted and unquoted fields as equally valid inputs.

Trailing Commas, Trailing Newlines, and Empty Rows

RFC 4180 says nothing about trailing newlines, but most parsers tolerate one. Two trailing newlines often produce a spurious empty row in the output, which then becomes a JSON object full of empty strings. Trailing commas in a row (Alice,30,) produce an empty final field; the question of whether that empty field is meaningful or a typo is impossible to answer without context. Empty rows in the middle of the file usually indicate that someone hand-edited the CSV in a text editor and pressed Enter accidentally.

Mixed Line Endings

A file produced on a Windows machine, opened on a Mac, edited, and re-saved frequently ends up with CRLF on the original rows and LF on the edited ones. A strict CRLF-only parser will read this as one giant unterminated record; a strict LF-only parser will leave stray CR characters at the end of every original field. The right behavior is to accept any of CRLF, LF, or CR as a record terminator and normalize them.

Number-Like Strings That Are Not Numbers

CSV has no type system. Every field is text. When you convert CSV to JSON, the question of whether "007" is the string "007" or the number 7 is yours to answer, and the answer is almost always "leave it as a string." Phone numbers, ZIP codes, ISBNs, account numbers, and product SKUs all look numeric but are not. The classic data-corruption story is Excel auto-converting a column of gene names — including SEPT1 and MARCH1 — into dates, which has been a documented headache in the bioinformatics literature for over a decade.

Encoding Confusion

A CSV file labeled UTF-8 that is actually Windows-1252 will mostly work right up until the first em-dash or curly quote, at which point you get mojibake or a decoding error. Files exported from older Windows systems are often Windows-1252; files from Mac systems older than about 2002 are sometimes MacRoman; files from Asian-language systems can be Shift-JIS, GB2312, or Big5. The HTTP Content-Type header is unreliable for files passed around as attachments. The only fully reliable approach is to test-decode against a few candidate encodings and pick the one that produces no replacement characters.

What a Defensive Parser Actually Does

Putting the edge cases together, a CSV parser you can trust does roughly the following. It reads bytes, not lines, and detects and strips the UTF-8 BOM if present. It accepts CRLF, LF, or CR as a record terminator and treats them uniformly. It maintains an explicit quote-state flag and only emits a record when it sees a terminator outside a quoted region. It treats doubled quotes inside quoted fields as a single literal quote. It does not impose its own type coercion; every output value is a string, and the calling code decides whether "2026-01-15" is a date or just a hyphenated number sequence.

For interactive work, that level of rigor is usually overkill. For clean, well-formed CSV from a known source, a quick split-on-delimiter pass produces correct output in five lines of code. For arbitrary CSV from the wider world, especially data that came out of a spreadsheet someone hand-edited, the defensive parser is the only path that will not eventually corrupt a row you did not expect.

The Operational Lesson

The deeper point about CSV is that it is a format defined by Postel's law in the worst possible way. Be conservative in what you send, liberal in what you accept. Every exporter is conservative in some direction the next importer is not expecting, and every importer is liberal in ways that swallow real errors instead of surfacing them. The result is a format that almost works, almost everywhere, and silently fails in ways that take weeks to diagnose.

The practical posture is to never trust a CSV file you have not validated. Open it in a text editor and look at the first ten lines before you trust your parser's output. Count the columns in the header and spot-check a few rows in the middle and at the end. If you are converting CSV to JSON for downstream processing, scan the output for empty objects, mysteriously-named keys (the BOM), and rows where the column count diverges from the header count. Those three checks catch most of the failures discussed here in under a minute.

RFC 4180 is short, it is twenty years old, and it is still the best reference we have. Read it once and keep a copy nearby. The edge cases will keep coming — vendors keep inventing new ways to misexport data — but the underlying logic of "tokenize first, split second, type-coerce last" remains the right shape. The CSV file on your desk almost certainly does not follow RFC 4180 exactly. Knowing how it deviates is the whole game.

Related Free Tools

Token CounterEstimate token counts for major LLMs JSON FormatterFormat and validate JSON instantly JWT DecoderDecode JWT tokens safely in your browser

Stay Informed

Get ecosystem updates

New tools, posts, and ecosystem news — no spam, unsubscribe anytime.