Why are columns shifted after extraction?

Many PDFs position text precisely rather than defining true columns. Extractors infer columns from spacing, which can vary across pages.

How can I tell if my PDF is scanned?

If you can't select text in the PDF (or selection highlights whole blocks strangely), it may be a scan. Scanned PDFs usually need OCR before table extraction can work.

Why do multi-line descriptions break rows?

PDFs don't store rows. A wrapped description is just multiple text fragments at different Y positions. Extractors may interpret those fragments as separate rows unless they can reliably group them.

Is exporting to XLSX more accurate than CSV?

The output format doesn't fix extraction errors. Whether you export CSV or XLSX, the key step is correctly interpreting the PDF's layout and validating the extracted rows.

PDF Tables: Why Extraction Fails (and How to Fix It)

If you've ever exported a "perfect-looking" PDF table and ended up with shuffled columns, duplicate headers, or missing rows, you're not alone. The frustrating part is that the PDF often looks immaculate on screen.

The root cause is simple: PDF is a presentation format, not a data format. A PDF stores instructions like "draw this text at X/Y," not "this is a row with 5 columns." Table extractors have to infer structure from spacing and alignment, and that inference can break in subtle ways.

This guide explains the most common failure modes and gives you a practical troubleshooting path. If you're converting bank statements specifically, pair this with the safe conversion workflow.

PDF isn't a spreadsheet

A spreadsheet stores a grid: rows, columns, and cell boundaries are explicit. A PDF stores drawing operations: text fragments, lines, and shapes placed at precise coordinates.

That difference explains why PDF conversion can produce:

columns that drift or merge together,
descriptions split into separate rows,
amounts separated from their dates,
headers repeated as if they were transactions.

Two kinds of PDFs: text vs scan

Most extraction problems become easier to diagnose once you figure out which type of PDF you have:

Text-based PDFs: you can select text, copy/paste, and search inside the document. Extraction is about layout inference.
Scanned PDFs: the "text" is actually pixels in an image. Extraction requires OCR first, and accuracy depends on scan quality.

If you suspect a scan, start with OCR using OCR to Text and then extract tables from the OCR result.

Common failure modes

Layout-driven failures (no real columns)

Many statements "look" like a grid, but the PDF might be built from separate text chunks that only happen to align visually. If the bank"s generator nudges spacing slightly on each page (or each row), your extractor"s column detection can shift.

Typical symptoms:

amounts appear under the wrong header,
the "Description" column swallows adjacent fields,
the first page extracts fine, later pages drift.

Wrapping and multi-line cells

Transaction descriptions often wrap (merchant name + location + reference). In a spreadsheet, wrapping is still one cell. In a PDF, a wrapped cell is usually multiple text fragments at different Y positions. Extractors must guess which fragments belong together.

Typical symptoms:

one transaction becomes two or three rows,
a continuation line loses its date/amount,
running balances detach from their rows.

Scans and weak OCR

If the page is an image, OCR must decide where characters and words are. Small OCR errors can have big consequences for financial data:

"0" vs "O" in account references,
decimal separators misread or dropped,
minus signs missed (turning debits into credits),
dates misread (01/07 vs 07/01 depending on locale).

If you must OCR, validate more aggressively and prefer later cleanup with a tool that can enforce a schema like Statement Converter.

Repeated headers and page breaks

Many statements repeat the header row on each page. A human knows it's a header. An extractor may treat it as a data row unless it detects patterns.

Page breaks can also split a single transaction across pages (especially with wrapped descriptions), which creates "phantom rows."

Signs your extraction is untrustworthy

Amounts appear in the description column or vice versa.
Dates suddenly change format mid-file.
Many rows have empty amounts or empty dates.
The same header line appears as a "transaction" multiple times.
Row count is far off from what you see in the PDF.

Practical fixes

Before extracting: quick prep

Identify whether it's scanned. If you can't reliably select and copy text, start with OCR.
Prefer the "transactions" export if available. Many banks offer a CSV or "download transactions" option separate from statements.
Focus on the table only. If you can crop or select a region, exclude logos, footers, and summary boxes.

During extraction: tools and tactics

Dedicated table extraction usually beats generic "PDF to Excel" conversion, because table tools are tuned for column inference and region selection.

Try a dedicated extractor like PDF Table Extractor and compare outputs.
Use region selection to isolate only the transaction table (avoid margins, page numbers, etc.).
If descriptions wrap, test a smaller range first (one page) and inspect whether wrapped lines are grouped correctly.
If the table has debit/credit columns, confirm the extractor doesn't merge them into one field.

After extraction: normalize and clean

Once you have "something" exported, the next goal is to enforce a consistent schema and remove junk rows.

Remove repeated headers and empty rows.
Normalize columns to a stable shape. For bank statements, the most reusable schema is typically: date, description, amount (or debit/credit), and optionally balance.
Standardize columns with Statement Converter so downstream tools receive consistent fields.
If you do cleanup in Excel, use careful import and cleaning practices; see cleaning bank statement data in Excel.

How to validate results

Sanity checks you can do fast

Spot-check 10 random rows: verify date, description, and amount align.
Scan for OCR artifacts: "O" vs "0", missing decimals, unexpected commas.
Sort by amount: unusually large or unusually tiny values can expose parsing issues.
Count rows: compare roughly to the number of transactions visible in the statement.

Reconciliation checks (when available)

If your statement includes opening and closing balance (or a running balance column), you can do stronger validation:

Verify the opening balance matches the first row's starting point (if shown).
Check that the closing balance matches the statement's ending balance.
Ensure debits reduce balance and credits increase it (sign errors are common).

When to stop and choose another format

Sometimes the best fix is not "try harder," but "change inputs." If you keep seeing shifted columns and broken rows after multiple attempts, look for an alternate export path:

Download transactions as CSV if your bank offers it.
Try a different PDF period (some months have different templates).
Ask the bank for a machine-readable export (even if it's hidden behind "Download").

If you want a broader understanding of file formats banks use, see common bank statement formats explained.

FAQ

Extraction is a game of "good enough + validation." The goal isn't to produce a pretty spreadsheet; it's to produce data you can trust for the workflow you're about to run.