2026-04-04/Ben Severn

10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)

The same 10 data quality issues show up in every dataset. Here's what they look like and how to fix each in one line.

data-qualitydata-cleaningpythongoldenflowdata-engineeringetldata-transformation

Every data engineer writes throwaway scripts to fix the same problems. Phone numbers in 15 formats. Dates that aren't dates. "N/A" and "null" and "" all meaning the same thing. The scripts are always slightly different, never reusable, and break when the data changes.

Here are the 10 problems we see in every dataset, what they actually look like, and how to fix each one.

1. Phone numbers in 15 formats

Your CRM has phone numbers entered by humans over a decade. No two look the same.

BeforeAfter
(555) 123-4567+15551234567
555.123.4567+15551234567
+1-555-123-4567+15551234567
5551234567+15551234567
1 (555) 123-4567+15551234567
goldenflow transform contacts.csv

Zero-config mode detects phone columns and normalizes to E.164 automatically. Every downstream system — Twilio, Salesforce, your matching pipeline — expects E.164. Do it once at the source.

2. Mojibake from bad CSV exports

Someone exported from Excel on a Windows machine, someone else opened it on a Mac, and now your text looks like this:

BeforeAfter
cafécafe
it’sit's
résuméresume
naïvenaive

This happens when UTF-8 bytes get decoded as Latin-1. The fix is to re-encode:

import goldenflow
from goldenflow import GoldenFlowConfig, TransformSpec

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="description", ops=["fix_mojibake"]),
])
result = goldenflow.transform_df(df, config=config)

fix_mojibake tries the Latin-1 → UTF-8 round-trip. If the text wasn't actually garbled, it returns the original unchanged.

3. European vs US decimal formats

You're merging data from a German office and a US office. One writes 1.234,56 (one thousand two hundred thirty-four and 56 cents). The other writes 1,234.56. Same number, incompatible formats.

Before (European)After
1.234,561234.56
99,9999.99
10.000,0010000.0
3.143.14

comma_decimal is smart about this — if there's no comma in the value, it parses as-is (so 3.14 stays 3.14, not 314). It only applies European conversion when a comma is actually present.

4. Gmail dots and +tags hiding duplicates

Gmail ignores dots in the local part and everything after a +. These are all the same inbox:

Your dedup pipeline sees four different emails. Your customer sees one inbox and four of the same marketing email.

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="email", ops=["email_normalize"]),
])

This only strips dots for Gmail/Googlemail domains — other providers treat dots as significant.

5. "Bob" and "Robert" are the same person

You're matching records across systems. One has "Bob Smith" and the other has "Robert Smith". Fuzzy matching gives you a 60% score — not enough to auto-match, too high to ignore. You end up manually reviewing thousands of maybe-matches.

BeforeAfter
BobRobert
BillWilliam
JimJames
MikeMichael
DickRichard
LizElizabeth
PeggyMargaret

nickname_standardize maps 50+ common English nicknames to their formal equivalents. Run it before dedup and that 60% match becomes 100%. This was built specifically to improve GoldenMatch results — normalizing nicknames before blocking and scoring eliminates an entire class of false negatives.

6. Country names in 6 variants

Your international dataset has country names entered by humans in different languages, abbreviations, and conventions:

BeforeAfter
United StatesUS
United States of AmericaUS
USAUS
United KingdomGB
Great BritainGB
DeutschlandDE

country_standardize maps common country names, abbreviations, and local names to ISO 3166-1 alpha-2 codes. Unknown values pass through unchanged — you can spot them by filtering for values longer than 2 characters.

7. SSNs stored as raw digits

Someone dumped SSNs into a spreadsheet without formatting. Now they're sitting in a column as 123456789 — no dashes, no masking, ready to be accidentally emailed to the wrong person.

TransformBeforeAfter
ssn_format123456789123-45-6789
ssn_mask123456789***-**-6789
ssn_mask123-45-6789***-**-6789
goldenflow transform employees.csv -c config.yaml
# config.yaml:
# transforms:
#   - column: ssn
#     ops: [ssn_mask]

ssn_format normalizes to the standard dashed format. ssn_mask redacts everything except the last four digits. Both handle raw digits, dashed formats, and spaced formats. Invalid SSNs (wrong digit count) are preserved as-is so you can find and fix them.

8. Dates that aren't dates

Your date column has actual dates, free text someone typed into a date field, and blanks masquerading as dates:

Beforedate_validatedate_iso8601
2024-03-15true2024-03-15
March 15, 2024true2024-03-15
03/15/2024true2024-03-15
not a datefalsenot a date
TBDfalseTBD

Run date_validate first to flag the garbage, then date_iso8601 to normalize the real dates. The invalid values pass through unchanged — they don't silently become null or today's date.

9. HTML tags and emoji in scraped data

You scraped product descriptions or customer reviews. The raw data is full of markup and emoji that will break your NLP pipeline:

BeforeAfter
<p>Great product!</p>Great product!
<b>Bold</b> claimBold claim
Love this! \U0001f600\U0001f44dLove this!
Check <a href="...">here</a>Check here

remove_html_tags strips all HTML tags. remove_emojis strips emoji characters. Chain them: ops: ["remove_html_tags", "remove_emojis", "collapse_whitespace"].

10. "N/A", "null", "none", "-", ""

The classic. Five different ways to say "this field is empty" and none of them are actual nulls. Your aggregations count them as values. Your joins fail silently.

BeforeAfter
N/Anull
nullnull
nonenull
NAnull
-null
(empty string)null
import goldenflow

# Zero-config handles this automatically
result = goldenflow.transform_df(df)

null_standardize is one of the auto-apply transforms — zero-config mode runs it on every string column without you asking. It catches the variants that trip up every pipeline.

All of this is in GoldenFlow v1.1.0

pip install goldenflow
goldenflow transform your_data.csv

v1.1.0 ships 76 transforms across 11 categories — including the email, identifier, and URL modules that are new in this release. If you're already using GoldenCheck for data profiling, the --from-findings flag now actually works (it was silently broken in v1.0.0 — the finding-to-transform mapping used the wrong check names).

Full release notes: v1.1.0 on GitHub