No dataset is ever clean. Not one. It doesn’t matter whether it comes from a scientific lab, an aircraft engine, a public source, or a sensor network that is reported to an agency. Data always carries noise, bias, and quirks. That noise isn’t random. It’s domain dependent or a consequence of the measuring method. Every dataset inherits the imperfections of the system that created it, and that’s exactly why cleaning data isn’t just a technical task, it’s design work using the knowledge of the domain. If you ever get a dataset which is clean, you should silently thank its creator for taking the time to clean it up before publishing.
Every dataset tells a story about its origin. Maybe timestamps drift because sensors went out of sync. Maybe duplicate records appear because systems re-ingested old events. Maybe half the fields are empty because some people collected the data and skipped what didn’t seem important. None of this means the data is bad but sometimes it is and we need to accept that. I have dozens of examples over my career where the domain expert attaches a lot of importance to the data but not some much towards the creation of it or understanding it (it is up to somebody else to do it). Yes it took you a lot of time and hard work to build it, but just because it exists does not mean it is good or even valuable. A lot of data can be safely binned but often ends up in scientific papers, data sources or onto products.
Here is one of my cherished learning:
Data is not valuable by itself, Data + Metadata + Model/understanding makes it valuable.
Cleaning data is about learning to read that story. I feel this is closely related to Domain-Driven understanding: your understanding of the domain should shape how you interpret, clean, and transform the data. You can’t separate the structure of the dataset from the structure of the world it represents. Metadata plays a massive role, domain expert plays a bigger role for understanding any data. I feel if the domain expert does not do their job of adding understanding or a model to the data, they are just creating another form of technical debt or a form of data debt that they push forward or on somebody else’s head.
Understanding this debt isn’t just debugging, it is studying how the system behaves, how people interact with it, and what the data really measures. In doing so, you end up understanding the domain far more deeply than you did at the start. Every dataset that I worked with ended up teaching me at the end. The more I work with messy datasets, the more I realize: cleaning is how you learn the domain. You start by learning how the data was collected, what the business rules are, and where the real-world variability comes from. Data cleaning is the process of discovery that makes everything else in data science possible. You’re actually uncovering how the system works. But this is also which produces a lot of problems with time estimates with data science work, the uncertainty and time It takes you to understand somebody else data.
A few examples I’ve seen:#
- Collection drift: A field that once represented shipment weight quietly starts recording total container weight. Nobody updates the documentation, and months later you’re wondering why all your trade data looks off. Small upstream changes like this can cascade into serious downstream errors.
- System updates: The pipeline truncates decimals or silently switches from kilograms to pounds. Sometimes it’s a schema migration, sometimes an ETL refactor. Either way, unit changes without clear versioning create chaos. Always sanity check magnitudes after system updates.
- Human entry: spelling mistakes, inconsistent abbreviations, or shorthand notations that make perfect sense to the person entering them but not to anyone else. It is a form of laziness, nothing can be done about it. With text, NLP helped a lot and now a lot of the LLM tools are helping fix these, but no fix will be 100%.
- Business rules changing midstream: What counts as a valid sale or a complete transaction shifts as policies evolve, teams change, or time passes. This is especially painful when analyzing data that spans multiple years or decades. The meaning of a metric can drift while the column name stays the same.
- Missing data: A very common problem if you are working with time series. Just because it does not exist in the dataset does not mean that it is not happening in actual process. Maybe the sensor went offline or the log didn’t capture it. But it can be fairly easy to fix if you know the inherent statistical distribution or structure of the data .You can do gap filling with say mean/median fill, zero fill, interpolate etc. You should always think about what the imputation implies and what the result will convey at the end.
- Merging data sources: two tables that look similar but define their metrics differently. For example I get my table from a source, then I have my manual additions table given to me by a domain expert. Unless I am 100% sure, I should not concatenate the two tables. Best to have a mandatory data source column.
- Duplicate records, this could be either an error or just an extra record somebody added because there is really double of the value. So it is not just about dropping duplicates and calling it a day. Then you actually drop some of the data and change the distribution.
Tools That Make It Easier in python and building your own#
Tools like Great Expectations make documenting and validating your assumptions part of the pipeline itself. You can literally encode your “domain rules” ranges, valid categories, units into automated tests. Your expectations become living documentation of how the data should behave, and you get immediate feedback when reality drifts.
For projects where schema consistency and data typing matter (especially in AI and ML pipelines), I’ve also found Pydantic incredibly useful. It enforces structure right at the data model level, catching subtle issues like a string sneaking into a numeric column, long before they contaminate your downstream models. The result is cleaner, more predictable datasets and far fewer surprises.
What happens if you don’t have fancy libraries at your disposal. You can still build structure and quality into your data work. The key is adopting the principles behind these tools, even if you implement them manually.
- Enforcing types. Always coerce your data to the correct type explicitly, don’t rely on defaults. Treat unexpected types as red flags worth investigating.
- Having data tests and validations. Simple assertions or
ifchecks go a long way. For example, ensure temperatures are within physical limits or IDs are unique. - Outliers and data ranges. Define what’s “normal” for your domain. Outliers aren’t just errors, they often point to real phenomena or process issues.
- I feel Database principles can teach us a lot about good design with data. Think in terms of primary keys, unique constraints, and referential integrity. Even in a CSV file, these ideas bring order.
- Document domain rules: Write down the expected ranges, categories, and relationships. These rules are the human-readable version of your data contract.
A Pandas Example of some common transformations#
Here’s a example of how I usually start cleaning this kind of data with Python and pandas. Say I’m dealing with a sensor column from inside a container ship. First, I’ll plot the data distribution to see what we’re working with. Then I’ll get a feel for what looks like “real” data and what we can do with values that don’t fit the distribution. If possible, I’ll cross-check against weather and sea-air temperatures along the ship’s route to see whether those changes also show up in the container’s temperature patterns. Just transformations guided by context. Without knowing the domain I would never guess why -80°F might be doing in the temperature column.
import pandas as pd
import numpy as np
df_raw = pd.read_csv("cargo_containers.csv")
df = df_raw.copy()
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
for col in ["timestamp", "container_id", "container_type", "temperature", "humidity"]:
if col not in df:
df[col] = np.nan
df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce", utc=True)
df = df.sort_values("timestamp").reset_index(drop=True)
def _extract_number(x):
if pd.isna(x):
return np.nan
if isinstance(x, (int, float)):
return float(x)
s = "".join(ch for ch in str(x) if ch.isdigit() or ch in ".-")
try:
return float(s)
except:
return np.nan
df["temperature_num"] = df["temperature"].map(_extract_number)
df["humidity"] = pd.to_numeric(df["humidity"], errors="coerce")
def f_to_c(f):
return (f - 32.0) * (5.0 / 9.0)
is_prob_f = (df["temperature_num"] > 60) | (df["temperature"].astype(str).str.upper().str.contains("F"))
df["temperature_c"] = np.where(is_prob_f, f_to_c(df["temperature_num"]), df["temperature_num"])
phys_ok = (df["temperature_c"] > -5.0) & (df["temperature_c"] < 40.0)
df["qc_flag"] = np.where(phys_ok, "ok", "phys_out_of_range")
df_clean = df.loc[phys_ok].copy()
df_clean["humidity"] = df_clean.groupby("container_type")["humidity"]\
.transform(lambda x: x.fillna(x.median()))
df_clean["humidity"] = df_clean["humidity"].fillna(df_clean["humidity"].median())
df_clean = df_clean.drop_duplicates()
def resample_container(g, freq="5min"):
g = g.set_index("timestamp").sort_index()
out = g.resample(freq).agg({
"container_id": "first",
"container_type": "first",
"temperature_c": "mean",
"humidity": "mean"
})
out["container_id"] = out["container_id"].ffill(limit=3)
out["container_type"] = out["container_type"].ffill(limit=3)
return out.reset_index()
df_rs = df_clean.groupby("container_id", group_keys=False).apply(resample_container)
def robust_z(x):
med = np.nanmedian(x)
mad = np.nanmedian(np.abs(x - med)) or np.nan
if not np.isfinite(mad) or mad == 0:
return pd.Series(np.zeros(len(x)), index=x.index)
return 0.6745 * (x - med) / mad
df_rs["robust_z_temp"] = df_rs.groupby("container_type")["temperature_c"].transform(robust_z)
def rolling_jump_flags(g, window=6, thresh=3.0):
g = g.sort_values("timestamp").copy()
g["roll_med"] = g["temperature_c"].rolling(window, min_periods=1, center=True).median()
g["delta"] = g["temperature_c"].diff()
roll_q1 = g["temperature_c"].rolling(window, min_periods=1, center=True).quantile(0.25)
roll_q3 = g["temperature_c"].rolling(window, min_periods=1, center=True).quantile(0.75)
iqr = (roll_q3 - roll_q1).replace(0, np.nan)
g["jump_flag"] = np.where(np.abs(g["delta"]) > (thresh * iqr.fillna(iqr.median() or 1.0)), 1, 0)
return g
df_rs = df_rs.groupby("container_id", group_keys=False).apply(rolling_jump_flags)
df_rs["qc_flag"] = "ok"
df_rs.loc[np.abs(df_rs["robust_z_temp"]) > 4.0, "qc_flag"] = "dist_outlier"
df_rs.loc[df_rs["jump_flag"] == 1, "qc_flag"] = np.where(
df_rs["qc_flag"].eq("ok"), "sudden_jump", df_rs["qc_flag"] + "|sudden_jump"
)
def winsorize(col, lower=0.01, upper=0.99):
lo, hi = col.quantile([lower, upper])
return col.clip(lo, hi)
df_winz = df_rs.copy()
df_winz["temperature_c_winz"] = df_rs.groupby("container_type")["temperature_c"].transform(winsorize)
try:
wx = pd.read_csv("route_weather.csv")
wx["timestamp"] = pd.to_datetime(wx["timestamp"], errors="coerce", utc=True)
merged = pd.merge_asof(
df_rs.sort_values("timestamp"),
wx.sort_values("timestamp"),
on="timestamp",
by="container_id",
direction="nearest",
tolerance=pd.Timedelta("15min"),
)
merged["delta_vs_ambient"] = merged["temperature_c"] - merged["ambient_temp_c"]
except FileNotFoundError:
merged = df_rs
ax = df_rs[df_rs["container_id"]=="ABC123"].plot(x="timestamp", y="temperature_c", figsize=(10,4))
df_rs.hist(column="temperature_c", by="container_type", bins=50, figsize=(10,6))
cols_out = [
"timestamp", "container_id", "container_type",
"temperature_c", "humidity", "robust_z_temp", "jump_flag", "qc_flag"
]
final = merged[[c for c in cols_out if c in merged.columns]].copy()
final.to_csv("containers_cleaned.csv", index=False)
