Artifact 4 · Data challenges in ML

The Data
Problem

Your model is only as good as what you feed it. Here's what goes wrong — and why it matters.

Scroll
Data pipeline — messy raw data flows through cleaning stages into organized structured data

Garbage In,
Garbage Out

ML models learn from data. If the data is biased, incomplete, noisy, or poorly labeled, the model inherits every flaw. No amount of tuning fixes bad inputs.

Three core quality issues break ML models more than anything else: imbalanced data where the model ignores minority classes (often the ones you care about most — fraud, disease, defects), class overlap where different categories occupy the same data space, and underrepresented subconcepts where rare patterns the model never learns.

Good data vs bad data — scattered question marks versus organized bar chart

Getting Data Right
Is Hard Work

Beyond quality, there's the operational reality. Labeling is slow, expensive, and error-prone — especially in vision and NLP. Privacy regulations (GDPR, HIPAA) limit what you can collect and use. Data distributions drift over time, making last year's model fail on today's inputs.

Real projects combine structured, unstructured, and real-time data from multiple sources. Versioning matters for reproducibility. Scale requires distributed processing and significant compute. Every step adds cost and complexity.

Data processing pipeline illustration

Model-Centric vs
Data-Centric AI

The field is moving from "build a better model" to "build better data." The old approach treated data as fixed and tuned the model. The new approach keeps the model fixed and iterates on the data — cleaning labels, balancing classes, removing noise.

Understanding data quality is now the competitive edge. Better data beats better algorithms, every time.

Model-Centric Data-Centric
Fix the model when accuracy drops Fix the data when accuracy drops
Add more layers, tune hyperparameters Clean labels, balance classes, remove noise
Treat data as fixed Iterate on data systematically

Better data beats better algorithms. Every time.

The models are only as good as what we feed them.

About This Project

Introduction This artifact explores why data — not models — is the primary bottleneck in ML, covering quality issues, operational challenges, and the shift toward data-centric AI. Description A visual explainer that breaks down data challenges into three categories: quality problems (imbalance, overlap, underrepresentation), operational hurdles (labeling, privacy, drift), and the paradigm shift from model-centric to data-centric AI. Objective Make the case that data quality is the most important — and most overlooked — factor in building reliable ML systems. Process Synthesized course materials and research (Santos et al., 2023) into a visual narrative, then designed comparison layouts and flow diagrams to communicate each concept without dense text. Tools HTML, CSS, JavaScript (IntersectionObserver), course research materials, ChatGPT for content refinement. Value Demonstrates understanding of the full ML pipeline — not just model architecture, but the data engineering that makes or breaks production systems. Unique Value Frames data challenges as an engineering problem, not an academic one. Uses the model-centric vs data-centric comparison to show a practical mental model for approaching real-world ML. Relevance Every ML engineer spends more time on data than on models. This artifact demonstrates the data literacy that separates research prototypes from production systems. References Santos, M. et al. (2023). "A Unifying View of Class Overlap and Imbalance." Information Fusion 89. Krawczyk, B. (2016). "Learning from Imbalanced Data." Progress in AI. Course materials from AIML-500, Indiana Wesleyan University.