Before choosing an algorithm, start with the data. Machine learning models learn from examples. If those examples are incomplete, inconsistent, mislabeled, or poorly formatted, the model can produce unreliable predictions.
Data quality and exploratory data analysis, or EDA, are the first real skills to build in practical machine learning.
Quick Answer
Data quality means making sure the dataset is accurate, complete, consistent, timely, and usable. EDA means exploring the dataset with summaries and visualizations so you can find missing values, outliers, correlations, patterns, and potential modeling problems before training.
Key Takeaways
- Data quality problems often become model quality problems.
- Missing values need a deliberate strategy.
- Date, text, categorical, and numeric fields often need cleaning.
- EDA helps identify patterns, anomalies, and influential variables.
- Cleaning and exploration are usually iterative, not one-time steps.
What Good Data Quality Looks Like
| Attribute | What it means | Example check |
|---|---|---|
| Accuracy | Values match reality | Does the recorded date or amount make sense? |
| Consistency | Values follow the same format | Are categories spelled the same way? |
| Timeliness | Data is current enough for the use case | Is the data stale? |
| Completeness | Required fields are present | How many missing labels or features exist? |
These checks are not just administrative. They directly affect whether the model can learn useful relationships.
Common Data Quality Problems
Missing values
Missing values can appear because a field was not collected, a system failed, a customer skipped a form, or a value is not applicable.
Common strategies:
- remove rows when missingness is rare and safe,
- fill numeric values with a median or domain-based default,
- fill categories with
unknown, - create a missingness flag,
- investigate whether missingness itself is predictive.
Wrong data types
Dates stored as text, numbers stored as strings, and categories mixed with numeric codes can break analysis.
Check:
- date columns,
- currency columns,
- boolean fields,
- categorical fields,
- numeric ranges.
Unwanted characters
Data may include symbols, prefixes, whitespace, inconsistent casing, or special markers.
Examples:
<2006in a year field,N/Amixed with real categories,- extra spaces in category names,
- inconsistent capitalization.
Categorical values
Many ML models require numeric inputs. Categorical columns often need encoding.
Common approaches:
- one-hot encoding,
- label encoding when order exists,
- grouping rare categories,
- using embeddings for high-cardinality categories.
What EDA Does
Exploratory data analysis helps you understand what the dataset is telling you before you build a model.
EDA helps answer:
- What values are common?
- What values are missing?
- Are there outliers?
- Are columns correlated?
- Which features may influence the target?
- Are there unusual groups?
- Does the label look usable?
Useful EDA Methods
| Data type | Numerical EDA | Visual EDA |
|---|---|---|
| Numeric | describe(), mean, median, standard deviation | histogram, box plot, scatter plot |
| Categorical | value counts, crosstab | count plot, grouped bar chart |
| Relationship | correlation matrix, grouped summary | heatmap, joint plot, pair plot |
The goal is not to make beautiful charts. The goal is to learn whether the data can support the prediction task.
Univariate And Bivariate Analysis
Univariate analysis looks at one variable at a time.
Use it to find:
- distribution shape,
- extreme values,
- missing values,
- category imbalance.
Bivariate analysis looks at relationships between two variables.
Use it to find:
- feature-target relationships,
- correlations,
- category differences,
- possible interaction effects.
Example EDA Checklist
Use this checklist before model training:
- Confirm the target label is present.
- Count missing values by column.
- Check numeric ranges.
- Review category counts.
- Convert dates into usable date/time features.
- Look for duplicates.
- Visualize the target distribution.
- Check correlations between numeric fields.
- Compare features against the target.
- Note data issues before training.
EDA In BigQuery And Python
Python is useful for visual exploration with tools such as Pandas, Matplotlib, and Seaborn.
BigQuery is useful when the dataset is large or already stored in a warehouse. SQL can help you count missing values, group categories, inspect ranges, and create training tables.
In real projects, both are common:
- use SQL to select and aggregate,
- use Python to visualize and experiment,
- return to SQL for repeatable data preparation.
Related AI Charcha Reading
- Launching Into Machine Learning: A Practical Learning Path
- Feature Engineering For Machine Learning
- How To Choose Good Machine Learning Features
- Model Evaluation, Generalization, And Sampling
FAQ
Should data cleaning happen before or after EDA?
Both. Start with basic cleaning so the data can be inspected, then use EDA to discover deeper cleaning needs.
Is EDA only for data scientists?
No. Analysts, engineers, product owners, and ML practitioners all benefit from understanding the dataset before trusting a model.
Can AutoML fix bad data?
AutoML can automate model training, but it cannot fully fix unclear labels, missing business context, or poor data quality.
Bottom Line
Good models start with trustworthy data. Data quality checks and EDA help you find problems before they become expensive model failures.