Choosing features is one of the most practical skills in machine learning. The model can only learn from the signals you give it, so weak or misleading features can hurt even a strong algorithm.
Use this guide as a checklist when reviewing raw data before building a model.
Quick Answer
Choose machine learning features by checking whether each candidate feature is related to the target, available at prediction time, ethical to use, represented correctly, and supported by enough examples.
Key Takeaways
- Good features must relate to the business objective.
- Features must be available when the model makes predictions.
- Avoid features that leak the answer.
- Numeric values should have meaningful magnitude before being treated as numbers.
- Rare categories may need grouping, hashing, or embeddings.
Feature Quality Checklist
| Check | Question |
|---|---|
| Relevance | Does the feature help explain the target? |
| Availability | Will this value exist at prediction time? |
| Legality | Are we allowed to collect and use it? |
| Ethics | Could this feature create unfair or sensitive outcomes? |
| Representation | Is the value encoded in a useful way? |
| Coverage | Do we have enough examples? |
| Stability | Will this feature behave similarly in production? |
| Evaluation | Does it improve the model metric? |
1. Is The Feature Related To The Objective?
Start with the target. If the goal is to predict taxi fare, features like distance, pickup location, dropoff location, time of day, and toll amount may be relevant.
Features that are not related to the prediction goal add noise. They can also make the model harder to explain and maintain.
Ask:
- Why might this feature affect the outcome?
- Would a domain expert expect it to matter?
- Does the data show a useful pattern?
- Does the model improve when the feature is added?
2. Is The Feature Available At Prediction Time?
This is one of the most important checks.
A feature is not useful if it is only known after the event you are trying to predict. That creates leakage.
Examples:
- Do not use final delivery time to predict delivery delay before delivery happens.
- Do not use post-purchase behavior to predict whether a customer will buy.
- Do not use resolved ticket category to predict support ticket routing.
If a value is delayed by hours or days, account for that delay in your training data.
3. Is It Legal And Ethical To Use?
Some data may be technically available but inappropriate to use.
Review:
- personal data,
- protected attributes,
- financial information,
- health information,
- employee records,
- sensitive location data,
- consent requirements.
Good feature engineering also includes deciding what not to use.
4. Does The Numeric Value Have Meaningful Magnitude?
Some values look numeric but should not be treated as real numbers.
Examples:
- ZIP code,
- customer ID,
- product ID,
- phone number,
- category code.
These values may be identifiers or labels. Treating them as numbers can mislead the model because the distance between two IDs is usually meaningless.
5. Do We Have Enough Examples?
Rare categories can be hard for a model to learn.
If a feature has thousands of unique categories with very few examples each, consider:
- grouping rare values into
Other, - bucketizing,
- hashing,
- embeddings,
- using higher-level categories,
- removing the feature if it does not help.
6. Does The Feature Survive Production?
Training data can look clean, while production data is messy.
Check:
- missing values,
- late-arriving data,
- changed category names,
- different formats,
- time zone issues,
- measurement drift,
- inconsistent source systems.
Features must work in the real prediction workflow.
Example Feature Review
| Candidate feature | Good? | Why |
|---|---|---|
| Trip distance | Yes | Related and numeric |
| Pickup hour | Yes | Captures time pattern |
| Final fare amount | No | This is the target |
| Driver ID | Maybe | Could overfit or create fairness issues |
| Customer phone number | No | Identifier and sensitive |
| Weather category | Maybe | Useful if available at prediction time |
Related AI Charcha Reading
- Feature Engineering for Machine Learning
- Feature Engineering With Keras and BigQuery ML
- How to Review AI Outputs Before Publishing
FAQ
What makes a machine learning feature good?
A good machine learning feature is relevant to the prediction goal, available at prediction time, represented correctly, ethically usable, and supported by enough training examples.
What is feature leakage?
Feature leakage happens when a model is trained with information that would not be available when the model makes a real prediction.
Bottom Line
Good feature selection is a disciplined review process. Choose features that are relevant, available, ethical, stable, and proven by evaluation.