Training a model is only one part of machine learning. After a model is trained and validated, it must serve predictions. Then the team must monitor whether the model continues to behave well in production.
This guide explains batch prediction, online prediction, and model monitoring in simple terms.
Quick Answer
Use batch prediction when you need many predictions at once and do not need an instant response. Use online prediction when an application needs a fast prediction through an endpoint. Use model monitoring to detect changes in production data, training-serving skew, drift, and behavior that may reduce model quality.
Key Takeaways
- Batch prediction is better for offline or scheduled prediction jobs.
- Online prediction is better for real-time applications.
- Pre-built containers can simplify prediction serving.
- Custom containers are useful when serving needs custom logic.
- Model monitoring helps detect skew and drift after launch.
- Alert thresholds should match the business risk of the model.
Batch Prediction
Batch prediction is used when many prediction requests can be processed together.
Examples:
- score all customers overnight,
- predict demand for next week,
- classify a large set of documents,
- update risk scores in a table,
- generate recommendations for many users.
Batch prediction is usually asynchronous. You submit a job and review results when the job finishes.
Online Prediction
Online prediction is used when an application needs a quick response.
Examples:
- show a recommendation while a user is on a site,
- classify a support request as it arrives,
- predict fraud risk during a transaction,
- estimate delivery time during checkout.
Online prediction usually uses a deployed model endpoint.
Batch vs Online Prediction
| Question | Batch prediction | Online prediction |
|---|---|---|
| Response needed immediately? | No | Yes |
| Works well for many records? | Yes | Sometimes |
| Used by applications? | Usually indirectly | Yes |
| Common pattern | Scheduled job | API endpoint |
| Main concern | Throughput and cost | Latency and availability |
Serving Containers
Vertex AI can use pre-built containers or custom containers for serving.
Pre-built containers are helpful when the model format and framework are supported. They reduce setup work.
Custom containers are useful when:
- the model needs custom preprocessing,
- the serving logic is special,
- the framework is not supported by a pre-built container,
- the container must handle custom health checks or prediction routes.
What Model Monitoring Checks
Model monitoring helps track whether production behavior changes.
Important signals:
- training-serving skew,
- feature drift,
- prediction distribution changes,
- missing input values,
- unusual input categories,
- data quality issues,
- prediction volume changes,
- latency and error rate.
Alert Thresholds
Alert thresholds should not be copied blindly. They depend on the use case.
A model used for marketing recommendations may tolerate more drift than a model used for risk review or safety-sensitive decisions.
Set thresholds based on:
- business impact,
- data volatility,
- model importance,
- review capacity,
- past monitoring results,
- acceptable false alarms.
Practical Monitoring Workflow
- Define the model owner.
- Decide what features should be monitored.
- Set initial thresholds.
- Capture serving inputs and prediction outputs.
- Review alerts regularly.
- Investigate drift or skew.
- Retrain, rollback, or update preprocessing when needed.
Common Mistakes
- deploying without monitoring
- ignoring input data changes
- using online prediction when batch would be simpler
- setting alert thresholds too tight or too loose
- not assigning alert ownership
- monitoring technical metrics but not business impact
Bottom Line
Prediction makes the model useful, but monitoring keeps it trustworthy. Choose batch or online prediction based on the workflow, then monitor production data and behavior so the team knows when the model needs attention.