Training a model is only one part of machine learning. After a model is trained and validated, it must serve predictions. Then the team must monitor whether the model continues to behave well in production.

This guide explains batch prediction, online prediction, and model monitoring in simple terms.

Quick Answer

Use batch prediction when you need many predictions at once and do not need an instant response. Use online prediction when an application needs a fast prediction through an endpoint. Use model monitoring to detect changes in production data, training-serving skew, drift, and behavior that may reduce model quality.

Key Takeaways

  • Batch prediction is better for offline or scheduled prediction jobs.
  • Online prediction is better for real-time applications.
  • Pre-built containers can simplify prediction serving.
  • Custom containers are useful when serving needs custom logic.
  • Model monitoring helps detect skew and drift after launch.
  • Alert thresholds should match the business risk of the model.

Batch Prediction

Batch prediction is used when many prediction requests can be processed together.

Examples:

  • score all customers overnight,
  • predict demand for next week,
  • classify a large set of documents,
  • update risk scores in a table,
  • generate recommendations for many users.

Batch prediction is usually asynchronous. You submit a job and review results when the job finishes.

Online Prediction

Online prediction is used when an application needs a quick response.

Examples:

  • show a recommendation while a user is on a site,
  • classify a support request as it arrives,
  • predict fraud risk during a transaction,
  • estimate delivery time during checkout.

Online prediction usually uses a deployed model endpoint.

Batch vs Online Prediction

QuestionBatch predictionOnline prediction
Response needed immediately?NoYes
Works well for many records?YesSometimes
Used by applications?Usually indirectlyYes
Common patternScheduled jobAPI endpoint
Main concernThroughput and costLatency and availability

Serving Containers

Vertex AI can use pre-built containers or custom containers for serving.

Pre-built containers are helpful when the model format and framework are supported. They reduce setup work.

Custom containers are useful when:

  • the model needs custom preprocessing,
  • the serving logic is special,
  • the framework is not supported by a pre-built container,
  • the container must handle custom health checks or prediction routes.

What Model Monitoring Checks

Model monitoring helps track whether production behavior changes.

Important signals:

  • training-serving skew,
  • feature drift,
  • prediction distribution changes,
  • missing input values,
  • unusual input categories,
  • data quality issues,
  • prediction volume changes,
  • latency and error rate.

Alert Thresholds

Alert thresholds should not be copied blindly. They depend on the use case.

A model used for marketing recommendations may tolerate more drift than a model used for risk review or safety-sensitive decisions.

Set thresholds based on:

  • business impact,
  • data volatility,
  • model importance,
  • review capacity,
  • past monitoring results,
  • acceptable false alarms.

Practical Monitoring Workflow

  1. Define the model owner.
  2. Decide what features should be monitored.
  3. Set initial thresholds.
  4. Capture serving inputs and prediction outputs.
  5. Review alerts regularly.
  6. Investigate drift or skew.
  7. Retrain, rollback, or update preprocessing when needed.

Common Mistakes

  • deploying without monitoring
  • ignoring input data changes
  • using online prediction when batch would be simpler
  • setting alert thresholds too tight or too loose
  • not assigning alert ownership
  • monitoring technical metrics but not business impact

Bottom Line

Prediction makes the model useful, but monitoring keeps it trustworthy. Choose batch or online prediction based on the workflow, then monitor production data and behavior so the team knows when the model needs attention.