Good Concerns for ML System Production

Today I was recommended by one of my colleagues with an article about good concerns of ML system production published by Google Inc.. Most of the points of this article are really helpful in real production. I decide to summarize those bullets down so that I can look through easily and remind me how to design and implement ML related product with less technical debts. I will continuously add some understanding corresponding to different bullets when I encounter similar concerns in my work.

Data

  1. Feature expectation are captured in schema.
  2. All features are beneficial.
  3. No features cost is too much.
  4. Features adhere to meta-level requirements.
  5. The data pipeline has appropriate privacy controls.
  6. New features can be added quickly.
  7. All feature code is tested.

Model

  1. Every model specification undergoes a code review and is checked in to a repository.
  2. Offline proxy metrics correlate with actual online impact metrics.
  3. All hyperparameters have been tuned.
  4. The impact of model stateness is known.
  5. A simpler model is not better.
  6. Model quality is sufficient on all important data slice.
  7. The model has been tested for consideration of inclusion.

Infrastructure

  1. Training is reproducible.
  2. Model specification code is unit tested.
  3. The full ML pipeline is integration tested.
  4. Model quality is validated before attempting to serve it.
  5. The model allows debugging by observing the step-by-step computation of training or inference on a simple example.
  6. Models are tested via canary process before they enter production serving environments.
  7. Models can be quickly and safely rolled back to a previous serving version.

System Monitor

  1. Dependency changes result in notification.
  2. Data invariants hold in training and serving inputs.
  3. Training and serving features compute the same value.
  4. Models are not too stale.
  5. The model is numerically stable.
  6. The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage.
  7. The model has not experienced a regression in prediction quality on served data.
comments powered by Disqus