AI Platform
- What are Jobs? Jobs are long-running operations that are processed asynchronously. AI Platform Training offers training jobs and batch prediction jobs.
- What is a model? In AI Platform Training, a model is a logical container for individual versions of that solution. A model can have a trained model, a saved model, and a model version.
- What does it mean to stage a training application? You stage your training application package in a Cloud Storage bucket that your project has access to. This enables the training service to access the package and copy it to all of the training instances.
- What does it mean to stage a saved model? You also stage a saved model trained elsewhere in a Cloud Storage bucket that your project has access to. This enables the online prediction service to access the model and deploy it. If you deploy custom code for prediction (beta), you additionally stage the custom code package in Cloud Storage so the online prediction service can access it during deployment.
- What does it mean to deploy a saved model? You deploy a model version when you create a version resource. You specify an exported model (a saved model directory) and a model resource to assign the version to, and AI Platform Training hosts the version so that you can send predictions to it.
- What types of Jobs does AI Platform offer? Training and batch prediction jobs (since 2022, also evaluation jobs)
- What is a long-running operation? AI Platform Training has three long-running operations: (1) Creating a version, (2) Deleting a model, and (3) Deleting a version. Of the long-running operations, only creating a version is likely to take much time to complete. Deleting models and versions are typically accomplished in near real-time.
- What are the components of AI Platform? Training Service, Prediction Service, Data Labeling Service
- What is data labeling service? It lets you request human labeling for a dataset that you plan to use to train a custom machine learning model. You can submit a request to label your video, image, or text data.
- What is AI Explanations? AI Explanations helps you understand your model's outputs for classification and regression tasks. Whenever you request a prediction on AI Platform, AI Explanations tells you how much each feature in the data contributed to the predicted result.
- What methods does AI Platform offer for feature attributions? Sampled Shapley, integrated gradients, and XRAI
- What is continuous evaluation? Continuous evaluation regularly samples prediction input and output from trained machine learning models that you have deployed to AI Platform Prediction. AI Platform Data Labeling Service then assigns human reviewers to provide ground truth labels for your prediction input; alternatively, you can provide your own ground truth labels. Data Labeling Service compares your models' predictions with the ground truth labels to provide continual feedback on how your model is performing over time.
- Which performance metrics for your model versions can you monitor using AI Platform? # Predictions, error rate, and latency
- Where does batch prediction output its predictions? A Cloud Storage location that you specify.
- What does a training cluster look like? The master node, worker nodes, parameter node
- What are the three training strategies on AI Platform?
- Data-parallel training with synchronous updates.
- Data-parallel training with asynchronous updates.
- Model-parallel training.
- In data-parallel processing, what is shared with all worker nodes? The model
- What are the requirements for using GPUs for prediction? Compute engine (N1) machine type, Tensorflow SavedModel, regional requirements
- What can you do to prevent VM restarts from messing with the training job? To ensure that your training job is resilient to these restarts, save model checkpoints regularly and configure your job to restore the most recent checkpoint.