Use Vertex AI TensorBoard with custom training

When using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard.

You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they are written to Cloud Storage.

For initial setup see Set up for Vertex AI TensorBoard.

Changes to your training script

Your training script must be configured to write TensorBoard logs to the Cloud Storage bucket, the location of which the Vertex AI Training Service will automatically make available through a predefined environment variable AIP_TENSORBOARD_LOG_DIR.

This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR'] as the log directory to the open source TensorBoard log writing APIs. The location of the AIP_TENSORBOARD_LOG_DIR is typically set with the staging_bucket variable.

To configure your training script in TensorFlow 2.x, create a TensorBoard callback and set the log_dir variable to os.environ['AIP_TENSORBOARD_LOG_DIR'] The TensorBoard callback is then included in the TensorFlow model.fit callbacks list.

  tensorboard_callback = tf.keras.callbacks.TensorBoard(
       log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
       histogram_freq=1
  )
  
  model.fit(
       x=x_train,
       y=y_train,
       epochs=epochs,
       validation_data=(x_test, y_test),
       callbacks=[tensorboard_callback],
  )
  

Learn more about how Vertex AI sets environment variables in your custom training environment.

Create a custom training job

The following example shows how to create your own custom training job.

For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.

To create a custom training job use either Vertex AI SDK for Python or REST.

Python

Python

def create_training_pipeline_custom_job_sample(
    project: str,
    location: str,
    staging_bucket: str,
    display_name: str,
    script_path: str,
    container_uri: str,
    model_serving_container_image_uri: str,
    dataset_id: Optional[str] = None,
    model_display_name: Optional[str] = None,
    args: Optional[List[Union[str, float, int]]] = None,
    replica_count: int = 0,
    machine_type: str = "n1-standard-4",
    accelerator_type: str = "ACCELERATOR_TYPE_UNSPECIFIED",
    accelerator_count: int = 0,
    training_fraction_split: float = 0.8,
    validation_fraction_split: float = 0.1,
    test_fraction_split: float = 0.1,
    sync: bool = True,
    tensorboard_resource_name: Optional[str] = None,
    service_account: Optional[str] = None,
):
    aiplatform.init(project=project, location=location, staging_bucket=staging_bucket)

    job = aiplatform.CustomTrainingJob(
        display_name=display_name,
        script_path=script_path,
        container_uri=container_uri,
        model_serving_container_image_uri=model_serving_container_image_uri,
    )

    # This example uses an ImageDataset, but you can use another type
    dataset = aiplatform.ImageDataset(dataset_id) if dataset_id else None

    model = job.run(
        dataset=dataset,
        model_display_name=model_display_name,
        args=args,
        replica_count=replica_count,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        training_fraction_split=training_fraction_split,
        validation_fraction_split=validation_fraction_split,
        test_fraction_split=test_fraction_split,
        sync=sync,
        tensorboard=tensorboard_resource_name,
        service_account=service_account,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    print(model.uri)
    return model

  • project: Your project ID. You can find these IDs in the Google Cloud console welcome page.
  • location: The region to run the CustomJob in. This should be the same region as the provided TensorBoard instance.
  • staging_bucket: The Cloud Storage bucket to stage artifacts during API calls, including TensorBoard logs.
  • display_name: Display name of the custom training job.
  • script_path: The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.
  • container_uri: The URI of the training container image can be Vertex AI. prebuilt training container or a custom container.
  • model_serving_container_image_uri: The URI of the model serving container suitable for serving the model produced by the training script.
  • dataset_id: The ID number for the dataset to use for training.
  • model_display_name: Display name of the trained model.
  • args: Command line arguments to be passed to the Python script.
  • replica_count: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
  • machine_type: The type of VM to use. For a list of supported VMs, see Machine types
  • accelerator_type: The type of GPU to attach to each VM in the resource pool. For a list of supported GPUs, see GPUs.
  • accelerator_count The number of GPUs to attach to each VM in the resource pool. The default the value is 1.
  • training_fraction_split: The fraction of the dataset to use to train your model.
  • validation_fraction_split: The fraction of the dataset to use to validate your model.
  • test_fraction_split: The fraction of the dataset to use to evaluate your model.
  • sync: Whether to execute this method synchronously.
  • tensorboard_resource_name: The resource name of the Vertex TensorBoard instance to which CustomJob will upload TensorBoard logs.
  • service_account: Required when running with TensorBoard. See Create a service account with required permissions.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: Your region.
  • PROJECT_ID: Your project ID.
  • TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
    projects/PROJECT_ID/locations/LOCATION_ID/tensorboards/TENSORBOARD_INSTANCE_ID
    Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404.
  • GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
  • USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
  • TRAINING_CONTAINER: TRAINING_CONTAINER.
  • INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
  • JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
  • BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"

HTTP method and URL:

POST http://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs

Request JSON body:

{
"displayName": JOB_NAME,
"jobSpec":{
"workerPoolSpecs":[
  {
    "replicaCount": "1",
     "machineSpec": {
        "machineType": "n1-standard-8",
      },
      "containerSpec": {
        "imageUri": TRAINING_CONTAINER,
      }
    }
  ],
  
  "base_output_directory": {
  "output_uri_prefix": BASE_OUTPUT_DIR,
   },
  "serviceAccount": USER_SA_EMAIL,
  "tensorboard": TENSORBOARD_INSTANCE_NAME,
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID",
  "displayName": "DISPLAY_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "n1-standard-8"
        },
        "replicaCount": "1",
        "diskSpec": {
          "bootDiskType": "pd-ssd",
          "bootDiskSizeGb": 100
        },
        "containerSpec": {
          "imageUri": "IMAGE_URI"
        }
      }
    ],
    "serviceAccount": "SERVICE_ACCOUNT",
    "baseOutputDirectory": {
      "outputUriPrefix": "OUTPUT_URI_PREFIX"
    },
    "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id"
  },
  "state": "JOB_STATE_PENDING",
  "createTime": "CREATE-TIME",
  "updateTime": "UPDATE-TIME"
}

What's next