Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

Use Vertex AI TensorBoard with custom training

To see examples of enabling TensorBoard for a custom training job that uses custom containers, run the following Jupyter notebooks in the environment of your choice:

"Vertex AI TensorBoard custom training with prebuilt containers":
Open in Colab | Open in Colab Enterprise | Open in Vertex AI Workbench user-managed notebooks | View on GitHub
"Vertex AI TensorBoard custom training with custom containers":
Open in Colab | Open in Colab Enterprise | Open in Vertex AI Workbench user-managed notebooks | View on GitHub

When using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard.

You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they are written to Cloud Storage.

For initial setup see Set up for Vertex AI TensorBoard.

Changes to your training script

Your training script must be configured to write TensorBoard logs to the Cloud Storage bucket, the location of which the Vertex AI Training Service will automatically make available through a predefined environment variable AIP_TENSORBOARD_LOG_DIR.

This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR'] as the log directory to the open source TensorBoard log writing APIs. The location of the AIP_TENSORBOARD_LOG_DIR is typically set with the staging_bucket variable.

To configure your training script in TensorFlow 2.x, create a TensorBoard callback and set the log_dir variable to os.environ['AIP_TENSORBOARD_LOG_DIR'] The TensorBoard callback is then included in the TensorFlow model.fit callbacks list.

  tensorboard_callback = tf.keras.callbacks.TensorBoard(
       log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
       histogram_freq=1
  )
  
  model.fit(
       x=x_train,
       y=y_train,
       epochs=epochs,
       validation_data=(x_test, y_test),
       callbacks=[tensorboard_callback],
  )

Learn more about how Vertex AI sets environment variables in your custom training environment.

Create a custom training job

The following example shows how to create your own custom training job.

For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.

To create a custom training job use either Vertex AI SDK for Python or REST.

Python

def create_training_pipeline_custom_job_sample(
    project: str,
    location: str,
    staging_bucket: str,
    display_name: str,
    script_path: str,
    container_uri: str,
    model_serving_container_image_uri: str,
    dataset_id: Optional[str] = None,
    model_display_name: Optional[str] = None,
    args: Optional[List[Union[str, float, int]]] = None,
    replica_count: int = 0,
    machine_type: str = "n1-standard-4",
    accelerator_type: str = "ACCELERATOR_TYPE_UNSPECIFIED",
    accelerator_count: int = 0,
    training_fraction_split: float = 0.8,
    validation_fraction_split: float = 0.1,
    test_fraction_split: float = 0.1,
    sync: bool = True,
    tensorboard_resource_name: Optional[str] = None,
    service_account: Optional[str] = None,
):
    aiplatform.init(project=project, location=location, staging_bucket=staging_bucket)

    job = aiplatform.CustomTrainingJob(
        display_name=display_name,
        script_path=script_path,
        container_uri=container_uri,
        model_serving_container_image_uri=model_serving_container_image_uri,
    )

    # This example uses an ImageDataset, but you can use another type
    dataset = aiplatform.ImageDataset(dataset_id) if dataset_id else None

    model = job.run(
        dataset=dataset,
        model_display_name=model_display_name,
        args=args,
        replica_count=replica_count,
        machine_type=machine_type,
        accelerator_type=accelerator_type,
        accelerator_count=accelerator_count,
        training_fraction_split=training_fraction_split,
        validation_fraction_split=validation_fraction_split,
        test_fraction_split=test_fraction_split,
        sync=sync,
        tensorboard=tensorboard_resource_name,
        service_account=service_account,
    )

    model.wait()

    print(model.display_name)
    print(model.resource_name)
    print(model.uri)
    return model

project: Your project ID. You can find these IDs in the Google Cloud console welcome page.
location: The region to run the CustomJob in. This should be the same region as the provided TensorBoard instance.
staging_bucket: The Cloud Storage bucket to stage artifacts during API calls, including TensorBoard logs.
display_name: Display name of the custom training job.
script_path: The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.
container_uri: The URI of the training container image can be Vertex AI. prebuilt training container or a custom container.
model_serving_container_image_uri: The URI of the model serving container suitable for serving the model produced by the training script.
dataset_id: The ID number for the dataset to use for training.
model_display_name: Display name of the trained model.
args: Command line arguments to be passed to the Python script.
replica_count: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
machine_type: The type of VM to use. For a list of supported VMs, see Machine types
accelerator_type: The type of GPU to attach to each VM in the resource pool. For a list of supported GPUs, see GPUs.
accelerator_count The number of GPUs to attach to each VM in the resource pool. The default the value is 1.
training_fraction_split: The fraction of the dataset to use to train your model.
validation_fraction_split: The fraction of the dataset to use to validate your model.
test_fraction_split: The fraction of the dataset to use to evaluate your model.
sync: Whether to execute this method synchronously.
tensorboard_resource_name: The resource name of the Vertex TensorBoard instance to which CustomJob will upload TensorBoard logs.
service_account: Required when running with TensorBoard. See Create a service account with required permissions.

REST

Before using any of the request data, make the following replacements:

LOCATION_ID: Your region.
PROJECT_ID: Your project ID.
TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
projects/PROJECT_ID/locations/LOCATION_ID/tensorboards/TENSORBOARD_INSTANCE_ID
Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404.
GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
TRAINING_CONTAINER: TRAINING_CONTAINER.
INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"

HTTP method and URL:

POST http://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs

Request JSON body:

{
"displayName": JOB_NAME,
"jobSpec":{
"workerPoolSpecs":[
  {
    "replicaCount": "1",
     "machineSpec": {
        "machineType": "n1-standard-8",
      },
      "containerSpec": {
        "imageUri": TRAINING_CONTAINER,
      }
    }
  ],
  
  "base_output_directory": {
  "output_uri_prefix": BASE_OUTPUT_DIR,
   },
  "serviceAccount": USER_SA_EMAIL,
  "tensorboard": TENSORBOARD_INSTANCE_NAME,
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json << 'EOF'
{
"displayName": JOB_NAME,
"jobSpec":{
"workerPoolSpecs":[
  {
    "replicaCount": "1",
     "machineSpec": {
        "machineType": "n1-standard-8",
      },
      "containerSpec": {
        "imageUri": TRAINING_CONTAINER,
      }
    }
  ],
  
  "base_output_directory": {
  "output_uri_prefix": BASE_OUTPUT_DIR,
   },
  "serviceAccount": USER_SA_EMAIL,
  "tensorboard": TENSORBOARD_INSTANCE_NAME,
  }
}
EOF

Then execute the following command to send your REST request:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "http://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'
{
"displayName": JOB_NAME,
"jobSpec":{
"workerPoolSpecs":[
  {
    "replicaCount": "1",
     "machineSpec": {
        "machineType": "n1-standard-8",
      },
      "containerSpec": {
        "imageUri": TRAINING_CONTAINER,
      }
    }
  ],
  
  "base_output_directory": {
  "output_uri_prefix": BASE_OUTPUT_DIR,
   },
  "serviceAccount": USER_SA_EMAIL,
  "tensorboard": TENSORBOARD_INSTANCE_NAME,
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "http://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID",
  "displayName": "DISPLAY_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "n1-standard-8"
        },
        "replicaCount": "1",
        "diskSpec": {
          "bootDiskType": "pd-ssd",
          "bootDiskSizeGb": 100
        },
        "containerSpec": {
          "imageUri": "IMAGE_URI"
        }
      }
    ],
    "serviceAccount": "SERVICE_ACCOUNT",
    "baseOutputDirectory": {
      "outputUriPrefix": "OUTPUT_URI_PREFIX"
    },
    "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id"
  },
  "state": "JOB_STATE_PENDING",
  "createTime": "CREATE-TIME",
  "updateTime": "UPDATE-TIME"
}

What's next

Check out View Vertex AI TensorBoard.
Learn how to optimize the performance of your custom training jobs using Vertex AI TensorBoard Profiler.