Creating datasets and importing images

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

  1. Create a dataset and specify whether to allow multiple labels on each item.
  2. Import data items into the dataset.
  3. Label the items.

When you import items with already-assigned labels, steps 2 and 3 are combined.

Creating a dataset

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. When you create a dataset, you specify the type of classification you want your custom model to perform:

  • MULTICLASS assigns a single label to each classified image
  • MULTILABEL allows an image to be assigned multiple labels

As of the v1 version of the AutoML API this request returns the ID of a long-running operation.

After the long-running operation completes you can import images into it. The newly created dataset doesn't contain any data until you import images into it.

Save the dataset ID of the new dataset (from the response) for use with other operations, such as importing images into your dataset and training a model.

Web UI

  1. Open the Vision Dashboard.

    You can also access this page from the console via the left navigation menu item Artificial Intelligence > Vision. This will take you to the integrated Vision dashboard. Select the AutoML Vision card.

    Integrated UI Vision dashboard

  2. Select Datasets from the left navigation menu.

  3. Select the New Dataset button at the top, update the dataset name (optional), and select single-label or multi-label classification based on the data you have.

    select model type for dataset page

  4. After specifying the classification type, select Create Dataset.

  5. On the Create Dataset page you can choose a CSV file from Google Cloud Storage, or local image files to import into the dataset.

    select import csv window

    Select Continue to begin image import into your dataset. While import occurs the dataset will show a status of Running: Importing images.

  6. You receive an email when import has finished.


The following example creates a dataset that supports one label per item (see MULTICLASS).

The newly created dataset doesn't contain any data until you import items into it.

Save the "name" of the new dataset (from the response) for use with other operations, such as importing items into your dataset and training a model.

Before using any of the request data, make the following replacements:

  • project-id: your GCP project ID.
  • display-name: a string display name of your choosing.

HTTP method and URL:


Request JSON body:

  "displayName": "DISPLAY_NAME",
  "imageClassificationDatasetMetadata": {
    "classificationType": "MULTICLASS"

To send your request, choose one of these options:


Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: project-id" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \


Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "project-id" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "" | Select-Object -Expand Content

You should see output similar to the following. You can use the operation ID (ICN3819960680614725486, in this case) to get the status of the task. For an example, see Working with long-running operations:

  "name": "projects/PROJECT_ID/locations/us-central1/operations/ICN3819960680614725486",
  "metadata": {
    "@type": "",
    "createTime": "2019-11-14T16:49:13.667526Z",
    "updateTime": "2019-11-14T16:49:13.667526Z",
    "createDatasetDetails": {}

After the long-running operation completes you can get the dataset's ID with the same operation status request. The response should look similar to the following:

  "name": "projects/PROJECT_ID/locations/us-central1/operations/ICN3819960680614725486",
  "metadata": {
    "@type": "",
    "createTime": "2019-11-14T16:49:13.667526Z",
    "updateTime": "2019-11-14T16:49:17.975314Z",
    "createDatasetDetails": {}
  "done": true,
  "response": {
    "@type": "",
    "name": "projects/PROJECT_ID/locations/us-central1/datasets/ICN5496445433112696489"


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

import (

	automl ""

// visionClassificationCreateDataset creates a dataset for image classification.
func visionClassificationCreateDataset(w io.Writer, projectID string, location string, datasetName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetName := "dataset_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	defer client.Close()

	req := &automlpb.CreateDatasetRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Dataset: &automlpb.Dataset{
			DisplayName: datasetName,
			DatasetMetadata: &automlpb.Dataset_ImageClassificationDatasetMetadata{
				ImageClassificationDatasetMetadata: &automlpb.ImageClassificationDatasetMetadata{
					// Specify the classification type:
					// - MULTILABEL: Multiple labels are allowed for one example.
					// - MULTICLASS: At most one label is allowed per example.
					ClassificationType: automlpb.ClassificationType_MULTILABEL,

	op, err := client.CreateDataset(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateDataset: %w", err)
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	dataset, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("Wait: %w", err)

	fmt.Fprintf(w, "Dataset name: %v\n", dataset.GetName())

	return nil


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

import java.util.concurrent.ExecutionException;

class VisionClassificationCreateDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String displayName = "YOUR_DATASET_NAME";
    createDataset(projectId, displayName);

  // Create a dataset
  static void createDataset(String projectId, String displayName)
      throws IOException, ExecutionException, InterruptedException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // A resource that represents Google Cloud Platform location.
      LocationName projectLocation = LocationName.of(projectId, "us-central1");

      // Specify the classification type
      // Types:
      // MultiLabel: Multiple labels are allowed for one example.
      // MultiClass: At most one label is allowed per example.
      ClassificationType classificationType = ClassificationType.MULTILABEL;
      ImageClassificationDatasetMetadata metadata =
      Dataset dataset =
      OperationFuture<Dataset, OperationMetadata> future =
          client.createDatasetAsync(projectLocation, dataset);

      Dataset createdDataset = future.get();

      // Display the dataset information.
      System.out.format("Dataset name: %s\n", createdDataset.getName());
      // To get the dataset id, you have to parse it out of the `name` field. As dataset Ids are
      // required for other methods.
      // Name Form: `projects/{project_id}/locations/{location_id}/datasets/{dataset_id}`
      String[] names = createdDataset.getName().split("/");
      String datasetId = names[names.length - 1];
      System.out.format("Dataset id: %s\n", datasetId);


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

 * TODO(developer): Uncomment these variables before running the sample.
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const displayName = 'YOUR_DISPLAY_NAME';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function createDataset() {
  // Construct request
  // Specify the classification type
  // Types:
  // MultiLabel: Multiple labels are allowed for one example.
  // MultiClass: At most one label is allowed per example.
  const request = {
    parent: client.locationPath(projectId, location),
    dataset: {
      displayName: displayName,
      imageClassificationDatasetMetadata: {
        classificationType: 'MULTILABEL',

  // Create dataset
  const [operation] = await client.createDataset(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();

  console.log(`Dataset name: ${}`);
    Dataset id: ${
        ['/').length - 1].split('\n')[0]



Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

from import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# display_name = "your_datasets_display_name"

client = automl.AutoMlClient()

# A resource that represents Google Cloud Platform location.
project_location = f"projects/{project_id}/locations/us-central1"
# Specify the classification type
# Types:
# MultiLabel: Multiple labels are allowed for one example.
# MultiClass: At most one label is allowed per example.
metadata = automl.ImageClassificationDatasetMetadata(
dataset = automl.Dataset(

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(
    parent=project_location, dataset=dataset, timeout=300

created_dataset = response.result()

# Display the dataset information
print(f"Dataset name: {}")
print("Dataset id: {}".format("/")[-1]))

Importing items into a dataset

After you have created a dataset, you can import item URIs and labels for items from a CSV file stored in a Google Cloud Storage bucket. For details on preparing your data and creating a CSV file for import, see Preparing your training data.

You can import items into an empty dataset or import additional items into an existing dataset.

Web UI

The AutoML Vision UI enables you to create a new dataset and import items into it from the same page; see Creating a dataset. The steps below import items into an existing dataset.

  1. Open the Vision Dashboard and select the dataset from the Datasets page.

    Dataset list page

  2. On the Images page, click Add items in the title bar and select the import method from the drop-down list.

    You can:

    • Upload a .csv file that contains the training images and their associated category labels from your local computer or from Google Cloud Storage.

    • Upload .txt or .zip files that contain the training images from your local computer.

  3. Select the file(s) to import.


Before using any of the request data, make the following replacements:

  • project-id: your GCP project ID.
  • dataset-id: the ID of your dataset. The ID is the last element of the name of your dataset. For example:
    • dataset name: projects/project-id/locations/location-id/datasets/3104518874390609379
    • dataset id: 3104518874390609379
  • input-storage-path: the path to a CSV file stored on Google Cloud Storage. The requesting user must have at least read permission to the bucket.

HTTP method and URL:


Request JSON body:

  "inputConfig": {
    "gcsSource": {
      "inputUris": [INPUT_STORAGE_PATH]

To send your request, choose one of these options:


Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: project-id" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \


Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "project-id" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "" | Select-Object -Expand Content

You should see output similar to the following. You can use the operation ID (ICN3819960680614725486, in this case) to get the status of the task. For an example, see Working with long-running operations.

  "name": "projects/PROJECT_ID/locations/us-central1/operations/OPERATION_ID",
  "metadata": {
    "@type": "",
    "createTime": "2018-10-29T15:56:29.176485Z",
    "updateTime": "2018-10-29T15:56:29.176485Z",
    "importDataDetails": {}


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

import (

	automl ""

// importDataIntoDataset imports data into a dataset.
func importDataIntoDataset(w io.Writer, projectID string, location string, datasetID string, inputURI string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetID := "TRL123456789..."
	// inputURI := "gs://BUCKET_ID/path_to_training_data.csv"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	defer client.Close()

	req := &automlpb.ImportDataRequest{
		Name: fmt.Sprintf("projects/%s/locations/%s/datasets/%s", projectID, location, datasetID),
		InputConfig: &automlpb.InputConfig{
			Source: &automlpb.InputConfig_GcsSource{
				GcsSource: &automlpb.GcsSource{
					InputUris: []string{inputURI},

	op, err := client.ImportData(ctx, req)
	if err != nil {
		return fmt.Errorf("ImportData: %w", err)
	fmt.Fprintf(w, "Processing operation name: %q\n", op.Name())

	if err := op.Wait(ctx); err != nil {
		return fmt.Errorf("Wait: %w", err)

	fmt.Fprintf(w, "Data imported.\n")

	return nil


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

import java.util.Arrays;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

class ImportDataset {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "YOUR_PROJECT_ID";
    String datasetId = "YOUR_DATASET_ID";
    String path = "gs://BUCKET_ID/path_to_training_data.csv";
    importDataset(projectId, datasetId, path);

  // Import a dataset
  static void importDataset(String projectId, String datasetId, String path)
      throws IOException, ExecutionException, InterruptedException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the complete path of the dataset.
      DatasetName datasetFullId = DatasetName.of(projectId, "us-central1", datasetId);

      // Get multiple Google Cloud Storage URIs to import data from
      GcsSource gcsSource =

      // Import data from the input URI
      InputConfig inputConfig = InputConfig.newBuilder().setGcsSource(gcsSource).build();
      System.out.println("Processing import...");

      // Start the import job
      OperationFuture<Empty, OperationMetadata> operation =
          client.importDataAsync(datasetFullId, inputConfig);

      System.out.format("Operation name: %s%n", operation.getName());

      // If you want to wait for the operation to finish, adjust the timeout appropriately. The
      // operation will still run if you choose not to wait for it to complete. You can check the
      // status of your operation using the operation's name.
      Empty response = operation.get(45, TimeUnit.MINUTES);
      System.out.format("Dataset imported. %s%n", response);
    } catch (TimeoutException e) {
      System.out.println("The operation's polling period was not long enough.");
      System.out.println("You can use the Operation's name to get the current status.");
      System.out.println("The import job is still running and will complete as expected.");
      throw e;


Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

 * TODO(developer): Uncomment these variables before running the sample.
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const datasetId = 'YOUR_DISPLAY_ID';
// const path = 'gs://BUCKET_ID/path_to_training_data.csv';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function importDataset() {
  // Construct request
  const request = {
    name: client.datasetPath(projectId, location, datasetId),
    inputConfig: {
      gcsSource: {
        inputUris: path.split(','),

  // Import dataset
  console.log('Proccessing import');
  const [operation] = await client.importData(request);

  // Wait for operation to complete.
  const [response] = await operation.promise();
  console.log(`Dataset imported: ${response}`);



Before trying this sample, follow the setup instructions for this language on the Client Libraries page.

from import automl

# TODO(developer): Uncomment and set the following variables
# project_id = "YOUR_PROJECT_ID"
# dataset_id = "YOUR_DATASET_ID"
# path = "gs://YOUR_BUCKET_ID/path/to/data.csv"

client = automl.AutoMlClient()
# Get the full path of the dataset.
dataset_full_id = client.dataset_path(project_id, "us-central1", dataset_id)
# Get the multiple Google Cloud Storage URIs
input_uris = path.split(",")
gcs_source = automl.GcsSource(input_uris=input_uris)
input_config = automl.InputConfig(gcs_source=gcs_source)
# Import data from the input URI
response = client.import_data(name=dataset_full_id, input_config=input_config)

print("Processing import...")
print(f"Data imported. {response.result()}")

Labeling training items

To be useful for training a model, each item in a dataset must have at least one category label assigned to it. AutoML Vision ignores items without a category label. You can provide labels for your training items in three ways:

  1. Include labels in your .csv file
  2. Label your items in the AutoML Vision UI
  3. Request labeling from human labeling service such as Google AI Platform Data Labeling Service.

Labeling in the UI

Web UI

To label items in the AutoML Vision UI, select the dataset from the Datasets listing page to see its details.

The side bar summarizes the number of labeled and unlabeled items. Here you can filter the item list by label, or select Add new label to create a new label.

Images page

From this screen you can also add or change an image's label.

Select an image to add or change its label.

add or change image label screen

Request labeling

You can leverage Google's AI Platform Data Labeling Service service to label your images. See the product documentation for more information.

Working with long-running operations

You can get the status of a long-running operation by using the following code samples.


Before using any of the request data, make the following replacements:

  • project-id: your GCP project ID.
  • operation-id: the ID of your operation. The ID is the last element of the name of your operation. For example:
    • operation name: projects/project-id/locations/location-id/operations/IOD5281059901324392598
    • operation id: IOD5281059901324392598

HTTP method and URL:


To send your request, choose one of these options:


Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: project-id" \


Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "project-id" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "" | Select-Object -Expand Content
You should see output similar to the following for a completed import operation:
  "name": "projects/PROJECT_ID/locations/us-central1/operations/OPERATION_ID",
  "metadata": {
    "@type": "",
    "createTime": "2018-10-29T15:56:29.176485Z",
    "updateTime": "2018-10-29T16:10:41.326614Z",
    "importDataDetails": {}
  "done": true,
  "response": {
    "@type": ""

You should see output similar to the following for a completed create model operation:

  "name": "projects/PROJECT_ID/locations/us-central1/operations/OPERATION_ID",
  "metadata": {
    "@type": "",
    "createTime": "2019-07-22T18:35:06.881193Z",
    "updateTime": "2019-07-22T19:58:44.972235Z",
    "createModelDetails": {}
  "done": true,
  "response": {
    "@type": "",
    "name": "projects/PROJECT_ID/locations/us-central1/models/MODEL_ID"


Before trying this sample, follow the setup instructions for this language on the APIs & Reference > Client Libraries page.

import (

	automl ""

// getOperationStatus gets an operation's status.
func getOperationStatus(w io.Writer, projectID string, location string, datasetID string, modelName string) error {
	// projectID := "my-project-id"
	// location := "us-central1"
	// datasetID := "ICN123456789..."
	// modelName := "model_display_name"

	ctx := context.Background()
	client, err := automl.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	defer client.Close()

	req := &automlpb.CreateModelRequest{
		Parent: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		Model: &automlpb.Model{
			DisplayName: modelName,
			DatasetId:   datasetID,
			ModelMetadata: &automlpb.Model_ImageClassificationModelMetadata{
				ImageClassificationModelMetadata: &automlpb.ImageClassificationModelMetadata{
					TrainBudgetMilliNodeHours: 1000, // 1000 milli-node hours are 1 hour

	op, err := client.CreateModel(ctx, req)
	if err != nil {
		return err
	fmt.Fprintf(w, "Name: %v\n", op.Name())

	// Wait for the longrunning operation complete.
	resp, err := op.Wait(ctx)
	if err != nil && !op.Done() {
		fmt.Println("failed to fetch operation status", err)
		return err
	if err != nil && op.Done() {
		fmt.Println("operation completed with error", err)
		return err
	fmt.Fprintf(w, "Response: %v\n", resp)

	return nil


Before trying this sample, follow the setup instructions for this language on the APIs & Reference > Client Libraries page.


class GetOperationStatus {

  static void getOperationStatus() throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String operationFullId = "projects/[projectId]/locations/us-central1/operations/[operationId]";

  // Get the status of an operation
  static void getOperationStatus(String operationFullId) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (AutoMlClient client = AutoMlClient.create()) {
      // Get the latest state of a long-running operation.
      Operation operation = client.getOperationsClient().getOperation(operationFullId);

      // Display operation details.
      System.out.println("Operation details:");
      System.out.format("\tName: %s\n", operation.getName());
      System.out.format("\tMetadata Type Url: %s\n", operation.getMetadata().getTypeUrl());
      System.out.format("\tDone: %s\n", operation.getDone());
      if (operation.hasResponse()) {
        System.out.format("\tResponse Type Url: %s\n", operation.getResponse().getTypeUrl());
      if (operation.hasError()) {
        System.out.format("\t\tError code: %s\n", operation.getError().getCode());
        System.out.format("\t\tError message: %s\n", operation.getError().getMessage());


Before trying this sample, follow the setup instructions for this language on the APIs & Reference > Client Libraries page.

 * TODO(developer): Uncomment these variables before running the sample.
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'us-central1';
// const operationId = 'YOUR_OPERATION_ID';

// Imports the Google Cloud AutoML library
const {AutoMlClient} = require('@google-cloud/automl').v1;

// Instantiates a client
const client = new AutoMlClient();

async function getOperationStatus() {
  // Construct request
  const request = {
    name: `projects/${projectId}/locations/${location}/operations/${operationId}`,

  const [response] = await client.operationsClient.getOperation(request);

  console.log(`Name: ${}`);
  console.log('Operation details:');



Before trying this sample, follow the setup instructions for this language on the APIs & Reference > Client Libraries page.

from import automl

# TODO(developer): Uncomment and set the following variables
# operation_full_id = \
#     "projects/[projectId]/locations/us-central1/operations/[operationId]"

client = automl.AutoMlClient()
# Get the latest state of a long-running operation.
response = client._transport.operations_client.get_operation(operation_full_id)

print(f"Name: {}")
print("Operation details:")

