gcloud dataproc batches submit pyspark

gcloud dataproc batches submit pyspark - submit a PySpark batch job
gcloud dataproc batches submit pyspark MAIN_PYTHON_FILE [--archives=[ARCHIVE,…]] [--async] [--batch=BATCH] [--container-image=CONTAINER_IMAGE] [--deps-bucket=DEPS_BUCKET] [--files=[FILE,…]] [--history-server-cluster=HISTORY_SERVER_CLUSTER] [--jars=[JAR,…]] [--kms-key=KMS_KEY] [--labels=[KEY=VALUE,…]] [--metastore-service=METASTORE_SERVICE] [--properties=[PROPERTY=VALUE,…]] [--py-files=[PY,…]] [--region=REGION] [--request-id=REQUEST_ID] [--service-account=SERVICE_ACCOUNT] [--staging-bucket=STAGING_BUCKET] [--tags=[TAGS,…]] [--ttl=TTL] [--version=VERSION] [--network=NETWORK     | --subnet=SUBNET] [GCLOUD_WIDE_FLAG] [-- JOB_ARG …]
Submit a PySpark batch job.
To submit a PySpark batch job called "my-batch" that runs "my-pyspark.py", run:
gcloud dataproc batches submit pyspark my-pyspark.py --batch=my-batch --deps-bucket=gs://my-bucket --region=us-central1 --py-files='path/to/my/python/script.py'
URI of the main Python file to use as the Spark driver. Must be a .py file.
[-- JOB_ARG …]
Arguments to pass to the driver.

The '--' argument must be specified between gcloud specific args on the left and JOB_ARG on the right.

Archives to be extracted into the working directory. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
Return immediately without waiting for the operation in progress to complete.
The ID of the batch job to submit. The ID must contain only lowercase letters (a-z), numbers (0-9) and hyphens (-). The length of the name must be between 4 and 63 characters. If this argument is not provided, a random generated UUID will be used.
Optional custom container image to use for the batch/session runtime environment. If not specified, a default container image will be used. The value should follow the container image naming format: {registry}/{repository}/{name}:{tag}, for example, gcr.io/my-project/my-image:1.2.3
A Cloud Storage bucket to upload workload dependencies.
Files to be placed in the working directory.
Spark History Server configuration for the batch/session job. Resource name of an existing Dataproc cluster to act as a Spark History Server for the workload in the format: "projects/{project_id}/regions/{region}/clusters/{cluster_name}".
Comma-separated list of jar files to be provided to the classpaths.
Cloud KMS key to use for encryption.
List of label KEY=VALUE pairs to add.

Keys must start with a lowercase character and contain only hyphens (-), underscores (_), lowercase characters, and numbers. Values must contain only hyphens (-), underscores (_), lowercase characters, and numbers.

Name of a Dataproc Metastore service to be used as an external metastore in the format: "projects/{project-id}/locations/{region}/services/{service-name}".
Specifies configuration properties for the workload. See Dataproc Serverless for Spark documentation for the list of supported properties.
Comma-separated list of Python scripts to be passed to the PySpark framework. Supported file types: .py, .egg and .zip.
Region resource - Dataproc region to use. Each Dataproc region constitutes an independent resource namespace constrained to deploying instances into Compute Engine zones inside the region. This represents a Cloud resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways.

To set the project attribute:

  • provide the argument --region on the command line with a fully specified name;
  • set the property dataproc/region with a fully specified name;
  • provide the argument --project on the command line;
  • set the property core/project.
ID of the region or fully qualified identifier for the region.

To set the region attribute:

  • provide the argument --region on the command line;
  • set the property dataproc/region.
A unique ID that identifies the request. If the service receives two batch create requests with the same request_id, the second request is ignored and the operation that corresponds to the first batch created and stored in the backend is returned. Recommendation: Always set this value to a UUID. The value must contain only letters (a-z, A-Z), numbers (0-9), underscores (), and hyphens (-). The maximum length is 40 characters.
The IAM service account to be used for a batch/session job.
The Cloud Storage bucket to use to store job dependencies, config files, and job driver console output. If not specified, the default [staging bucket] (http://cloud.go888ogle.com.fqhub.com/dataproc-serverless/docs/concepts/buckets) is used.
Network tags for traffic control.
The duration after the workload will be unconditionally terminated, for example, '20m' or '1h'. Run gcloud topic datetimes for information on duration formats.
Optional runtime version. If not specified, a default version will be used.
At most one of these can be specified:
Network URI to connect network to.
Subnetwork URI to connect network to. Subnet must have Private Google Access enabled.
These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.

Run $ gcloud help for details.

This variant is also available:
gcloud beta dataproc batches submit pyspark