blob: a09753653a04aa49f4053787cbc5fcdc5e1904e2 [file] [log] [blame]
---
subcategory: "Dataproc"
description: |-
Manages a Cloud Dataproc cluster resource.
---
# google\_dataproc\_cluster
Manages a Cloud Dataproc cluster resource within GCP.
* [API documentation](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters)
* How-to Guides
* [Official Documentation](https://cloud.google.com/dataproc/docs)
!> **Warning:** Due to limitations of the API, all arguments except
`labels`,`cluster_config.worker_config.num_instances` and `cluster_config.preemptible_worker_config.num_instances` are non-updatable. Changing `cluster_config.worker_config.min_num_instances` will be ignored. Changing others will cause recreation of the
whole cluster!
## Example Usage - Basic
```hcl
resource "google_dataproc_cluster" "simplecluster" {
name = "simplecluster"
region = "us-central1"
}
```
## Example Usage - Advanced
```hcl
resource "google_service_account" "default" {
account_id = "service-account-id"
display_name = "Service Account"
}
resource "google_dataproc_cluster" "mycluster" {
name = "mycluster"
region = "us-central1"
graceful_decommission_timeout = "120s"
labels = {
foo = "bar"
}
cluster_config {
staging_bucket = "dataproc-staging-bucket"
master_config {
num_instances = 1
machine_type = "e2-medium"
disk_config {
boot_disk_type = "pd-ssd"
boot_disk_size_gb = 30
}
}
worker_config {
num_instances = 2
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
disk_config {
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
preemptible_worker_config {
num_instances = 0
}
# Override or set some custom properties
software_config {
image_version = "2.0.35-debian10"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
}
gce_cluster_config {
tags = ["foo", "bar"]
# Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
service_account = google_service_account.default.email
service_account_scopes = [
"cloud-platform"
]
}
# You can define multiple initialization_action blocks
initialization_action {
script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
timeout_sec = 500
}
}
}
```
## Example Usage - Using a GPU accelerator
```hcl
resource "google_dataproc_cluster" "accelerated_cluster" {
name = "my-cluster-with-gpu"
region = "us-central1"
cluster_config {
gce_cluster_config {
zone = "us-central1-a"
}
master_config {
accelerators {
accelerator_type = "nvidia-tesla-k80"
accelerator_count = "1"
}
}
}
}
```
## Argument Reference
* `name` - (Required) The name of the cluster, unique within the project and
zone.
- - -
* `project` - (Optional) The ID of the project in which the `cluster` will exist. If it
is not provided, the provider project is used.
* `region` - (Optional) The region in which the cluster and associated nodes will be created in.
Defaults to `global`.
* `labels` - (Optional) The list of labels (key/value pairs) configured on the resource through Terraform and to be applied to
instances in the cluster.
**Note**: This field is non-authoritative, and will only manage the labels present in your configuration. Please refer to the field `effective_labels` for all of the labels present on the resource.
* `terraform_labels` -
The combination of labels configured directly on the resource and default labels configured on the provider.
* `effective_labels` - (Computed) The list of labels (key/value pairs) to be applied to
instances in the cluster. GCP generates some itself including `goog-dataproc-cluster-name`
which is the name of the cluster.
* `virtual_cluster_config` - (Optional) Allows you to configure a virtual Dataproc on GKE cluster.
Structure [defined below](#nested_virtual_cluster_config).
* `cluster_config` - (Optional) Allows you to configure various aspects of the cluster.
Structure [defined below](#nested_cluster_config).
* `graceful_decommission_timeout` - (Optional) Allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply.
Does not affect auto scaling decomissioning from an autoscaling policy.
Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress.
Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs).
Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day. (see JSON representation of
[Duration](https://developers.google.com/protocol-buffers/docs/proto3#json)).
Only supported on Dataproc image versions 1.2 and higher.
For more context see the [docs](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/patch#query-parameters)
- - -
<a name="nested_virtual_cluster_config"></a>The `virtual_cluster_config` block supports:
```hcl
virtual_cluster_config {
auxiliary_services_config { ... }
kubernetes_cluster_config { ... }
}
```
* `staging_bucket` - (Optional) The Cloud Storage staging bucket used to stage files,
such as Hadoop jars, between client machines and the cluster.
Note: If you don't explicitly specify a `staging_bucket`
then GCP will auto create / assign one for you. However, you are not guaranteed
an auto generated bucket which is solely dedicated to your cluster; it may be shared
with other clusters in the same region/zone also choosing to use the auto generation
option.
* `auxiliary_services_config` (Optional) Configuration of auxiliary services used by this cluster.
Structure [defined below](#nested_auxiliary_services_config).
* `kubernetes_cluster_config` (Required) The configuration for running the Dataproc cluster on Kubernetes.
Structure [defined below](#nested_kubernetes_cluster_config).
- - -
<a name="nested_auxiliary_services_config"></a>The `auxiliary_services_config` block supports:
```hcl
virtual_cluster_config {
auxiliary_services_config {
metastore_config {
dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id
}
spark_history_server_config {
dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id
}
}
}
```
* `metastore_config` (Optional) The Hive Metastore configuration for this workload.
* `dataproc_metastore_service` (Required) Resource name of an existing Dataproc Metastore service.
* `spark_history_server_config` (Optional) The Spark History Server configuration for the workload.
* `dataproc_cluster` (Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.
- - -
<a name="nested_kubernetes_cluster_config"></a>The `kubernetes_cluster_config` block supports:
```hcl
virtual_cluster_config {
kubernetes_cluster_config {
kubernetes_namespace = "foobar"
kubernetes_software_config {
component_version = {
"SPARK" : "3.1-dataproc-7"
}
properties = {
"spark:spark.eventLog.enabled": "true"
}
}
gke_cluster_config {
gke_cluster_target = google_container_cluster.primary.id
node_pool_target {
node_pool = "dpgke"
roles = ["DEFAULT"]
node_pool_config {
autoscaling {
min_node_count = 1
max_node_count = 6
}
config {
machine_type = "n1-standard-4"
preemptible = true
local_ssd_count = 1
min_cpu_platform = "Intel Sandy Bridge"
}
locations = ["us-central1-c"]
}
}
}
}
}
```
* `kubernetes_namespace` (Optional) A namespace within the Kubernetes cluster to deploy into.
If this namespace does not exist, it is created.
If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it.
If not specified, the name of the Dataproc Cluster is used.
* `kubernetes_software_config` (Required) The software configuration for this Dataproc cluster running on Kubernetes.
* `component_version` (Required) The components that should be installed in this Dataproc cluster. The key must be a string from the
KubernetesComponent enumeration. The value is the version of the software to be installed. At least one entry must be specified.
* **NOTE** : `component_version[SPARK]` is mandatory to set, or the creation of the cluster will fail.
* `properties` (Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format,
for example spark:spark.kubernetes.container.image.
* `gke_cluster_config` (Required) The configuration for running the Dataproc cluster on GKE.
* `gke_cluster_target` (Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster
(the GKE cluster can be zonal or regional)
* `node_pool_target` (Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the `DEFAULT`
GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a `DEFAULT` GkeNodePoolTarget.
Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings.
* `node_pool` (Required) The target GKE node pool.
* `roles` (Required) The roles associated with the GKE node pool.
One of `"DEFAULT"`, `"CONTROLLER"`, `"SPARK_DRIVER"` or `"SPARK_EXECUTOR"`.
* `node_pool_config` (Input only) The configuration for the GKE node pool.
If specified, Dataproc attempts to create a node pool with the specified shape.
If one with the same name already exists, it is verified against all specified fields.
If a field differs, the virtual cluster creation will fail.
* `autoscaling` (Optional) The autoscaler configuration for this node pool.
The autoscaler is enabled only when a valid configuration is present.
* `min_node_count` (Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount.
* `max_node_count` (Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0.
* `config` (Optional) The node pool configuration.
* `machine_type` (Optional) The name of a Compute Engine machine type.
* `local_ssd_count` (Optional) The number of local SSD disks to attach to the node,
which is limited by the maximum number of disks allowable per zone.
* `preemptible` (Optional) Whether the nodes are created as preemptible VM instances.
Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the
CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role).
* `min_cpu_platform` (Optional) Minimum CPU platform to be used by this instance.
The instance may be scheduled on the specified or a newer CPU platform.
Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge".
* `spot` (Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag.
* `locations` (Optional) The list of Compute Engine zones where node pool nodes associated
with a Dataproc on GKE virtual cluster will be located.
- - -
<a name="nested_cluster_config"></a>The `cluster_config` block supports:
```hcl
cluster_config {
gce_cluster_config { ... }
master_config { ... }
worker_config { ... }
preemptible_worker_config { ... }
software_config { ... }
# You can define multiple initialization_action blocks
initialization_action { ... }
encryption_config { ... }
endpoint_config { ... }
metastore_config { ... }
}
```
* `staging_bucket` - (Optional) The Cloud Storage staging bucket used to stage files,
such as Hadoop jars, between client machines and the cluster.
Note: If you don't explicitly specify a `staging_bucket`
then GCP will auto create / assign one for you. However, you are not guaranteed
an auto generated bucket which is solely dedicated to your cluster; it may be shared
with other clusters in the same region/zone also choosing to use the auto generation
option.
* `temp_bucket` - (Optional) The Cloud Storage temp bucket used to store ephemeral cluster
and jobs data, such as Spark and MapReduce history files.
Note: If you don't explicitly specify a `temp_bucket` then GCP will auto create / assign one for you.
* `gce_cluster_config` (Optional) Common config settings for resources of Google Compute Engine cluster
instances, applicable to all instances in the cluster. Structure [defined below](#nested_gce_cluster_config).
* `master_config` (Optional) The Google Compute Engine config settings for the master instances
in a cluster. Structure [defined below](#nested_master_config).
* `worker_config` (Optional) The Google Compute Engine config settings for the worker instances
in a cluster. Structure [defined below](#nested_worker_config).
* `preemptible_worker_config` (Optional) The Google Compute Engine config settings for the additional
instances in a cluster. Structure [defined below](#nested_preemptible_worker_config).
* **NOTE** : `preemptible_worker_config` is
an alias for the api's [secondaryWorkerConfig](https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#InstanceGroupConfig). The name doesn't necessarily mean it is preemptible and is named as
such for legacy/compatibility reasons.
* `software_config` (Optional) The config settings for software inside the cluster.
Structure [defined below](#nested_software_config).
* `security_config` (Optional) Security related configuration. Structure [defined below](#nested_security_config).
* `autoscaling_config` (Optional) The autoscaling policy config associated with the cluster.
Note that once set, if `autoscaling_config` is the only field set in `cluster_config`, it can
only be removed by setting `policy_uri = ""`, rather than removing the whole block.
Structure [defined below](#nested_autoscaling_config).
* `initialization_action` (Optional) Commands to execute on each node after config is completed.
You can specify multiple versions of these. Structure [defined below](#nested_initialization_action).
* `encryption_config` (Optional) The Customer managed encryption keys settings for the cluster.
Structure [defined below](#nested_encryption_config).
* `lifecycle_config` (Optional) The settings for auto deletion cluster schedule.
Structure [defined below](#nested_lifecycle_config).
* `endpoint_config` (Optional) The config settings for port access on the cluster.
Structure [defined below](#nested_endpoint_config).
* `dataproc_metric_config` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.
Structure [defined below](#nested_dataproc_metric_config).
* `auxiliary_node_groups` (Optional) A Dataproc NodeGroup resource is a group of Dataproc cluster nodes that execute an assigned role.
Structure [defined below](#nested_auxiliary_node_groups).
* `metastore_config` (Optional) The config setting for metastore service with the cluster.
Structure [defined below](#nested_metastore_config).
- - -
<a name="nested_gce_cluster_config"></a>The `cluster_config.gce_cluster_config` block supports:
```hcl
cluster_config {
gce_cluster_config {
zone = "us-central1-a"
# One of the below to hook into a custom network / subnetwork
network = google_compute_network.dataproc_network.name
subnetwork = google_compute_network.dataproc_subnetwork.name
tags = ["foo", "bar"]
}
}
```
* `zone` - (Optional, Computed) The GCP zone where your data is stored and used (i.e. where
the master and the worker nodes will be created in). If `region` is set to 'global' (default)
then `zone` is mandatory, otherwise GCP is able to make use of [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/auto-zone)
to determine this automatically for you.
Note: This setting additionally determines and restricts
which computing resources are available for use with other configs such as
`cluster_config.master_config.machine_type` and `cluster_config.worker_config.machine_type`.
* `network` - (Optional, Computed) The name or self_link of the Google Compute Engine
network to the cluster will be part of. Conflicts with `subnetwork`.
If neither is specified, this defaults to the "default" network.
* `subnetwork` - (Optional) The name or self_link of the Google Compute Engine
subnetwork the cluster will be part of. Conflicts with `network`.
* `service_account` - (Optional) The service account to be used by the Node VMs.
If not specified, the "default" service account is used.
* `service_account_scopes` - (Optional, Computed) The set of Google API scopes
to be made available on all of the node VMs under the `service_account`
specified. Both OAuth2 URLs and gcloud
short names are supported. To allow full access to all Cloud APIs, use the
`cloud-platform` scope. See a complete list of scopes [here](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/instances/set-scopes#--scopes).
* `tags` - (Optional) The list of instance tags applied to instances in the cluster.
Tags are used to identify valid sources or targets for network firewalls.
* `internal_ip_only` - (Optional) By default, clusters are not restricted to internal IP addresses,
and will have ephemeral external IP addresses assigned to each instance. If set to true, all
instances in the cluster will only have internal IP addresses. Note: Private Google Access
(also known as `privateIpGoogleAccess`) must be enabled on the subnetwork that the cluster
will be launched in.
* `metadata` - (Optional) A map of the Compute Engine metadata entries to add to all instances
(see [Project and instance metadata](https://cloud.google.com/compute/docs/storing-retrieving-metadata#project_and_instance_metadata)).
* `reservation_affinity` - (Optional) Reservation Affinity for consuming zonal reservation.
* `consume_reservation_type` - (Optional) Corresponds to the type of reservation consumption.
* `key` - (Optional) Corresponds to the label key of reservation resource.
* `values` - (Optional) Corresponds to the label values of reservation resource.
* `node_group_affinity` - (Optional) Node Group Affinity for sole-tenant clusters.
* `node_group_uri` - (Required) The URI of a sole-tenant node group resource that the cluster will be created on.
* `shielded_instance_config` (Optional) Shielded Instance Config for clusters using [Compute Engine Shielded VMs](https://cloud.google.com/security/shielded-cloud/shielded-vm).
- - -
The `cluster_config.gce_cluster_config.shielded_instance_config` block supports:
```hcl
cluster_config{
gce_cluster_config{
shielded_instance_config{
enable_secure_boot = true
enable_vtpm = true
enable_integrity_monitoring = true
}
}
}
```
* `enable_secure_boot` - (Optional) Defines whether instances have Secure Boot enabled.
* `enable_vtpm` - (Optional) Defines whether instances have the [vTPM](https://cloud.google.com/security/shielded-cloud/shielded-vm#vtpm) enabled.
* `enable_integrity_monitoring` - (Optional) Defines whether instances have integrity monitoring enabled.
- - -
<a name="nested_master_config"></a>The `cluster_config.master_config` block supports:
```hcl
cluster_config {
master_config {
num_instances = 1
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
disk_config {
boot_disk_type = "pd-ssd"
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
}
```
* `num_instances`- (Optional, Computed) Specifies the number of master nodes to create.
If not specified, GCP will default to a predetermined computed value (currently 1).
* `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type
to create for the master. If not specified, GCP will default to a predetermined
computed value (currently `n1-standard-4`).
* `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family
for the master. If not specified, GCP will default to a predetermined computed value
for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform)
for details about which CPU families are available (and defaulted) for each zone.
* `image_uri` (Optional) The URI for the image to use for this worker. See [the guide](https://cloud.google.com/dataproc/docs/guides/dataproc-images)
for more information.
* `disk_config` (Optional) Disk Config
* `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node.
One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`.
* `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each node, specified
in GB. The primary disk contains the boot volume and system libraries, and the
smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
* `num_local_ssds` - (Optional) The amount of local SSD disks that will be
attached to each master cluster node. Defaults to 0.
* `accelerators` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.
* `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`.
* `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`.
~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select
zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check [accelerator availability by zone](https://cloud.google.com/compute/docs/reference/rest/v1/acceleratorTypes/list)
if you are trying to use accelerators in a given zone.
- - -
<a name="nested_worker_config"></a>The `cluster_config.worker_config` block supports:
```hcl
cluster_config {
worker_config {
num_instances = 3
machine_type = "e2-medium"
min_cpu_platform = "Intel Skylake"
min_num_instance = 2
disk_config {
boot_disk_type = "pd-standard"
boot_disk_size_gb = 30
num_local_ssds = 1
}
}
}
```
* `num_instances`- (Optional, Computed) Specifies the number of worker nodes to create.
If not specified, GCP will default to a predetermined computed value (currently 2).
There is currently a beta feature which allows you to run a
[Single Node Cluster](https://cloud.google.com/dataproc/docs/concepts/single-node-clusters).
In order to take advantage of this you need to set
`"dataproc:dataproc.allow.zero.workers" = "true"` in
`cluster_config.software_config.properties`
* `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type
to create for the worker nodes. If not specified, GCP will default to a predetermined
computed value (currently `n1-standard-4`).
* `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family
for the master. If not specified, GCP will default to a predetermined computed value
for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform)
for details about which CPU families are available (and defaulted) for each zone.
* `disk_config` (Optional) Disk Config
* `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node.
One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`.
* `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each worker node, specified
in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
* `num_local_ssds` - (Optional) The amount of local SSD disks that will be
attached to each worker cluster node. Defaults to 0.
* `image_uri` (Optional) The URI for the image to use for this worker. See [the guide](https://cloud.google.com/dataproc/docs/guides/dataproc-images)
for more information.
* `min_num_instances` (Optional) The minimum number of primary worker instances to create. If `min_num_instances` is set, cluster creation will succeed if the number of primary workers created is at least equal to the `min_num_instances` number.
* `accelerators` (Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times.
* `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`.
* `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`.
~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select
zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check [accelerator availability by zone](https://cloud.google.com/compute/docs/reference/rest/v1/acceleratorTypes/list)
if you are trying to use accelerators in a given zone.
- - -
<a name="nested_preemptible_worker_config"></a>The `cluster_config.preemptible_worker_config` block supports:
```hcl
cluster_config {
preemptible_worker_config {
num_instances = 1
disk_config {
boot_disk_type = "pd-standard"
boot_disk_size_gb = 30
num_local_ssds = 1
}
instance_flexibility_policy {
instance_selection_list {
machine_types = ["n2-standard-2","n1-standard-2"]
rank = 1
}
instance_selection_list {
machine_types = ["n2d-standard-2"]
rank = 3
}
}
}
}
```
Note: Unlike `worker_config`, you cannot set the `machine_type` value directly. This
will be set for you based on whatever was set for the `worker_config.machine_type` value.
* `num_instances`- (Optional) Specifies the number of preemptible nodes to create.
Defaults to 0.
* `preemptibility`- (Optional) Specifies the preemptibility of the secondary workers. The default value is `PREEMPTIBLE`
Accepted values are:
* PREEMPTIBILITY_UNSPECIFIED
* NON_PREEMPTIBLE
* PREEMPTIBLE
* SPOT
* `disk_config` (Optional) Disk Config
* `boot_disk_type` - (Optional) The disk type of the primary disk attached to each preemptible worker node.
One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`.
* `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified
in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
* `num_local_ssds` - (Optional) The amount of local SSD disks that will be
attached to each preemptible worker node. Defaults to 0.
* `instance_flexibility_policy` (Optional) Instance flexibility Policy allowing a mixture of VM shapes and provisioning models.
* `instance_selection_list` - (Optional) List of instance selection options that the group will use when creating new VMs.
* `machine_types` - (Optional) Full machine-type names, e.g. `"n1-standard-16"`.
* `rank` - (Optional) Preference of this instance selection. A lower number means higher preference. Dataproc will first try to create a VM based on the machine-type with priority rank and fallback to next rank based on availability. Machine types and instance selections with the same priority have the same preference.
- - -
<a name="nested_software_config"></a>The `cluster_config.software_config` block supports:
```hcl
cluster_config {
# Override or set some custom properties
software_config {
image_version = "2.0.35-debian10"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
}
}
```
* `image_version` - (Optional, Computed) The Cloud Dataproc image version to use
for the cluster - this controls the sets of software versions
installed onto the nodes when you create clusters. If not specified, defaults to the
latest version. For a list of valid versions see
[Cloud Dataproc versions](https://cloud.google.com/dataproc/docs/concepts/dataproc-versions)
* `override_properties` - (Optional) A list of override and additional properties (key/value pairs)
used to modify various aspects of the common configuration files used when creating
a cluster. For a list of valid properties please see
[Cluster properties](https://cloud.google.com/dataproc/docs/concepts/cluster-properties)
* `optional_components` - (Optional) The set of optional components to activate on the cluster. See [Available Optional Components](https://cloud.google.com/dataproc/docs/concepts/components/overview#available_optional_components).
- - -
<a name="nested_security_config"></a>The `cluster_config.security_config` block supports:
```hcl
cluster_config {
# Override or set some custom properties
security_config {
kerberos_config {
kms_key_uri = "projects/projectId/locations/locationId/keyRings/keyRingId/cryptoKeys/keyId"
root_principal_password_uri = "bucketId/o/objectId"
}
}
}
```
* `kerberos_config` (Required) Kerberos Configuration
* `cross_realm_trust_admin_server` - (Optional) The admin server (IP or hostname) for the
remote trusted realm in a cross realm trust relationship.
* `cross_realm_trust_kdc` - (Optional) The KDC (IP or hostname) for the
remote trusted realm in a cross realm trust relationship.
* `cross_realm_trust_realm` - (Optional) The remote realm the Dataproc on-cluster KDC will
trust, should the user enable cross realm trust.
* `cross_realm_trust_shared_password_uri` - (Optional) The Cloud Storage URI of a KMS
encrypted file containing the shared password between the on-cluster Kerberos realm
and the remote trusted realm, in a cross realm trust relationship.
* `enable_kerberos` - (Optional) Flag to indicate whether to Kerberize the cluster.
* `kdc_db_key_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing
the master key of the KDC database.
* `key_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing
the password to the user provided key. For the self-signed certificate, this password
is generated by Dataproc.
* `keystore_uri` - (Optional) The Cloud Storage URI of the keystore file used for SSL encryption.
If not provided, Dataproc will provide a self-signed certificate.
* `keystore_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing
the password to the user provided keystore. For the self-signed certificated, the password
is generated by Dataproc.
* `kms_key_uri` - (Required) The URI of the KMS key used to encrypt various sensitive files.
* `realm` - (Optional) The name of the on-cluster Kerberos realm. If not specified, the
uppercased domain of hostnames will be the realm.
* `root_principal_password_uri` - (Required) The Cloud Storage URI of a KMS encrypted file
containing the root principal password.
* `tgt_lifetime_hours` - (Optional) The lifetime of the ticket granting ticket, in hours.
* `truststore_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file
containing the password to the user provided truststore. For the self-signed
certificate, this password is generated by Dataproc.
* `truststore_uri` - (Optional) The Cloud Storage URI of the truststore file used for
SSL encryption. If not provided, Dataproc will provide a self-signed certificate.
- - -
<a name="nested_autoscaling_config"></a>The `cluster_config.autoscaling_config` block supports:
```hcl
cluster_config {
# Override or set some custom properties
autoscaling_config {
policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId"
}
}
```
* `policy_uri` - (Required) The autoscaling policy used by the cluster.
Only resource names including projectid and location (region) are valid. Examples:
`https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]`
`projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]`
Note that the policy must be in the same project and Cloud Dataproc region.
- - -
<a name="nested_initialization_action"></a>The `initialization_action` block (Optional) can be specified multiple times and supports:
```hcl
cluster_config {
# You can define multiple initialization_action blocks
initialization_action {
script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
timeout_sec = 500
}
}
```
* `script`- (Required) The script to be executed during initialization of the cluster.
The script must be a GCS file with a gs:// prefix.
* `timeout_sec` - (Optional, Computed) The maximum duration (in seconds) which `script` is
allowed to take to execute its action. GCP will default to a predetermined
computed value if not set (currently 300).
- - -
<a name="nested_encryption_config"></a>The `encryption_config` block supports:
```hcl
cluster_config {
encryption_config {
kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName"
}
}
```
* `kms_key_name` - (Required) The Cloud KMS key name to use for PD disk encryption for
all instances in the cluster.
- - -
<a name="nested_dataproc_metric_config"></a>The `dataproc_metric_config` block supports:
```hcl
dataproc_metric_config {
metrics {
metric_source = "HDFS"
metric_overrides = ["yarn:ResourceManager:QueueMetrics:AppsCompleted"]
}
}
```
* `metrics` - (Required) Metrics sources to enable.
* `metric_source` - (Required) A source for the collection of Dataproc OSS metrics (see [available OSS metrics](https://cloud.google.com//dataproc/docs/guides/monitoring#available_oss_metrics)).
* `metric_overrides` - (Optional) One or more [available OSS metrics] (https://cloud.google.com/dataproc/docs/guides/monitoring#available_oss_metrics) to collect for the metric course.
- - -
<a name="nested_auxiliary_node_groups"></a>The `auxiliary_node_groups` block supports:
```hcl
auxiliary_node_groups{
node_group {
roles = ["DRIVER"]
node_group_config{
num_instances=2
machine_type="n1-standard-2"
min_cpu_platform = "AMD Rome"
disk_config {
boot_disk_size_gb = 35
boot_disk_type = "pd-standard"
num_local_ssds = 1
}
accelerators {
accelerator_count = 1
accelerator_type = "nvidia-tesla-t4"
}
}
}
}
```
* `node_group` - (Required) Node group configuration.
* `roles` - (Required) Node group roles.
One of `"DRIVER"`.
* `name` - (Optional) The Node group resource name.
* `node_group_config` - (Optional) The node group instance group configuration.
* `num_instances`- (Optional, Computed) Specifies the number of master nodes to create.
Please set a number greater than 0. Node Group must have at least 1 instance.
* `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type
to create for the node group. If not specified, GCP will default to a predetermined
computed value (currently `n1-standard-4`).
* `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family
for the node group. If not specified, GCP will default to a predetermined computed value
for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform)
for details about which CPU families are available (and defaulted) for each zone.
* `disk_config` (Optional) Disk Config
* `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node.
One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`.
* `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each node, specified
in GB. The primary disk contains the boot volume and system libraries, and the
smallest allowed disk size is 10GB. GCP will default to a predetermined
computed value if not set (currently 500GB). Note: If SSDs are not
attached, it also contains the HDFS data blocks and Hadoop working directories.
* `num_local_ssds` - (Optional) The amount of local SSD disks that will be attached to each master cluster node.
Defaults to 0.
* `accelerators` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified
multiple times.
* `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`.
* `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`.
- - -
<a name="nested_lifecycle_config"></a>The `lifecycle_config` block supports:
```hcl
cluster_config {
lifecycle_config {
idle_delete_ttl = "10m"
auto_delete_time = "2120-01-01T12:00:00.01Z"
}
}
```
* `idle_delete_ttl` - (Optional) The duration to keep the cluster alive while idling
(no jobs running). After this TTL, the cluster will be deleted. Valid range: [10m, 14d].
* `auto_delete_time` - (Optional) The time when cluster will be auto-deleted.
A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds.
Example: "2014-10-02T15:01:23.045123456Z".
- - -
<a name="nested_endpoint_config"></a>The `endpoint_config` block (Optional, Computed, Beta) supports:
```hcl
cluster_config {
endpoint_config {
enable_http_port_access = "true"
}
}
```
* `enable_http_port_access` - (Optional) The flag to enable http access to specific ports
on the cluster from external sources (aka Component Gateway). Defaults to false.
<a name="nested_metastore_config"></a>The `metastore_config` block (Optional, Computed, Beta) supports:
```hcl
cluster_config {
metastore_config {
dataproc_metastore_service = "projects/projectId/locations/region/services/serviceName"
}
}
```
* `dataproc_metastore_service` - (Required) Resource name of an existing Dataproc Metastore service.
Only resource names including projectid and location (region) are valid. Examples:
`projects/[projectId]/locations/[dataproc_region]/services/[service-name]`
## Attributes Reference
In addition to the arguments listed above, the following computed attributes are
exported:
* `cluster_config.0.master_config.0.instance_names` - List of master instance names which
have been assigned to the cluster.
* `cluster_config.0.worker_config.0.instance_names` - List of worker instance names which have been assigned
to the cluster.
* `cluster_config.0.preemptible_worker_config.0.instance_names` - List of preemptible instance names which have been assigned
to the cluster.
* `cluster_config.0.bucket` - The name of the cloud storage bucket ultimately used to house the staging data
for the cluster. If `staging_bucket` is specified, it will contain this value, otherwise
it will be the auto generated name.
* `cluster_config.0.software_config.0.properties` - A list of the properties used to set the daemon config files.
This will include any values supplied by the user via `cluster_config.software_config.override_properties`
* `cluster_config.0.lifecycle_config.0.idle_start_time` - Time when the cluster became idle
(most recent job finished) and became eligible for deletion due to idleness.
* `cluster_config.0.endpoint_config.0.http_ports` - The map of port descriptions to URLs. Will only be populated if
`enable_http_port_access` is true.
## Import
This resource does not support import.
## Timeouts
This resource provides the following
[Timeouts](https://developer.hashicorp.com/terraform/plugin/sdkv2/resources/retries-and-customizable-timeouts) configuration options: configuration options:
- `create` - Default is 45 minutes.
- `update` - Default is 45 minutes.
- `delete` - Default is 45 minutes.