| --- |
| subcategory: "Dataproc" |
| description: |- |
| Manages a Cloud Dataproc cluster resource. |
| --- |
| |
| # google\_dataproc\_cluster |
| |
| Manages a Cloud Dataproc cluster resource within GCP. |
| |
| * [API documentation](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters) |
| * How-to Guides |
| * [Official Documentation](https://cloud.google.com/dataproc/docs) |
| |
| |
| !> **Warning:** Due to limitations of the API, all arguments except |
| `labels`,`cluster_config.worker_config.num_instances` and `cluster_config.preemptible_worker_config.num_instances` are non-updatable. Changing `cluster_config.worker_config.min_num_instances` will be ignored. Changing others will cause recreation of the |
| whole cluster! |
| |
| ## Example Usage - Basic |
| |
| ```hcl |
| resource "google_dataproc_cluster" "simplecluster" { |
| name = "simplecluster" |
| region = "us-central1" |
| } |
| ``` |
| |
| ## Example Usage - Advanced |
| |
| ```hcl |
| resource "google_service_account" "default" { |
| account_id = "service-account-id" |
| display_name = "Service Account" |
| } |
| |
| resource "google_dataproc_cluster" "mycluster" { |
| name = "mycluster" |
| region = "us-central1" |
| graceful_decommission_timeout = "120s" |
| labels = { |
| foo = "bar" |
| } |
| |
| cluster_config { |
| staging_bucket = "dataproc-staging-bucket" |
| |
| master_config { |
| num_instances = 1 |
| machine_type = "e2-medium" |
| disk_config { |
| boot_disk_type = "pd-ssd" |
| boot_disk_size_gb = 30 |
| } |
| } |
| |
| worker_config { |
| num_instances = 2 |
| machine_type = "e2-medium" |
| min_cpu_platform = "Intel Skylake" |
| disk_config { |
| boot_disk_size_gb = 30 |
| num_local_ssds = 1 |
| } |
| } |
| |
| preemptible_worker_config { |
| num_instances = 0 |
| } |
| |
| # Override or set some custom properties |
| software_config { |
| image_version = "2.0.35-debian10" |
| override_properties = { |
| "dataproc:dataproc.allow.zero.workers" = "true" |
| } |
| } |
| |
| gce_cluster_config { |
| tags = ["foo", "bar"] |
| # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles. |
| service_account = google_service_account.default.email |
| service_account_scopes = [ |
| "cloud-platform" |
| ] |
| } |
| |
| # You can define multiple initialization_action blocks |
| initialization_action { |
| script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh" |
| timeout_sec = 500 |
| } |
| } |
| } |
| ``` |
| |
| ## Example Usage - Using a GPU accelerator |
| |
| ```hcl |
| resource "google_dataproc_cluster" "accelerated_cluster" { |
| name = "my-cluster-with-gpu" |
| region = "us-central1" |
| |
| cluster_config { |
| gce_cluster_config { |
| zone = "us-central1-a" |
| } |
| |
| master_config { |
| accelerators { |
| accelerator_type = "nvidia-tesla-k80" |
| accelerator_count = "1" |
| } |
| } |
| } |
| } |
| ``` |
| |
| ## Argument Reference |
| |
| * `name` - (Required) The name of the cluster, unique within the project and |
| zone. |
| |
| - - - |
| |
| * `project` - (Optional) The ID of the project in which the `cluster` will exist. If it |
| is not provided, the provider project is used. |
| |
| * `region` - (Optional) The region in which the cluster and associated nodes will be created in. |
| Defaults to `global`. |
| |
| * `labels` - (Optional) The list of labels (key/value pairs) configured on the resource through Terraform and to be applied to |
| instances in the cluster. |
| **Note**: This field is non-authoritative, and will only manage the labels present in your configuration. Please refer to the field `effective_labels` for all of the labels present on the resource. |
| |
| * `terraform_labels` - |
| The combination of labels configured directly on the resource and default labels configured on the provider. |
| |
| * `effective_labels` - (Computed) The list of labels (key/value pairs) to be applied to |
| instances in the cluster. GCP generates some itself including `goog-dataproc-cluster-name` |
| which is the name of the cluster. |
| |
| * `virtual_cluster_config` - (Optional) Allows you to configure a virtual Dataproc on GKE cluster. |
| Structure [defined below](#nested_virtual_cluster_config). |
| |
| * `cluster_config` - (Optional) Allows you to configure various aspects of the cluster. |
| Structure [defined below](#nested_cluster_config). |
| |
| * `graceful_decommission_timeout` - (Optional) Allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply. |
| Does not affect auto scaling decomissioning from an autoscaling policy. |
| Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. |
| Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). |
| Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day. (see JSON representation of |
| [Duration](https://developers.google.com/protocol-buffers/docs/proto3#json)). |
| Only supported on Dataproc image versions 1.2 and higher. |
| For more context see the [docs](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/patch#query-parameters) |
| - - - |
| |
| <a name="nested_virtual_cluster_config"></a>The `virtual_cluster_config` block supports: |
| |
| ```hcl |
| virtual_cluster_config { |
| auxiliary_services_config { ... } |
| kubernetes_cluster_config { ... } |
| } |
| ``` |
| |
| * `staging_bucket` - (Optional) The Cloud Storage staging bucket used to stage files, |
| such as Hadoop jars, between client machines and the cluster. |
| Note: If you don't explicitly specify a `staging_bucket` |
| then GCP will auto create / assign one for you. However, you are not guaranteed |
| an auto generated bucket which is solely dedicated to your cluster; it may be shared |
| with other clusters in the same region/zone also choosing to use the auto generation |
| option. |
| |
| * `auxiliary_services_config` (Optional) Configuration of auxiliary services used by this cluster. |
| Structure [defined below](#nested_auxiliary_services_config). |
| |
| * `kubernetes_cluster_config` (Required) The configuration for running the Dataproc cluster on Kubernetes. |
| Structure [defined below](#nested_kubernetes_cluster_config). |
| - - - |
| |
| <a name="nested_auxiliary_services_config"></a>The `auxiliary_services_config` block supports: |
| |
| ```hcl |
| virtual_cluster_config { |
| auxiliary_services_config { |
| metastore_config { |
| dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id |
| } |
| |
| spark_history_server_config { |
| dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id |
| } |
| } |
| } |
| ``` |
| |
| * `metastore_config` (Optional) The Hive Metastore configuration for this workload. |
| |
| * `dataproc_metastore_service` (Required) Resource name of an existing Dataproc Metastore service. |
| |
| * `spark_history_server_config` (Optional) The Spark History Server configuration for the workload. |
| |
| * `dataproc_cluster` (Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload. |
| - - - |
| |
| <a name="nested_kubernetes_cluster_config"></a>The `kubernetes_cluster_config` block supports: |
| |
| ```hcl |
| virtual_cluster_config { |
| kubernetes_cluster_config { |
| kubernetes_namespace = "foobar" |
| |
| kubernetes_software_config { |
| component_version = { |
| "SPARK" : "3.1-dataproc-7" |
| } |
| |
| properties = { |
| "spark:spark.eventLog.enabled": "true" |
| } |
| } |
| |
| gke_cluster_config { |
| gke_cluster_target = google_container_cluster.primary.id |
| |
| node_pool_target { |
| node_pool = "dpgke" |
| roles = ["DEFAULT"] |
| |
| node_pool_config { |
| autoscaling { |
| min_node_count = 1 |
| max_node_count = 6 |
| } |
| |
| config { |
| machine_type = "n1-standard-4" |
| preemptible = true |
| local_ssd_count = 1 |
| min_cpu_platform = "Intel Sandy Bridge" |
| } |
| |
| locations = ["us-central1-c"] |
| } |
| } |
| } |
| } |
| } |
| ``` |
| |
| * `kubernetes_namespace` (Optional) A namespace within the Kubernetes cluster to deploy into. |
| If this namespace does not exist, it is created. |
| If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it. |
| If not specified, the name of the Dataproc Cluster is used. |
| |
| * `kubernetes_software_config` (Required) The software configuration for this Dataproc cluster running on Kubernetes. |
| |
| * `component_version` (Required) The components that should be installed in this Dataproc cluster. The key must be a string from the |
| KubernetesComponent enumeration. The value is the version of the software to be installed. At least one entry must be specified. |
| * **NOTE** : `component_version[SPARK]` is mandatory to set, or the creation of the cluster will fail. |
| |
| * `properties` (Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format, |
| for example spark:spark.kubernetes.container.image. |
| |
| * `gke_cluster_config` (Required) The configuration for running the Dataproc cluster on GKE. |
| |
| * `gke_cluster_target` (Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster |
| (the GKE cluster can be zonal or regional) |
| |
| * `node_pool_target` (Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the `DEFAULT` |
| GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a `DEFAULT` GkeNodePoolTarget. |
| Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings. |
| |
| * `node_pool` (Required) The target GKE node pool. |
| |
| * `roles` (Required) The roles associated with the GKE node pool. |
| One of `"DEFAULT"`, `"CONTROLLER"`, `"SPARK_DRIVER"` or `"SPARK_EXECUTOR"`. |
| |
| * `node_pool_config` (Input only) The configuration for the GKE node pool. |
| If specified, Dataproc attempts to create a node pool with the specified shape. |
| If one with the same name already exists, it is verified against all specified fields. |
| If a field differs, the virtual cluster creation will fail. |
| |
| * `autoscaling` (Optional) The autoscaler configuration for this node pool. |
| The autoscaler is enabled only when a valid configuration is present. |
| |
| * `min_node_count` (Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount. |
| |
| * `max_node_count` (Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0. |
| |
| * `config` (Optional) The node pool configuration. |
| |
| * `machine_type` (Optional) The name of a Compute Engine machine type. |
| |
| * `local_ssd_count` (Optional) The number of local SSD disks to attach to the node, |
| which is limited by the maximum number of disks allowable per zone. |
| |
| * `preemptible` (Optional) Whether the nodes are created as preemptible VM instances. |
| Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the |
| CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role). |
| |
| * `min_cpu_platform` (Optional) Minimum CPU platform to be used by this instance. |
| The instance may be scheduled on the specified or a newer CPU platform. |
| Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge". |
| |
| * `spot` (Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag. |
| |
| * `locations` (Optional) The list of Compute Engine zones where node pool nodes associated |
| with a Dataproc on GKE virtual cluster will be located. |
| - - - |
| |
| <a name="nested_cluster_config"></a>The `cluster_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| gce_cluster_config { ... } |
| master_config { ... } |
| worker_config { ... } |
| preemptible_worker_config { ... } |
| software_config { ... } |
| |
| # You can define multiple initialization_action blocks |
| initialization_action { ... } |
| encryption_config { ... } |
| endpoint_config { ... } |
| metastore_config { ... } |
| } |
| ``` |
| |
| * `staging_bucket` - (Optional) The Cloud Storage staging bucket used to stage files, |
| such as Hadoop jars, between client machines and the cluster. |
| Note: If you don't explicitly specify a `staging_bucket` |
| then GCP will auto create / assign one for you. However, you are not guaranteed |
| an auto generated bucket which is solely dedicated to your cluster; it may be shared |
| with other clusters in the same region/zone also choosing to use the auto generation |
| option. |
| |
| * `temp_bucket` - (Optional) The Cloud Storage temp bucket used to store ephemeral cluster |
| and jobs data, such as Spark and MapReduce history files. |
| Note: If you don't explicitly specify a `temp_bucket` then GCP will auto create / assign one for you. |
| |
| * `gce_cluster_config` (Optional) Common config settings for resources of Google Compute Engine cluster |
| instances, applicable to all instances in the cluster. Structure [defined below](#nested_gce_cluster_config). |
| |
| * `master_config` (Optional) The Google Compute Engine config settings for the master instances |
| in a cluster. Structure [defined below](#nested_master_config). |
| |
| * `worker_config` (Optional) The Google Compute Engine config settings for the worker instances |
| in a cluster. Structure [defined below](#nested_worker_config). |
| |
| * `preemptible_worker_config` (Optional) The Google Compute Engine config settings for the additional |
| instances in a cluster. Structure [defined below](#nested_preemptible_worker_config). |
| * **NOTE** : `preemptible_worker_config` is |
| an alias for the api's [secondaryWorkerConfig](https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#InstanceGroupConfig). The name doesn't necessarily mean it is preemptible and is named as |
| such for legacy/compatibility reasons. |
| |
| * `software_config` (Optional) The config settings for software inside the cluster. |
| Structure [defined below](#nested_software_config). |
| |
| * `security_config` (Optional) Security related configuration. Structure [defined below](#nested_security_config). |
| |
| * `autoscaling_config` (Optional) The autoscaling policy config associated with the cluster. |
| Note that once set, if `autoscaling_config` is the only field set in `cluster_config`, it can |
| only be removed by setting `policy_uri = ""`, rather than removing the whole block. |
| Structure [defined below](#nested_autoscaling_config). |
| |
| * `initialization_action` (Optional) Commands to execute on each node after config is completed. |
| You can specify multiple versions of these. Structure [defined below](#nested_initialization_action). |
| |
| * `encryption_config` (Optional) The Customer managed encryption keys settings for the cluster. |
| Structure [defined below](#nested_encryption_config). |
| |
| * `lifecycle_config` (Optional) The settings for auto deletion cluster schedule. |
| Structure [defined below](#nested_lifecycle_config). |
| |
| * `endpoint_config` (Optional) The config settings for port access on the cluster. |
| Structure [defined below](#nested_endpoint_config). |
| |
| * `dataproc_metric_config` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times. |
| Structure [defined below](#nested_dataproc_metric_config). |
| |
| * `auxiliary_node_groups` (Optional) A Dataproc NodeGroup resource is a group of Dataproc cluster nodes that execute an assigned role. |
| Structure [defined below](#nested_auxiliary_node_groups). |
| |
| * `metastore_config` (Optional) The config setting for metastore service with the cluster. |
| Structure [defined below](#nested_metastore_config). |
| - - - |
| |
| <a name="nested_gce_cluster_config"></a>The `cluster_config.gce_cluster_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| gce_cluster_config { |
| zone = "us-central1-a" |
| |
| # One of the below to hook into a custom network / subnetwork |
| network = google_compute_network.dataproc_network.name |
| subnetwork = google_compute_network.dataproc_subnetwork.name |
| |
| tags = ["foo", "bar"] |
| } |
| } |
| ``` |
| |
| * `zone` - (Optional, Computed) The GCP zone where your data is stored and used (i.e. where |
| the master and the worker nodes will be created in). If `region` is set to 'global' (default) |
| then `zone` is mandatory, otherwise GCP is able to make use of [Auto Zone Placement](https://cloud.google.com/dataproc/docs/concepts/auto-zone) |
| to determine this automatically for you. |
| Note: This setting additionally determines and restricts |
| which computing resources are available for use with other configs such as |
| `cluster_config.master_config.machine_type` and `cluster_config.worker_config.machine_type`. |
| |
| * `network` - (Optional, Computed) The name or self_link of the Google Compute Engine |
| network to the cluster will be part of. Conflicts with `subnetwork`. |
| If neither is specified, this defaults to the "default" network. |
| |
| * `subnetwork` - (Optional) The name or self_link of the Google Compute Engine |
| subnetwork the cluster will be part of. Conflicts with `network`. |
| |
| * `service_account` - (Optional) The service account to be used by the Node VMs. |
| If not specified, the "default" service account is used. |
| |
| * `service_account_scopes` - (Optional, Computed) The set of Google API scopes |
| to be made available on all of the node VMs under the `service_account` |
| specified. Both OAuth2 URLs and gcloud |
| short names are supported. To allow full access to all Cloud APIs, use the |
| `cloud-platform` scope. See a complete list of scopes [here](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/instances/set-scopes#--scopes). |
| |
| * `tags` - (Optional) The list of instance tags applied to instances in the cluster. |
| Tags are used to identify valid sources or targets for network firewalls. |
| |
| * `internal_ip_only` - (Optional) By default, clusters are not restricted to internal IP addresses, |
| and will have ephemeral external IP addresses assigned to each instance. If set to true, all |
| instances in the cluster will only have internal IP addresses. Note: Private Google Access |
| (also known as `privateIpGoogleAccess`) must be enabled on the subnetwork that the cluster |
| will be launched in. |
| |
| * `metadata` - (Optional) A map of the Compute Engine metadata entries to add to all instances |
| (see [Project and instance metadata](https://cloud.google.com/compute/docs/storing-retrieving-metadata#project_and_instance_metadata)). |
| |
| * `reservation_affinity` - (Optional) Reservation Affinity for consuming zonal reservation. |
| * `consume_reservation_type` - (Optional) Corresponds to the type of reservation consumption. |
| * `key` - (Optional) Corresponds to the label key of reservation resource. |
| * `values` - (Optional) Corresponds to the label values of reservation resource. |
| |
| * `node_group_affinity` - (Optional) Node Group Affinity for sole-tenant clusters. |
| * `node_group_uri` - (Required) The URI of a sole-tenant node group resource that the cluster will be created on. |
| |
| * `shielded_instance_config` (Optional) Shielded Instance Config for clusters using [Compute Engine Shielded VMs](https://cloud.google.com/security/shielded-cloud/shielded-vm). |
| |
| - - - |
| |
| |
| The `cluster_config.gce_cluster_config.shielded_instance_config` block supports: |
| |
| ```hcl |
| cluster_config{ |
| gce_cluster_config{ |
| shielded_instance_config{ |
| enable_secure_boot = true |
| enable_vtpm = true |
| enable_integrity_monitoring = true |
| } |
| } |
| } |
| ``` |
| |
| * `enable_secure_boot` - (Optional) Defines whether instances have Secure Boot enabled. |
| |
| * `enable_vtpm` - (Optional) Defines whether instances have the [vTPM](https://cloud.google.com/security/shielded-cloud/shielded-vm#vtpm) enabled. |
| |
| * `enable_integrity_monitoring` - (Optional) Defines whether instances have integrity monitoring enabled. |
| |
| - - - |
| |
| <a name="nested_master_config"></a>The `cluster_config.master_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| master_config { |
| num_instances = 1 |
| machine_type = "e2-medium" |
| min_cpu_platform = "Intel Skylake" |
| |
| disk_config { |
| boot_disk_type = "pd-ssd" |
| boot_disk_size_gb = 30 |
| num_local_ssds = 1 |
| } |
| } |
| } |
| ``` |
| |
| * `num_instances`- (Optional, Computed) Specifies the number of master nodes to create. |
| If not specified, GCP will default to a predetermined computed value (currently 1). |
| |
| * `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type |
| to create for the master. If not specified, GCP will default to a predetermined |
| computed value (currently `n1-standard-4`). |
| |
| * `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family |
| for the master. If not specified, GCP will default to a predetermined computed value |
| for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform) |
| for details about which CPU families are available (and defaulted) for each zone. |
| |
| * `image_uri` (Optional) The URI for the image to use for this worker. See [the guide](https://cloud.google.com/dataproc/docs/guides/dataproc-images) |
| for more information. |
| |
| * `disk_config` (Optional) Disk Config |
| |
| * `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node. |
| One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`. |
| |
| * `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each node, specified |
| in GB. The primary disk contains the boot volume and system libraries, and the |
| smallest allowed disk size is 10GB. GCP will default to a predetermined |
| computed value if not set (currently 500GB). Note: If SSDs are not |
| attached, it also contains the HDFS data blocks and Hadoop working directories. |
| |
| * `num_local_ssds` - (Optional) The amount of local SSD disks that will be |
| attached to each master cluster node. Defaults to 0. |
| |
| * `accelerators` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times. |
| |
| * `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`. |
| |
| * `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`. |
| |
| ~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select |
| zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check [accelerator availability by zone](https://cloud.google.com/compute/docs/reference/rest/v1/acceleratorTypes/list) |
| if you are trying to use accelerators in a given zone. |
| |
| - - - |
| |
| <a name="nested_worker_config"></a>The `cluster_config.worker_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| worker_config { |
| num_instances = 3 |
| machine_type = "e2-medium" |
| min_cpu_platform = "Intel Skylake" |
| min_num_instance = 2 |
| disk_config { |
| boot_disk_type = "pd-standard" |
| boot_disk_size_gb = 30 |
| num_local_ssds = 1 |
| } |
| } |
| } |
| ``` |
| |
| * `num_instances`- (Optional, Computed) Specifies the number of worker nodes to create. |
| If not specified, GCP will default to a predetermined computed value (currently 2). |
| There is currently a beta feature which allows you to run a |
| [Single Node Cluster](https://cloud.google.com/dataproc/docs/concepts/single-node-clusters). |
| In order to take advantage of this you need to set |
| `"dataproc:dataproc.allow.zero.workers" = "true"` in |
| `cluster_config.software_config.properties` |
| |
| * `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type |
| to create for the worker nodes. If not specified, GCP will default to a predetermined |
| computed value (currently `n1-standard-4`). |
| |
| * `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family |
| for the master. If not specified, GCP will default to a predetermined computed value |
| for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform) |
| for details about which CPU families are available (and defaulted) for each zone. |
| |
| * `disk_config` (Optional) Disk Config |
| |
| * `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node. |
| One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`. |
| |
| * `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each worker node, specified |
| in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined |
| computed value if not set (currently 500GB). Note: If SSDs are not |
| attached, it also contains the HDFS data blocks and Hadoop working directories. |
| |
| * `num_local_ssds` - (Optional) The amount of local SSD disks that will be |
| attached to each worker cluster node. Defaults to 0. |
| |
| * `image_uri` (Optional) The URI for the image to use for this worker. See [the guide](https://cloud.google.com/dataproc/docs/guides/dataproc-images) |
| for more information. |
| |
| * `min_num_instances` (Optional) The minimum number of primary worker instances to create. If `min_num_instances` is set, cluster creation will succeed if the number of primary workers created is at least equal to the `min_num_instances` number. |
| |
| * `accelerators` (Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times. |
| |
| * `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`. |
| |
| * `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`. |
| |
| ~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select |
| zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check [accelerator availability by zone](https://cloud.google.com/compute/docs/reference/rest/v1/acceleratorTypes/list) |
| if you are trying to use accelerators in a given zone. |
| |
| - - - |
| |
| <a name="nested_preemptible_worker_config"></a>The `cluster_config.preemptible_worker_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| preemptible_worker_config { |
| num_instances = 1 |
| |
| disk_config { |
| boot_disk_type = "pd-standard" |
| boot_disk_size_gb = 30 |
| num_local_ssds = 1 |
| } |
| instance_flexibility_policy { |
| instance_selection_list { |
| machine_types = ["n2-standard-2","n1-standard-2"] |
| rank = 1 |
| } |
| instance_selection_list { |
| machine_types = ["n2d-standard-2"] |
| rank = 3 |
| } |
| } |
| } |
| } |
| ``` |
| |
| Note: Unlike `worker_config`, you cannot set the `machine_type` value directly. This |
| will be set for you based on whatever was set for the `worker_config.machine_type` value. |
| |
| * `num_instances`- (Optional) Specifies the number of preemptible nodes to create. |
| Defaults to 0. |
| |
| * `preemptibility`- (Optional) Specifies the preemptibility of the secondary workers. The default value is `PREEMPTIBLE` |
| Accepted values are: |
| * PREEMPTIBILITY_UNSPECIFIED |
| * NON_PREEMPTIBLE |
| * PREEMPTIBLE |
| * SPOT |
| |
| * `disk_config` (Optional) Disk Config |
| |
| * `boot_disk_type` - (Optional) The disk type of the primary disk attached to each preemptible worker node. |
| One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`. |
| |
| * `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified |
| in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined |
| computed value if not set (currently 500GB). Note: If SSDs are not |
| attached, it also contains the HDFS data blocks and Hadoop working directories. |
| |
| * `num_local_ssds` - (Optional) The amount of local SSD disks that will be |
| attached to each preemptible worker node. Defaults to 0. |
| |
| * `instance_flexibility_policy` (Optional) Instance flexibility Policy allowing a mixture of VM shapes and provisioning models. |
| |
| * `instance_selection_list` - (Optional) List of instance selection options that the group will use when creating new VMs. |
| * `machine_types` - (Optional) Full machine-type names, e.g. `"n1-standard-16"`. |
| |
| * `rank` - (Optional) Preference of this instance selection. A lower number means higher preference. Dataproc will first try to create a VM based on the machine-type with priority rank and fallback to next rank based on availability. Machine types and instance selections with the same priority have the same preference. |
| |
| - - - |
| |
| <a name="nested_software_config"></a>The `cluster_config.software_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| # Override or set some custom properties |
| software_config { |
| image_version = "2.0.35-debian10" |
| |
| override_properties = { |
| "dataproc:dataproc.allow.zero.workers" = "true" |
| } |
| } |
| } |
| ``` |
| |
| * `image_version` - (Optional, Computed) The Cloud Dataproc image version to use |
| for the cluster - this controls the sets of software versions |
| installed onto the nodes when you create clusters. If not specified, defaults to the |
| latest version. For a list of valid versions see |
| [Cloud Dataproc versions](https://cloud.google.com/dataproc/docs/concepts/dataproc-versions) |
| |
| * `override_properties` - (Optional) A list of override and additional properties (key/value pairs) |
| used to modify various aspects of the common configuration files used when creating |
| a cluster. For a list of valid properties please see |
| [Cluster properties](https://cloud.google.com/dataproc/docs/concepts/cluster-properties) |
| |
| * `optional_components` - (Optional) The set of optional components to activate on the cluster. See [Available Optional Components](https://cloud.google.com/dataproc/docs/concepts/components/overview#available_optional_components). |
| |
| - - - |
| |
| <a name="nested_security_config"></a>The `cluster_config.security_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| # Override or set some custom properties |
| security_config { |
| kerberos_config { |
| kms_key_uri = "projects/projectId/locations/locationId/keyRings/keyRingId/cryptoKeys/keyId" |
| root_principal_password_uri = "bucketId/o/objectId" |
| } |
| } |
| } |
| ``` |
| |
| * `kerberos_config` (Required) Kerberos Configuration |
| |
| * `cross_realm_trust_admin_server` - (Optional) The admin server (IP or hostname) for the |
| remote trusted realm in a cross realm trust relationship. |
| |
| * `cross_realm_trust_kdc` - (Optional) The KDC (IP or hostname) for the |
| remote trusted realm in a cross realm trust relationship. |
| |
| * `cross_realm_trust_realm` - (Optional) The remote realm the Dataproc on-cluster KDC will |
| trust, should the user enable cross realm trust. |
| |
| * `cross_realm_trust_shared_password_uri` - (Optional) The Cloud Storage URI of a KMS |
| encrypted file containing the shared password between the on-cluster Kerberos realm |
| and the remote trusted realm, in a cross realm trust relationship. |
| |
| * `enable_kerberos` - (Optional) Flag to indicate whether to Kerberize the cluster. |
| |
| * `kdc_db_key_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing |
| the master key of the KDC database. |
| |
| * `key_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing |
| the password to the user provided key. For the self-signed certificate, this password |
| is generated by Dataproc. |
| |
| * `keystore_uri` - (Optional) The Cloud Storage URI of the keystore file used for SSL encryption. |
| If not provided, Dataproc will provide a self-signed certificate. |
| |
| * `keystore_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file containing |
| the password to the user provided keystore. For the self-signed certificated, the password |
| is generated by Dataproc. |
| |
| * `kms_key_uri` - (Required) The URI of the KMS key used to encrypt various sensitive files. |
| |
| * `realm` - (Optional) The name of the on-cluster Kerberos realm. If not specified, the |
| uppercased domain of hostnames will be the realm. |
| |
| * `root_principal_password_uri` - (Required) The Cloud Storage URI of a KMS encrypted file |
| containing the root principal password. |
| |
| * `tgt_lifetime_hours` - (Optional) The lifetime of the ticket granting ticket, in hours. |
| |
| * `truststore_password_uri` - (Optional) The Cloud Storage URI of a KMS encrypted file |
| containing the password to the user provided truststore. For the self-signed |
| certificate, this password is generated by Dataproc. |
| |
| * `truststore_uri` - (Optional) The Cloud Storage URI of the truststore file used for |
| SSL encryption. If not provided, Dataproc will provide a self-signed certificate. |
| |
| - - - |
| |
| <a name="nested_autoscaling_config"></a>The `cluster_config.autoscaling_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| # Override or set some custom properties |
| autoscaling_config { |
| policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId" |
| } |
| } |
| ``` |
| |
| * `policy_uri` - (Required) The autoscaling policy used by the cluster. |
| |
| Only resource names including projectid and location (region) are valid. Examples: |
| |
| `https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]` |
| `projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]` |
| Note that the policy must be in the same project and Cloud Dataproc region. |
| |
| - - - |
| |
| <a name="nested_initialization_action"></a>The `initialization_action` block (Optional) can be specified multiple times and supports: |
| |
| ```hcl |
| cluster_config { |
| # You can define multiple initialization_action blocks |
| initialization_action { |
| script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh" |
| timeout_sec = 500 |
| } |
| } |
| ``` |
| |
| * `script`- (Required) The script to be executed during initialization of the cluster. |
| The script must be a GCS file with a gs:// prefix. |
| |
| * `timeout_sec` - (Optional, Computed) The maximum duration (in seconds) which `script` is |
| allowed to take to execute its action. GCP will default to a predetermined |
| computed value if not set (currently 300). |
| |
| - - - |
| |
| <a name="nested_encryption_config"></a>The `encryption_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| encryption_config { |
| kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName" |
| } |
| } |
| ``` |
| |
| * `kms_key_name` - (Required) The Cloud KMS key name to use for PD disk encryption for |
| all instances in the cluster. |
| |
| - - - |
| |
| <a name="nested_dataproc_metric_config"></a>The `dataproc_metric_config` block supports: |
| |
| ```hcl |
| dataproc_metric_config { |
| metrics { |
| metric_source = "HDFS" |
| metric_overrides = ["yarn:ResourceManager:QueueMetrics:AppsCompleted"] |
| } |
| } |
| ``` |
| |
| |
| * `metrics` - (Required) Metrics sources to enable. |
| |
| * `metric_source` - (Required) A source for the collection of Dataproc OSS metrics (see [available OSS metrics](https://cloud.google.com//dataproc/docs/guides/monitoring#available_oss_metrics)). |
| |
| * `metric_overrides` - (Optional) One or more [available OSS metrics] (https://cloud.google.com/dataproc/docs/guides/monitoring#available_oss_metrics) to collect for the metric course. |
| |
| - - - |
| |
| <a name="nested_auxiliary_node_groups"></a>The `auxiliary_node_groups` block supports: |
| |
| ```hcl |
| auxiliary_node_groups{ |
| node_group { |
| roles = ["DRIVER"] |
| node_group_config{ |
| num_instances=2 |
| machine_type="n1-standard-2" |
| min_cpu_platform = "AMD Rome" |
| disk_config { |
| boot_disk_size_gb = 35 |
| boot_disk_type = "pd-standard" |
| num_local_ssds = 1 |
| } |
| accelerators { |
| accelerator_count = 1 |
| accelerator_type = "nvidia-tesla-t4" |
| } |
| } |
| } |
| } |
| ``` |
| |
| |
| * `node_group` - (Required) Node group configuration. |
| |
| * `roles` - (Required) Node group roles. |
| One of `"DRIVER"`. |
| |
| * `name` - (Optional) The Node group resource name. |
| |
| * `node_group_config` - (Optional) The node group instance group configuration. |
| |
| * `num_instances`- (Optional, Computed) Specifies the number of master nodes to create. |
| Please set a number greater than 0. Node Group must have at least 1 instance. |
| |
| * `machine_type` - (Optional, Computed) The name of a Google Compute Engine machine type |
| to create for the node group. If not specified, GCP will default to a predetermined |
| computed value (currently `n1-standard-4`). |
| |
| * `min_cpu_platform` - (Optional, Computed) The name of a minimum generation of CPU family |
| for the node group. If not specified, GCP will default to a predetermined computed value |
| for each zone. See [the guide](https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform) |
| for details about which CPU families are available (and defaulted) for each zone. |
| |
| * `disk_config` (Optional) Disk Config |
| |
| * `boot_disk_type` - (Optional) The disk type of the primary disk attached to each node. |
| One of `"pd-ssd"` or `"pd-standard"`. Defaults to `"pd-standard"`. |
| |
| * `boot_disk_size_gb` - (Optional, Computed) Size of the primary disk attached to each node, specified |
| in GB. The primary disk contains the boot volume and system libraries, and the |
| smallest allowed disk size is 10GB. GCP will default to a predetermined |
| computed value if not set (currently 500GB). Note: If SSDs are not |
| attached, it also contains the HDFS data blocks and Hadoop working directories. |
| |
| * `num_local_ssds` - (Optional) The amount of local SSD disks that will be attached to each master cluster node. |
| Defaults to 0. |
| |
| * `accelerators` (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified |
| multiple times. |
| |
| * `accelerator_type` - (Required) The short name of the accelerator type to expose to this instance. For example, `nvidia-tesla-k80`. |
| |
| * `accelerator_count` - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of `1`, `2`, `4`, or `8`. |
| |
| |
| - - - |
| |
| <a name="nested_lifecycle_config"></a>The `lifecycle_config` block supports: |
| |
| ```hcl |
| cluster_config { |
| lifecycle_config { |
| idle_delete_ttl = "10m" |
| auto_delete_time = "2120-01-01T12:00:00.01Z" |
| } |
| } |
| ``` |
| |
| * `idle_delete_ttl` - (Optional) The duration to keep the cluster alive while idling |
| (no jobs running). After this TTL, the cluster will be deleted. Valid range: [10m, 14d]. |
| |
| * `auto_delete_time` - (Optional) The time when cluster will be auto-deleted. |
| A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds. |
| Example: "2014-10-02T15:01:23.045123456Z". |
| |
| - - - |
| |
| <a name="nested_endpoint_config"></a>The `endpoint_config` block (Optional, Computed, Beta) supports: |
| |
| ```hcl |
| cluster_config { |
| endpoint_config { |
| enable_http_port_access = "true" |
| } |
| } |
| ``` |
| |
| * `enable_http_port_access` - (Optional) The flag to enable http access to specific ports |
| on the cluster from external sources (aka Component Gateway). Defaults to false. |
| |
| |
| <a name="nested_metastore_config"></a>The `metastore_config` block (Optional, Computed, Beta) supports: |
| |
| ```hcl |
| cluster_config { |
| metastore_config { |
| dataproc_metastore_service = "projects/projectId/locations/region/services/serviceName" |
| } |
| } |
| ``` |
| |
| * `dataproc_metastore_service` - (Required) Resource name of an existing Dataproc Metastore service. |
| |
| Only resource names including projectid and location (region) are valid. Examples: |
| |
| `projects/[projectId]/locations/[dataproc_region]/services/[service-name]` |
| |
| ## Attributes Reference |
| |
| In addition to the arguments listed above, the following computed attributes are |
| exported: |
| |
| * `cluster_config.0.master_config.0.instance_names` - List of master instance names which |
| have been assigned to the cluster. |
| |
| * `cluster_config.0.worker_config.0.instance_names` - List of worker instance names which have been assigned |
| to the cluster. |
| |
| * `cluster_config.0.preemptible_worker_config.0.instance_names` - List of preemptible instance names which have been assigned |
| to the cluster. |
| |
| * `cluster_config.0.bucket` - The name of the cloud storage bucket ultimately used to house the staging data |
| for the cluster. If `staging_bucket` is specified, it will contain this value, otherwise |
| it will be the auto generated name. |
| |
| * `cluster_config.0.software_config.0.properties` - A list of the properties used to set the daemon config files. |
| This will include any values supplied by the user via `cluster_config.software_config.override_properties` |
| |
| * `cluster_config.0.lifecycle_config.0.idle_start_time` - Time when the cluster became idle |
| (most recent job finished) and became eligible for deletion due to idleness. |
| |
| * `cluster_config.0.endpoint_config.0.http_ports` - The map of port descriptions to URLs. Will only be populated if |
| `enable_http_port_access` is true. |
| |
| ## Import |
| |
| This resource does not support import. |
| |
| ## Timeouts |
| |
| This resource provides the following |
| [Timeouts](https://developer.hashicorp.com/terraform/plugin/sdkv2/resources/retries-and-customizable-timeouts) configuration options: configuration options: |
| |
| - `create` - Default is 45 minutes. |
| - `update` - Default is 45 minutes. |
| - `delete` - Default is 45 minutes. |