v1.13.3/docs/unicode.md - terraform - Git at Google

 # How Terraform Uses Unicode

 The Terraform language uses the Unicode standards as the basis of various
 different features. The Unicode Consortium publishes new versions of those
 standards periodically, and we aim to adopt those new versions in new
 minor releases of Terraform in order to support additional characters added
 in those new versions.

 Unfortunately, due to those features relying on multiple
 of external libraries, adopting a new version of Unicode is not as simple as
 just updating a version number somewhere. This document aims to describe the
 various steps required to adopt a new version of Unicode in Terraform.

 We typically aim to be consistent across all of these dependencies as to which
 major version of Unicode we currently conform to. The usual initial driver
 for a Unicode upgrade is switching to a new version of the Go runtime library
 which itself uses a new version of Unicode, because Go itself does not provide
 any way to select Unicode versions independently from Go versions. Therefore,
 we typically upgrade to a new Unicode version only in conjunction with
 upgrading to a new Go version.

 ## Unicode tables in the Go standard library

 Several Terraform language features are implemented in terms of functions in
 [the Go `strings` package](https://pkg.go.dev/strings),
 [the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
 packages in the Go standard library.

 The Go team maintains the Go standard library features to support a particular
 Unicode version for each Go version. The specific Unicode version for a
 particular Go version is available in
 [`unicode.Version`](https://pkg.go.dev/unicode#Version).

 We adopt a new version of Go by editing the `.go-version` file in the root
 of this repository. Although it's typically possible to build Terraform with
 other versions of Go, that file documents the version we intend to use for
 official releases and thus the primary version we use for development and
 testing. Adopting a new Go version typically also implies other behavior
 changes inherited from the Go standard library, so it's important to review the
 relevant version changelog(s) to note any behavior changes we'll need to pass
 on to our own users via the Terraform changelog.

 The other subsystems described below should always be set up to match
 `unicode.Version`. In some cases those libraries automatically try to align
 themselves with `unicode.Version` and generate an error if they cannot, but
 that isn't true of all of them.

 ## Unicode Identifier Rules in HCL

 _Identifier and Pattern Syntax_ (TF31) is a Unicode standards annex which
 describe a set of rules for tokenizing "identifiers", such as variable names
 in a programming language.

 HCL uses a superset of that specification for its own identifier tokenization
 rules, and so it includes some code derived from the TF31 data tables that
 describe which characters belong to the "ID_Start" and "ID_Continue" classes.

 Since Terraform is the primary user of HCL, it's typically Terraform's adoption
 of a new Unicode version which drives HCL to adopt one. To update the Unicode
 tables to a new version:
 * Edit `hclsyntax/generate.go`'s line which runs `unicode2ragel.rb` to specify
   the URL of the `DerivedCoreProperties.txt` data file for the intended Unicode
   version.
 * Run `go generate ./hclsyntax` to run the generation code to update both
   `unicode_derived.rl` and, indirectly, `scan_tokens.go`. (You will need both
   a Ruby interpreter and the Ragel state machine compiler on your system in
   order to complete this step.)
 * Run all the tests to check for regressions: `go test ./...`
 * If all looks good, commit all of the changes and open a PR to HCL.
 * Once that PR is merged and released, update Terraform to use the new version
   of HCL.

 ## Unicode Text Segmentation

 _Text Segmentation_ (TR29) is a Unicode standards annex which describes
 algorithms for breaking strings into smaller units such as sentences, words,
 and grapheme clusters.

 Several Terraform language features make use of the _grapheme cluster_
 algorithm in particular, because it provides a practical definition of
 individual visible characters, taking into account combining sequences such
 as Latin letters with separate diacritics or Emoji characters with gender
 presentation and skin tone modifiers.

 The text segmentation algorithms rely on supplementary data tables that are
 not part of the core set encoded in the Go standard library's `unicode`
 packages, and so instead we rely on the third-party module
 [`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
 to provide those tables and a Go implementation of the grapheme cluster
 segmentation algorithm in terms of the tables.

 The `go-textseg` library is designed to allow calling programs to potentially
 support multiple Unicode versions at once, by offering a separate module major
 version for each Unicode major version. For example, the full module path for
 the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.

 If that external library doesn't yet have support for the Unicode version we
 intend to adopt then we'll first need to open a pull request to contribute
 new language support. The details of how to do this will unfortunately vary
 depending on how significantly the Text Segmentation annex has changed since
 the most recently-supported Unicode version, but in many cases it can be
 just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
 and `generate.go` files to point to the URLs where the Unicode consortium
 published new tables and then run `go generate` to rebuild the files derived
 from those data sources. As long as the new Unicode version has only changed
 the data tables and not also changed the algorithm, often no further changes
 are needed.

 Once a new Unicode version is included, the maintainer of that library will
 typically publish a new major version that we can depend on. Two different
 codebases included in Terraform all depend directly on the `go-textseg` module
 for parts of their functionality:

 * [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
   segmentation as part of producing visual column offsets in source ranges
   returned by the tokenizer and parser. Terraform in turn uses that library
   for the underlying syntax of the Terraform language, and so it passes on
   those source ranges to the end-user as part of diagnostic messages.
 * The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
   provides several of the Terraform language built in functions, including
   functions like `substr` and `length` which need to count grapheme clusters
   as part of their implementation.

 As part of upgrading Terraform's Unicode support we therefore typically also
 open pull requests against these other codebases, and then adopt the new
 versions that produces. Terraform work often drives the adoption of new Unicode
 versions in those codebases, with other dependencies following along when they
 next upgrade.

 At the time of writing Terraform itself doesn't _directly_ depend on
 `go-textseg`, and so there are no specific changes required in this Terraform
 codebase aside from the `go.sum` file update that always follows from
 changes to transitive dependencies.

 The `go-textseg` library does have a different "auto-version" mechanism which
 selects an appropriate module version based on the current Go language version,
 but neither HCL nor cty use that because the auto-version package will not
 compile for any Go version that doesn't have a corresponding Unicode version
 explicitly recorded in that repository, and so that would be too harsh a
 constraint for libraries like HCL which have many callers, many of which don't
 care strongly about Unicode support, that may wish to upgrade Go before the
 text segmentation library has been updated.
	# How Terraform Uses Unicode

	The Terraform language uses the Unicode standards as the basis of various
	different features. The Unicode Consortium publishes new versions of those
	standards periodically, and we aim to adopt those new versions in new
	minor releases of Terraform in order to support additional characters added
	in those new versions.

	Unfortunately, due to those features relying on multiple
	of external libraries, adopting a new version of Unicode is not as simple as
	just updating a version number somewhere. This document aims to describe the
	various steps required to adopt a new version of Unicode in Terraform.

	We typically aim to be consistent across all of these dependencies as to which
	major version of Unicode we currently conform to. The usual initial driver
	for a Unicode upgrade is switching to a new version of the Go runtime library
	which itself uses a new version of Unicode, because Go itself does not provide
	any way to select Unicode versions independently from Go versions. Therefore,
	we typically upgrade to a new Unicode version only in conjunction with
	upgrading to a new Go version.

	## Unicode tables in the Go standard library

	Several Terraform language features are implemented in terms of functions in
	[the Go `strings` package](https://pkg.go.dev/strings),
	[the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
	packages in the Go standard library.

	The Go team maintains the Go standard library features to support a particular
	Unicode version for each Go version. The specific Unicode version for a
	particular Go version is available in
	[`unicode.Version`](https://pkg.go.dev/unicode#Version).

	We adopt a new version of Go by editing the `.go-version` file in the root
	of this repository. Although it's typically possible to build Terraform with
	other versions of Go, that file documents the version we intend to use for
	official releases and thus the primary version we use for development and
	testing. Adopting a new Go version typically also implies other behavior
	changes inherited from the Go standard library, so it's important to review the
	relevant version changelog(s) to note any behavior changes we'll need to pass
	on to our own users via the Terraform changelog.

	The other subsystems described below should always be set up to match
	`unicode.Version`. In some cases those libraries automatically try to align
	themselves with `unicode.Version` and generate an error if they cannot, but
	that isn't true of all of them.

	## Unicode Identifier Rules in HCL

	_Identifier and Pattern Syntax_ (TF31) is a Unicode standards annex which
	describe a set of rules for tokenizing "identifiers", such as variable names
	in a programming language.

	HCL uses a superset of that specification for its own identifier tokenization
	rules, and so it includes some code derived from the TF31 data tables that
	describe which characters belong to the "ID_Start" and "ID_Continue" classes.

	Since Terraform is the primary user of HCL, it's typically Terraform's adoption
	of a new Unicode version which drives HCL to adopt one. To update the Unicode
	tables to a new version:
	* Edit `hclsyntax/generate.go`'s line which runs `unicode2ragel.rb` to specify
	the URL of the `DerivedCoreProperties.txt` data file for the intended Unicode
	version.
	* Run `go generate ./hclsyntax` to run the generation code to update both
	`unicode_derived.rl` and, indirectly, `scan_tokens.go`. (You will need both
	a Ruby interpreter and the Ragel state machine compiler on your system in
	order to complete this step.)
	* Run all the tests to check for regressions: `go test ./...`
	* If all looks good, commit all of the changes and open a PR to HCL.
	* Once that PR is merged and released, update Terraform to use the new version
	of HCL.

	## Unicode Text Segmentation

	_Text Segmentation_ (TR29) is a Unicode standards annex which describes
	algorithms for breaking strings into smaller units such as sentences, words,
	and grapheme clusters.

	Several Terraform language features make use of the _grapheme cluster_
	algorithm in particular, because it provides a practical definition of
	individual visible characters, taking into account combining sequences such
	as Latin letters with separate diacritics or Emoji characters with gender
	presentation and skin tone modifiers.

	The text segmentation algorithms rely on supplementary data tables that are
	not part of the core set encoded in the Go standard library's `unicode`
	packages, and so instead we rely on the third-party module
	[`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
	to provide those tables and a Go implementation of the grapheme cluster
	segmentation algorithm in terms of the tables.

	The `go-textseg` library is designed to allow calling programs to potentially
	support multiple Unicode versions at once, by offering a separate module major
	version for each Unicode major version. For example, the full module path for
	the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.

	If that external library doesn't yet have support for the Unicode version we
	intend to adopt then we'll first need to open a pull request to contribute
	new language support. The details of how to do this will unfortunately vary
	depending on how significantly the Text Segmentation annex has changed since
	the most recently-supported Unicode version, but in many cases it can be
	just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
	and `generate.go` files to point to the URLs where the Unicode consortium
	published new tables and then run `go generate` to rebuild the files derived
	from those data sources. As long as the new Unicode version has only changed
	the data tables and not also changed the algorithm, often no further changes
	are needed.

	Once a new Unicode version is included, the maintainer of that library will
	typically publish a new major version that we can depend on. Two different
	codebases included in Terraform all depend directly on the `go-textseg` module
	for parts of their functionality:

	* [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
	segmentation as part of producing visual column offsets in source ranges
	returned by the tokenizer and parser. Terraform in turn uses that library
	for the underlying syntax of the Terraform language, and so it passes on
	those source ranges to the end-user as part of diagnostic messages.
	* The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
	provides several of the Terraform language built in functions, including
	functions like `substr` and `length` which need to count grapheme clusters
	as part of their implementation.

	As part of upgrading Terraform's Unicode support we therefore typically also
	open pull requests against these other codebases, and then adopt the new
	versions that produces. Terraform work often drives the adoption of new Unicode
	versions in those codebases, with other dependencies following along when they
	next upgrade.

	At the time of writing Terraform itself doesn't _directly_ depend on
	`go-textseg`, and so there are no specific changes required in this Terraform
	codebase aside from the `go.sum` file update that always follows from
	changes to transitive dependencies.

	The `go-textseg` library does have a different "auto-version" mechanism which
	selects an appropriate module version based on the current Go language version,
	but neither HCL nor cty use that because the auto-version package will not
	compile for any Go version that doesn't have a corresponding Unicode version
	explicitly recorded in that repository, and so that would be too harsh a
	constraint for libraries like HCL which have many callers, many of which don't
	care strongly about Unicode support, that may wish to upgrade Go before the
	text segmentation library has been updated.