blob: 26ae3bafccba1de068a0895c7dc210afeaf2027b [file] [log] [blame]
//
// Copyright (c) 2020 Contributors to the Eclipse Foundation
//
[appendix]
== Binding XML Names to Java Identifiers
=== Overview
This section provides default mappings from:
* XML Name to Java identifier
* Model group to Java identifier
* Namespace URI to Java package name
=== The Name to Identifier Mapping Algorithm
Java identifiers typically follow three simple, well-known conventions:
* Class and interface names always begin with
an upper-case letter. The remaining characters are either digits,
lower-case letters, or upper-case letters. Upper-case letters within a
multi-word name serve to identify the start of each non-initial word, or
sometimes to stand for acronyms.
* Method names and components of a package
name always begin with a lower-case letter, and otherwise are exactly
like class and interface names.
* Constant names are entirely in upper case,
with each pair of words separated by the underscore character (‘_’,
\u005F, LOW LINE).
XML names, however, are much richer than Java
identifiers: They may include not only the standard Java identifier
characters but also various punctuation and special characters that are
not permitted in Java identifiers. Like most Java identifiers, most XML
names are in practice composed of more than one natural-language word.
Non-initial words within an XML name typically start with an upper-case
letter followed by a lower-case letter, as in Java language, or are
prefixed by punctuation characters, which is not usual in the Java
language and, for most punctuation characters, is in fact illegal.
In order to map an arbitrary XML name into a
Java class, method, or constant identifier, the XML name is first broken
into a _word list_. For the purpose of constructing word lists from XML
names we use the following definitions:
* A _punctuation character_ is one of the following:
* A hyphen (’-’, \u002D, HYPHEN-MINUS),
* A period (‘.’, \u002E, FULL STOP),
* A colon (’:’, \u003A, COLON),
* A dot (‘.’, \u00B7, MIDDLE DOT),
* \u0387, GREEK ANO TELEIA,
* \u06DD, ARABIC END OF AYAH, or
* \u06DE, ARABIC START OF RUB EL HIZB.
* An underscore (’\_’, \u005F, LOW LINE) with following exceptionfootnote:exc[Exception case:
Underscore is not considered a punctuation mark for schema customization
_<jaxb:globalBindings underscoreHandling="asCharInWord"/>_ specified in
<<Underscore Handling>>. For this
customization, underscore is considered a special letter that never
results in a word break as defined in <<xmlWordBreaks>>
and it is definitely not considered an uncased letter.
See example bindings in <<asCharInWord>>.]
These are all legal characters in XML names.
* A _letter_ is a character for which the
`Character.isLetter` method returns `true`, _i.e._ , a letter according
to the Unicode standard. Every letter is a legal Java identifier
character, both initial and non-initial.
* A _digit_ is a character for which the
`Character.isDigit` method returns `true`, _i.e._ , a digit according
to the Unicode Standard. Every digit is a legal non-initial Java
identifier character.
* A _mark_ is a character that is in none of
the previous categories but for which the
`Character.isJavaIdentifierPart` method returns `true`. This category
includes numeric letters, combining marks, non-spacing marks, and
ignorable control characters.
Every XML name character falls into one of
the above categories. We further divide letters into three
subcategories:
* An _upper-case letter_ is a letter for which the `Character.isUpperCase` method returns `true`,
* A _lowercase letter_ is a letter for which the `Character.isLowerCase` method returns `true`, and
* All other letters are _uncased_.
An XML name is split into a word list by
removing any leading and trailing punctuation characters and then
searching for _word breaks_. A word break is defined by three regular
expressions: A prefix, a separator, and a suffix. The prefix matches
part of the word that precedes the break, the separator is not part of
any word, and the suffix matches part of the word that follows the
break. The word breaks are defined as:
.XML Word Breaks
[[xmlWordBreaks]]
[cols=",,,",options="header"]
|===
| Prefix | Separator | Suffix | Example
| `[^punct]` | `punct+` footnote:exc[] | `[^punct]` | `foo{vbar}--{vbar}bar`
| `digit` | | `[^digit]` | `foo{vbar}22{vbar}bar`
| `[^digit]` | | `digit` | `foo{vbar}22`
| `lower` | | `[^lower]` | `foo{vbar}Bar`
| `upper` | | `upper lower` | `FOO{vbar}Bar`
| `letter` | | `[^letter]` | `Foo{vbar}\u2160`
| `[^letter]` | | `letter` | `\u2160{vbar}Foo`
| `uncased` | | `[^uncased]` |
| `[^uncased]` | | `uncased` |
|===
(The character `\u2160` is ROMAN NUMERAL ONE, a numeric letter.)
After splitting, if a word begins with a
lower-case character then its first character is converted to upper
case. The final result is a word list in which each word is either
* A string of upper- and lower-case letters,
the first character of which is upper case (includes underscore, _’, for
exception casefootnote:exc[]).
* A string of digits, or
* A string of uncased letters and marks.
Given an XML name in word-list form, each of
the three types of Java identifiers is constructed as follows:
* A class or interface identifier is
constructed by concatenating the words in the list,
* A method identifier is constructed by
concatenating the words in the list. A prefix verb (`get`, `set`,
_etc._) is prepended to the result.
* A constant identifier is constructed by
converting each word in the list to upper case; the words are then
concatenated, separated by underscores.
This algorithm will not change an XML name
that is already a legal and conventional Java class, method, or constant
identifier, except perhaps to add an initial verb in the case of a
property access method.
To improve user experience with default
binding, the automated resolution of frequent naming collision is
specified in <<Standardized Name Collision Resolution>>.
*_Example_*
.XML Names and derived Java Class, Method, and Constant Names
[[jcmcn]]
[cols=",,,",options="header"]
|===
| XML Name | Class Name | Method Name | Constant Name
| mixedCaseName | MixedCaseName | getMixedCaseName | MIXED_CASE_NAME
| Answer42 | Answer42 | getAnswer42 | ANSWER_42
| name-with-dashes | NameWithDashes | getNameWithDashes | NAME_WITH_DASHES
| other_punct-chars | OtherPunctChars | getOtherPunctChars | OTHER_PUNCT_CHARS
|===
.XML Names and derived Java Class, Method, and Constant Names when <jaxb:globalBindings underscoreHandling=”asCharInWord”>
[[asCharInWord]]
[cols=",,,",options="header"]
|===
| XML Name | Class Name | Method Name | Constant Name
| other_punct-chars | Other_punctChars | getOther_punctChars | OTHER_PUNCT_CHARS
| name_with_underscore | Name_with_underscore | name_with_underscore | NAME_WITH_UNDERSCORE
|===
==== Collisions and conflicts
It is possible that the name-mapping
algorithm will map two distinct XML names to the same word list.These
cases will result in a _collision_ if, and only if, the same Java
identifier is constructed from the word list and is used to name two
distinct generated classes or two distinct methods or constants in the
same generated class. It is also possible if two or more namespaces are
customized to map to the same Java package, XML names that are unique
due to belonging to distinct namespaces could mapped to the same Java
Class identifier. Collisions are not permitted by the schema compiler
and are reported as errors; they may be repaired by revising XML name
within the source schema or by specifying a customized binding that maps
one of the two XML names to an alternative Java identifier.
A class name must not conflict with the
generated JAXB class, `ObjectFactory`, <<Java Package>>,
that occurs in each schema-derived Java package. Method
names are forbidden to conflict with Java keywords or literals, with
methods declared in `java.lang.Object`, or with methods declared in the
binding-framework classes. Such conflicts are reported as errors and may
be repaired by revising the appropriate schema or by specifying an
appropriate customized binding that resolves the name collision.
===== Standardized Name Collision Resolution
Given the frequency of an XML element or
attribute with the name class or Class resulting in a naming
collision with the inherited method `java.lang.Object.getClass()`,
method name mapping automatically resolves this conflict by mapping
these XML names to the java method identifier getClazz”.
[NOTE]
.Design Note
====
The likelihood of collisions, and the difficulty of working around them
when they occur, depends upon the source schema, the schema language
in which it is written, and the binding declarations. In general, however,
we expect that the combination of the identifier-construction rules given above,
together with good schema-design practices, will make collisions relatively uncommon.
The capitalization conventions embodied in the identifier-construction
rules will tend to reduce collisions as long as names with shared mappings
are used in schema constructs that map to distinct sorts of Java constructs.
Anattribute named `foo` is unlikely to collide with an element type named `foo`
because the first maps to a set of property access methods (`getFoo`, `setFoo`, _etc._)
while the second maps to a class name (`Foo`).
Good schema-design practices also make collisions less likely. When writing a schema
it is inadvisable to use, in identical roles, names that are distinguished only by
punctuation or case. Suppose a schema declares two attributes of a single element type,
one named `Foo` and the other named `foo`. Their generated access methods,
namely `getFoo` and `setFoo`, will collide. This situation would best be handled by
revising the source schema, which would not only eliminate the collision
but also improve the readability of the source schema and documents that use it.
====
=== Deriving a legal Java identifier from an enum facet value
Given that an enum facets value is not
restricted to an XML name, the XML Name to Java identifier algorithm is
not applicable to generating a Java identifier from an enum facets
value. The following algorithm maps an enum facet value to a valid Java
constant identifier name.
* For each character in enum facet value,
copy the character to a string representation `javaId` when
`java.lang.Character.isJavaIdentifierPart()` is `true`.
** To follow Java constant naming convention,
each valid lower case character must be copied as its upper case
equivalent.
* There is no derived Java constant identifier when any of the following occur:
** `javaId.length() == 0`
** `java.lang.Character.isJavaIdentifierStart(javaId.get(0)) == false`
=== Deriving an identifier for a model group
XML Schema has the concept of a group of
element declarations. Occasionally, it is convenient to bind the
grouping as a Java content property or a Java value class. When a
semantically meaningful name for the group is not provided within the
source schema or via a binding declaration customization, it is
necessary to generate a Java identifier from the grouping. Below is an
algorithm to generate such an identifier.
A name is computed for an unnamed model group
by concatenating together the first 3 element declarations and/or
wildcards that occur within the model group. Each XML _{name}_ is mapped
to a Java identifier for a method using the XML Name to Java Identifier
Mapping algorithm. Since wildcard does not have a _{name}_ property, it
is represented as the Java identifier `"Any"`. The Java identifiers
are concatenated together with the separator `"And"` for sequence and
all compositor and `"Or"` for choice compositors. For example, a
sequence of element `foo` and element `bar` would map to `"FooAndBar"`
and a choice of element `foo` and element `bar` maps to
`"FooOrBar"` Lastly, a sequence of wildcard and element `bar` would
map to the Java identifier `"AnyAndBar"`.
*_Example:_* +
Given XML Schema fragment:
[source,xml,indent="2"]
----
<xs:choice>
<xs:sequence>
<xs:element ref="A"/>
<xs:any processContents="strict"/>
</xs:sequence>
<xs:element ref="C"/>
</xs:choice>
----
The generated Java identifier would be `AAndAnyOrC`.
=== Generating a Java package name
This section describes how to generate a
package name to hold the derived Java representation. The motivation for
specifying a default means to generate a Java package name is to
increase the chances that a schema can be processed by a schema compiler
without requiring the user to specify customizations.
If a schema has a target namespace, the next
subsection describes how to map the URI into a Java package name. If the
schema has no target namespace, there is a section that describes an
algorithm to generate a Java package name from the schema filename.
==== Mapping from a Namespace URI
An XML namespace is represented by a URI.
Since XML Namespace will be mapped to a Java package, it is necessary to
specify a default mapping from a URI to a Java package name. The URI
format is described in [RFC2396].
The following steps describe how to map a URI
to a Java package name. The example URI,
`http://www.acme.com/go/espeak.xsd`, is used to illustrate each step.
. Remove the scheme and `":"` part from the
beginning of the URI, if present. +
Since there is no formal syntax to identify the optional URI scheme,
restrict the schemes to be removed to case insensitive checks for
schemes `"http"` and `"urn"`.
+
[source]
----
//www.acme.com/go/espeak.xsd
----
. Remove the trailing file type, one of `.??` or `.???` or `.html`.
+
[source]
----
//www.acme.com/go/espeak
----
. Parse the remaining string into a list of
strings using `'/'` and `':'` as separators. Treat consecutive
separators as a single separator.
+
[source]
----
{"www.acme.com", "go", "espeak"}
----
. For each string in the list produced by
previous step, unescape each escape sequence octet.
+
[source]
----
{"www.acme.com", "go", "espeak"}
----
. If the scheme is a `"urn"`, replace all
dashes, `"-"`, occurring in the first component with
`"."`.footnote:[Sample URN
"urn:hl7-org:v3" {"h17-org", "v3"} transforms to {"h17.org", "v3"}.]
. Apply algorithm described in Section 7.7
Unique Package Names in [JLS] to derive a unique package name from the
potential internet domain name contained within the first component. The
internet domain name is reversed, component by component. Note that a
leading `"www."` is not considered part of an internet domain name and
must be dropped.
+
If the first component does not contain
either one of the top-level domain names, for example, com, gov, net,
org, edu, or one of the English two-letter codes identifying countries
as specified in ISO Standard 3166, 1981, this step must be skipped.
+
[source]
----
{"com", "acme", "go", "espeak"}
----
. For each string in the list, convert each string to be all lower case.
+
[source]
----
{"com", "acme", "go", "espeak"}
----
. For each string remaining, the following
conventions are adopted from [JLS] Section 7.7, Unique Package Names.”
.. If the sting component contains a hyphen,
or any other special character not allowed in an identifier, convert it
into an underscore.
.. If any of the resulting package name
components are keywords then append underscore to them.
.. If any of the resulting package name
components start with a digit, or any other character that is not
allowed as an initial character of an identifier, have an underscore
prefixed to the component.
+
[source]
----
{"com", "acme", "go", "espeak"}
----
. Concatenate the resultant list of strings
using `"."` as a separating character to produce a package name.
+
[source]
----
Final package name: "com.acme.go.espeak".
----
<<Collisions and conflicts>> specifies what to do when the above algorithm results in
an invalid Java package name.
=== Conforming Java Identifier Algorithm
This section describes how to convert a legal
Java identifier which may not conform to Java naming conventions to a
Java identifier that conforms to the standard naming conventions.
<<Customized Name Mapping>> discusses when
this algorithm is applied to customization names.
Since a legal Java identifier is also a XML
name, this algorithm is the same as <<The Name to Identifier Mapping Algorithm>>
with the following exception:
constant names must not be mapped to a Java constant that conforms to
the Java naming convention for a constant.