blob: 423abd14509abbba3928a6a271e18e6d4fd8ccc3 [file] [log] [blame]
//
// Copyright (c) 2020 Contributors to the Eclipse Foundation
//
[appendix]
== [[a4649]]Binding XML Names to Java Identifiers
=== Overview
This section provides default mappings from:
* XML Name to Java identifier
* Model group to Java identifier
* Namespace URI to Java package name
=== [[a4656]]The Name to Identifier Mapping Algorithm
Java identifiers typically follow three
simple, well-known conventions:
* Class and interface names always begin with
an upper-case letter. The remaining characters are either digits,
lower-case letters, or upper-case letters. Upper-case letters within a
multi-word name serve to identify the start of each non-initial word, or
sometimes to stand for acronyms.
* Method names and components of a package
name always begin with a lower-case letter, and otherwise are exactly
like class and interface names.
* Constant names are entirely in upper case,
with each pair of words separated by the underscore character (‘_’,
\u005F, LOW LINE).
XML names, however, are much richer than Java
identifiers: They may include not only the standard Java identifier
characters but also various punctuation and special characters that are
not permitted in Java identifiers. Like most Java identifiers, most XML
names are in practice composed of more than one natural-language word.
Non-initial words within an XML name typically start with an upper-case
letter followed by a lower-case letter, as in Java language, or are
prefixed by punctuation characters, which is not usual in the Java
language and, for most punctuation characters, is in fact illegal.
In order to map an arbitrary XML name into a
Java class, method, or constant identifier, the XML name is first broken
into a _word list_ . For the purpose of constructing word lists from XML
names we use the following definitions:
* A _punctuation character_ is one of the
following:
* A hyphen (’-’, \u002D, HYPHEN-MINUS),
* A period (‘.’, \u002E, FULL STOP),
* A colon (’:’, \u003A, COLON),
* A dot (‘.’, \u00B7, MIDDLE DOT),
* \u0387, GREEK ANO TELEIA,
* \u06DD, ARABIC END OF AYAH, or
* \u06DE, ARABIC START OF RUB EL HIZB.
* An underscore (’_’, \u005F, LOW LINE) with
following exceptionlink:#a5380[29]
These are all legal characters in XML names.
* A _letter_ is a character for which the
_Character.isLetter_ method returns _true_ , _i.e._ , a letter according
to the Unicode standard. Every letter is a legal Java identifier
character, both initial and non-initial.
* A _digit_ is a character for which the
_Character.isDigit_ method returns _true_ , _i.e._ , a digit according
to the Unicode Standard. Every digit is a legal non-initial Java
identifier character.
* A _mark_ is a character that is in none of
the previous categories but for which the
_Character.isJavaIdentifierPart_ method returns _true_ . This category
includes numeric letters, combining marks, non-spacing marks, and
ignorable control characters.
Every XML name character falls into one of
the above categories. We further divide letters into three
subcategories:
* An _upper-case letter_ is a letter for
which the _Character.isUpperCase_ method returns _true_ ,
* A _lowercase letter_ is a letter for which
the _Character.isLowerCase_ method returns _true_ , and
* All other letters are _uncased_ .
An XML name is split into a word list by
removing any leading and trailing punctuation characters and then
searching for _word breaks_ . A word break is defined by three regular
expressions: A prefix, a separator, and a suffix. The prefix matches
part of the word that precedes the break, the separator is not part of
any word, and the suffix matches part of the word that follows the
break. The word breaks are defined as:
=== [[a4681]]XML Word Breaks
Prefix
Separator
Suffix
Example
{empty}[^punct]
punct+1
{empty}[^punct]
foo|--|bar
digit
{empty}[^ _digit_ ]
foo|22|bar
{empty}[^digit]
digit
foo|22
lower
{empty}[^lower]
foo|Bar
upper
upper lower
FOO|Bar
letter
{empty}[^letter]
Foo|\u2160
{empty}[^letter]
letter
\u2160|Foo
uncased
{empty}[^uncased]
{empty}[^uncased]
uncased
(The character _\u2160_ is ROMAN NUMERAL ONE,
a numeric letter.)
After splitting, if a word begins with a
lower-case character then its first character is converted to upper
case. The final result is a word list in which each word is either
* A string of upper- and lower-case letters,
the first character of which is upper case (includes underscore,’_’, for
exception case1).
* A string of digits, or
* A string of uncased letters and marks.
Given an XML name in word-list form, each of
the three types of Java identifiers is constructed as follows:
* A class or interface identifier is
constructed by concatenating the words in the list,
* A method identifier is constructed by
concatenating the words in the list. A prefix verb ( _get_ , _set_ ,
_etc._ ) is prepended to the result.
* A constant identifier is constructed by
converting each word in the list to upper case; the words are then
concatenated, separated by underscores.
This algorithm will not change an XML name
that is already a legal and conventional Java class, method, or constant
identifier, except perhaps to add an initial verb in the case of a
property access method.
To improve user experience with default
binding, the automated resolution of frequent naming collision is
specified in link:jaxb.html#a4770[See Standardized Name
Collision Resolution]“.
=== Example
=== [[a4734]]XML Names and derived Java Class, Method, and Constant Names
XML Name
Class Name
Method Name
Constant Name
mixedCaseName
MixedCaseName
getMixedCaseName
MIXED_CASE_NAME
Answer42
Answer42
getAnswer42
ANSWER_42
name-with-dashes
NameWithDashes
getNameWithDashes
NAME_WITH_DASHES
other_punct-chars
OtherPunctChars
getOtherPunctChars
OTHER_PUNCT_CHARS
=== [[a4755]]XML Names and derived Java Class, Method, and Constant Names when <jaxb:globalBindings underscoreHandling=”asCharInWord”>
[width="100%",cols="25%,25%,25%,25%",options="header",]
|===
|XML Name |Class
Name |Method Name
|Constant Name
|other_punct-chars
|Other_punctChars
|getOther_punctChars
|OTHER_PUNCT_CHARS
|name_with_underscore
|Name_with_underscore
|name_with_underscore
|NAME_WITH_UNDERSCORE
|===
=== [[a4767]]Collisions and conflicts
It is possible that the name-mapping
algorithm will map two distinct XML names to the same word list.These
cases will result in a _collision_ if, and only if, the same Java
identifier is constructed from the word list and is used to name two
distinct generated classes or two distinct methods or constants in the
same generated class. It is also possible if two or more namespaces are
customized to map to the same Java package, XML names that are unique
due to belonging to distinct namespaces could mapped to the same Java
Class identifier. Collisions are not permitted by the schema compiler
and are reported as errors; they may be repaired by revising XML name
within the source schema or by specifying a customized binding that maps
one of the two XML names to an alternative Java identifier.
A class name must not conflict with the
generated JAXB class, _ObjectFactory_ , link:jaxb.html#a482[See
Java Package], that occurs in each schema-derived Java package. Method
names are forbidden to conflict with Java keywords or literals, with
methods declared in _java.lang.Object_ , or with methods declared in the
binding-framework classes. Such conflicts are reported as errors and may
be repaired by revising the appropriate schema or by specifying an
appropriate customized binding that resolves the name collision.
=== [[a4770]]Standardized Name Collision Resolution
Given the frequency of an XML element or
attribute with the name class or Class resulting in a naming
collision with the inherited method _java.lang.Object.getClass()_ ,
method name mapping automatically resolves this conflict by mapping
these XML names to the java method identifier getClazz”.
*
=== [[a4773]]Deriving a legal Java identifier from an enum facet value
Given that an enum facets value is not
restricted to an XML name, the XML Name to Java identifier algorithm is
not applicable to generating a Java identifier from an enum facets
value. The following algorithm maps an enum facet value to a valid Java
constant identifier name.
* For each character in enum facet value, +
copy the character to a string representation _javaId_ when
_java.lang.Character.isJavaIdentifierPart()_ is _true_ .
* To follow Java constant naming convention,
each valid lower case character must be copied as its upper case
equivalent.
* There is no derived Java constant
identifier when any of the following occur:
* _javaId.length() == 0_
*
_java.lang.Character.isJavaIdentifierStart(javaId.get(0)) == false_
=== [[a4780]]Deriving an identifier for a model group
XML Schema has the concept of a group of
element declarations. Occasionally, it is convenient to bind the
grouping as a Java content property or a Java value class. When a
semantically meaningful name for the group is not provided within the
source schema or via a binding declaration customization, it is
necessary to generate a Java identifier from the grouping. Below is an
algorithm to generate such an identifier.
A name is computed for an unnamed model group
by concatenating together the first 3 element declarations and/or
wildcards that occur within the model group. Each XML \{name} is mapped
to a Java identifier for a method using the XML Name to Java Identifier
Mapping algorithm. Since wildcard does not have a \{name} property, it
is represented as the Java identifier _Any_ ”. The Java identifiers
are concatenated together with the separator _And_ for sequence and
all compositor and _Or_ for choice compositors. For example, a
sequence of element _foo_ and element _bar_ would map to __ _FooAndBar_
__ and a choice of element _foo_ and element _bar_ maps to __
_FooOrBar_ _._ Lastly, a sequence of wildcard and element _bar_ would
map to the Java identifier __ _AnyAndBar_ __ .
=== Example:
Given XML Schema fragment:
<xs:choice> +
<xs:sequence> +
<xs:element ref="A"/> +
<xs:any processContents="strict"/> +
</xs:sequence> +
<xs:element ref="C"/>
</xs:choice>
The generated Java identifier would be
_AAndAnyOrC_ .
=== [[a4788]]Generating a Java package name
This section describes how to generate a
package name to hold the derived Java representation. The motivation for
specifying a default means to generate a Java package name is to
increase the chances that a schema can be processed by a schema compiler
without requiring the user to specify customizations.
If a schema has a target namespace, the next
subsection describes how to map the URI into a Java package name. If the
schema has no target namespace, there is a section that describes an
algorithm to generate a Java package name from the schema filename.
=== Mapping from a Namespace URI
An XML namespace is represented by a URI.
Since XML Namespace will be mapped to a Java package, it is necessary to
specify a default mapping from a URI to a Java package name. The URI
format is described in [RFC2396].
The following steps describe how to map a URI
to a Java package name. The example URI,
_http://www.acme.com/go/espeak.xsd_ , is used to illustrate each step.
. Remove the scheme and _":"_ part from the
beginning of the URI, if present. +
Since there is no formal syntax to identify the optional URI scheme,
restrict the schemes to be removed to case insensitive checks for
schemes _http_ and _urn_ ”.
_//www.acme.com/go/espeak.xsd_
. Remove the trailing file type, one of _.??_
or _.???_ or _.html_ .
_//www.acme.com/go/espeak_
. Parse the remaining string into a list of
strings using _’/’_ and _‘:’_ as separators. Treat consecutive
separators as a single separator.
_\{"www.acme.com", "go", "espeak" }_
. For each string in the list produced by
previous step, unescape each escape sequence octet.
_\{"www.acme.com", "go", "espeak" }_
. If the scheme is a urn”, replace all
dashes, “-”, occurring in the first component with
“.”.link:#a5381[30]
. Apply algorithm described in Section 7.7
Unique Package Names in [JLS] to derive a unique package name from the
potential internet domain name contained within the first component. The
internet domain name is reversed, component by component. Note that a
leading _www_ .” is not considered part of an internet domain name and
must be dropped.
If the first component does not contain
either one of the top-level domain names, for example, com, gov, net,
org, edu, or one of the English two-letter codes identifying countries
as specified in ISO Standard 3166, 1981, this step must be skipped.
_\{com”, acme”, go”, espeak”}_
. For each string in the list, convert each
string to be all lower case.
_\{"com”, “acme”, "go", "espeak" }_
. For each string remaining, the following
conventions are adopted from [JLS] Section 7.7, “Unique Package Names.”
. If the sting component contains a hyphen,
or any other special character not allowed in an identifier, convert it
into an underscore.
. If any of the resulting package name
components are keywords then append underscore to them.
. If any of the resulting package name
components start with a digit, or any other character that is not
allowed as an initial character of an identifier, have an underscore
prefixed to the component.
_\{"com”, acme”, "go", "espeak" }_
. Concatenate the resultant list of strings
using _’.’_ as a separating character to produce a package name.
_Final package name: "com.acme.go.espeak"._
link:jaxb.html#a4767[See Collisions
and conflicts], specifies what to do when the above algorithm results in
an invalid Java package name.
=== [[a4816]]Conforming Java Identifier Algorithm
This section describes how to convert a legal
Java identifier which may not conform to Java naming conventions to a
Java identifier that conforms to the standard naming conventions.
link:jaxb.html#a1608[See Customized Name Mapping]“discusses when
this algorithm is applied to customization names.
Since a legal Java identifier is also a XML
name, this algorithm is the same as link:jaxb.html#a4656[See The
Name to Identifier Mapping Algorithm]” with the following exception:
constant names must not be mapped to a Java constant that conforms to
the Java naming convention for a constant.