| // |
| // Copyright (c) 2020, 2021 Contributors to the Eclipse Foundation |
| // |
| |
| [appendix] |
| == Binding XML Names to Java Identifiers |
| |
| === Overview |
| |
| This section provides default mappings from: |
| |
| * XML Name to Java identifier |
| * Model group to Java identifier |
| * Namespace URI to Java package name |
| |
| === The Name to Identifier Mapping Algorithm |
| |
| Java identifiers typically follow three simple, well-known conventions: |
| |
| * Class and interface names always begin with |
| an upper-case letter. The remaining characters are either digits, |
| lower-case letters, or upper-case letters. Upper-case letters within a |
| multi-word name serve to identify the start of each non-initial word, or |
| sometimes to stand for acronyms. |
| * Method names and components of a package |
| name always begin with a lower-case letter, and otherwise are exactly |
| like class and interface names. |
| * Constant names are entirely in upper case, |
| with each pair of words separated by the underscore character (‘_’, |
| \u005F, LOW LINE). |
| |
| XML names, however, are much richer than Java |
| identifiers: They may include not only the standard Java identifier |
| characters but also various punctuation and special characters that are |
| not permitted in Java identifiers. Like most Java identifiers, most XML |
| names are in practice composed of more than one natural-language word. |
| Non-initial words within an XML name typically start with an upper-case |
| letter followed by a lower-case letter, as in Java language, or are |
| prefixed by punctuation characters, which is not usual in the Java |
| language and, for most punctuation characters, is in fact illegal. |
| |
| In order to map an arbitrary XML name into a |
| Java class, method, or constant identifier, the XML name is first broken |
| into a _word list_. For the purpose of constructing word lists from XML |
| names we use the following definitions: |
| |
| * A _punctuation character_ is one of the following: |
| * A hyphen (’-’, \u002D, HYPHEN-MINUS), |
| * A period (‘.’, \u002E, FULL STOP), |
| * A colon (’:’, \u003A, COLON), |
| * A dot (‘.’, \u00B7, MIDDLE DOT), |
| * \u0387, GREEK ANO TELEIA, |
| * \u06DD, ARABIC END OF AYAH, or |
| * \u06DE, ARABIC START OF RUB EL HIZB. |
| * An underscore (’\_’, \u005F, LOW LINE) with following exceptionfootnote:exc[Exception case: |
| Underscore is not considered a punctuation mark for schema customization |
| _<jaxb:globalBindings underscoreHandling="asCharInWord"/>_ specified in |
| <<Underscore Handling>>. For this |
| customization, underscore is considered a special letter that never |
| results in a word break as defined in <<xmlWordBreaks>> |
| and it is definitely not considered an uncased letter. |
| See example bindings in <<asCharInWord>>.] |
| |
| These are all legal characters in XML names. |
| |
| * A _letter_ is a character for which the |
| `Character.isLetter` method returns `true`, _i.e._ , a letter according |
| to the Unicode standard. Every letter is a legal Java identifier |
| character, both initial and non-initial. |
| * A _digit_ is a character for which the |
| `Character.isDigit` method returns `true`, _i.e._ , a digit according |
| to the Unicode Standard. Every digit is a legal non-initial Java |
| identifier character. |
| * A _mark_ is a character that is in none of |
| the previous categories but for which the |
| `Character.isJavaIdentifierPart` method returns `true`. This category |
| includes numeric letters, combining marks, non-spacing marks, and |
| ignorable control characters. |
| |
| Every XML name character falls into one of |
| the above categories. We further divide letters into three |
| subcategories: |
| |
| * An _upper-case letter_ is a letter for which the `Character.isUpperCase` method returns `true`, |
| * A _lowercase letter_ is a letter for which the `Character.isLowerCase` method returns `true`, and |
| * All other letters are _uncased_. |
| |
| An XML name is split into a word list by |
| removing any leading and trailing punctuation characters and then |
| searching for _word breaks_. A word break is defined by three regular |
| expressions: A prefix, a separator, and a suffix. The prefix matches |
| part of the word that precedes the break, the separator is not part of |
| any word, and the suffix matches part of the word that follows the |
| break. The word breaks are defined as: |
| |
| .XML Word Breaks |
| [[xmlWordBreaks]] |
| [cols=",,,",options="header"] |
| |=== |
| | Prefix | Separator | Suffix | Example |
| | `[^punct]` | `punct+` footnote:exc[] | `[^punct]` | `foo{vbar}--{vbar}bar` |
| | `digit` | | `[^digit]` | `foo{vbar}22{vbar}bar` |
| | `[^digit]` | | `digit` | `foo{vbar}22` |
| | `lower` | | `[^lower]` | `foo{vbar}Bar` |
| | `upper` | | `upper lower` | `FOO{vbar}Bar` |
| | `letter` | | `[^letter]` | `Foo{vbar}\u2160` |
| | `[^letter]` | | `letter` | `\u2160{vbar}Foo` |
| | `uncased` | | `[^uncased]` | |
| | `[^uncased]` | | `uncased` | |
| |=== |
| |
| (The character `\u2160` is ROMAN NUMERAL ONE, a numeric letter.) |
| |
| After splitting, if a word begins with a |
| lower-case character then its first character is converted to upper |
| case. The final result is a word list in which each word is either |
| |
| * A string of upper- and lower-case letters, |
| the first character of which is upper case (includes underscore, ’_’, for |
| exception casefootnote:exc[]). |
| * A string of digits, or |
| * A string of uncased letters and marks. |
| |
| Given an XML name in word-list form, each of |
| the three types of Java identifiers is constructed as follows: |
| |
| * A class or interface identifier is |
| constructed by concatenating the words in the list, |
| * A method identifier is constructed by |
| concatenating the words in the list. A prefix verb (`get`, `set`, |
| _etc._) is prepended to the result. |
| * A constant identifier is constructed by |
| converting each word in the list to upper case; the words are then |
| concatenated, separated by underscores. |
| |
| This algorithm will not change an XML name |
| that is already a legal and conventional Java class, method, or constant |
| identifier, except perhaps to add an initial verb in the case of a |
| property access method. |
| |
| To improve user experience with default |
| binding, the automated resolution of frequent naming collision is |
| specified in <<Standardized Name Collision Resolution>>. |
| |
| *_Example_* |
| |
| .XML Names and derived Java Class, Method, and Constant Names |
| [[jcmcn]] |
| [cols=",,,",options="header"] |
| |=== |
| | XML Name | Class Name | Method Name | Constant Name |
| | mixedCaseName | MixedCaseName | getMixedCaseName | MIXED_CASE_NAME |
| | Answer42 | Answer42 | getAnswer42 | ANSWER_42 |
| | name-with-dashes | NameWithDashes | getNameWithDashes | NAME_WITH_DASHES |
| | other_punct-chars | OtherPunctChars | getOtherPunctChars | OTHER_PUNCT_CHARS |
| |=== |
| |
| .XML Names and derived Java Class, Method, and Constant Names when <jaxb:globalBindings underscoreHandling=”asCharInWord”> |
| [[asCharInWord]] |
| [cols=",,,",options="header"] |
| |=== |
| | XML Name | Class Name | Method Name | Constant Name |
| | other_punct-chars | Other_punctChars | getOther_punctChars | OTHER_PUNCT_CHARS |
| | name_with_underscore | Name_with_underscore | name_with_underscore | NAME_WITH_UNDERSCORE |
| |=== |
| |
| ==== Collisions and conflicts |
| |
| It is possible that the name-mapping |
| algorithm will map two distinct XML names to the same word list.These |
| cases will result in a _collision_ if, and only if, the same Java |
| identifier is constructed from the word list and is used to name two |
| distinct generated classes or two distinct methods or constants in the |
| same generated class. It is also possible if two or more namespaces are |
| customized to map to the same Java package, XML names that are unique |
| due to belonging to distinct namespaces could mapped to the same Java |
| Class identifier. Collisions are not permitted by the schema compiler |
| and are reported as errors; they may be repaired by revising XML name |
| within the source schema or by specifying a customized binding that maps |
| one of the two XML names to an alternative Java identifier. |
| |
| A class name must not conflict with the |
| generated JAXB class, `ObjectFactory`, <<Java Package>>, |
| that occurs in each schema-derived Java package. Method |
| names are forbidden to conflict with Java keywords or literals, with |
| methods declared in `java.lang.Object`, or with methods declared in the |
| binding-framework classes. Such conflicts are reported as errors and may |
| be repaired by revising the appropriate schema or by specifying an |
| appropriate customized binding that resolves the name collision. |
| |
| ===== Standardized Name Collision Resolution |
| |
| Given the frequency of an XML element or |
| attribute with the name “class” or “Class” resulting in a naming |
| collision with the inherited method `java.lang.Object.getClass()`, |
| method name mapping automatically resolves this conflict by mapping |
| these XML names to the java method identifier “getClazz”. |
| |
| [NOTE] |
| .Design Note |
| ==== |
| The likelihood of collisions, and the difficulty of working around them |
| when they occur, depends upon the source schema, the schema language |
| in which it is written, and the binding declarations. In general, however, |
| we expect that the combination of the identifier-construction rules given above, |
| together with good schema-design practices, will make collisions relatively uncommon. |
| |
| The capitalization conventions embodied in the identifier-construction |
| rules will tend to reduce collisions as long as names with shared mappings |
| are used in schema constructs that map to distinct sorts of Java constructs. |
| Anattribute named `foo` is unlikely to collide with an element type named `foo` |
| because the first maps to a set of property access methods (`getFoo`, `setFoo`, _etc._) |
| while the second maps to a class name (`Foo`). |
| |
| Good schema-design practices also make collisions less likely. When writing a schema |
| it is inadvisable to use, in identical roles, names that are distinguished only by |
| punctuation or case. Suppose a schema declares two attributes of a single element type, |
| one named `Foo` and the other named `foo`. Their generated access methods, |
| namely `getFoo` and `setFoo`, will collide. This situation would best be handled by |
| revising the source schema, which would not only eliminate the collision |
| but also improve the readability of the source schema and documents that use it. |
| |
| ==== |
| |
| === Deriving a legal Java identifier from an enum facet value |
| |
| Given that an enum facet’s value is not |
| restricted to an XML name, the XML Name to Java identifier algorithm is |
| not applicable to generating a Java identifier from an enum facet’s |
| value. The following algorithm maps an enum facet value to a valid Java |
| constant identifier name. |
| |
| * For each character in enum facet value, |
| copy the character to a string representation `javaId` when |
| `java.lang.Character.isJavaIdentifierPart()` is `true`. |
| ** To follow Java constant naming convention, |
| each valid lower case character must be copied as its upper case |
| equivalent. |
| * There is no derived Java constant identifier when any of the following occur: |
| ** `javaId.length() == 0` |
| ** `java.lang.Character.isJavaIdentifierStart(javaId.get(0)) == false` |
| |
| === Deriving an identifier for a model group |
| |
| XML Schema has the concept of a group of |
| element declarations. Occasionally, it is convenient to bind the |
| grouping as a Java content property or a Java value class. When a |
| semantically meaningful name for the group is not provided within the |
| source schema or via a binding declaration customization, it is |
| necessary to generate a Java identifier from the grouping. Below is an |
| algorithm to generate such an identifier. |
| |
| A name is computed for an unnamed model group |
| by concatenating together the first 3 element declarations and/or |
| wildcards that occur within the model group. Each XML _{name}_ is mapped |
| to a Java identifier for a method using the XML Name to Java Identifier |
| Mapping algorithm. Since wildcard does not have a _{name}_ property, it |
| is represented as the Java identifier `"Any"`. The Java identifiers |
| are concatenated together with the separator `"And"` for sequence and |
| all compositor and `"Or"` for choice compositors. For example, a |
| sequence of element `foo` and element `bar` would map to `"FooAndBar"` |
| and a choice of element `foo` and element `bar` maps to |
| `"FooOrBar"` Lastly, a sequence of wildcard and element `bar` would |
| map to the Java identifier `"AnyAndBar"`. |
| |
| *_Example:_* + |
| Given XML Schema fragment: |
| |
| [source,xml,indent="2"] |
| ---- |
| <xs:choice> |
| <xs:sequence> |
| <xs:element ref="A"/> |
| <xs:any processContents="strict"/> |
| </xs:sequence> |
| <xs:element ref="C"/> |
| </xs:choice> |
| ---- |
| |
| The generated Java identifier would be `AAndAnyOrC`. |
| |
| === Generating a Java package name |
| |
| This section describes how to generate a |
| package name to hold the derived Java representation. The motivation for |
| specifying a default means to generate a Java package name is to |
| increase the chances that a schema can be processed by a schema compiler |
| without requiring the user to specify customizations. |
| |
| If a schema has a target namespace, the next |
| subsection describes how to map the URI into a Java package name. If the |
| schema has no target namespace, there is a section that describes an |
| algorithm to generate a Java package name from the schema filename. |
| |
| ==== Mapping from a Namespace URI |
| |
| An XML namespace is represented by a URI. |
| Since XML Namespace will be mapped to a Java package, it is necessary to |
| specify a default mapping from a URI to a Java package name. The URI |
| format is described in [RFC2396]. |
| |
| The following steps describe how to map a URI |
| to a Java package name. The example URI, |
| `http://www.acme.com/go/espeak.xsd`, is used to illustrate each step. |
| |
| . Remove the scheme and `":"` part from the |
| beginning of the URI, if present. + |
| Since there is no formal syntax to identify the optional URI scheme, |
| restrict the schemes to be removed to case insensitive checks for |
| schemes `"http"`, `"https"` and `"urn"`. |
| + |
| [source] |
| ---- |
| //www.acme.com/go/espeak.xsd |
| ---- |
| . Remove the trailing file type, one of `.??` or `.???` or `.html`. |
| + |
| [source] |
| ---- |
| //www.acme.com/go/espeak |
| ---- |
| . Parse the remaining string into a list of |
| strings using `'/'` and `':'` as separators. Treat consecutive |
| separators as a single separator. |
| + |
| [source] |
| ---- |
| {"www.acme.com", "go", "espeak"} |
| ---- |
| . For each string in the list produced by |
| previous step, unescape each escape sequence octet. |
| + |
| [source] |
| ---- |
| {"www.acme.com", "go", "espeak"} |
| ---- |
| . If the scheme is a `"urn"`, replace all |
| dashes, `"-"`, occurring in the first component with |
| `"."`.footnote:[Sample URN |
| "urn:hl7-org:v3" {"h17-org", "v3"} transforms to {"h17.org", "v3"}.] |
| . Apply algorithm described in Section 7.7 |
| “Unique Package Names” in [JLS] to derive a unique package name from the |
| potential internet domain name contained within the first component. The |
| internet domain name is reversed, component by component. Note that a |
| leading `"www."` is not considered part of an internet domain name and |
| must be dropped. |
| + |
| If the first component does not contain |
| either one of the top-level domain names, for example, com, gov, net, |
| org, edu, or one of the English two-letter codes identifying countries |
| as specified in ISO Standard 3166, 1981, this step must be skipped. |
| + |
| [source] |
| ---- |
| {"com", "acme", "go", "espeak"} |
| ---- |
| . For each string in the list, convert each string to be all lower case. |
| + |
| [source] |
| ---- |
| {"com", "acme", "go", "espeak"} |
| ---- |
| . For each string remaining, the following |
| conventions are adopted from [JLS] Section 7.7, “Unique Package Names.” |
| .. If the sting component contains a hyphen, |
| or any other special character not allowed in an identifier, convert it |
| into an underscore. |
| .. If any of the resulting package name |
| components are keywords then append underscore to them. |
| .. If any of the resulting package name |
| components start with a digit, or any other character that is not |
| allowed as an initial character of an identifier, have an underscore |
| prefixed to the component. |
| |
| + |
| [source] |
| ---- |
| {"com", "acme", "go", "espeak"} |
| ---- |
| |
| . Concatenate the resultant list of strings |
| using `"."` as a separating character to produce a package name. |
| + |
| [source] |
| ---- |
| Final package name: "com.acme.go.espeak". |
| ---- |
| |
| <<Collisions and conflicts>> specifies what to do when the above algorithm results in |
| an invalid Java package name. |
| |
| === Conforming Java Identifier Algorithm |
| |
| This section describes how to convert a legal |
| Java identifier which may not conform to Java naming conventions to a |
| Java identifier that conforms to the standard naming conventions. |
| <<Customized Name Mapping>> discusses when |
| this algorithm is applied to customization names. |
| |
| Since a legal Java identifier is also a XML |
| name, this algorithm is the same as <<The Name to Identifier Mapping Algorithm>> |
| with the following exception: |
| constant names must not be mapped to a Java constant that conforms to |
| the Java naming convention for a constant. |