| \htmlhr |
| \chapterAndLabel{Regex Checker for regular expression syntax}{regex-checker} |
| |
| The Regex Checker prevents, at compile-time, use of syntactically invalid |
| regular expressions and access of invalid capturing groups. |
| |
| A regular expression, or regex, is a pattern for matching certain strings |
| of text. In Java, a programmer writes a regular expression as a string. |
| The syntax of regular expressions is complex, so it is easy to make a |
| mistake. It is also easy to accidentally use a regex feature from another |
| language that is not supported by Java (see section ``Comparison to Perl |
| 5'' in the \sunjavadoc{java.base/java/util/regex/Pattern.html}{Pattern} Javadoc). |
| These problems cause run-time errors. |
| |
| Regular expressions in Java also have capturing groups, which |
| are delimited by parentheses and allow for extraction from text. |
| If a programmer uses an incorrect index (larger than the number of |
| capturing groups), an \<IndexOutOfBoundsException> is thrown. |
| |
| The Regex Checker warns about these problems at compile time, guaranteeing |
| that your program does not crash due to incorrect use of regular expressions. |
| |
| For further details, including case studies, see a paper about the Regex |
| Checker~\cite{SpishakDE2012}. |
| |
| To run the Regex Checker, supply the |
| \code{-processor org.checkerframework.checker.regex.RegexChecker} |
| command-line option to javac. |
| |
| |
| \sectionAndLabel{Regex annotations}{regex-annotations} |
| |
| These qualifiers make up the Regex type system: |
| |
| \begin{description} |
| |
| \item[\refqualclass{checker/regex/qual}{Regex}] |
| indicates that the run-time value is a valid regular expression |
| \code{String}. If the optional parameter is supplied to the qualifier, |
| then the number of capturing groups in the regular expression is at least |
| that many. If not provided, the parameter defaults to 0. |
| For example, if an expression's type is \<@Regex(1) String>, then its |
| run-time value could be \<"colo(u?)r"> or \<"(brown|beige)"> but not |
| \<"colou?r"> nor a non-regex string such as \<"1) first point">. |
| |
| \item[\refqualclass{checker/regex/qual}{PolyRegex}] |
| indicates qualifier polymorphism. |
| For a description of qualifier polymorphism, see |
| Section~\ref{method-qualifier-polymorphism}. |
| |
| \end{description} |
| |
| The subtyping hierarchy of the Regex Checker's qualifiers is shown in |
| Figure~\ref{fig-regex-hierarchy}. |
| |
| \begin{figure} |
| \includeimage{regex}{9cm} |
| \caption{The subtyping relationship of the Regex Checker's qualifiers. |
| The type qualifiers are applicable to \<CharSequence> and its subtypes. |
| Because the parameter to a \<@Regex> qualifier is at least the number of |
| capturing groups in a regular expression, a \<@Regex> qualifier with more |
| capturing groups is a subtype of a \<@Regex> qualifier with fewer capturing |
| groups. Qualifiers in gray are used internally by the type |
| system but should never be written by a programmer.} |
| \label{fig-regex-hierarchy} |
| \end{figure} |
| |
| \sectionAndLabel{Annotating your code with \code{@Regex}}{annotating-with-regex} |
| |
| |
| \subsectionAndLabel{Implicit qualifiers}{regex-implicit-qualifiers} |
| |
| The Regex Checker adds |
| implicit qualifiers, reducing the number of annotations that must appear |
| in your code (see Section~\ref{effective-qualifier}). |
| If a \code{String} literal is a valid regex, |
| the checker implicitly adds the \code{@Regex} qualifier with |
| the argument set to the correct number of capturing groups. |
| The Regex Checker allows |
| the \code{null} literal to be assigned to any type qualified with the |
| \code{Regex} qualifier. |
| |
| |
| \subsectionAndLabel{Capturing groups}{regex-capturing-groups} |
| |
| The Regex Checker validates that a legal capturing group number is passed |
| to \sunjavadoc{java.base/java/util/regex/Matcher.html}{Matcher}'s |
| \sunjavadoc{java.base/java/util/regex/Matcher.html\#group(int)}{group}, |
| \sunjavadoc{java.base/java/util/regex/Matcher.html\#start(int)}{start} and |
| \sunjavadoc{java.base/java/util/regex/Matcher.html\#end(int)}{end} methods. To do this, |
| the type of \<Matcher> must be qualified with a \<@Regex> annotation |
| with the number of capturing groups in the regular expression. This is |
| handled implicitly by the Regex Checker for local variables (see |
| Section~\ref{type-refinement}), but you may need to add \<@Regex> annotations |
| with a capturing group count to \<Pattern> and \<Matcher> fields and |
| parameters. |
| |
| |
| \subsectionAndLabel{Concatenation of partial regular expressions}{regex-partial-regex} |
| |
| \begin{figure} |
| \begin{Verbatim} |
| public @Regex String parenthesize(@Regex String regex) { |
| return "(" + regex + ")"; // Even though the parentheses are not @Regex Strings, |
| // the whole expression is a @Regex String |
| } |
| \end{Verbatim} |
| \caption{An example of the Regex Checker's support for concatenation |
| of non-regular-expression Strings to produce valid regular expression Strings.} |
| \label{fig-regex-partial} |
| \end{figure} |
| |
| In general, concatenating a non-regular-expression String with any other |
| string yields a non-regular-expression String. The Regex Checker can |
| sometimes determine that concatenation of non-regular-expression Strings |
| will produce valid regular expression Strings. For an example see |
| Figure~\ref{fig-regex-partial}. |
| |
| |
| \subsectionAndLabel{Testing whether a string is a regular expression}{regexutil-methods} |
| |
| Sometimes, the Regex Checker cannot infer whether a particular expression |
| is a regular expression --- and sometimes your code cannot either! In |
| these cases, you can use the \<isRegex> method to perform such a test, and |
| other helper methods to provide useful error messages. A |
| common use is for user-provided regular expressions (such as ones passed |
| on the command-line). |
| Figure~\ref{fig-regex-util-example} gives an |
| example of the intended use of the \code{RegexUtil} methods. |
| |
| \begin{description} |
| |
| \item[\refmethod{checker/regex/util}{RegexUtil}{isRegex}{-java.lang.String-}] |
| returns \<true> if its argument is a valid regular expression. |
| |
| \item[\refmethod{checker/regex/util}{RegexUtil}{regexError}{-java.lang.String-}] |
| returns a \<String> error message if its argument is not a valid regular |
| expression, or \<null> if its argument is a valid regular expression. |
| |
| \item[\refmethod{checker/regex/util}{RegexUtil}{regexException}{-java.lang.String-}] |
| returns the |
| \sunjavadoc{java.base/java/util/regex/PatternSyntaxException.html}{Pattern\-Syntax\-Exception} |
| that \sunjavadoc{java.base/java/util/regex/Pattern.html\#compile(java.lang.String)}{Pattern.compile(String)} |
| throws when compiling an invalid regular expression. It returns \<null> |
| if its argument is a valid regular expression. |
| |
| \end{description} |
| |
| An additional version of each of these methods is also provided that takes |
| an additional group count parameter. The |
| \refmethod{checker/regex/util}{RegexUtil}{isRegex}{-java.lang.String-int-} method |
| verifies that the argument has at least the given number of groups. The |
| \refmethod{checker/regex/util}{RegexUtil}{regexError}{-java.lang.String-int-} and |
| \refmethod{checker/regex/util}{RegexUtil}{regexException}{-java.lang.String-int-} |
| methods return a \<String> error message and \<Pattern\-Syntax\-Exception>, |
| respectively, detailing why the given String is not a syntactically valid |
| regular expression with at least the given number of capturing groups. |
| |
| \begin{sloppypar} |
| If you detect that a \<String> is not a valid regular expression but would like |
| to report the error higher up the call stack (potentially where you can |
| provide a more detailed error message) you can throw a |
| \refclass{checker/regex/util}{RegexUtil.CheckedPatternSyntaxException}. This exception is |
| functionally the same as a |
| \sunjavadoc{java.base/java/util/regex/PatternSyntaxException.html}{Pattern\-Syntax\-Exception} |
| except it is checked to guarantee that the error will be handled up the |
| call stack. For more details, see the Javadoc for |
| \refclass{checker/regex/util}{RegexUtil.CheckedPatternSyntaxException}. |
| \end{sloppypar} |
| |
| To use the \<RegexUtil> class, the \<checker-util.jar> file |
| must be on the classpath at run time. |
| |
| \begin{figure} |
| %BEGIN LATEX |
| \begin{smaller} |
| %END LATEX |
| \begin{Verbatim} |
| String regex = getRegexFromUser(); |
| if (! RegexUtil.isRegex(regex)) { |
| throw new RuntimeException("Error parsing regex " + regex, RegexUtil.regexException(regex)); |
| } |
| Pattern p = Pattern.compile(regex); |
| \end{Verbatim} |
| %BEGIN LATEX |
| \end{smaller} |
| %END LATEX |
| \caption{Example use of \code{RegexUtil} methods.} |
| \label{fig-regex-util-example} |
| \end{figure} |
| |
| \subsectionAndLabel{Suppressing warnings}{regex-suppressing-warnings} |
| |
| If you are positive that a particular string that is being used as a |
| regular expression is syntactically valid, but the Regex Checker cannot |
| conclude this and issues a warning about possible use of an invalid regular |
| expression, then you can use the |
| \refmethod{checker/regex/util}{RegexUtil}{asRegex}{-java.lang.String-} method to suppress the |
| warning. |
| |
| You can think of this method |
| as a cast: it returns its argument unchanged, but with the type |
| \code{@Regex String} if it is a valid regular expression. It throws an |
| error if its argument is not a valid regular expression, but you should |
| only use it when you are sure it will not throw an error. |
| |
| There is an additional \refmethod{checker/regex/util}{RegexUtil}{asRegex}{-java.lang.String-int-} |
| method that takes a capturing group parameter. This method works the same as |
| described above, but returns a \code{@Regex String} with the parameter on the |
| annotation set to the value of the capturing group parameter passed to the method. |
| |
| The use case shown in Figure~\ref{fig-regex-util-example} should support most cases |
| so the \<asRegex> method should be used rarely. |
| |
| |
| |
| |
| % LocalWords: Regex regex quals PolyRegex isRegex RegexUtil regexError asRegex |
| % LocalWords: regexException PatternSyntaxException Matcher java qual |
| % LocalWords: CheckedPatternSyntaxException colo colou CharSequence |
| %% LocalWords: regexutil |