blob: 72ae0542480f3affbc714d4a8f01befe4a6576c7 [file] [log] [blame] [edit]
\htmlhr
\chapterAndLabel{Regex Checker for regular expression syntax}{regex-checker}
The Regex Checker prevents, at compile-time, use of syntactically invalid
regular expressions and access of invalid capturing groups.
A regular expression, or regex, is a pattern for matching certain strings
of text. In Java, a programmer writes a regular expression as a string.
The syntax of regular expressions is complex, so it is easy to make a
mistake. It is also easy to accidentally use a regex feature from another
language that is not supported by Java (see section ``Comparison to Perl
5'' in the \sunjavadoc{java.base/java/util/regex/Pattern.html}{Pattern} Javadoc).
These problems cause run-time errors.
Regular expressions in Java also have capturing groups, which
are delimited by parentheses and allow for extraction from text.
If a programmer uses an incorrect index (larger than the number of
capturing groups), an \<IndexOutOfBoundsException> is thrown.
The Regex Checker warns about these problems at compile time, guaranteeing
that your program does not crash due to incorrect use of regular expressions.
For further details, including case studies, see a paper about the Regex
Checker~\cite{SpishakDE2012}.
To run the Regex Checker, supply the
\code{-processor org.checkerframework.checker.regex.RegexChecker}
command-line option to javac.
\sectionAndLabel{Regex annotations}{regex-annotations}
These qualifiers make up the Regex type system:
\begin{description}
\item[\refqualclass{checker/regex/qual}{Regex}]
indicates that the run-time value is a valid regular expression
\code{String}. If the optional parameter is supplied to the qualifier,
then the number of capturing groups in the regular expression is at least
that many. If not provided, the parameter defaults to 0.
For example, if an expression's type is \<@Regex(1) String>, then its
run-time value could be \<"colo(u?)r"> or \<"(brown|beige)"> but not
\<"colou?r"> nor a non-regex string such as \<"1) first point">.
\item[\refqualclass{checker/regex/qual}{PolyRegex}]
indicates qualifier polymorphism.
For a description of qualifier polymorphism, see
Section~\ref{method-qualifier-polymorphism}.
\end{description}
The subtyping hierarchy of the Regex Checker's qualifiers is shown in
Figure~\ref{fig-regex-hierarchy}.
\begin{figure}
\includeimage{regex}{9cm}
\caption{The subtyping relationship of the Regex Checker's qualifiers.
The type qualifiers are applicable to \<CharSequence> and its subtypes.
Because the parameter to a \<@Regex> qualifier is at least the number of
capturing groups in a regular expression, a \<@Regex> qualifier with more
capturing groups is a subtype of a \<@Regex> qualifier with fewer capturing
groups. Qualifiers in gray are used internally by the type
system but should never be written by a programmer.}
\label{fig-regex-hierarchy}
\end{figure}
\sectionAndLabel{Annotating your code with \code{@Regex}}{annotating-with-regex}
\subsectionAndLabel{Implicit qualifiers}{regex-implicit-qualifiers}
The Regex Checker adds
implicit qualifiers, reducing the number of annotations that must appear
in your code (see Section~\ref{effective-qualifier}).
If a \code{String} literal is a valid regex,
the checker implicitly adds the \code{@Regex} qualifier with
the argument set to the correct number of capturing groups.
The Regex Checker allows
the \code{null} literal to be assigned to any type qualified with the
\code{Regex} qualifier.
\subsectionAndLabel{Capturing groups}{regex-capturing-groups}
The Regex Checker validates that a legal capturing group number is passed
to \sunjavadoc{java.base/java/util/regex/Matcher.html}{Matcher}'s
\sunjavadoc{java.base/java/util/regex/Matcher.html\#group(int)}{group},
\sunjavadoc{java.base/java/util/regex/Matcher.html\#start(int)}{start} and
\sunjavadoc{java.base/java/util/regex/Matcher.html\#end(int)}{end} methods. To do this,
the type of \<Matcher> must be qualified with a \<@Regex> annotation
with the number of capturing groups in the regular expression. This is
handled implicitly by the Regex Checker for local variables (see
Section~\ref{type-refinement}), but you may need to add \<@Regex> annotations
with a capturing group count to \<Pattern> and \<Matcher> fields and
parameters.
\subsectionAndLabel{Concatenation of partial regular expressions}{regex-partial-regex}
\begin{figure}
\begin{Verbatim}
public @Regex String parenthesize(@Regex String regex) {
return "(" + regex + ")"; // Even though the parentheses are not @Regex Strings,
// the whole expression is a @Regex String
}
\end{Verbatim}
\caption{An example of the Regex Checker's support for concatenation
of non-regular-expression Strings to produce valid regular expression Strings.}
\label{fig-regex-partial}
\end{figure}
In general, concatenating a non-regular-expression String with any other
string yields a non-regular-expression String. The Regex Checker can
sometimes determine that concatenation of non-regular-expression Strings
will produce valid regular expression Strings. For an example see
Figure~\ref{fig-regex-partial}.
\subsectionAndLabel{Testing whether a string is a regular expression}{regexutil-methods}
Sometimes, the Regex Checker cannot infer whether a particular expression
is a regular expression --- and sometimes your code cannot either! In
these cases, you can use the \<isRegex> method to perform such a test, and
other helper methods to provide useful error messages. A
common use is for user-provided regular expressions (such as ones passed
on the command-line).
Figure~\ref{fig-regex-util-example} gives an
example of the intended use of the \code{RegexUtil} methods.
\begin{description}
\item[\refmethod{checker/regex/util}{RegexUtil}{isRegex}{-java.lang.String-}]
returns \<true> if its argument is a valid regular expression.
\item[\refmethod{checker/regex/util}{RegexUtil}{regexError}{-java.lang.String-}]
returns a \<String> error message if its argument is not a valid regular
expression, or \<null> if its argument is a valid regular expression.
\item[\refmethod{checker/regex/util}{RegexUtil}{regexException}{-java.lang.String-}]
returns the
\sunjavadoc{java.base/java/util/regex/PatternSyntaxException.html}{Pattern\-Syntax\-Exception}
that \sunjavadoc{java.base/java/util/regex/Pattern.html\#compile(java.lang.String)}{Pattern.compile(String)}
throws when compiling an invalid regular expression. It returns \<null>
if its argument is a valid regular expression.
\end{description}
An additional version of each of these methods is also provided that takes
an additional group count parameter. The
\refmethod{checker/regex/util}{RegexUtil}{isRegex}{-java.lang.String-int-} method
verifies that the argument has at least the given number of groups. The
\refmethod{checker/regex/util}{RegexUtil}{regexError}{-java.lang.String-int-} and
\refmethod{checker/regex/util}{RegexUtil}{regexException}{-java.lang.String-int-}
methods return a \<String> error message and \<Pattern\-Syntax\-Exception>,
respectively, detailing why the given String is not a syntactically valid
regular expression with at least the given number of capturing groups.
\begin{sloppypar}
If you detect that a \<String> is not a valid regular expression but would like
to report the error higher up the call stack (potentially where you can
provide a more detailed error message) you can throw a
\refclass{checker/regex/util}{RegexUtil.CheckedPatternSyntaxException}. This exception is
functionally the same as a
\sunjavadoc{java.base/java/util/regex/PatternSyntaxException.html}{Pattern\-Syntax\-Exception}
except it is checked to guarantee that the error will be handled up the
call stack. For more details, see the Javadoc for
\refclass{checker/regex/util}{RegexUtil.CheckedPatternSyntaxException}.
\end{sloppypar}
To use the \<RegexUtil> class, the \<checker-util.jar> file
must be on the classpath at run time.
\begin{figure}
%BEGIN LATEX
\begin{smaller}
%END LATEX
\begin{Verbatim}
String regex = getRegexFromUser();
if (! RegexUtil.isRegex(regex)) {
throw new RuntimeException("Error parsing regex " + regex, RegexUtil.regexException(regex));
}
Pattern p = Pattern.compile(regex);
\end{Verbatim}
%BEGIN LATEX
\end{smaller}
%END LATEX
\caption{Example use of \code{RegexUtil} methods.}
\label{fig-regex-util-example}
\end{figure}
\subsectionAndLabel{Suppressing warnings}{regex-suppressing-warnings}
If you are positive that a particular string that is being used as a
regular expression is syntactically valid, but the Regex Checker cannot
conclude this and issues a warning about possible use of an invalid regular
expression, then you can use the
\refmethod{checker/regex/util}{RegexUtil}{asRegex}{-java.lang.String-} method to suppress the
warning.
You can think of this method
as a cast: it returns its argument unchanged, but with the type
\code{@Regex String} if it is a valid regular expression. It throws an
error if its argument is not a valid regular expression, but you should
only use it when you are sure it will not throw an error.
There is an additional \refmethod{checker/regex/util}{RegexUtil}{asRegex}{-java.lang.String-int-}
method that takes a capturing group parameter. This method works the same as
described above, but returns a \code{@Regex String} with the parameter on the
annotation set to the value of the capturing group parameter passed to the method.
The use case shown in Figure~\ref{fig-regex-util-example} should support most cases
so the \<asRegex> method should be used rarely.
% LocalWords: Regex regex quals PolyRegex isRegex RegexUtil regexError asRegex
% LocalWords: regexException PatternSyntaxException Matcher java qual
% LocalWords: CheckedPatternSyntaxException colo colou CharSequence
%% LocalWords: regexutil