blob: 581f47c9a42ebd9b4abb2eb786f60bbfa233a858 [file] [log] [blame]
This document describes a \emph{Dataflow Framework} for the Java
Programming Language. The framework is used in the
\href{}{Checker Framework},
\href{}{Error Prone},
at Facebook, and in other contexts.
The primary purpose of the Dataflow Framework is to estimate values:
for each line of source code, it determines what
values each variable might contain.
The Dataflow Framework's result (\autoref{sec:analysis_result_class})
is an abstract value for each expression (an estimate of the
expression's run-time value) and a store at each program point. A
store maps variables and other expressions to abstract
values. As a pre-pass, the Dataflow Framework transforms an input
AST into a control flow graph (\autoref{sec:cfg}) consisting of basic
blocks made up of nodes representing single operations. An analysis
performs iterative data flow
analysis over the control flow graph. The effect of a single node on
the dataflow store is represented by a transfer function, which takes
an input store and a node and produces an output store. Once the
analysis reaches a fix point, the result can be accessed by client
In the Checker Framework, the abstract values to be computed are
annotated types. An individual checker can customize its analysis by
extending the abstract value class and by overriding the behavior of
the transfer function for particular node types.
The Dataflow Framework was
designed with several goals in mind. First, to encourage use
beyond the Checker Framework, it is written as a separate package that can be
built and used with no dependence on the Checker Framework. Second,
the framework supports analysis but not
transformation, so it provides information that can be used by a type
checker or an IDE, but it does not support optimization. Third, the
framework aims to minimize the burden on developers who build on top
of it. In particular, the hierarchy of analysis classes is designed
to reduce the effort required to implement a new flow-sensitive type
checker in the Checker Framework. The
\href{}{Dataflow User's Guide}
gives an introduction to customizing dataflow to add checker-specific
Paragraphs colored in gray with a gray bar on the left side (just
like this one) contain questions, additional comments or indicate
missing parts. Eventually, these paragraphs will be removed.
\paragraph{Potential version conflicts if you export the Dataflow Framework}
Suppose that two tools both utilize the Dataflow Framework. If neither one
exports the Dataflow Framework's classes or the Checker Framework
annotations, both tools can be run at the same time. If both export the
classes, then it may be impossible for users to run both tools, because
they may require a different version of the Dataflow Framework. The
Checker Framework necessarily exports the Dataflow Framework. If your tool
cannot shadow the Checker Framework and Dataflow Framework classes, and you
cannot conform to the Checker Framework release schedule, then instead of
using the \code{dataflow} artifact at
you can use the \code{dataflow-shaded} artifact at
It contains the Dataflow Framework in package
\code{org.checkerframework.shaded.dataflow} and also its dependencies in
their usual packages, namely the Checker Framework
qualifiers and the plume-util project
(There may still be a problem if two different non-Checker-Framework tools
both export different, incompatible versions of the \code{dataflow-shaded}
% TODO: Update
The source code of the combined Checker Framework and Dataflow
Framework is divided into multiple projects: \code{javacutil},
\code{dataflow}, \code{framework}, and \code{checker},
which can be built into distinct jar files. \code{checker.jar} is a fat
jar that contains all of these, plus the Stub Parser.
\code{javacutil} provides convenient interfaces to routines in
Oracle's javac library. There are utility classes for interacting
with annotations, elements, trees and types, as well as
\code{InternalUtils}, which gives direct access to internal features
of javac that are not part of a supported interface. There are
interfaces or abstract classes for reporting errors, for processing
types in an AST, and for providing the annotations present on an
The \code{org.checkerframework.javacutil.trees} package provides a
class to parse expressions into javac Trees (\code{TreeParser}), a
class to build new Trees from scratch (\code{TreeBuilder}), and a
class to represent newly introduced variables that are not part of an
input program (\code{DetachedVarSymbol}).
\code{dataflow} contains the classes to represent and construct
control flow graphs and the base classes required for flow analysis.
These classes are described in detail in \autoref{sec:node_classes}.
\code{framework} contains the framework aspects of the Checker
Framework, including the derived classes for flow analysis of
annotated types which are described later in this document.
\code{checker} contains the type system-specific checkers.
The \code{dataflow} project depends only on \code{javacutil}.
This section gives an overview of the major Java classes and
interfaces in the implementation of the Dataflow Framework and the
flow-sensitive type checking feature of the Checker Framework. It
includes both the base classes in the \code{dataflow} project and the
derived classes in the \code{framework} project. The class and
interface declarations are given with full package names to indicate
which project they belong to.
Dataflow doesn't actually work on trees; it works on Nodes.
A Node class represents an individual operation of a program,
including arithmetic operations, logical operations, method calls,
variable references, array accesses, etc.
simplify writing a dataflow analysis by separating the dataflow
analysis from the original source code.
\autoref{tab:nodes} on page~\pageref{tab:nodes} lists
the Node types.
package org.checkerframework.dataflow.cfg.node;
abstract class Node
class *Node extends Node
The Block
classes represent basic blocks.
package org.checkerframework.dataflow.cfg.block;
interface Block
abstract class BlockImpl implements Block
interface SingleSuccessorBlock extends Block
abstract class SingleSuccessorBlockImpl extends BlockImpl implements SingleSuccessorBlock
A RegularBlock contains no exception-raising operations and has a
single control-flow successor.
package org.checkerframework.dataflow.cfg.block;
interface RegularBlock extends SingleSuccessorBlock
class RegularBlockImpl extends SingleSuccessorBlockImpl implements RegularBlock
An ExceptionBlock contains a single operation that may raise an
exception, with one or more exceptional successors and a single normal
control-flow successor.
package org.checkerframework.dataflow.cfg.block;
interface ExceptionBlock extends SingleSuccessorBlock
class ExceptionBlockImpl extends SingleSuccessorBlockImpl implements ExceptionBlock
A SpecialBlock represents method entry or exit, including exceptional
exit which is represented separately from normal exit.
package org.checkerframework.dataflow.cfg.block;
interface SpecialBlock extends SingleSuccessorBlock
class SpecialBlockImpl extends SingleSuccessorBlockImpl implements SpecialBlock
A ConditionalBlock contains no operations at all. It represents a
control-flow split to either a `then' or an `else' successor based on
the immediately preceding boolean-valued Node.
package org.checkerframework.dataflow.cfg.block;
interface ConditionalBlock extends Block
class ConditionalBlockImpl extends BlockImpl implements ConditionalBlock
A ControlFlowGraph represents the body of a method or an initializer
expression as a graph of Blocks with distinguished entry, exit, and
exceptional exit SpecialBlocks. ControlFlowGraphs are produced by the
CFGBuilder classes and are treated as immutable once they are built.
package org.checkerframework.dataflow.cfg;
class ControlFlowGraph
The CFGBuilder classes visit an AST and produce a corresponding
ControlFlowGraph as described in \autoref{sec:ast_to_cfg_translation}.
package org.checkerframework.dataflow.cfg;
class CFGBuilder
The Checker Framework derives from CFGBuilder in order to desugar
enhanced for loops that make explicit use of type annotations provided
by the checker in use.
package org.checkerframework.framework.flow;
class CFCFGBuilder extends CFGBuilder
The CFGVisualizeLauncher generates a DOT or String representation of
the control flow graph for a given method in a given class.
package org.checkerframework.dataflow.cfg;
class CFGVisualizeLauncher
The Dataflow Framework records the abstract values of certain
expressions, called JavaExpressions: local variables, field accesses,
array accesses, references to \code{this}, and pure method calls.
JavaExpressions are keys in the store of abstract values.
package org.checkerframework.dataflow.analysis;
class JavaExpressions
Java expressions that appear in method pre- and postconditions are
parsed into JavaExpressions using helper routines in
AbstractValue is the internal representation of dataflow information
produced by an analysis. An AbstractValue is an estimate about the
run-time values that an expression may evaluate to. The client of the
Dataflow Framework defines the abstract value, so the information may
vary widely among different users of the Dataflow Framework, but they
share a common feature that one can compute the least upper bound of
two AbstractValues.
package org.checkerframework.dataflow.analysis;
interface AbstractValue<V extends AbstractValue<V>>
For the Checker Framework, abstract values are essentially
package org.checkerframework.framework.flow;
abstract class CFAbstractValue<V extends CFAbstractValue<V>> implements AbstractValue<V>
class CFValue extends CFAbstractValue<CFValue>
For the Nullness Checker, abstract values additionally track the
meaning of PolyNull, which may be either Nullable or NonNull. The
meaning of PolyNull can change when a PolyNull value is compared to
the null literal, which is specific to the NullnessChecker. Other
checkers often also support a Poly* qualifier, but only the
NullnessChecker tracks the meaning of its poly qualifier using the
dataflow analysis.
package org.checkerframework.checker.nullness;
class NullnessValue extends CFAbstractValue<NullnessValue>
A Store is a set of dataflow facts computed by an analysis, so it is a
mapping from JavaExpressions to AbstractValues. As with
AbstractValues, one can take the least upper bound of two Stores.
package org.checkerframework.dataflow.analysis;
interface Store<S extends Store<S>>
The Checker Framework store restricts the type of abstract values it
may contain.
package org.checkerframework.framework.flow;
abstract class CFAbstractStore<V extends CFAbstractValue<V>,
S extends CFAbstractStore<V, S>>
implements Store<S>
class CFStore extends CFAbstractStore<CFValue, CFStore>
An InitializationStore tracks which fields of the `self' reference
have been initialized.
package org.checkerframework.checker.initialization;
class InitializationStore<V extends CFAbstractValue<V>,
S extends InitializationStore<V, S>>
extends CFAbstractStore<V, S>
A NullnessStore additionally tracks the meaning of PolyNull.
package org.checkerframework.checker.nullness;
class NullnessStore extends InitializationStore<NullnessValue, NullnessStore>
\subsubsection{Transfer functions}
A transfer function (\autoref{sec:transfer_function_classes}) is
explicitly represented as a node visitor that takes a TransferInput
(\autoref{sec:transfer_input_classes}) and produces a TransferResult
The TransferInput represents the set of dataflow facts known to be
true immediately before the node to be analyzed. A TransferInput may
contain a single store, or a pair of `then' and `else' stores when
following a boolean-valued expression.
package org.checkerframework.dataflow.analysis;
class TransferInput<V extends AbstractValue<V>,
S extends Store<S>>
A TransferResult is the output of a transfer function. In other
words, it is the set of dataflow facts known to be true immediately
after a node. A Boolean-valued expression produces a
ConditionalTransferResult that contains both a `then' and an `else'
store, while most other Nodes produce a RegularTransferResult with a
single store.
package class org.checkerframework.dataflow.analysis;
abstract TransferResult<V extends AbstractValue<V>,
S extends Store<S>>
class ConditionalTransferResult<V extends AbstractValue<V>,
S extends Store<S>>
extends TransferResult<A, S>
class RegularTransferResult<V extends AbstractValue<V>,
S extends Store<S>>
extends TransferResult<A, S>
A TransferFunction is a NodeVisitor that takes an input and produces an output.
package org.checkerframework.dataflow.analysis;
interface TransferFunction<V extends AbstractValue<V>,
S extends Store<S>>
extends NodeVisitor<TransferResult<V, S>, TransferInput<V, S>>
interface ForwardTransferFunction<V extends AbstractValue<V>,
S extends Store<S>>
extends TransferFunction<V, S>
interface BackwardTransferFunction<V extends AbstractValue<V>,
S extends Store<S>>
extends TransferFunction<V, S>
The Checker Framework defines a derived class of TransferFunction to
serve as the default for most checkers. The class constrains the type
of abstract values and it overrides many node visitor methods to
refine the abstract values in their TransferResults.
package org.checkerframework.framework.flow;
abstract class CFAbstractTransfer<V extends CFAbstractValue<V>,
S extends CFAbstractStore<V, S>,
T extends CFAbstractTransfer<V, S, T>>
extends AbstractNodeVisitor<TransferResult<V, S>, TransferInput<V, S>>
implements ForwardTransferFunction<V, S>
class CFTransfer extends CFAbstractTransfer<CFValue, CFStore, CFTransfer>
The Initialization Checker's transfer function tracks which fields of
the `self' reference have been initialized.
package org.checkerframework.checker.initialization;
class InitializationTransfer<V extends CFAbstractValue<V>,
T extends InitializationTransfer<V, T, S>,
S extends InitializationStore<V, S>>
extends CFAbstractTransfer<V, S, T>
The Regex Checker's transfer function overrides visitMethodInvocation
to special-case the \code{isRegex} and \code{asRegex} methods.
package org.checkerframework.checker.regex;
class RegexTransfer extends CFAbstractTransfer<CFValue, CFStore, RegexTransfer>
An Analysis performs iterative dataflow analysis over a control flow
graph using a given transfer function. Both forward and backward
analyses are supported.
package org.checkerframework.dataflow.analysis;
interface Analysis<V extends AbstractValue<V>,
S extends Store<S>,
T extends TransferFunction<V, S>>
abstract class AbstractAnalysis<V extends AbstractValue<V>,
S extends Store<S>,
T extends TransferFunction<V, S>>
implements Analysis<V, S, T>
interface ForwardAnalysis<V extends AbstractValue<V>,
S extends Store<S>,
T extends ForwardTransferFunction<V, S>>
extends Analysis<V, S, T>
interface BackwardAnalysis<V extends AbstractValue<V>,
S extends Store<S>,
T extends BackwardTransferFunction<V, S>>
extends Analysis<V, S, T>
The Checker Framework defines a derived class of Analysis for use as
the default analysis of most checkers. This class adds information
about the type hierarchy being analyzed and acts as a factory for
abstract values, stores, and the transfer function.
package org.checkerframework.framework.flow;
abstract class CFAbstractAnalysis<V extends CFAbstractValue<V>,
S extends CFAbstractStore<V, S>,
T extends CFAbstractTransfer<V, S, T>>
extends ForwardAnalysisImpl<V, S, T>
class CFAnalysis extends CFAbstractAnalysis<CFValue, CFStore, CFTransfer>
The Nullness Checkers' analysis overrides the factory methods for
abstract values, stores, and the transfer function.
package org.checkerframework.checker.nullness;
class NullnessAnalysis extends CFAbstractAnalysis<NullnessValue,
NullnessStore, NullnessTransfer>
The RegexChecker's analysis overrides the factory methods for abstract
values, stores, and the transfer function.
package org.checkerframework.checker.regex;
class RegexAnalysis extends CFAbstractAnalysis<CFValue, CFStore, RegexTransfer>
An AnalysisResult preserves the dataflow information computed by an
Analysis for later use by clients. The information consists of an
AbstractValue for each node in the CFG and a Store that is valid at
the start of each Block. The AnalysisResult class can return
AbstractValues for either Nodes or Trees and it can re-run the
transfer function to compute Stores that are valid immediately before
or after any Tree.
package org.checkerframework.dataflow.analysis;
class AnalysisResult<V extends AbstractValue<V>,
S extends Store<S>>
AnnotatedTypeFactorys are not part of the Dataflow Framework, per se,
but they are parameterized by the Dataflow Framework classes that they
package org.checkerframework.framework.type;
class AnnotatedTypeFactory implements AnnotationProvider
In the Checker Framework, dataflow analysis is performed on demand,
one class at a time, the first time that a ClassTree is passed to
getAnnotatedType. This is implemented in the abstract class
GenericAnnotatedTypeFactory with concrete implementation in
package org.checkerframework.framework.type;
abstract class GenericAnnotatedTypeFactory<Checker extends BaseTypeChecker<?>,
Value extends CFAbstractValue<Value>,
Store extends CFAbstractStore<Value, Store>,
TransferFunction extends CFAbstractTransfer<Value, Store, TransferFunction>,
FlowAnalysis extends CFAbstractAnalysis<Value, Store, TransferFunction>>
extends AnnotatedTypeFactory
package org.checkerframework.common.basetype;
class BaseAnnotatedTypeFactory
extends GenericAnnotatedTypeFactory<CFValue, CFStore, CFTransfer, CFAnalysis>
\section{The Control-Flow Graph}
A control-flow graph (CFG) represents a single method or field
initialization. (The Dataflow Framework performs an intra-procedural
analysis. This analysis is modular and every method is considered in
This section also describes the translation from the abstract syntax tree
(AST) to the CFG\@.
We start with a simple example, then give a more formal
definition of the CFG and its properties, and finally describe the
translation from the AST to the CFG.
As is standard, a control-flow graph is a set of
basic blocks that are linked by control-flow edges. Possibly less
standard, every basic block consists of a sequence of so-called nodes,
each of which represents a minimal Java operation or expression.
\flow{CFGSimple}{.33}{1.1}{A simple Java code snippet to introduce the CFG.
In CFG visualizations, special basic blocks are shown as ovals;
conditional basic blocks are polygons with eight sides; and regular and exception
basic blocks are rectangles.}
Consider the method \code{test} of \autoref{fig:CFGSimple}. The if
conditional got translated to a \emph{conditional basic block}
(octagon) with two successors. There are also two special basic blocks
(ovals) to denote the entry and exit point of the method.
\subsection{Formal Definition of the Control-Flow Graph}
The control-flow graph models all paths that can possibly be taken by an
execution of the method.
\begin{definition}[Control-Flow Graph]
A \emph{control-flow graph} consists of a set of \emph{basic
blocks} and a set of directed edges between these basic blocks,
some of which are labeled.
\begin{definition}[Basic Block]
A \emph{basic block} is a sequence of \emph{nodes}, where the only
control flow between the nodes inside
the basic block is sequential. Furthermore, there is no
control flow occurring between those nodes and nodes of other basic
blocks, except between the last node of one block $b_1$ and the first node
of another block $b_2$, if $b_2$ is a successor of $b_1$. A basic
block may have multiple successors.
\begin{definition}[Types of Basic Blocks]
There are four \emph{types} of basic blocks in a control-flow graph:
\item \textbf{Regular basic block.} A \emph{regular basic
block} contains any non-empty sequence of nodes and has
exactly one successor. None of the nodes in the block can
throw an exception at runtime.
\item \textbf{Special basic blocks.} A \emph{special basic
block} contains the empty sequence of nodes (i.e., is empty)
and denotes either the entry or one of the exit blocks of a
method. There are three types of special basic blocks:
\item Entry block. This basic block is the (only) entry
point of the method and thus is the only basic block
without predecessors.
\item Exit block. This basic block denotes the (normal)
exit of a method, and it has no successors.
\item Exceptional exit block, which indicates exceptional
termination of the method. As an exit block, this block
has no successors.
Every method has exactly one entry block, zero or one exit blocks,
and zero or one exceptional exit blocks. There is always
either an exit block, an exceptional exit block, or both.
\item \textbf{Exception basic block.} An \emph{exception basic
block} contains exactly one node that \emph{might} throw an
exception at runtime (e.g., a method call). There are zero
or one non-exceptional successors (only a basic block containing a
\code{throw} statement does not have a non-exceptional
successor). There are one or more
exceptional successors (see \autoref{def:edges}). In all
cases there is at least one successor (regular or
\item \textbf{Conditional basic block.} A \emph{conditional
basic block} does not contain any nodes and is used as a
\emph{split point} after the execution of a node of boolean
type. It has exactly two successors (both non-exceptional):
the \emph{then} successor that is reached when the previous node
evaluates to true and the \emph{else} successor that is reached
when the previous node evaluates to false. There is always
exactly a single predecessor block for every conditional
basic block, which is either a regular basic block or an
exception basic block. In both cases, the last node in the
predecessor will be of boolean type and the boolean value
controls which successor of the conditional block is
The Java implementation of the four block types above is described in
\begin{definition}[Control-Flow Graph Edges]
The basic blocks of a control-flow graph are connected by directed
\emph{edges}. If $b_1$ and $b_2$ are connected by a directed edge
$(b_1,b_2)$, we call $b_1$ a predecessor of $b_2$, and we call $b_2$
a successor of $b_1$. In a control-flow graph, there are three
types of edges:
\item \textbf{Exceptional edges}. An \emph{exceptional edge}
connects an exception basic block with its exceptional
successors, and it is labeled by the most general exception that
might cause execution to take this edge during run time. Note
that the outgoing exceptional edges of a basic block do not need
to have mutually exclusive labels; the semantics is that the
control flow follows the most specific edge. For instance, if
one edge is labeled with type \code{A} and another is labeled
with type \code{B} where \code{B} is a subtype of \code{A}, then
the execution only takes the first edge if the exception is of a
subtype of \code{A}, but not a subtype of \code{B}.
There is not necessarily a most specific exception type in the program
text; in that case, does the translation add a most specific case that will
never be executed at run time?
In general, what is the relation of the ordering in source code to the
one here?
There is at most one successor for every exception type.
\item \textbf{Conditional edges}. A \emph{conditional edge} is a
non-exceptional edge that connects a conditional basic block
with one of its successors, and is labeled with either ``true''
or ``false''.
\item \textbf{Regular, non-conditional edge.} Any other edge is a
\emph{regular edge}, and does not carry a label. Only regular
basic blocks, the entry basic block, and exception basic blocks
have outgoing regular edges.
A \emph{node} is a minimal Java operation or expression. It is
minimal in the sense that it cannot be decomposed further into
subparts between which control flow occurs. Examples for such
nodes include integer literals, an addition node (which performs
the mathematical addition of two nodes) or a method call. Control
flow such as \code{if} and \code{break} are not represented as
nodes. The full list of nodes is given in \autoref{tab:nodes} and
several of them are described in more detail in
It is important to note that, even though nodes can contain
references to other nodes, it is only the ``top-level'' node which
is considered at that point in the basic block. In the example of
the addition node, this means that only the addition operation is
to be executed, and its operands would occur earlier in the
control-flow graph (as they are evaluated first, before performing
the addition).
In the visualization, a string representation of the node is used,
followed by the node type in square brackets. Note that the string
representation often also includes more than just the ``top-level''
node. For instance, the addition expression \code{a + b[0];} will
appear as ``a + b[0] [ NumericalAddition ]'' rather than ``a'' plus
some temporary variable. This is done for clarity, so that it is easy
to see what expressions are summed up and because we don't create
internal names for expression results.
\autoref{tab:nodes} lists all node types in the framework. We use the
Java class name of the implementation, but leave out the suffix
\code{Node}, which is present for every type.
All classes are in package \code{org.checkerframework.dataflow.cfg.node}.
\multicolumn{3}{c}{\autoref{tab:nodes}: All node types in the Dataflow Framework.} \\ \\
\textbf{Node type} & \textbf{Notes} & \textbf{Example} \\ \midrule \endfirsthead
\textbf{Node type} & \textbf{Notes} & \textbf{Example} \\ \midrule \endhead
\hline \multicolumn{3}{|c|}{{Continued on next page}} \\ \hline \endfoot
\code{Node} & The base class of all nodes. & \\
\code{ValueLiteral} & The base class of literal value nodes. & \\
\code{BooleanLiteral} & & \code{true} \\
\code{CharacterLiteral} & & \code{'c'} \\
\code{DoubleLiteral} & & \code{3.14159} \\
\code{FloatLiteral} & & \code{1.414f} \\
\code{IntegerLiteral} & & \code{42} \\
\code{LongLiteral} & & \code{1024L} \\
\code{NullLiteral} & & \code{null} \\
\code{ShortLiteral} & & \code{512} \\
\code{StringLiteral} & & \code{"memo"} \\
& Accessor expressions & \\
\code{ArrayAccess} & & \code{args[i]} \\
\code{FieldAccess} & & \code{f}, \code{obj.f} \\
\code{MethodAccess} & & \code{obj.hashCode} \\
\code{This} & Base class of references to \code{this} & \\
\code{ExplicitThis} & Explicit use of \code{this} in an expression & \\
\code{ImplicitThis} & Implicit use of \code{this} in an expression & \\
\code{Super} & Explicit use of \code{super} in expression. & \code{super(x, y)} \\
\code{LocalVariable} & Use of a local variable, either as l-value or r-value & \\
\code{MethodInvocation} & Note that access and invocation are distinct. & \code{hashCode()} \\
& Arithmetic and logical operations. & \\
\code{BitwiseAnd} & & a \& \code{b} \\
\code{BitwiseComplement} & & \verb|~b| \\
\code{BitwiseOr} & & \code{a | b} \\
\code{BitwiseXor} & & \code{a ^ b} \\
\code{ConditionalAnd} & Short-circuiting. & a \&\& \code{b} \\
\code{ConditionalNot} & & \code{!a} \\
\code{ConditionalOr} & Short-circuiting. & \code{a || b} \\
\code{FloatingDivision} & & \code{1.0 / 2.0} \\
\code{FloatingRemainder} & & \code{13.0 \% 4.0} \\
\code{LeftShift} & & \code{x << 3} \\
\code{IntegerDivision} & & \code{3 / 2} \\
\code{IntegerRemainder} & & \code{13 \% 4} \\
\code{NumericalAddition} & & \code{x + y} \\
\code{NumericalMinus} & & \code{-x} \\
\code{NumericalMultiplication} & & \code{x * y} \\
\code{NumericalPlus} & & \code{+x} \\
\code{NumericalSubtraction} & & \code{x - y} \\
\code{SignedRightShift} & & \code{x >> 3} \\
\code{StringConcatenate} & & \code{s + ".txt"} \\
\code{TernaryExpression} & & \code{c ? t : f} \\
\code{UnsignedRightShift} & & \code{x >>> 5} \\
& Relational operations & \\
\code{EqualTo} & & \code{x == y} \\
\code{NotEqual} & & \code{x != y} \\
\code{GreaterThan} & & \code{x > y} \\
\code{GreaterThanOrEqual} & & \code{x >= y} \\
\code{LessThan} & & \code{x < y} \\
\code{LessThanOrEqual} & & \code{x <= y} \\
\code{Case} & Case of a switch. Acts as an equality test. & \\
\code{Assignment} & & \code{x = 1} \\
% \midrule
\code{StringConcatenateAssignment} & A compound assignment. & \code{s += ".txt"} \\
\code{ArrayCreation} & & \code{new double[]} \\
\code{ObjectCreation} & & \code{new Object()} \\
\code{TypeCast} & & \code{(float) 42} \\
\code{InstanceOf} & & \code{x instanceof Float} \\
& Conversion nodes. & \\
\code{NarrowingConversion} & Implicit conversion. & \\
\code{StringConversion} & Might be implicit. & \code{obj.toString()} \\
\code{WideningConversion} & Implicit conversion. & \\
& \textbf{Non-value nodes} & \\
& Types appearing in expressions, such as \code{MyType.class} & \\
\code{ArrayType} & & \\
\code{ParameterizedType} & & \\
\code{PrimitiveType} & & \\
\code{ClassName} & Identifier referring to Java class or interface. & \code{java.util.HashMap} \\
\code{PackageName} & Identifier referring to Java package. & \code{java.util} \\
\code{Throw} & Throw an exception. & \\
\code{Return} & Return from a method. & \\
\code{AssertionError} & & \code{assert x != null : "Hey"} \\
\code{Marker} & No-op nodes used to annotate a CFG with
information of the underlying Java source code. Mostly useful
for debugging and visualization. An example is indicating the
start/end of switch statements. & \\ \midrule
\code{NullChk} & Null checks inserted by javac & \\
\code{VariableDeclaration} & Declaration of a local variable & \\
\caption{All node types in the Dataflow Framework.}
In theory, nearly any statement can throw an \code{Error} such as
\code{OutOfMemoryError} or \code{NoSuchFieldError}. The Dataflow Framework
does not represent all that possible flow. It only creates the exceptional
edges shown in \autoref{tab:nodesWithException}.
\textbf{Node type} & \textbf{Exception type} \\ \hline
\code{ArrayAccess} & \code{NullPointerException}, \code{ArrayIndexOutOfBoundsException} \\
\code{FieldAccess} & \code{NullPointerException} \\
\code{MethodAccess} & \code{NullPointerException} \\
\code{MethodInvocation} & \code{Throwable}, types in throws clause of the signature \\
\code{IntegerDivision} & \code{ArithmeticException} \\
\code{IntegerRemainder} & \code{ArithmeticException} \\
\code{ObjectCreation} & \code{Throwable}, types in throws clause of the signature \\
\code{ArrayCreation} & \code{NegativeArraySizeException}, \code{OutOfMemoryError} \\
\code{TypeCast} & \code{ClassCastException} \\
\code{Throw} & Type of \code{e} when \code{throw e} \\
\code{AssertionError} & \code{AssertionError} \\
\code{ClassName} & \code{ClassCircularityError}, \code{ClassFormatError}, \\
& \code{NoClassDefFoundError}, \code{OutOfMemoryError} \\
\caption{All node types that could throw Exception, and the types
to be thrown.
Java class name of nodes are simplified as with \autoref{tab:nodes}.
All exception types are in package \code{java.lang}.}
Can \code{StringConversion} be implicit? I think so, but in any event discuss.
For each non-expression, explain its purpose, just like the explanation for
Marker that still needs to be fleshed out.
``Could be desugared'' on ``StringConcatenateAssignment'' and ``Conversion
nodes'' was confusing. What is the design rationale for
desugaring in the Dataflow Framework? Discuss that. Here, at least
forward-reference to \autoref{sec:desugaring}, if that's relevant. \\
More generally, for any cases that will be discussed in the text, add a
forward reference to the section with the discussion.
There is a \code{StringConcatenateAssignmentNode}.
What about other compound assignments? Are they desugared?
When we added Java~8 support, did we add additional nodes that are not
listed here? Cross-check with implementation.
Why is it \code{AssertionErrorNode} instead of \code{AssertNode}?
\subsection{Noteworthy Translations and Node Types}
In this section we mention any non-straightforward translations from the AST to
the CFG, or special properties about individual nodes.
\subsubsection{Program Structure}
Java programs are structured using high-level programming constructs
such as loops, if-then-else constructs,
try-catch-finally blocks or switch statements. During the translation
from the AST to the CFG some of this program structure is lost and all
non-sequential control flow is represented by two low-level
constructs: conditional basic blocks and control-flow edges between
basic blocks. For instance, a \code{while} loop is translated into its
condition followed by a conditional basic block that models the two
possible outcomes of the condition: either the control flow follows
the `true' branch and continues with the loop's body, or control goes to the
`false' successor and executes the first statement after the loop.
As described in \jlsref{15.26.1}, the execution of an assignment is in
general not strictly left-to-right. Rather, the right-hand side might
be evaluated even if the left-hand side of the assignment causes an
exception. This semantics is faithfully represented in the CFG
produced by the translation. An example of a field assignment
exhibiting this behavior is shown in \autoref{fig:CFGFieldAssignment}.
\flow{CFGFieldAssignment}{.33}{1}{Control flow for a field assignment is not strictly
left-to-right (cf.\ \jlsref{15.26.1}),
which is properly handled by the translation.}
\subsubsection{Postfix/Prefix Increment/Decrement}
Postfix and prefix increment and decrement have a side effect to
update the variable or field. To represent this side effect, the Dataflow
Framework creates an artificial assignment node like \code{n = n + 1}
for \code{++n} or \code{n++}. This artificial assignment node is stored
in \code{unaryAssignNodeLookup} of \code{ControlFlowGraph}. The assignment
node is also stored in \code{treeLookup} for prefix increment or decrement
so that the result of it is after the assignment. However, the node before
the assignment is stored in \code{treeLookup} for postfix increment or decrement
because the result of it should be before the assignment. For further information
about node-tree mapping, see \autoref{sec:conversions}.
\subsubsection{Conditional stores}
The Dataflow Framework extracts information from control-flow splits
that occur in \code{if}, \code{for}, \code{while}, and \code{switch} statements. In order to have
the information available at the split, we eagerly produce two stores
contained in a \code{ConditionalTransferResult} after certain
boolean-valued expressions. The stores are called the \emph{then} and
\emph{else} stores. So, for example, after the expression \code{x ==
null}, two different stores will be created. The Nullness Checker
would produce a then store that maps \code{x} to @Nullable and an else
store that maps \code{x} to @NonNull.
The Dataflow Framework allows a maximum of two stores and when there
are two distinct stores, they always refer to the most recent
boolean-valued expression. Stores are propagated through most nodes
and they are reversed for conditional not expressions. The transfer
functions for many nodes merge conditional stores back together
because they cannot maintain the distinction between them. Merging
just means taking the least upper bound of the two stores and it
happens automatically by calling \code{TransferInput.getRegularStore}.
The control flow graph represents all non-exceptional control-flow
splits, or branches, as \code{ConditionalBlock}s that contain no
nodes. If there is one store flowing into a conditional block, then
it is duplicated to both successors. If there are two stores flowing
into a conditional block, the then store is propagated to the block's
then successor and the else store is propagated to the block's else
Consider the control flow graph generated for the simple if statement
in \autoref{fig:CFGIfStatement}. The conditional expression \code{b1}
immediately precedes the \code{ConditionalBlock}, represented by the
octagonal node. The \code{ConditionalBlock} is followed by both a
then and an else successor block, after which control flow merges back
together at the exit block. The edge labels \code{EACH_TO_EACH},
\code{THEN_TO_BOTH}, and \code{ELSE_TO_BOTH} are flow rules described
in \autoref{sec:flow-rules}. As described above, the then store
propagates to (both stores of) the block's then successor according to
rule \code{THEN_TO_BOTH} and the else store propagates to (both stores
of) the block's else successor according to rule \code{ELSE_TO_BOTH}.
More precise rules are used to preserve dataflow information for
short-circuiting expressions, as described in \autoref{sec:cond-exp}.
\flow{CFGIfStatement}{.33}{1.25}{Example of an if statement translated into a
\subsubsection{Conditional Expressions}
The conditional and (\code{&&}, cf. \jlsref{15.23}) and the
conditional or (\code{||}, cf. \jlsref{15.24}) expressions are subject
to short-circuiting: if evaluating the left-hand side already
determines the result, then the right-hand side is not evaluated. This
semantics is faithfully represented in the constructed CFG and more
precise flow rules (\autoref{sec:flow-rules}) are used to preserve
additional dataflow information.
An example program using conditional or is shown in
\autoref{fig:CFGConditionalOr}. Note that the CFG correctly
represents short-circuiting. The expression \code{b2 || b3} is only
executed if \code{b1} is false and \code{b3} is only evaluated if
\code{b1} and \code{b2} are false.
Observe in \autoref{fig:CFGConditionalOr} that the flow rule between
the first conditional block and its then successor is
\code{THEN_TO_THEN}, rather than the default flow rule for such edges
\code{THEN_TO_BOTH}, which is present on the edge from the last
conditional block to its then successor. \code{THEN_TO_THEN} is a
more precise rule which propagates the then store from the predecessor
of the conditional block to the then store of the then successor and
leaves the else store of the successor untouched. This is a valid
rule for propagating information along the short-circuit edge of a
conditional or expression because \code{b1 || (b2 || b3)} being false
implies that \code{b1} is false, so dataflow information that obtains
when \code{b1} is true has no effect on the dataflow information
obtains when \code{b1 || (b2 || b3)} is false. To put it another way,
if control reaches the block containing \code{b1 || (b2 || b3)} and
that expression is false, then control must have flowed along the else
branches of both conditional blocks and only the facts that obtain
along those edges need to be kept in the else store of the block
containing \code{b1 || (b2 || b3)}.
\flow{CFGConditionalOr}{.33}{1.33}{Example of a conditional or expression
(\code{||}) with short-circuiting and more precise flow rules.}
\subsubsection{Implicit \code{this} access}
The Java compiler AST uses the same type (\code{IdentifierTree}) for
local variables and implicit field accesses (where \code{this.} is
left out). To relieve the user of the Dataflow Framework from
manually determining the two cases, we consistently use
\code{FieldAccessNode} for field accesses, where the receiver might be
an \code{ImplicitThisNode}.
\subsubsection{Assert statements}
Assert statements are treated specially by the CFG builder because it
is unknown at CFG construction time whether or not assertions will be
enabled when the program is run. When assertions are enabled, the
dataflow information gained by analyzing the assert statement can
improve precision and allow the programmer to avoid redundant
annotations. However, when assertions are disabled, it would be
unsound to assume that they had any effect on dataflow information.
The user of the Dataflow Framework may specify that
assertions are enabled or disabled. When assertions are assumed to be
disabled, no CFG Nodes are built for the assert statement.
When assertions are assumed to be enabled, CFG Nodes are built to
represent the condition of the assert statement and, in the else
successor of a ConditionalBlock, CFG Nodes are built to represent the
detail expression of the assert, if any.
If assertions are not assumed to be enabled or disabled, then
the CFG is conservative and represents the fact that the
assert statement may execute or may not. This takes the form of a
ConditionalBlock that branches on a fake variable. For example, see
\autoref{fig:CFGAssert}. The fake variable named
\code{assertionsEnabled#num0} controls the first ConditionalBlock.
The then successor of the ConditionalBlock is the same subgraph of CFG
Nodes that would be created if assertions were assumed to be enabled,
while the else successor of the ConditionalBlock is the same, empty,
subgraph of CFG Nodes that would be created if assertions were assumed
to be disabled.
\flow{CFGAssert}{.15}{2.9}{Example of an assert statement translated with
assertions neither assumed to be enabled nor assumed to be
\subsubsection{Varargs method invocation}
In Java, varargs in method or constructor invocation is compiled
as new array creation (cf.\ \jlsref{}). For example,
\code{m(1, 2, 3)} will be compiled as \code{m(new int[]\{1, 2, 3\})}
when the signature of \code{m} is \code{m(int... args)}. Dataflow
Framework creates an \code{ArrayCreationNode} with initializer for varargs
in the same way as the Java compiler does.
Note that it doesn't create an \code{ArrayCreationNode}
when the varargs is an array with the same depth as the type of
the formal parameter, or if \code{null} is given as the actual varargs argument.
\subsubsection{Default case and fall through for switch statement}
A switch statement is handled as a chain of \code{CaseNode} and nodes in
the case. \code{CaseNode} makes a branch by comparing the equality of
the expression of the switch statement and the expression of the case.
Note that the expression of a switch statement must be executed just only
once at the beginning of the switch statement. To refer to its value, a fake variable
is created and it is assigned to a fake variable. \code{THEN_TO_BOTH}
edge goes to nodes in the case and \code{ELSE_TO_BOTH} edge goes to next
\code{CaseNode}. When a next is default case, it goes to nodes in the default
case. If a break statement is in nodes, it creates an edge to next node of
the switch statement. If there is any possibility of fall-through, an edge
to the first of nodes in the next case is created after nodes in the case.
For example, see \autoref{fig:CFGSwitch}. The fake variable named \code{switch#num0}
is created and each of case nodes creates the branches.
\flow{CFGSwitch}{.21}{1.45}{Example of a switch statement with case, default and fall through.}
\subsubsection{Handling \code{finally} blocks}
Control flow statements, like \code{return}, \code{break}, and \code{continue},
within \code{try} blocks will cause execution of the \code{finally} block before
continuing at the target of the jump. The Dataflow Framework models this
behavior by adding a jump to a duplicate of the finally block before the
jump to the original target of the control flow statement.
\subsection{AST to CFG Translation}
This section gives a high-level overview of the translation process
from the abstract syntax tree to the control-flow graph as described
in \autoref{sec:cfg-formal}.
First, we define several entities, which will be used in the translation.
\begin{definition}[Extended Node]
In the translation process the data type \emph{extended node} is used.
An extended node can be one of four possibilities:
\item \textbf{Simple extended node.} An extended node can just be a
wrapper for a node that does not throw an exception,
as defined in Definition~\ref{def:node}.
\item \textbf{Exception extended node.} Similar to a simple
node, an exception extended node contains a node, but this
node might throw an exception at runtime.
\item \textbf{Unconditional jump.} An unconditional jump
indicates that control flow proceeds non-sequentially to a
location indicated by a target label.
\item \textbf{Conditional jump.} A conditional jump can follow
an extended node that contains a node of boolean type. It
contains two target labels, one if the node evaluates to
true and one for false.
\textbf{Comparison of nodes and extended nodes.}
Nodes themselves never contain control flow information; they only
represent computation.
An extended node is a wrapper around a node that represents control flow
information. It contains: a node, a label, a predecessor, and a
An extended node is temporarily used to keep track of some control flow
information. Later, the basic block data structures are created, and they
represent the control flow. (And at that point the extended nodes are
A \emph{label} is a marker that is used to refer to extended
nodes. It is used only temporarily during CFG construction.
% I agree, we should keep this in mind in the future.
%I find it useful to never name phases ``one'', ``two'', ``three'', but
%always to give them English names. This makes the meaning clearer to
%readers --- often because it forces the writer to think harder about the
%meaning of each. It can still be useful to number the phases even after
%they have more mnemonic names.
The process of translating an AST to a CFG proceeds in three distinct phases.
\item \textbf{Phase one.} In the first phase, a single linear
sequence of extended nodes is created. The control flow is
implicitly assumed to be sequential through the sequence of
extended nodes, until a (conditional or unconditional) jump is
encountered in an if, for, while, or switch statement, in which
case the jump target decides where execution proceeds.
The labels used as targets of jumps are associated with positions
in this sequence and are managed by maintaining a binding function
from labels to sequence positions. The advantage of having this
indirection is that one can create a label and associate with the
next free position in the sequence, without knowing which exact
extended node will be placed there. Furthermore, labels can be
created and used before they are actually bound to their correct
position in the sequence (e.g., when that position is not yet
known). At the end, the binding function can be used to resolve
labels to extended nodes.
Furthermore, phase one also computes a mapping from AST tree
elements to nodes, as well as a set of leaders. A \emph{leader} is
an extended node for which one of the following conditions
\item It is the first extended node in the sequence.
\item It is the target of a jump (i.e. there is a label bound to
the location of the node in the sequence).
\item It is the first node following a jump.
\item \textbf{Phase two.} Phase two translates the linear
representation to a control-flow graph by performing the
following transformations:
\item Simple extended nodes are translated to regular basic
blocks, where multiple nodes can be grouped in one regular
basic block.
\item Exception extended nodes are translated to exception
basic blocks with the correct edges.
\item Unconditional jumps are replaced with edges between the
correct basic blocks.
\item Conditional jumps are replaced by a conditional basic
To greatly simplify the implementation, phase two is allowed to
produce a degenerated control-flow graph. In particular, the
following deficiencies are possible:
\item Regular basic blocks might be empty.
\item Some conditional basic blocks might be unnecessary, in that
they have the same target for both the `then' as well as the
`else' branch.
\item Two consecutive, non-empty, regular basic blocks can exist,
even if the second block has only exactly one predecessor and
the two blocks could thus be merged.
\item \textbf{Phase three.} In the third and last phase, the
control-flow graph is transformed such that the deficiencies
remaining from phase two are removed. It is ensured that
removing one kind of deficiency does not create another
degenerate case.
Desugaring means replacing complicated source language constructs by
simpler ones, or removing syntactic sugar from an input program.
Originally, we intended for the control flow graph representation to
be as close as possible to the Java abstract syntax tree to simplify
the mapping from tree to CFG node and back and to reuse existing
checker code written in terms of trees. However, we ran into several
cases that were better handled by desugaring.
\item We decided to represent implicit conversion operations like
boxing, unboxing, widening, and narrowing as explicit CFG nodes
because they change the type of a value. For example, implicit
unboxing of an \code{Integer} will be translated into a call to
\code{Integer.intValue}. The pre-conversion type can be
associated with the original node and the post-conversion type
can be associated with the explicit conversion node. It also
makes it possible for the transfer function to operate on the
conversion nodes.
\item Enhanced for loops are defined in terms of a complicated
translation into simpler operations, including field accesses,
branches, and method calls that could affect dataflow
information. It would be prohibitively difficult for a checker
writer to write a transfer function that correctly accounted for
all of those operations, so we desugar enhanced for loops.
\item Once we decided to make conversion nodes explicit it made
sense to desugar compound assignments. A compound assignment
like \begin{verbatim}Integer i; i += 3;\end{verbatim} performs
both an unboxing and a boxing operation on \code{i}. Desugaring
all compound assignments greatly reduced the total number of
node classes.
In order to desugar code and still maintain the invariant that every
CFG node maps to a tree, we needed to create new AST tree nodes that
were not present in the input program. Javac allows us to do this
through non-supported APIs and we wrote some utility classes in
\code{javacutil} to make the process easier. The new trees are
created during CFG building and they persist as long as some CFG node
refers to them. However, the trees are not inserted into the AST, so
they are not type-checked or seen by other tree visitors. Their main
purpose is to carry Java types and to satisfy AnnotatedTypeFactory
A further complication is that these newly-introduced AST trees are
not part of the TreePath when visiting the AST. We work around this
problem by giving the AnnotatedTypeFactory a mapping, called the
\code{pathHack}, from newly-introduced trees to their containing
MethodTree and ClassTree.
Possibly even worse, we needed to create fake symbols for variables
created when desugaring enhanced for loops. Javac does not expose the
ability to create a symbol, so we created a new subclass of
Symbol.VarSymbol called \code{javacutil.tree.DetachedVarSymbol} for
this purpose. AnnotatedTypeFactory explicitly checks for
DetachedVarSymbols in its DeclarationFromElement method.
\subsubsection{Conversions and node-tree mapping}
As mentioned in \autoref{sec:desugaring}, we represent implicit Java
type conversions such as boxing, unboxing, widening, and narrowing by
explicit CFG nodes. This means that some AST tree nodes correspond to
multiple CFG nodes: a pre-conversion node and a post-conversion node.
We will describe how the conversions work and how the node-tree
mappings are implemented.
Boxing and unboxing are represented in terms of calls to Java standard
library methods. Boxing corresponds to a call to
\code{BoxedClass.valueOf} while unboxing corresponds to a call to
\code{BoxedClass.\*Value}. This allows annotations on the library
methods, as well as transfer functions for method invocations, to
apply to the conversions with no special work on the part of a checker
Widening and narrowing conversions are still represented as special
node types, although it would be more consistent to change them into
type casts.
Is the last point a to-do item?
We maintain the invariant that a CFG node maps to zero or one AST tree
and almost all of them map to a single tree. But we can't maintain a
unique inverse mapping because some trees have both pre- and
post-conversion nodes. Instead, we remember two mappings, one from
tree to pre-conversion node and, for those trees that were converted,
one from tree to post-conversion node. Both the CFGBuilder and the
ControlFlowGraph store two separate mappings. The Analysis class
explicitly stores the tree to pre-conversion node mapping as
\code{treeLookup} and it indirectly uses the tree to post-conversion
mapping in \code{Analysis.getValue(Tree)}. This has effectively
hidden the distinction between pre and post-conversion nodes from the
Checker Framework, but in the long run it may be necessary to expose
\section{Dataflow Analysis}
This section describes how the dataflow analysis over the control-flow
graph is performed and how to implement a particular analysis.
Roughly, a dataflow analysis in the framework works as follows. Given
the abstract syntax tree of a method, the framework computes the
corresponding control-flow graph as described in
\autoref{sec:cfg}. Then, a simple forward or backward iterative
algorithm is used to compute a fix-point, by iteratively applying a
set of transfer functions to the nodes in the CFG\@. These transfer
functions are specific to the particular analysis and are used to
approximate the runtime behavior of different statements and expressions.
\subsection{Managing Intermediate Results of the Analysis}
Conceptually, the dataflow analysis computes an abstract value for
every node and flow expression\footnote{Certain dataflow analyses
might choose not to produce an abstract value for every node. For
instance, a constant propagation analysis would only be concerned
with nodes of a numerical type, and could ignore other nodes.}. The
transfer function (\autoref{sec:transfer-fnc}) produces these abstract
values, taking as input the abstract values computed earlier for
sub-expressions. For instance, in a constant propagation analysis,
the transfer function for addition (\code{+}) would look at the
abstract values for the left and right operand, and determine that the
\code{AdditionNode} is a constant if and only if both operands are
An analysis result contains two parts:
The \emph{node-value mapping} (\code{Analysis.nodeValues}) maps \code{Node}s to their abstract
values. Only nodes that can take on an abstract value are
used as keys. For example, in the Checker Framework, the mapping is
from expression nodes to annotated types.
The framework consciously does not store the abstract value
directly in the node, to remove any coupling between the control-flow
graph and a particular analysis. This allows the control-flow graph
to be constructed only once, and then reused for different dataflow
A set of \emph{stores}. Each store maps a flow expression to an
abstract value. Each store is associated with a specific program point.
The stores tracked by an analysis implement the \code{Store}
interface, which defines the following operations:
\item Least upper bound: Compute the least upper bound of two stores
(e.g., at a merge-point in the control-flow graph).
\item Equivalence: Compare two stores if they are (semantically)
different, which is used to determine if a fix-point is reached in
the dataflow analysis. Note that reference-equality is most likely
not sufficient.
\item Copy mechanism: Clone a store to get an exact copy.
The store is analysis-dependent, but the framework provides a default
store implementation which can be reused. The default implementation
What information is tracked in the store depends on the analysis to be
performed. Some examples of stores include
Every store is associated with a particular point in the control-flow
graph, and all stores are managed by the framework. It saves an explicit store
for the start of each basic block.
When dataflow information
is requested for a later point in a block, the analysis applies the
transfer function to recompute it from the initial store.
After an analysis has iterated to a fix-point, the computed dataflow
information is maintained in an AnalysisResult, which can map either
nodes or trees to abstract values.
\subsection{Answering Questions}
After the flow analysis for a particular method has been computed,
there are two kinds of information that have been computed. Firstly,
the node-value mapping stores an abstract value for every node, and
secondly, the information maintained in various stores is available.
Two kinds of queries are possible to the dataflow analysis after the
analysis is complete:
\item For a given AST tree node, what is its abstract value? Both
pre- and postconversion values can be retrieved.
A discussion of conversions can be found in \autoref{sec:conversions}.
\item For a given AST tree node, what is the state right after
this AST tree node? Examples of questions include:
\item Which locks are currently held?
\item Are all fields of a given object initialized?
The store may first need to be (re-)computed, as the framework does not
store all intermediate stores but rather only those for key positions
as described in \autoref{sec:store-management}.
To support both kinds of queries, the framework builds a map from AST
tree nodes (of type \code{com.sun.source.tree.Tree}) to CFG nodes. To
answer questions of the first type it is then possible to go from the
AST tree node to the CFG node and look up its abstract value in the
node-value mapping (this is provided by the framework). By default,
the abstract value returned for a tree by
\code{Analysis.getValue(Tree)} includes any implicit conversions
because it uses the mapping from tree node to post-conversion CFG
node. To request the pre-conversion value, one currently uses the
\code{ControlFlowGraph.treelookup} map directly.
To support questions of the second kind, every node has a reference to
the basic block it is part of. Thus, for a given AST tree node, the
framework can determine the CFG node and thereby the CFG basic block,
and compute the necessary store to answer the question.
\subsection{Transfer Function}
A transfer function is an object that has a transfer method for
every \code{Node} type, and also a transfer method for procedure entry.
\item A transfer method for a \code{Node} type takes a store
and the node, and produces an updated store. This is achieved by
implementing the \code{NodeVisitor<S, S>} interface for the store
type \code{S}.
These transfer methods also get access to the abstract value of any
sub-node of the node \code n under consideration. This is not limited
to immediate children, but the abstract value for any node contained
in \code n can be queried.
\item A transfer method for procedure entry returns the initial store, given the
list of parameters (as \code{LocalVariableNode}s that represent the formal
parameters) and the
\code{MethodTree} (useful if the initial store depends on the procedure
signature, for instance).
\subsection{Flow Rules}
As mentioned in \autoref{sec:cond-stores}, dataflow analysis
conceptually maintains two stores for each program point, a then store
containing information valid when the previous boolean-valued
expression was true and an else store containing information valid
when the expression was false. In many cases, there is only a single
store because there is no boolean-valued expression to split on or
there was an expression, but it yielded no useful dataflow
information. However, any CFG edge may potentially have a predecessor
with two stores and a successor with two stores.
We could simply propagate information from both predecessor stores to
both successor stores, but that would throw away useful information,
so we define five flow rules that allow more precise propagation.
\item \code{EACH_TO_EACH} is the default rule for an edge with a
predecessor that is not a \code{ConditionalBlock}. It
propagates information from the then store of the predecessor to
the then store of the successor and from the else store of the
predecessor to the else store of the successor.
\item \code{THEN_TO_BOTH} is the default rule for an edge from a
\code{ConditionalBlock} to its then successor. It propagates
information from the then store of its predecessor to both the
then and else stores of the then successor, thereby splitting
the conditional store to take advantage of the fact that the
condition is known to be true.
\item \code{ELSE_TO_BOTH} is the default rule for an edge from a
\code{ConditionalBlock} to its else successor. It propagates
information from the else store of its predecessor to both the
then and else stores of the else successor, thereby splitting
the conditional store to take advantage of the fact that the
condition is known to be false.
\item \code{THEN_TO_THEN} is a special rule for a short-circuit edge
from a \code{ConditionalBlock} to its then successor. It only
propagates information from the then store of its predecessor to
the then store of its successor. This flow rule is used for
conditional or expressions because the else store of \code{a ||
b} is not influenced by the then store of \code{a}.
\item \code{ELSE_TO_ELSE} is a special rule for a short-circuit edge
from a \code{ConditionalBlock} to its else successor. It only
propagates information from the else store of its predecessor to
the else store of its successor. This flow rule is used for
conditional and expressions because the then store of \code{a &&
b} is not influenced by the else store of \code{a}.
Note that the more precise flow rules \code{THEN_TO_THEN} and
\code{ELSE_TO_ELSE} improve the precision of the store they do not
write to. In other words, \code{THEN_TO_THEN} yields a more precise
else store of its successor by not propagating information to the else
store which might conflict with information already there, and
conversely for \code{ELSE_TO_ELSE}.
See \autoref{sec:cond-exp} for more details and an example.
Currently, we only use flow rules for short-circuiting edges of
conditional ands and ors. The CFG builder sets the flow rule of each
short-circuiting edge as it builds the CFG for the conditional and/or
The dataflow analysis logic requires that both the then and the else
store of each block contain some information before the block is
analyzed, so it is a requirement that at least one predecessor block
writes the then store and at least one writes the else store.
By default, the Dataflow Framework analyzes the code as if it is
executed sequentially. This is unsound if the code is run
concurrently. Use the \code{-AconcurrentSemantics} command-line
option to enable concurrent semantics.
In the concurrent mode, the dataflow analysis cannot infer any
local information for fields. This is because after a local update,
another thread might change the field's value before the next use.
An exception to this are monotonic type properties, such as the
\code{@MonotonicNonNull} annotation of the nullness type system. The
meta-annotation \code{@MonotonicQualifier} declares that a qualifier
behaves monotonically, however it is not yet used to preserve dataflow
information about fields under concurrent semantics.
\section{Example: Constant Propagation}
As a proof-of-concept, I (Stefan) implemented a constant propagation
analysis for local variables and integer values. The main class is
\code{org.checkerframework.dataflow.cfg.playground.ConstantPropagationPlayground}. I
describe the most important aspects here.
\textbf{Abstract values.} A class \code{Constant} is used as an
abstract value, which can either be \emph{top} (more than one integer
value seen), \emph{bottom} (no value seen yet), or \emph{constant}
(exactly one value seen; in which case the value is also stored).
\textbf{The store.} The store maps \code{Node}s to \code{Constant},
where only \code{LocalVariableNode}s and \code{IntegerLiteralNode}s
are used as keys. Only those two nodes actually are of interest
(there is no addition/multiplication/etc. yet, and other constructs
like fields are not yet supported by the analysis).
Two different instances of \code{LocalVariableNode} can be uses of the
same local variable, and thus the \code{equals} method has been
implemented accordingly. Therefore, every local variable occurs at
most once in the store, even if multiple (equal)
\code{LocalVariableNode}s for it exist.
\textbf{The transfer function.} The transfer function is very
simple. The initial store contains \emph{top} for all parameters, as
any value could have been passed in. When an integer literal is
encountered, the store is extended to indicate what abstract value
this literal stands for. Furthermore, for an assignment, if the
left-hand side is a local variable, the transfer function updates its
abstract value in the store with the abstract value of the right-hand
side (which can be looked up in the store).
To illustrate how we can have different information in the then and
else block of a conditional, I also implemented another transfer
function that considers the \code{EqualToNode}, and if it is of the
form \code{a == e} for a local variable \code{a} and constant
\code{e}, passes the correct information to one of the branches. This
is also shown in \autoref{fig:ConstSimple}.
\text{Example.} A small example is shown in \autoref{fig:ConstSimple}.
\flow{ConstSimple}{.45}{1}{Simple sequential program to illustrate constant
propagation. Intermediate analysis results are shown.}
\section{Example: Live Variable}
A live variable analysis for local variables and fields was implemented
to show the backward analysis works properly. The main class is
\textbf{Abstract values.} A class \code{LiveVarValue} is a live
variable (which is represented by a node) wrapper turning node into
abstract value. A node can be \code{LocalVariableNode} or \code{FieldAccessNode}.
\textbf{The store.} The live variable store \code{LiveVarStore} has a field
\code{liveVarValueSet} which contains a set of \code{LiveVarValue}s. Only
\code{LocalVariableNode} or \code{FieldAccessNode} will be considered as a
live variable and added to \code{liveVarValueSet}. The store defines methods
\code{putLiveVar(LiveVarValue)} and \code{killLiveVar(LiveVarValue)} to add
and kill live variables.
\textbf{The transfer function.} The transfer function \code{LiveVarTransfer}
initializes empty stores at normal and exceptional exit blocks (because this
is a backward transfer function). The transfer function visits assignments to
update the information of live variables for each node in the stores.
\textbf{Example.} An example is shown in \autoref{fig:LiveSimple}.
\flow{LiveSimple}{.33}{1}{Simple sequential program to illustrate live variable. Intermediate analysis results are shown.}
\section{Default Analysis}
The default flow-sensitive analysis \code{org.checkerframework.framework.flow.CFAnalysis}
works for any checker defined in the
Checker Framework. This generality is both a strength and a weakness
because the default analysis can always run but the facts it can
deduce are limited. The default analysis is extensible so checkers
can add logic specific to their own qualifiers.
The default flow-sensitive analysis takes advantage of several forms
of control-flow to improve the precision of type qualifiers. It
tracks assignments to flow expressions, propagating type qualifiers
from the right-hand side of the assignment. It considers equality and
inequality tests to propagate the most precise qualifiers from the
left or right-hand side to the true (resp. false) successor. It also
applies type qualifiers from method postconditions after calls.
Preconditions are not mentioned at all in this manual. How are they handled?
\subsection{Interaction of the Checker Framework and the Dataflow Analysis}
This section describes how the dataflow analysis is integrated into the
Checker Framework to enable flow-sensitive type checking.
A main purpose of a type factory is to create an \code{AnnotatedTypeMirror}
based on an input tree node. Using the results of the dataflow analysis,
the type factory can return a more refined type than otherwise possible.
Type factories that extend from \code{GeneralAnnotatedTypeFactory}
and set the constructor parameter \code{useFlow} to true will automatically
run dataflow analysis and use the result of the analysis when creating an
\code{AnnotatedTypeMirror}. The first time that a \code{GenericAnnotatedTypeFactory}
instance visits a \code{ClassTree}, the type factory runs the dataflow analysis on all the field initializers of the class first, then the bodies of methods in
the class, and then finally the dataflow analysis is
ran recursively on the members of nested classes. The result of
dataflow analysis are stored in the \code{GenericAnnotatedTypeFactory} instance.
When creating an \code{AnnotatedTypeMirror} for a tree node, the type factory
queries the result of the dataflow analysis to determine if a more refined
type for the tree node was inferred by the analysis. This is the first
type of query described in \autoref{sec:answering-questions}.
I found the below section confusing. I had a hard time putting my finger
on it, but perhaps you could re-read the section. As one minor issue,
``very similar'': similar to what? ``intermediary nodes'': what are
those? ``we'': is that the analysis writer or the framework implementor
(or the runtime system)?
Dataflow itself uses the type factory to get the initial
\code{AnnotatedTypeMirror} for a tree node in the following way.
For a given node \code{n}
If it has no corresponding AST tree node, use ``top'' as its
abstract value.
If it has a corresponding AST tree node, ask the
\code{AnnotatedTypeFactory} about the type of the tree node. The
factory will then use its checker-dependent logic to compute this
type. A typical implementation will look at the type of sub-trees
and compute the overall type based on the information about these
Note that the factory recursively uses information provided by the
flow analysis to determine the types of sub-trees. There is a
check in \code{Analysis.getValue} that the node whose type is
being requested is a sub-node of the node to which the transfer
function is currently being applied. For other nodes, the
analysis will return \code{null} (i.e., no information) and the
factory will return the flow-insensitive annotated type.
The \code{AnnotatedTypeFactory} and helper classes need to be prepared to work
even when dataflow analysis is not finished yet. Code should either check
whether dataflow analysis is still in progress (using
\code{analysis.isRunning()}) or handle possible null values. The factory should
return conservative, flow-insensitive types if the analysis is still in
\subsection{The Checker Framework Store and Dealing with Aliasing}
Word of caution: The exact rules of what information is
retained may or may not be implemented exactly as described
here. This is a good starting point in any case, but if very
precise information is needed, then the source code is very
readable and well documented.
The Dataflow Framework provides a default implementation of a store
with the class \code{CFAbstractStore}, which is used (as
\code{CFStore}) as the default store if a checker does not provide its
own implementation. This implementation of a store tracks the
following information:
\item Abstract values of local variables.
\item Abstract values of fields where the receiver is an
access sequence compose of the following:
\item Field access.
\item Local variable.
\item Self reference (i.e., \code{this}).
\item Pure or non-pure method call.
The most challenging part is ensuring that the information about field
accesses is kept up to date in the face of incomplete aliasing
information. In particular, at method calls and assignments care
needs to be taken about which information is still valid afterwards.
The store maintains a mapping from field accesses (as defined in
Section~\ref{sec:field-access}) to abstract values, and the subsequent
sections describe the operations that keep this mapping up-to-date.
There hasn't been a good introduction of pure vs. non-pure methods.
Maybe this is a good location for a discussion?
Introduce the purity annotations somewhere.
\subsubsection{Internal Representation of field access}
To keep track of the abstract values of fields, the Dataflow Framework
uses its own representation of field accesses (that is different from
the \code{Node} type introduced earlier). This data type is defined
inductively as follows:
{\nonterminal{Receiver} \nonterminal{Field} \qquad Java field (identified by its \code{Element})}
{\nonterminal{SelfReference} \qquad \code{this}}
\altline{\nonterminal{LocalVariable} \qquad local variable (identified by its \code{Element})}
\altline{\nonterminal{MethodCall} \qquad Java method call of a method}
\altline{\nonterminal{Unknown} \qquad any other Java expression}
{\nonterminal{Receiver} \literal{.} \nonterminal{Method}
\literal{(} \nonterminal{Receiver}$^{,*}$ \literal{)}}
\nonterminal{Unknown} is only used to determine which
information needs to be removed (e.g., after an assignment), but no field
access that contains \nonterminal{Unknown} is stored in the mapping to
abstract values. For instance, \nonterminal{Unknown} could stand for
a non-pure method call, an array access, or a ternary expression.
\subsubsection{Updating Information in the Store}
In the following, let $o$ be any \nonterminal{Receiver}, $x$ a local
variable, $f$ a field, $m$ a pure method, and $e$ an expression.
Furthermore, we assume to have access to a predicate $\alias(o_1,o_2)$
that returns true if and only if~$o_1$ might alias~$o_2$; see
\subsubsubsection{At Field Assignments}
For a field update of the form $o_1.f_1 = e$, the
dataflow analysis first determines the abstract value $e_\text{val}$ for $e$.
Then, it updates the store $S$ as follows.
\item For every field access $o_2.f_2$ that is a key in $S$, remove its
information if $f_1 = f_2$ and
$o_2$ lexically contains a \nonterminal{Receiver} that \emph{might}
alias $o_1.f$ as determined by the $\alias$
Note that the ``lexically contains'' notion for pure method calls
includes both the receiver as well as the arguments.
This includes the case where $o_1 = o_2$ (that is, they are
syntactically the same) and the case where $o_2$ and $o_1.f$ might be
Should the two occurrences of field $f$ in this paragraph be $f_1$?
\item $S[o_1.f_1] = e_\text{val}$
\subsubsubsection{At Local Variable Assignments}
For a local variable assignment of the form $x = e$,
the dataflow analysis first determines the abstract value $e_\text{val}$ for
Then, it updates the store $S$ as follows.
\item For every field access $o_2.f_2$ that is a key in $S$,
remove its information if $o_2$ lexically contains a
\nonterminal{Receiver} that \emph{might} alias $x$ as determined
by the $\alias$ predicate.
\item $S[x] = e_\text{val}$
\subsubsubsection{At Other Assignments}
For any other assignment $z = e$ where the
assignment target $z$ is not a field or local variable,
the dataflow analysis first determines the abstract value $e_\text{val}$ for
Then, it updates the store $S$ as follows.
\item For every field access $o_2.f_2$, remove its information if
$o_2$ lexically contains a \nonterminal{Receiver} that \emph{might}
alias $z$ as determined by the $\alias$ predicate.
\subsubsubsection{At Non-Pure Method Calls}
A non-pure method call might modify the value of any field arbitrarily.
Therefore, at a method call, any information about fields is lost.
\subsubsubsection{Alias Information}
The Checker Framework does not include an aliasing analysis, which could
provide precise aliasing information. For this reason, we implement the
predicate $\alias$ as follows:
\[ \alias(o_1,o_2) :=
\left(\operatorname{type}(o_1) <: \operatorname{type}(o_2)
\operatorname{type}(o_2) <: \operatorname{type}(o_1) \right) \]
where $\operatorname{type}(o)$ determines the Java type of a reference $o$
and $<:$ denotes standard Java subtyping.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "dataflow"
%%% TeX-command-default: "PDF"
%%% End:
% LocalWords: pre cfg javacutils InternalUtils DivideByZero BlockImpl se
% LocalWords: SingleSuccessorBlock RegularBlock SingleSuccessorBlockImpl
% LocalWords: ExceptionBlock SpecialBlock ConditionalBlock SpecialBlocks
% LocalWords: ControlFlowGraph ControlFlowGraphs CFGBuilder ast CFValue
% LocalWords: JavaExpressions AbstractValue AbstractValues PolyNull Poly
% LocalWords: AnnotatedTypeMirrors CFAbstractValue NullnessValue CFStore
% LocalWords: NullnessCheckers CFAbstractStore InitializationStore Regex
% LocalWords: NullnessStore TransferInput TransferResult TransferInputs
% LocalWords: ConditionalTransferResult RegularTransferResult CFTransfer
% LocalWords: TransferFunction NodeVisitor TransferResults isRegex lst
% LocalWords: CFAbstractTransfer AbstractNodeVisitor asRegex ClassTree
% LocalWords: InitializationTransfer visitMethodInvocation RegexTransfer
% LocalWords: CFAbstractAnalysis NullnessTransfer RegexChecker's lp Uber
% LocalWords: AnalysisResult AnnotatedTypeFactory AnnotatedTypeFactorys
% LocalWords: AnnotationProvider getAnnotatedType BaseTypeChecker
% LocalWords: GenericAnnotatedTypeFactory FlowAnalysis CFAnalysis
% LocalWords: BaseAnnotatedTypeFactory CFGSimple ValueLiteral txt
% LocalWords: BooleanLiteral CharacterLiteral DoubleLiteral FloatLiteral
% LocalWords: IntegerLiteral LongLiteral NullLiteral ShortLiteral args
% LocalWords: StringLiteral Accessor ArrayAccess FieldAccess This
% LocalWords: MethodAccess ExplicitThis ImplicitThis java NullAway
% LocalWords: LocalVariable MethodInvocation BitwiseAnd BitwiseOr MyType
% LocalWords: BitwiseComplement BitwiseXor ConditionalAnd ConditionalNot
% LocalWords: ConditionalOr FloatingDivision FloatingRemainder LeftShift
% LocalWords: IntegerDivision IntegerRemainder NumericalAddition EqualTo
% LocalWords: NumericalMinus NumericalMultiplication NumericalPlus util
% LocalWords: NumericalSubtraction SignedRightShift StringConcatenate
% LocalWords: TernaryExpression UnsignedRightShift NotEqual GreaterThan
% LocalWords: GreaterThanOrEqual LessThan LessThanOrEqual ArrayCreation
% LocalWords: StringConcatenateAssignment ObjectCreation TypeCast cond
% LocalWords: InstanceOf instanceof NarrowingConversion StringConversion
% LocalWords: WideningConversion ArrayType ParameterizedType ClassName
% LocalWords: PrimitiveType PackageName AssertionError NullChk accessor
% LocalWords: VariableDeclaration desugared desugaring FieldAssignment
% LocalWords: CFGFieldAssignment getRegularStore CFGConditionalOr fnc
% LocalWords: CFGConditionalOr2 ConditionalOrNode ConditionalOr2 valueOf
% LocalWords: IdentifierTree FieldAccessNode ImplicitThisNode unboxing
% LocalWords: TreePath pathHack MethodTree VarSymbol DetachedVarSymbols
% LocalWords: DeclarationFromElement BoxedClass treeLookup getValue
% LocalWords: nodeValues AdditionNode postconversion treelookup desugar
% LocalWords: LocalVariableNode AconcurrentSemantics MonotonicNonNull
% LocalWords: MonotonicQualifier IntegerLiteralNode EqualToNode
% LocalWords: ConstSimple SelfReference PureMethodCall javacutil
% LocalWords: stubparser TreeParser TreeBuilder DetachedVarSymbol LiveSimple