% Make this conditional?
\tikzset{%
    every node/.style={
	inner sep=1pt
    },
    every loop/.style={}
}
\lettrine[nindent=0ex]A{fter} more than two decades of development of Haskell compilers, one has become slightly spoiled by the quality and power of optimisations performed by the compiler. For example, list fusion allows us to write concise and easy to understand code using combinators and list comprehensions and still get the efficiency of a tight loop that avoids allocating the intermediate lists.

Unfortunately, not all list-processing functions used to take part in list fusion. In particular, before my work, left folds like \hi{foldl}, \hi{foldl'}, \hi{length} and derived functions like \hi{sum} were not fusing, and an expression like \mbox{\hi{sum (filter f [42..2016])}} still allocated and traversed one list.

The issue is that in order to take part in list fusion, these need to be expressed as right folds, which requires higher-order parameters as in
\begin{hcode}
\> foldl k z xs = foldr (λv fn z → fn (k z v)) id xs z.
\end{hcode}
The resulting fused code would be allocating and calling function closures on the heap, causing the final program to run too slowly (see\@ \cref{sec:stg}).
%I inspect that example more closely in \cref{sec:list-fusion}.

Already Andrew Gill noted that eta-expansion based on an arity analysis would help here  \cite{gill}. Previous arity analyses, however, are not precise enough to allow for a fusing \hi{foldl}.

\pagebreak[4]

Why is this so hard? Consider the slightly contrived example in \cref{fig:tricky}:
Our goal is to eta-expand the definition of \hi{tA}. For that, we need to ensure that it is always called with one argument, which is not obvious: Syntactically, the only use of \hi{tA} is in \hi{goB}, and there it occurs without an argument. But we see that \hi{goB} is initially called with two arguments, and under that assumption calls itself with two arguments as well, and it therefore always calls \hi{tA} with one argument -- done.

\begin{figure}
\abovedisplayskip=0pt
\belowdisplayskip=0pt
\begin{hcode}
\> !let tA = !if f a !then ... !else ... \\
\> !in \> !let \=2  goA x  \=3 =  \=4 !if f (tB + x) !then goA (x+1) !else x \\
               \=2  tB     \=3 =  \=4 !let  goB y = !if f y !then goB (goA y) !else tA \\
                                  \=4 !in goB 0 1 \\
\>     \> !in goA (goA 1)
\end{hcode}
\caption{Is it safe to eta-expand \protect\hi{tA}?}
\label{fig:tricky}
\end{figure}

But \hi{tA} is a thunk -- i.e.~not in head normal form -- and even if there are many calls to \hi{tA}, the call to \hi{f a} is only evaluated once. If we were to eta-expand \hi{tA} we would be duplicating that possibly expensive work! So we are only allowed to eta-expand \hi{tA} if we know that it is called at most once. This is tricky: \hi{tA} is called from a recursive function \hi{goB}, which itself is called from the mutual recursion consisting of \hi{goA} and \hi{tB}, and that recursion is started multiple times!

Nevertheless we know that \hi{tA} is evaluated at most once: \hi{tB} is a thunk, so although it will be called multiple times by the outer recursion, its right-hand side is only evaluated once. Furthermore, the recursion involving \hi{goB} is started once and stops when the call to \hi{tA} happens. Together, this implies that we are allowed to eta-expand \hi{tA} without losing any work.

I have developed an analysis, dubbed \emph{Call Arity}, that is capable of this reasoning and correctly detects that \hi{tA} can be eta-expanded. It is a combination of a standard forward call arity analysis (\cite{gill}, \cite{arityanalysis}) with a novel cardinality analysis, dubbed \emph{co-call analysis}. The latter determines for an expression and two variables whether one evaluation of the expression can possibly call both variables and -- as a special case -- which variables it calls at most once. I found that this is just the right amount of information to handle tricky cases as those in \cref{fig:tricky}.
\goodbreak

In this chapter, which is based on the work that I presented at the Trends in Functional Programming conference in 2014 \cite{tfp}, I approach the analysis from the practical, empirical side. \Cref{sec:why} motivates the need for and the design of the Call Arity analysis. The following two sections describe the co-call graph data structure that is central to the analysis (\cref{sec:cocallgraphs}) and the analysis itself (\cref{sec:callarity}). \Cref{sec:implementation} describes a few aspects of the implementation (which is reproduced in its entirety in \cref{sec:call-arity-impl}). Finally, \cref{sec:call-arity-discussion} discusses the analysis and quantifies the performance improvements. Notes on related and future work follow.

% Since version 7.10, the Haskell compiler GHC includes this analysis, and the Haskell standard library now implements \hi{foldl} as a good consumer for list fusion. I quantify the performance improvements in \cref{sec:measurements}.

\section{The need for co-call analysis}
\label{sec:why}

The main contribution of this chapter is the description of the co-call cardinality analysis and its importance for arity analysis. I want to motivate the analysis based on a sequence of ever more complicated arity analysis puzzles.

\subsection{A syntactical analysis}

The simplest such puzzle is the following code, where a function is defined as taking one argument, but always called with two arguments:
\begin{hcode}
\> !let  f x = \ldots\\
\> !in f 1 2 + f 3 4.
\end{hcode}
Are we allowed to eta-expand \hi{f} by another argument? Yes! How would we find out about it? We would analyse each expression of the syntax tree and ask
\begin{quote}
“For each free variable, what is a lower bound on the number of arguments passed to it?”
\end{quote}
This will tell us that \hi{f} is always called with two arguments, so we eta-expand it.

\pagebreak[3]

\subsection{Incoming arity}
Here is a slightly more difficult puzzle:
\begin{hcode}
\> !let \> f \> x \> = ... \\
\>      \> g \> y \> = f (y+1) \\
\> !in g 1 2 + g 3 4.
\end{hcode}
Are we still allowed to eta-expand \hi{f}? The previous syntactic approach fails, as the right-hand side of \hi{g} mentions \hi{f} with only one argument. However, \hi{g} itself can be eta-expanded, and once that is done we would see that \hi{g}'s right hand side is called with one argument more. We could run the previous analysis, simplify the code, and run the analysis once more, but we can do better by asking, for every expression:
\begin{quote}
“If this expression is called with $n$ arguments, for each free variable, what is a lower bound on the number of arguments passed to it?”
\end{quote}
The body of the \hi{!let} will report to call \hi{g} with two arguments. The pattern on the left-hand side of the definition of \hi{g} consumes one of them, so we can analyse the right-hand side with an \emph{incoming arity} of 1, and thus find out that \hi{f} is always called with two arguments.

For recursive functions this is more powerful than just running the simpler variant multiple times. Consider
\begin{hcode}
\> !let \> f \> x \> = \ldots \\
\>      \> g \> y \> = !if y > 10 !then f y !else g (y + 1) \\
\> !in g 1 2 + g 3 4.
\end{hcode}
A purely syntactical approach will never be able to eta-expand \hi{g} or \hi{f}. But by assuming an incoming arity we can handle the recursive case: The body of the \hi{!let} reports that \hi{g} is called with two arguments. We initially assume that to be true for all calls to \hi{g}. Next we analyse the right-hand side of \hi{g} and will learn -- under our assumption -- that it calls \hi{g} with two arguments, too, so our assumption was justified and we can proceed.
\goodbreak

Of course, it may well be that the assumption is refuted by analysing the definition of the recursive function:
\begin{hcode}
\> !let \> f \> x \> = \ldots \\
\>      \> g \> y \> = !if y > 10 !then f y !else foo (g (y+1)) \\
\> !in g 1 2 + g 3 4.
\end{hcode}
The body still reports that it calls \hi{g} with two arguments, but -- even under that assumption -- the right-hand side of \hi{g} calls \hi{g} with only one argument. So we have to re-analyse \hi{g} with one argument, which in turn calls \hi{f} with one argument and no eta-expansion is possible here.

This corresponds to the analysis outlined in \cite{gill}.


\subsection{Called-once information}

So far we have only eta-expanded functions; for these the final analysis in the previous section is sufficient. But there is also the case of \emph{thunks}: If the expression bound to a variable $x$ is not in head-normal form, i.e. the outermost syntactic construct is a function call, case expression or let-binding, but not a lambda abstraction or constructor, then that expression is evaluated upon its first call, and the result is shared with further calls to~$x$.

If we were to eta-expand the expression, though, the expensive operation is hidden under a lambda and will be evaluated for every call to $x$.
Therefore, it is crucial that thunks are only eta-expanded if they are going to be called at most once. So we need to distinguish the situation
\begin{hcode}
\> !let t = foo x \\
\> !in !if x !then t 1 !else t 2.
\end{hcode}
where \hi{t} is called at most once and eta-expansion is allowed, from
\begin{hcode}
\> !let t = foo x \\
\> !in t 1 + t 2.
\end{hcode}
where \hi{t} is called multiple times and must not be eta-expanded.
\goodbreak

An analysis that could help us here would be answering this question:
\begin{quote}
“If this expression is called once with $n$ arguments, for each free variable, what is a lower bound on the number of arguments passed to it, and are we calling it at most once?”
\end{quote}

In the first example, both branches of the \hi{!if} would report to call \hi{t} only once (with one argument), so the whole body of the \hi{!let} calls \hi{t} only once and we can eta-expand \hi{t}. In the second example the two subexpressions \mbox{\hi{t 1}} and \mbox{\hi{t 2}} are both going to be evaluated. Combined they call \hi{t} twice and we cannot eta-expand \hi{t}.

\subsection{Mutually exclusive calls}

What can we say in the case of a thunk that is called from within a recursion, like in the following code?
\begin{hcode}
\> !let t = foo x \\
\> !in  \> !let g y = !if y > 10 !then t !else g (y+1) \\
\>      \> !in g 1 2
\end{hcode}
Clearly \hi{t} is called at most once, but the current state of the analysis does not see that: The right-hand side of \hi{g} reports to call \hi{t} and \hi{g} at most once. But
\begin{hcode}
\> !let t = foo x \\
\> !in \> !let g y = !if y > 10 !then id !else g (t y) \\
\>     \> !in g 1 2
\end{hcode}
would yield the same result, although \hi{t} is called many times!

How can we extend our analysis to distinguish these two cases? The crucial difference is that in the first code, \hi{g} calls \emph{either} \hi{t} \emph{or} \hi{g}, while the second one calls both of them together. So we would like to know, for each expression:

\begin{quote}
“If this expression is called once with $n$ arguments, for each free variable, what is a lower bound on the number of arguments passed to it? Additionally, what set of variables is called mutually exclusively and at most once?”
\end{quote}

In the first example, the right-hand side would report to call $\{\text{\hi{t}},\text{\hi{g}}\}$ mutually exclusively and this allows us to see that the call to \hi{t} does not lie on the recursive path, so there will be at most one call to \hi{t} in every run of the recursion. We also need the information that the body of the \hi{!let} (which reports $\{\text{\hi{g}}\}$) and the right-hand side of \hi{g} both call \hi{g} at most once; if the recursion were started multiple times, or were not linear, then we would get many calls to \hi{t} as well.

\subsection{Co-call analysis}

The final puzzle in this sequence is the code
\begin{hcode}
\> !let t1 = foo x \\
\> !in \> !let g x = \> !if x > 10 \\
\>     \>            \> !then \> t1 \\
\>     \>            \> !else \> !let t2 = bar x \\
\>     \>            \>       \> !in  \> !let h y = \> !if y > 10 \\
\>     \>            \>       \>      \>            \> !then \> g (t2 y) \\
\>     \>            \>       \>      \>            \> !else \> h (y+1) \\
\>     \>            \>       \>      \> !in h 1 x \\
\>     \> !in g 1 2. \\
\end{hcode}
which shows the shortcomings of the previous iteration and the strength of the actual co-call analysis.

Note that both recursions are well-behaved: They are entered once and each recursive function calls either itself once or calls the thunk \hi{t1} resp.\ \hi{t2} once. So we would like to see both \hi{t1} and \hi{t2} eta-expanded. Unfortunately, with the analysis above, we can only get one of them.

The problematic subexpression is \hi{g (t2 y)}: We need to know that \hi{g} is called at most once \emph{and} that \hi{t2} is called at most once. But we cannot return $\{\text{\hi{g}}, \text{\hi{t2}}\}$ as that is a lie -- they are not mutually exclusive -- and the best we can do is to arbitrarily return either $\{\text{\hi{g}}\}$ or $\{\text{\hi{t2}}\}$.

To avoid this dilemma we extend the analysis one last time, in order to preserve all valuable information.
\goodbreak

We now ask, for each expression:
\begin{quote}
“If this expression is called once with $n$ arguments, for each free variable, what is a lower bound on the number of arguments passed to it, and for each pair of free variables, can both be called during the same execution of the expression?”
\end{quote}

The latter tells us, as a special case, whether one variable may be called multiple times.

For the problematic expression \hi{g (t2 y)} we would find that \hi{g} might be called together with \hi{t2}, but neither of them is called twice. For the right-hand side of \hi{h} the analysis would tell us that either \hi{h} is called at most once and on its own, or \hi{g} and \hi{t2} are called together, but each at most once. The whole inner \hi{!let} therefore calls \hi{t2} and \hi{g} at most once, so we get to eta-expand \hi{t2} and learn that the outer recursion is well-behaved.

\section{The type of co-call graphs}
\label{sec:cocallgraphs}

\index{Graph@$\sGraph$\idxexpl{undirected graphs over the set of variables}}%
\index{dom@$\dom{\idxdummy}$\idxexpl{domain}!of a $\sGraph$}%
\index{.O2edge@$\edge\idxdummy\idxdummy$\idxexpl{edge in a $\sGraph$}}
This information -- i.e.\@ which pairs of variables can both be called during the same execution -- can be represented by a graph on the set of variables. These graphs are undirected, non-transitive and can have loops. I denote the set of such graphs with $\sGraph$, and the intuition is that
\begin{compactitem}
\item only the nodes of $G$ (denoted by $\dom G$) are called, and that
\item an edge $\edge x y\in G$ indicates that $x$ and $y$ can be called together, while the absence of an edge guarantees that calls to $x$ resp.\ $y$ are mutually exclusive.
\end{compactitem}
In particular, the absence of a loop, i.e.\ $\edge x x\in G$, implies that $x$ is called at most once.

\begin{example}
Consider the three graphs
\begin{align*}
G_1 &=
\begin{tikzpicture}[baseline={(0.base)}, xscale=0.7]
\node (0) {$x$};
\node (1) at (1,0) {$y$};
\draw (0) -- (1);
\end{tikzpicture},
\\
G_2 &=
\begin{tikzpicture}[baseline={(0.base)}, xscale=0.7]
\node (0) {$x$};
\node (1) at (1,0) {$y$};
\draw (0) -- (1);
\draw (1)  edge[loop right] (1);
\end{tikzpicture},\text{ and}
\\
G_3 &=
\begin{tikzpicture}[baseline={(0.base)}, xscale=0.7]
\node (0) {$x$};
\node (1) at (1,0) {$y$};
%\draw (0) -- (1);
\draw (1)  edge[loop right] (1);
\end{tikzpicture}.
\end{align*}
The first graph allows at most one call to $y$ and at most one call to $x$, both of which can occur together. In contrast, the graph $G_2$ allows any number of calls to $y$, together with at most one call to $x$, while $G_3$ describes that any execution performs \emph{either} at most one call to $x$ \emph{or} any number of calls to $y$.
\end{example}

\index{.O2times@$\idxdummy\times\idxdummy$\idxexpl{Cartesian product of two sets of variables}}%
I often identify the graphs with their set of edges, e.g.\@ in the definition of the Cartesian product of two sets of variables, which is
\[
V \times V' \coloneqq \{\edge x y \mid x\in V \wedge y \in V' \vee y \in V \wedge x\in V'\},
\]
and specify its set of nodes separately -- which in this case is given by $\dom (V\times V') \coloneqq \dom V \cup \dom V'$.

\index{.O1square@$\idxdummy^2$\idxexpl{complete graph on a set of variables}}%
I write $V^2 \coloneqq V \times V$ for the complete graph on the variables in the set $V$.

\index{.O2neigh@$\neighbors\idxdummy\idxdummy$\idxexpl{neighbours of a node in a $\sGraph$}}%
\index{.O2restr@$\idxdummy"|_\idxdummy$!on $\sGraph$}%
\index{.O2setminus@$\idxdummy\setminus\idxdummy$!on $\sGraph$}%
The set of neighbours of a variable is $\neighbors x G \coloneqq \{y \mid \edge x y \in G\}$. The graph $G \setminus V$ is $G$ with nodes in $V$ removed, while $\restr{G}{V}$ is $G$ with only nodes in $V$ retained.

The graphs are obviously partially ordered by inclusion, i.e.
\index{.R2sqsubset@$\idxdummy\sqsubseteq\idxdummy$!on $\sGraph$}
\[
G \sqsubseteq G' \iff  \dom G \subseteq \dom {{G'}} \wedge G \subseteq G',
\]
with the empty graph $\{\}$ being the least element.

\section{The Call Arity analysis}
\label{sec:callarity}

Thus having motivated the need for a co-call-based analysis in order to get a precise arity analysis, I devote this section to a formal description of it. I build on the syntax introduced in \cref{sec:syntax}, allowing expressions as arguments in function calls.

\subsection{The specification}
\label{sec:spec}

The goal of the analysis is to determine the \emph{call arity} of every variable $x$. As defined in \cref{sec:arity}, this is a natural number $\alpha_x$ indicating that the compiler can replace the binding $\keyword{let}~{x = e}$ with $\keyword{let}~{x = \sLam{z_1 \ldots z_\alpha}{\sApp e {z_1\ldots z_\alpha}}}$ without losing any sharing.

The bottom-up analysis considers each expression $e$ under the assumption of an \emph{incoming arity} $\alpha$ -- which is the number of arguments the expression is currently being applied to -- and determines with at least how many arguments $e$ calls its free variables, and which free variables can be called together. Separating these two aspects into two functions, we have
\index{aexp@$\aexp\idxdummy\idxdummy$\idxexpl{arity analysis}}
\index{gexp@$\ccexp\idxdummy\idxdummy$\idxexpl{co-call analysis}}
\begin{align*}
\aexp{}\alpha &\colon \sExp \to (\sVar \pfun \mathbb N) && \text{arity analysis} \\
\ccexp{}\alpha &\colon \sExp \to \sGraph && \text{co-call analysis}
\end{align*}
where $\pfun$ denotes a partial map and $\sGraph$ is the type of undirected graphs (with self-edges) over the set of variables.
\index{Graph@$\sGraph$\idxexpl{undirected graphs over the set of variables}}

The informal specifications for $\aexp{}\alpha$ and $\ccexp{}\alpha$ are
\begin{compactitem}
\item If $\aexp{e}\alpha\,x = m$, then every call from $e$ (applied to $\alpha$ arguments) to $x$ passes at least $m$ arguments.
\item If $x_1$ and $x_2$ are not adjacent in $\ccexp{e}{\alpha}$, then no execution of $e$ (applied to $\alpha$ arguments) will call both $x_1$ and $x_2$. In particular, if $\edge x x \notin \ccexp{e}{\alpha}$, then $x$ will be called at most once.
\end{compactitem}


We can define a partial order on the results that expresses the notion of precision:
If $x$ is correct and $x \sqsubseteq y$, then $y$ is also correct, but possibly less precise.

In particular for $A, A' \colon (\sVar \pfun \mathbb N)$ we have
\index{.R2sqsubset@$\idxdummy\sqsubseteq\idxdummy$!on $(\sVar\pfun \mathbb N)$}
\[
A \sqsubseteq A' \iff \forall x \in \dom(A).\, A\,x \ge A'\,x
\]
(note the contravariance), because it is always safe to assume that $x$ is called with fewer arguments.

The partial order on $\sGraph$ introduced in \cref{sec:cocallgraphs} is also compatible with this notion of precision: If we have $G \sqsubseteq G'$, then every behaviour that is allowed by $G$ is also allowed by $G'$, as it is always safe to pessimistically assume that any two variables are called together, or to assume that one variable is called multiple times.

Thus the always correct and least useful analysis result maps every variable to 0 (making no statements about the number of arguments passed to them), and returns the complete graph on all variables as the co-call graph (allowing everything to be called with everything else).

The bottom of the lattice, i.e.\ the best information, is the empty map and the empty graph. This is the analysis result we expect for closed values such as $(\sLam y y)$ or $\sCon{\bT}$.

\subsection{The equations}
\label{sec:equation}

\begin{figure}[t]
\abovedisplayskip=0pt
\belowdisplayskip=0pt
\begin{align*}
\aexp{x}\alpha     &= [x \mapsto \alpha] \\
\ccexp{x}{\alpha}        &= x \\[1ex]
\aexp{e_1\ e_2}\alpha &= \aexp{e_1}{\alpha+1} \sqcup \aexp{e_2}{0} \\
\ccexp{e_1\ e_2}{\alpha} &=
\begin{cases}
\ccexp{e_1}{\alpha+1} \sqcup \{x\}^2 \sqcup \fv{e_1} \times\{x\} & \text{if } e_2 = x\\
\ccexp{e_1}{\alpha+1} \sqcup \ccexp{e_2}{0} \sqcup \fv{e_1} \times\fv{e_2} & \text{otherwise}
\end{cases} \\[1ex]
\aexp{\sLam x e}{0} &= \aexp{e}{0} \\
\aexp{\sLam x e}{\alpha+1} &= \aexp{e}{\alpha} \\
\ccexp{\sLam x e}{0} &= (\fv{e})^2\\
\ccexp{\sLam x e}{\alpha+1} &= \ccexp{e}{\alpha}\\[1ex]
\aexp{\sITE{e}{e_1}{e_2}}\alpha &= \aexp{e}0 \sqcup \aexp{e_1}{\alpha} \sqcup \aexp{e_2}\alpha \\
\ccexp{\sITE{e}{e_1}{e_2}}{\alpha} &= \ccexp{e}{0} \sqcup \ccexp{e_1}{\alpha} \sqcup \ccexp{e_2}{\alpha} \sqcup \fv{e} \times (\fv{e_1} \cup \fv{e_2})
\end{align*}
\caption{The Call Arity analysis equations}
\label{fig:eqn1}
\end{figure}

From the specification we can derive equations for every syntactical construct, given in \cref{fig:eqn1,fig:eqn2,fig:eqn3}.

%By construction, for code without “obviously” dead code, i.e.\ unused let-bound variables, we will have $\dom{\aexp e \alpha} = \dom{\ccexp e \alpha} = \fv e$, so the node set of the graph is not interesting.

Note that from the above definition of $\sqsubseteq$, the least upper bound of two arity analysis results $A,A' \in (\sVar \pfun \mathbb N )$ is the pointwise minimum.

\subsubsection*{Case 1: Variables}

Evaluating a variable with an incoming arity of $\alpha$ yields a call to that variable with $\alpha$ arguments, so the arity analysis returns a singleton map. Because we are interested in the effect of \emph{one} call to the expression, we return $x$ as called at-most once, i.e.\ the graph has the node $x$, but no edges.

\subsubsection*{Case 2: Application}

In this case, the incoming arity is adjusted: If $e_1\ e_2$ is being called with $\alpha$ arguments, then $e_1$ is called with $\alpha+1$ arguments. On the other hand we do not know how many arguments $e_2$ is called with -- this analysis is not higher order (see~\cref{sec:sophisticated}) -- so we analyse it with an incoming arity of 0.

The co-call analysis reports all possible co-calls from both $e_1$ and $e_2$. Additionally, it reports that everything that may be called by $e_1$ can be called together with everything called by $e_2$.

In the evaluation of a Core program, an argument to a function is shared and thus evaluated only once. Therefore, the co-call information in $\ccexp{e_2}0$ can be used as is. There is, however, an exception: If the argument is trivial, i.e.\@ a variable $x$, the Core-to-STG transformation does not introduce an explicit binding for the argument, and no sharing happens at this point. So if $e_1$ uses its argument more than once, $x$ will itself be called multiple times. Hence the analysis pessimistically includes $\{x\}^2$ in the result. This corner case was not handled in an earlier version of Call Arity, see \cref{sec:formalisationgap-semantics} for a discussion of this bug.

\subsubsection*{Case 3: Lambda abstraction}

For lambda abstractions, we have to distinguish two cases. The good case is if
the incoming arity is nonzero, i.e.\ we want to know the behaviour of the expression when applied once to some arguments. In that case, we know that the body is evaluated once, applied to one argument less, and the co-call information from the body can be used directly.

If the incoming arity is zero we have to assume that the lambda abstraction is used as-is, for example as a parameter to a higher-order function, or stored in a data type. In particular, it is possible that it is going to be called multiple times. So while the incoming arity on the body of the lambda stays zero (which is always correct), we cannot obtain any useful co-call results and have to assume that every variable mentioned in $e$ is called with every other.

Naturally, there is no point in passing arity or co-call information about the abstracted variable out of its scope. In the interest of a concise presentation, this is not explicated in \cref{fig:eqn1}. \Cref{sec:callarityconcretely} contains a more pedantic formal presentation.

\pagebreak

\begin{example}
The expression $e = \sLam x {(\sITE{x_0}{x_1}{x_2})}$ will, when analysed with an incoming arity of 1 resp. 0 yield
\begin{center}
\hfill
$\ccexp{e}{1} = $
\begin{tikzpicture}[baseline={(0.base)}]
\node (0) {$x_0$};
\node at (1,0.3) (1) {$x_1$};
\node at (1,-0.3) (2) {$x_2$};
\draw (0) -- (1);
\draw (0) -- (2);
\end{tikzpicture},
\hfill
resp.
\hfill
$\ccexp{e}{0} = $
\begin{tikzpicture}[baseline={(0.base)}]
\node (0) {$x_0$};
\node at (1,0.3) (1) {$x_1$};
\node at (1,-0.3) (2) {$x_2$};
\draw (0) -- (1);
\draw (0) -- (2);
\draw (1) -- (2);
\draw (0) edge[loop left] (0);
\draw (1) edge[loop right] (1);
\draw (2) edge[loop right] (2);
\end{tikzpicture}.
\hfill
\strut
\exampleSymbol\NoEndMark
\end{center}
\end{example}

\subsubsection*{Case 4: Case analysis}

The arity analysis of a case expression is straightforward: The incoming arity is fed into each of the alternatives, while the scrutinee is analysed with an incoming arity of zero; the results are combined using $\sqcup$.

The co-call analysis proceeds likewise. Furthermore,
% and similar to function application,
extra co-call edges are added, connecting everything that may be called by the scrutinee with everything that may be called in the alternatives -- analogous to analysing applications.

This may be an over-approximation: The analysis will yield
\begin{center}
$\ccexp{(\sITE{z}{x_1}{x_2}) (\sITE{z}{x_3}{x_4})}0 = $
\begin{tikzpicture}[baseline={([yshift=-.5ex]current bounding box.center)}]
\node at (-1,0)   (z) {$z$};
\node at (0,0.3)  (0) {$x_1$};
\node at (0,-0.3) (1) {$x_2$};
\node at (1,0.3)  (2) {$x_3$};
\node at (1,-0.3) (3) {$x_4$};
\draw (0) -- (2);
\draw (0) -- (3);
\draw (1) -- (2);
\draw (1) -- (3);
\draw (z) edge[loop left] (z);
\draw (z) -- (0);
\draw (z) -- (1);
\draw (z) edge[bend right=5] (2);
\draw (z) edge[bend left=5] (3);
\end{tikzpicture}
\end{center}
which contains the edge $\edge{x_1}{x_4}$, although $x_1$ cannot be called together with $x_4$ (and analogously for $\edge{x_2}{x_3}$), as the conditionals will choose the same branch in both cases.

\subsubsection*{Case 5: Non-recursive let}

This case is slightly more complicated than the previous, so we describe it in multiple equations in \cref{fig:eqn2}.


We analyse the body of the let-expression first, using the incoming arity of the whole expression. Based on that we determine our main analysis result, the call arity of the variable. There are two cases:
\begin{enumerate}
\item If the right-hand side expression $e_1$ is a thunk and the body of the \keyword{let} may possibly call it twice, i.e.\ there is a self-loop in the co-call graph, then there is a risk of losing work when eta-expanding $e_1$, so we do not do that.
\item Otherwise, the call arity is the minimum number of arguments passed to $x$ by the code in $e_2$, as reported by $\aexp{e_2}\alpha$.
\end{enumerate}

Depending on this result we need to adjust the co-call information obtained from $e_1$. Again, there are two cases:
\begin{enumerate}
\item We can use the co-call graph from $e_1$ if $e_1$ is evaluated at most once. This is obviously the case if $x$ is called at most once in the first place. It is also the case if $e_1$ is (and stays!) a thunk, because its result will be shared and further calls to $x$ can be ignored here.
\item If $e_1$ may be evaluated multiple times we cannot get useful co-call information and therefore return the complete graph on everything that is possibly called by $e_1$.
\end{enumerate}

Finally we combine the results from the body and the right-hand side, and add the appropriate extra co-call edges. We can be more precise than in the application case because we can exclude variables that are not called together with $x$ from the complete bipartite graph.

Note again that we do not clutter the presentation here with removing the local variable from the final analysis results. The implementation removes $x$ from $A$ and $G$ before returning them.

\begin{figure}[t]
\abovedisplayskip=0pt
\belowdisplayskip=0pt
\begin{align*}
\alpha_x &=
\begin{cases}
0                    & \text{if } \edge x x \in \ccexp{e_2}{\alpha} \text { and } \neg \isVal{e_1} \\
\aexp{e_2}\alpha\,x & \text{otherwise} \\
\end{cases}\\
G_{\text{rhs}} &=
\begin{cases}
\ccexp{e_1}{\alpha_x} & \text{if } \edge x x \notin \ccexp{e_2}{\alpha} \text{ or $\alpha_x = 0$}\\
\fv{e_1}^2 & \text{otherwise}
\end{cases}\\
E &= \fv{e_1} \times \neighbors{x}{\ccexp{e_2}{\alpha}}\\
A &= \aexp{e_1}{\alpha_x} \sqcup \aexp{e_2}\alpha \\
G &= G_{\text{rhs}} \sqcup \ccexp{e_2}{\alpha} \sqcup E
\end{align*}
\vspace*{-\belowdisplayskip}
\vspace*{-\abovedisplayskip}
\begin{align*}
\aexp{\nonreclet}\alpha &= A  &  \ccexp{\nonreclet}{\alpha} &= G
\end{align*}
\caption{Equations for a non-recursive $\nonreclet$}
\label{fig:eqn2}
\end{figure}


\begin{example}
Consider the expression
\[
e = \sLet{z = (\sITE{x}{(\sLam y {x_2})}{x_3})}{\sLam \_ {(\sITE{x_1}{x_2}{\sApp z y})}}
\]
with an incoming arity of $1$. The co-call graph of the body is
\begin{center}
$\ccexp{\sLam \_(\sITE{x_1}{x_2}{z\ y})}1 = $
\begin{tikzpicture}[baseline={(0.base)}]
\node (0) {$x_1$};
\node at (1,0.3) (1) {$x_2$};
\node at (1,-0.3) (2) {$z$};
\node at (2,-0.3) (3) {$y$};
\draw (0) -- (1);
\draw (0) -- (2);
\draw (0) edge [bend left=5] (3);
\draw (2) edge (3);
\end{tikzpicture}
\end{center}
and $\aexp{\sLam \_(\sITE{x_1}{x_2}{z\ y})}{1}\,z = 1$. The right-hand side of $z$’s definition is a thunk, so we must be careful when eta-expanding it. But there is no self-loop at $z$ in the graph, so $z$ is called at most once. The call-arity of $z$ is thus $\alpha_z = 1$ and we analyse its right-hand side with an incoming arity of $1$ to obtain
\[
\ccexp{\sITE{x}{(\sLam y {x_2})}{x_3}}1 =
\begin{tikzpicture}[baseline={(0.base)}]
\node (0) {$x$};
\node at (1,0.3) (1) {$x_2$};
\node at (1,-0.3) (2) {$x_3$.};
\draw (0) -- (1);
\draw (0) -- (2);
\end{tikzpicture}
\]
The additional edges $E$ connect all free variables of the right-hand side ($\{x,x_2,x_3\}$) with everything called together with $z$ from the body ($\{x_1, y\}$) and the overall result (skipping the now out-of-scope $z$) is
\[
\ccexp{e}{1} = 
\begin{tikzpicture}[baseline={(0.base)}]
\node at (-180:0.5) (0) {$x$};
\node at (108:0.5) (1) {$x_2$};
\node at (-108:0.5) (2) {$x_3$};
\node at (36:0.5) (b0){$x_1$};
\node at (-36:0.5) (b2) {$y$.};
\draw (0) -- (1);
\draw (0) -- (2);
\draw (b0) -- (b2);
\draw (0) edge[bend right=5] (b0);
\draw (1) -- (b0);
\draw (2) -- (b0);
\draw (0) edge[bend left=5] (b2);
\draw (1) edge (b2);
\draw (2) edge (b2);
\end{tikzpicture}
\]
Note that although $x_2$ occurs in both the body and the right-hand side, there is no self-loop at $x_2$: The analysis has detected that $x_2$ is called at most once.

The results are very different if we analyse $e$ with an incoming arity of 0. The body is a lambda abstraction, so it may be called many times, and we have
\begin{center}
$\ccexp{\sLam \_(\sITE{x_1}{x_2}{z\ y})}0 = $
\begin{tikzpicture}[baseline={(0.base)}]
\node at (-180:0.5) (0) {$x_1$};
\node at (90:0.5) (1) {$x_2$};
\node at (-90:0.5) (2) {$z$};
\node at (0:0.5) (3) {$y$};
\draw (0) -- (1);
\draw (0) -- (2);
\draw (0) edge  (3);
\draw (1) edge (2);
\draw (2) edge (3);
\draw (1) edge (3);
\draw (0) edge [loop left] (0);
\draw (1) edge [loop right] (1);
\draw (2) edge [loop right] (2);
\draw (3) edge [loop right] (3);
\end{tikzpicture}.
\end{center}
This time there is a self-loop at $z$, and we need to set $\alpha_z = 0$ to be on the safe side. This also means that $z$ stays a thunk and we still get some useful information from the right-hand side:
\begin{center}
$\ccexp{\sITE{x}{(\sLam \_ {x_2})}{x_3}}0 = $
\begin{tikzpicture}[baseline={(0.base)}]
\node (0) {$x$};
\node at (1,0.3) (1) {$x_2$};
\node at (1,-0.3) (2) {$x_3$.};
\draw (0) -- (1);
\draw (0) -- (2);
\draw (1) edge [loop right] (1);
\end{tikzpicture}
\end{center}
Due to the lower incoming arity we can no longer rule out that  $x_2$ is called multiple times, as it is hidden inside a lambda abstraction.
The final graph now becomes quite large, because everything in the body is potentially called together with $z$:
\begin{center}
$\ccexp{e}{0} = $
\begin{tikzpicture}[baseline={(0.base)}]
\node at (0:0.5) (0) {$x_1$};
\node at (72:0.5) (1) {$x_2$};
\node at (-72:0.5) (3) {$y$};
\node at (144:0.5) (r0) {$x$};
\node at (216:0.5) (r1) {$x_3$};
\draw (0) -- (1);
\draw (0) -- (3);
\draw (1) -- (3);
\draw (0) edge [loop right] (0);
\draw (1) edge [loop right] (1);
\draw (3) edge [loop right] (3);
%
\draw (r0) -- (r1);
%
\draw (r0) -- (0);
\draw (r0) -- (1);
\draw (r0) -- (3);
%
\draw (r1) -- (0);
\draw (r1) -- (1);
\draw (r1) -- (3);
\end{tikzpicture}.
\end{center}
This is almost the complete graph, but it is still possible to derive that $x$ and $x_3$ are called at most once.
\end{example}

\subsubsection*{Case 6: Recursive let}
The final case is the most complicated. It is also the reason why the figures are labelled “Equations” and not “Definitions”: They are also mutually recursive and it is the task of the implementation to find a suitable solution strategy (see \cref{sec:fixpointing}).

The complication arises from the fact that the result of the analysis affects its parameters: If the right-hand side of a variable calls itself with a lower arity than the body, we need to use the lower arity as the call arity. Therefore, the final result ($A$ and $G$ in the equations) is also used to determine the basis for the call-arity and co-call information of the variables.

\begin{figure}[t]
\abovedisplayskip=0pt
\belowdisplayskip=0pt
\begin{align*}
A   &= \aexp{e}\alpha \sqcup \bigsqcup_i \aexp{e_1}{\alpha_{x_i}} \\
G   &= \ccexp{e}{\alpha} \sqcup \bigsqcup_i G^i \sqcup \bigsqcup_i E^i\\
\alpha_{x_i} &=
\begin{cases}
0      & \text{if } \neg \isVal{e_i} \\
A\,x_i & \text{otherwise}
\end{cases}\\
G^i &=
\begin{cases}
\ccexp{e_i}{\alpha_{x_i}} & \text{if } \edge {x_i}{x_i} \notin G \text{ or $\alpha_{x_i} = 0$}\\
\fv{e_i}^2 & \text{otherwise}
\end{cases}\\
E^i &=
\begin{cases}
\fv{e_i} \times N(\ccexp{e}{\alpha} \sqcup \bigsqcup_j G^j) & \text{if $\alpha_{x_i} \ne 0$} \\
\fv{e_i} \times N(\ccexp{e}{\alpha} \sqcup \bigsqcup_{j\ne i} G^j) & \text{if $\alpha_{x_i} = 0$}
\end{cases}\\
N(G) &= \{z \mid \edge z {x_i} \in G, i = 1\ldots\}
\end{align*}
\vspace*{-\belowdisplayskip}
\vspace*{-\abovedisplayskip}
\begin{align*}
\aexp{\letrec}\alpha &= A  &  \ccexp{\letrec}{\alpha} &= G
\end{align*}
\caption{Equations for a recursive $\letrec$}
\label{fig:eqn3}
\end{figure}

Thunks aside, we can think of one recursive binding $\sLet{x = e_1}{e_2}$ as an arbitrarily large number of nested non-recursive bindings
\label{infinite-unwrapping}
\begin{hcode}
\> !let $x$ = $e_1$ \\
\> !in \> !let $x$ = $e_1$ \\
\>     \> !in \> !let $x$ = $e_1$ \\
\>     \>     \> !in \> \phantom{!let } \> \\
\>     \>     \>     \> \rlap{\smash{$\ddots$}}       \> !let $x$ = $e_1$ \\
\>     \>     \>     \>       \> !in $e_2$.
\end{hcode}
The co-call information $G$ can be thought of the co-call information of this expression, and this is how $\edge{x_i}{x_i} \notin G$ has to interpreted: Not that there is at most one call to $x_i$ in the whole recursion (there probably are many, why else would there be a recursive \keyword{let}), but rather that when doing such an unrolling of the recursion, there is at most one call to $x_i$ leaving the scope of the outermost non-recursive \keyword{let}.

This analogy is flawed for thunks, where multiple nested non-recursive bindings would have a different sharing behaviour. Therefore, I set $\alpha_{x_i}=0$ for all thunks in a recursive \keyword{let}; this preserves sharing.

The formulas for the additional co-calls  $E^i$ are a bit more complicated than in the non-recursive case, and differ for thunks and non-thunks. Consider one execution that reaches a call to $x_i$. What other variables might have been called on the way? If the call came directly from the body $e$, then we need to consider everything that is adjacent to $x_i$ in $\ccexp{e}{\alpha}$. But it is also possible that the body has called some other $x_j$, $j\ne i$ and $e_j$ then has called $x_i$ -- in that case, we need to take those variables adjacent to $x_j$ in $\ccexp{e}{\alpha}$ and those adjacent to $x_i$ in $G^j$.

\pagebreak

In general, every call that can occur together with any recursive call in any of the expressions can occur together with whatever $x_i$ does.

For a thunk we can get slightly better information: A non-thunk $e_i$ can be evaluated multiple times during the recursion, so its free variables can be called together with variables on $e_i$'s own recursive path. A thunk, however, is evaluated at most once, even in a recursive group, so for the calculation of additional co-call edges it is sufficient to consider only the \emph{other} right-hand sides (and the body of the \hi{!let}, of course).


\begin{example} Consider the expression
\[
%e =
\begin{aligned}[t]
&\keyword{let}~
	\begin{aligned}[t]
		x_1 &= \sLam{y}{(\sITE{y_1}{x_2~y}{z_1})} \\
		x_2 &= \sLam{y}{(\sITE{y_2}{x_1~y}{z_2})}
	\end{aligned} \\
&\keyword{in}~\sLam{y}{x_1~y~y}
\end{aligned}
\]
with an incoming arity of 1. It is an example for a nice tail-call recursion as it is commonly produced by list fusion: The body has one call into the recursive group, and each function in the group also calls at most one of them.

The minimal solution to the equations in \cref{fig:eqn3} in this example is
\begin{align*}
\ccexp{e}{1} &= \{\} \\
\alpha_{x_1} = \alpha_{x_2} &= 2 \\
G^1 = \ccexp{e_1}{2} &= \{y_1\} \times \{x_2, z_1\}\\
G^2 = \ccexp{e_2}{2} &= \{y_2\} \times \{x_1, z_2\}\\
E^1 &= \{y_1, x_2, z_1\} \times \{y_1,y_2\}\\
E^2 &= \{y_2, x_1, z_2\} \times \{y_1,y_2\}
\end{align*}
and the final result is
\begin{center}
$G = $
\begin{tikzpicture}[baseline={(0.base)}]
\node[color=white] (0) {$x_1$};
\node at (-1,0.3)  (x1) {$x_1$};
\node at (-1,-0.3) (x2) {$x_2$};
\node at (0,0.6)   (y1) {$y_1$};
\node at (0,-0.6)  (y2) {$y_2$};
\node at (1,0.3)   (z1) {$z_1$};
\node at (1,-0.3)  (z2) {$z_2$};
\draw (x1) -- (y1);
\draw (x1) -- (y2);
\draw (x2) -- (y1);
\draw (x2) -- (y2);
\draw (y1) -- (z1);
\draw (y1) -- (z2);
\draw (y2) -- (z1);
\draw (y2) -- (z2);
\draw (y1) edge[loop right] (y1);
\draw (y2) edge[loop right] (y2);
\draw (y1) -- (y2);
\end{tikzpicture},
\end{center}
where we see that at most one of $z_1$ and $z_2$ is called by the recursive group, and neither of them twice.

% If the incoming arity is $0$ we have to assume multiple calls into the recursion, and get these results:
% \begin{align*}
% \ccexp{e}{0} &= \{x_1\}^2 \\
% \alpha_{x_1} = \alpha_{x_2} &= 2 \\
% G^1 &= \{y_1,x_2,z_1\}^2 \\
% G^2 &= \{y_2,x_1,z_2\}^2 \\
% E^1 &= \{y_1, x_2, z_1\} \times \{y_1,x_2,z_1,y_2,x_1,z_2\}\\
% E^2 &= \{y_2, x_1, z_2\} \times \{y_1,x_2,z_1,y_2,x_1,z_2\}\\
% \end{align*}
% and the final result is the complete graph on all variables, i.e.~$E=\{x_1,x_2,y_1,y_2,z_1,z_2\}^2$.

% For comparison, let one of the variables be (and remain) a thunk:
% \[
% e = \begin{aligned}[t]
% &|letrec|~ 
% 	\begin{aligned}[t]
% 		x_1 &= \sITE{y_1}{x_2~y}{z_1} \\
% 		x_2 &= \sLam{y}{(\sITE{y_2}{x_1~y}{z_2})}
% 	\end{aligned} \\
% &|in|~\sLam{y}{x_1~y~y}
% \end{aligned}
% \]
% This improves the result 
% 

In contrast consider this recursion which forks in $x_2$:
\[
%e =
\begin{aligned}[t]
&\keyword{let}~ 
	\begin{aligned}[t]
		x_1 &= \sLam{y}{(\sITE{y_1}{x_2~y}{z_1})} \\
		x_2 &= \sLam{y}{(\sITE{y_2}{x_1~(x_1~y~y)}{z_2})}
	\end{aligned} \\
&\keyword{in}~\sLam{y}{x_1~y~y.}
\end{aligned}
\]
We see that now $z_1$ and $z_2$ are possibly called together and multiple times. Indeed $\edge {x_1}{x_1} \in \ccexp{e_2}{2}$ causes $x_1 \in N(\ldots)$ in the equation for $E^i$, so especially $\edge {x_1}{x_1} \in E^2 \subseteq G$. Therefore, $G^1 = \fv{e_1}^2$ and we also have $\edge {x_2}{x_2}\in G$ and $G^2 = \fv{e_2}^2$. Eventually, we find that the result is the complete graph on all variables, i.e.~$E=\{x_1,x_2,y_1,y_2,z_1,z_2\}^2$, and in particular $\edge{z_1}{z_2} \in E$, as expected.
\end{example}

\section{The implementation}
\label{sec:implementation}

This section is of primary interest to those readers who want to understand the implementation of Call Arity in GHC. I explain some of the design choices and how the the code relates to the definitions in this thesis.

The Call Arity analysis is implemented in GHC as a separate Core-to-Core pass, where Core is GHC's typed intermediate language based on System $F_C$ (cf.\@ \cref{sec:ghc-core}). See \cref{sec:call-arity-impl} for the code of the analysis.

This pass does not actually do the eta-expansion; it merely annotates let-bound variables with their call arity. A subsequent pass of GHC's simplifier then performs the expansion, using the same code as for the regular, definition-based arity analysis, and immediately applies optimisations made possible by the eta-expansion. This separation of concerns keeps the Call Arity implementation concise and close to the formalisation presented here.

GHC Core obviously has more syntactical constructs than our toy lambda calculus, including literals, coercion values, casts, profiling annotations (“ticks”), type lambdas and type applications, but these are irrelevant for our purposes: For literals and coercion values Call Arity returns the bottom of the lattice; the others are transparent to the analysis. In particular type arguments are not counted towards the arity here, which coincides with the meaning of arity as returned by GHC's regular arity analysis.

I want the analysis to make one pass over the syntax tree (up to the iterative calculation of fixed points for recursive bindings, \cref{sec:fixpointing}). So instead of having two functions -- one for the arity analysis and one for the co-call analysis -- I defined one function \hi{callArityAnal} which returns a tuple \hi{(UnVarGraph, VarEnv Arity)}, where the \hi{UnVarGraph} is a data structure for undirected graphs on variable names (see \cref{sec:graph}) and \hi{VarEnv Arity} is a partial map from variable names to \hi{Arity}, which is a type synonym for \hi{Int}.

The equations refer to $\fv{e}$, the set of free variables of an expression. In the implementation, I do not use GHC's corresponding function \hi{exprFreeIds}, as this would require another traversal of the expression. Instead I use $\dom(\aexp{e}\alpha)$, which by construction happens to be the set of free variables of $e$, independent of $\alpha$, as at this stage in the compiler pipeline, “obviously” dead code has been removed.

In the sequence of Core-to-Core passes, I inserted Call Arity and its eta-expanding simplifier pass after the simplifier's phase 0, as that is when all the rewrite rules have been active \cite{rewriterules}, and before the strictness analyser. This way, the latter has a chance to unbox any new function parameters introduced by Call Arity, such as the accumulator in a call to \hi{sum}.

\subsection{Interesting variables}
\label{sec:interesting-variables}

The analysis as presented in the previous section would be too expensive if implemented as is. This can be observed when compiling GHC's \hi{DynFlags} module, which defines a record type with 157 elements. The Core code for a setter of one of the fields is
\begin{hcode}
\> setX\hspace{-1pt}42 \> x \=2 (DynFlags x1 {\ldots} x41 \> \_ \> x43 {\ldots} x157) \\
\> ~~ \> =  \=2 (DynFlags x1 {\ldots} x41 \> x \> x43 {\ldots} x157).
\end{hcode}
For the body of the function, the analysis would report that 157 variables are called with (at least) 0 arguments, and that all of them are co-called with every other, a graph with 12246 edges. And none of this information is useful: The variables come from function parameters or pattern matches and there is no definition that we can possibly eta-expand!

% Similarly consider thunk of non-function type, such as in
% \begin{code}
% let x = 24 + y in x * x
% \end{code}
% We'd analyse the body to find out that \hi{x} is called with (at least) 0 arguments. But \hi{x} has type \hi{Int}, so we already know that it won't be passed any arguments. We’d also find out that \hi{x} is called multiple times, but that information is also useless: In \cref{fig:eqn2} we know $\alpha_x = 0$ already as $e_1$ is not in head normal form, and this is sufficient to decide that $G_{\text{rhs}} = \ccexp{e_1}{\alpha_x}$.
% % What about $\edge v x \in G_{\text{body}}$ ? Ignore that? Heuristic?

Therefore, the code keeps track of the set of \emph{interesting variables}, and only returns information about them. Currently, interesting variables are all let-bound variables of function type, while function parameters and pattern match variables are not interesting.

Generally, considering fewer variables as interesting will trade precision for performance, but preserves soundness: It would be perfectly sound, for example, to consider the variables of a very large recursive group to be uninteresting.

The complete type signature of the analysis is therefore
\begin{hcode}
\>callArityAnal :: \>Arity → VarSet → CoreExpr → \\
\>                 \>((UnVarGraph, VarEnv Arity), CoreExpr)
\end{hcode}
where the arguments are
\begin{compactitem}
\item the incoming arity,
\item the set of interesting variables and
\item the expression to analyse
\end{compactitem}
and the return values are the co-call graph and arity information (both restricted to the set of interesting variables) and the expression with the Call Arity result annotation added.

\subsection{Finding the fixed points}
\label{sec:fixpointing}

The equations in the previous section specify the analysis, but do not provide an algorithm: In the case of a recursive \keyword{let} (\cref{fig:eqn3}), the equations are mutually recursive and the implementation has to employ a suitable strategy to find a solution.

The implementation finds the solution by iteratively approaching the fixpoint, using memorisation of intermediate results.
\begin{compactenum}
\item Initially, it sets $A = \aexp{e}\alpha$ and $G=\ccexp{e}{\alpha}$.
\item For every variable $x_i \in \dom A$ that has not been analysed before, or has been analysed before with different values for $\alpha_{x_i}$ or $\edge{x_i}{x_i}\in G$, it (re-)analyses it, remembering the parameters and memorising the result $\aexp{e_1}{\alpha_{x_i}}$ and $G^i$.
\item If any variable has been (re)analysed in this iteration, it recalculates $A$ and $G$ and repeats from step 2.
\end{compactenum}

This process will terminate, as shown by a simple standard argument: The variant that proves this consists of $\alpha_{x_i}$ and whether $\edge{x_i}{x_i} \in G$. The former starts at some natural number and decreases, the latter may start as not true, but once it is true, it stays true. Therefore, these parameters can change only a finite number of times, and the loop terminates once all of them are unchanged during one iteration. The monotonicity of the parameters follows from the monotonicity of the equations for $\aexp{}\alpha$ and $\ccexp{}\alpha$: We have that $\alpha \ge \alpha'$ implies $\aexp{e}\alpha \sqsubseteq \aexp{e}{\alpha'}$ and $\ccexp{e}{\alpha} \sqsubseteq \ccexp{e}{\alpha'}$.

\subsection{Top-level values}

GHC supports modular compilation. Therefore, for exported functions, the compiler does not have the call sites available to analyse. Nevertheless I do want it to be able to analyse and eta-expand at least non-exported top-level functions.

To solve this elegantly I treat a module
\begin{hcode}
\> !module Foo(foo) !where \\
\> bar  \> = {\ldots} \\
\> foo  \> = {\ldots}
\end{hcode}
as if it were a sequence of let-bindings
\begin{hcode}
\> !let bar  \> = {\ldots} !in\\
\> !let foo  \> = {\ldots} !in\\
\> e
\end{hcode}
where \hi{e} represents the external code for which I assume the worst: It calls all exported variables (\hi{foo} here) with 0 arguments and the co-call graph is the complete graph. This prevents unwanted expansion of \hi{foo}, but still allows us to eta-expand \hi{bar} based on how it is called by \hi{foo}.

Unfortunately, it also means that adding a top-level function to the export list of the module can prevent Call Arity from eta-expanding it and other functions in the module.
%If we export the local |showInt :: Int -> ([String] -> [String])| function in the module mentioned in \cref{sec:dlist}, we
If, for example, I export all difference-list producing functions in the code mentioned in \cref{sec:dlist}, then I lose the benefits from Call Arity. In this sense, Call Arity can be considered a whole program analysis that happens to be useful in a setting with separate compilation as well.


\subsection{The graph data structure}
\label{sec:graph}

The analysis often builds complete bipartite graphs and complete graphs between sets of variables. A usual graph representation like adjacency lists would be quadratic in size and too inefficient for this use.

Hence, the data type \hi{UnVarGraph} used in the implementation is specifically crafted for this purpose, see \cref{sec:co-call-graph-impl} for the code. It represents graphs symbolically, as multisets (“bags” in the lingua of GHC code) of complete bipartite and complete graphs:
\begin{hcode}
\> !data Gen\>[][c] = \> CBPG VarSet VarSet \\
\>          \>[][c] | \> CG  VarSet \\
\> !type UnVarGraph = Bag Gen
\end{hcode}
This allows for very quick, $O(1)$, creation and combination of graphs. The important query operation, calculating the set of neighbours of a node, is done by traversing the generating subgraphs.

One disadvantage of this data structure is that it does not normalise the representation. In particular, the union of a graph with itself is twice as large. I had to take that into account when I implemented the calculation of fixed points: It would be very inefficient to update $G$ by merging it with the new results in each iteration. Instead, $G$ is always reassembled from $\ccexp{e}{\alpha}$ and the -- new or memorised -- results from the bound expressions.

I experimented with simplifying the graph representation using identities like $S_1\times S_2 \cup S_1^2 \cup S_2^2 = (S_1\cup S_2)^2$, but it did not pay off, especially as deciding set equality can be expensive.


%Since all graphs constructed in the process are unions of such complete bipartite and complete graphs, we represent them as such (|Bag| a like an unordered list with $O(1)$ concatenation, i.e. a multiset):
%\begin{code}
%data Gen  = CBG  VarSet VarSet | CG  VarSet
%type UnVarGraph = Bag Gen
%\end{code}
%This representation allows the various graph-constructing operations to be fast and in size linear in the input:
%\begin{code}
%emptyUnVarGraph = []
%g1 `unionUnVarGraph` g2 = g1 ++ g2
%completeBipartiteGraph s1 s2 = [CPG s1 s2]
%completeGraph s = [CG s]
%\end{code}
%
%The important query operation that we need to support is to obtain the set of neighbors of a variable; this is calculated by traversing the list of generating subgraphs and collecting the neighbors there:
%\begin{code}
%neighbors x g = unionVarSets (map (neighborsGen x) g)
%neighborsGen x (CG s) =
%  if x `elem` S then s else emptyVarSet
%neighborsGen x (CBG s1 s2) =
%  case (x `elem` s1, x `elem` s2) of
%    (True,   True)   = s1 `unionVarSet` s2
%    (False,  True)   = s1
%    (True,   False)  = s2
%    (False,  False)  = emptyVarSet
%\end{code}
%
%
%One disadvantage of this data structure is that it does not normalize the representation. In particular, the graph \hi{g `unionUnVarGraph` g} is semantically identical to \hi{g}, but twice as large. This had to be taken into account when implementing the calculation of the fixedpoints: It would be very inefficient to update $G$ by unioning it with the new results in each iteration. Instead, $G$ is recalculated from $\ccexp{e}{\alpha}$ and the -- new or memorized -- results from the bound expressions.
%
%We experimented with ideas like simplifying a graph with representation \hi{[CBG s1 s2, CG s1, CG s2]} to the equivalent graph \hi{[CG (s1 `unionVarSet` s2)]}, but these checks were not worth it, especially as deciding set equality can be expensive.


\section{Discussion}
\label{sec:call-arity-discussion}

\subsection{Call Arity and list fusion}
\label{sec:list-fusion}

As hinted at in the introduction, I devised Call Arity mainly to allow for a fusing \hi{foldl}, i.e.\@ a definition of \hi{foldl} in terms of \hi{foldr} that takes part in list fusion while still producing good code. How exactly does Call Arity help here?

Consider the code \hi{sum (filter f [42..2016])}.
Previously, only \hi{filter} would fuse with the list comprehension, eliminating one intermediate list, but the call to \hi{sum}, being a left-fold, would remain:
Compiled with previous versions of GHC, this produces code roughly equivalent to
\begin{hcode}
\> !let go = λx → \> !let r = \> !if x == 2016 \\
\>                \>          \> !then \> \relax [] \\
\>                \>          \> !else \> go (x + 1)\\
\>                \> !in \=x !if f x !then x : r !else r \\
\> !in foldl (+) 0 (go 42).
\end{hcode}

If we changed the definition of \hi{foldl} to use \hi{foldr}, as in
\begin{hcode}
\> foldl k z xs = foldr (λv fn z → fn (k z v)) id xs z.
\end{hcode}
all lists are completely fused and we obtain the code
\begin{hcode}
\> !let go = λx → \> !let r = \> !if x == 2016 \\
\>                \>          \> !then  id \\
\>                \>          \> !else  go (x + 1) \\
\>                \> !in \=2 !if f x \\
\>                \>     \=2 !then λa → r (a + x) \\
\>                \>     \=2 !else  r \\
\> !in go 42 0.
\end{hcode}
Without Call Arity, this was the final code, and as such quite inefficient:
The recursive loop \hi{go} has become a function that takes one argument, then allocates a function closure for \hi{r} on the heap, and finally returns another heap-allocated function closure which will pop the next argument from the stack -- not the fastest way to evaluate this simple program.

With Call Arity the compiler detects that \hi{go} and \hi{r} can both safely be eta-expanded with another argument, yielding the code
\begin{hcode}
\> !let go = λ x a → \> !let r = λa → \> !if x == 2016 \\
\>                   \>               \> !then a \\
\>                   \>               \> !else  go (x + 1) a \\
\>                   \> !in \=2 !if f x \\
\>                   \>     \=2 !then r (a + x)\\
\>                   \>     \=2 !else r a \\
\> !in go 42 0
\end{hcode}
where the number of arguments passed matches the number of lambdas that are manifest on the outer level. This avoids allocations of function closures and allows the runtime to do fast calls \cite{fastcurry}, or even tail-recursive jumps.

\subsection{Limitations}
\label{sec:weakness}

A particularly tricky case is list fusion with generators with multiple (or non-linear) recursion. This arises when flattening a tree to a list. Consider the code
\begin{hcode}
\> !data Tree = Tip Int | Bin Tree Tree \\
\> \\
\> toList !:: Tree → [Int] \\
\> toList tree = build (toListFB tree) \\
\> \\
\> toListFB root cons nil = go root nil \\
\> \hwhere \\
\> \hind \> go \> (Tip x)   \> rest = \> cons x rest \\
\>       \> go \> (Bin l r) \> rest = \> go l (go r rest)
\end{hcode}
which is a good producer; for example \hi{filter f (toList t)} is compiled to
\begin{hcode}
\> !let go = λt rest → !case t !of \\
\> \hind\hind \> Tip x   \> → !if f x !then x : rest !else rest \\
\>            \> Bin l r \> → go l (go r rest) \\
\> !in go t [].
\end{hcode}

If we add a left-fold to the pipeline, i.e. \hi{foldl (+) 0 (filter f (toList t))}, where the \hi{foldl} is implemented via \hi{foldr}, the resulting code (before Call Arity) is
\begin{hcode}
\> !let go = λt fn → !case t !of \\
\> \hind\hind \> Tip x    \> → !if f x !then (λa → fn (x + a)) !else fn \\
\>            \> Bin l r  \> → go l (go r fn) \\
\> !in go t id 0.
\end{hcode}
Although \hi{go} is always being called with three arguments, my analysis does not see this. For that it would have to detect that \hi{go} calls its second parameter with one argument; as it is a forward analysis (in the nomenclature of \cite{arityanalysis}) it cannot do that.

And even if GHC could eta-expand it (in fact it can, due to the one-shot annotation discussed in \cref{sec:one-shot}), the result would not be much better: For the recursion, the runtime still needs to create a function closure for the unsaturated call \hi{go r fn}, which is then called slowly by \hi{go}, as explained in \cref{sec:stg}.

Things look better if we adjust the definition of \hi{toList} so that the worker is tail-recursive. This requires an explicit stack, keeping track of the branches of the tree that are yet to be visited:
\begin{hcode}
\> toListFB root cons nil = go root nil [] \\
\> \hwhere \\
\> \hind \> go \> (Tip x)   \> s = \> cons x (goS s) \\
\>       \> go \> (Bin l r) \> s = \> go l (r:s) \\
\>       \> goS \=x {}[]    \>   = nil\\
\>       \> goS \=x (x:xs)  \>   = go x xs
\end{hcode}
Now the resulting code is a nice tail-recursive loop, and it even allows GHC to unbox the accumulator, which usually provides a large performance benefit.


But the code that we would really want to see, and which we would write by hand, is
\begin{hcode}
\> !let go = λt a → !case t !of\\
\> \hind\hind \> Tip x   \> → !if f x !then a + x !else a \\
\>            \> Bin l r \> → go l (go r a) \\
\> !in go t 0
\end{hcode}
with no continuations or explicit stack whatsoever and just the accumulator (unboxed by GHC) is being passed through the recursion. Such a transformation would require much more involved changes to the code than just eta-expansion followed by simplification, and is out of scope for Call Arity.

%The wor\-ker-wrap\-per extension to list fusion proposed by Takano is able to solve this, but has other shortcomings (\cref{sec:takano}). % problem is able to produce this good code using just rewrite rules and without the help of special compiler optimizations.

We (the GHC developers) still decided to let \hi{foldl} take part in list fusion based on the benchmark results, presented in the next section. They indicate that the real-world benefits in the common case of linear recursion are larger than the penalty in the non-linear recursion, and if necessary, the producer can be adjusted to be linearly recursive. 

\subsection{Measurements}
\label{sec:measurements}

No work on compiler optimisations without some benchmark results! I compare four variants, all based on the GHC 7.10.3 codebase (revision \hi{97e7c29}):
\begin{enumerate}[(A)]
\item For the baseline, I removed the Call Arity analysis code and undid the changes to the library code, i.e. reverted \hi{foldl} to its naive, non-fusing definition.
\label{perf-base}
\item To measure the effect of Call Arity analysis alone I enable it again, but left \hi{foldl} with the naive definition.
\label{perf-call-arity}
\item The current, unmodified state of the compiler, with Call Arity enabled and \hi{foldl} implemented via \hi{foldr}, is the most relevant variant; in the table this column is highlighted.
\label{perf-call-arity-foldr}
\item To assess the importance of Call Arity for allowing \hi{foldl} to take part in list fusion, I also measure GHC without Call Arity, but with \hi{foldl} implemented via \hi{foldr}.
\label{perf-foldr}
% \item To assess the importance of the co-call component of the analysis, I measure how well an arity analysis without it, would have fared. Such an analysis does not eta-expand any thunks, corresponding to what is described in \cite{gill}.
\end{enumerate}


\subsubsection{Setup}

The ubiquitous benchmark suite for Haskell is nofib \cite{nofib}, a set of 100 example Haskell programs, ranging from small micro-benchmarks to “real” applications. Most benchmarks support different modes to make them run longer. My numbers all result from the “slow” mode.

The measurements are taken on an 8-core Intel i7-3770 machine with 16\,GB of RAM running Ubuntu 14.04 on Linux 3.13.0.

Initially, I attempted to use the actual run time measurements, but it turned out to be a mostly pointless endeavour. For example the \texttt{knights} benchmark would become 9\% \emph{slower} when enabling Call Arity (i.e. when comparing (\ref{perf-base}) to (\ref{perf-call-arity})), a completely unexpected result, given that the changes to the GHC Core code were reasonable. Further investigation using performance data obtained from the CPU indicated that with the changed code, the CPU’s instruction decoder was idling for more cycles, hinting at cache effects and/or bad program layout.

Indeed: When I compiled the code with the compiler flag \texttt{-g}, which includes debugging information in the resulting binary, but should otherwise not affect the relative performance characteristics much, the unexpected difference vanished. I conclude that non-local changes to the Haskell or Core code will change the layout of the generated program code in unpredictable ways and render such run time measurements mostly meaningless.

This conclusion has been drawn before \cite{wrong-data}, and recently, tools to mitigate this effect, e.g.\@ by randomising the code layout \cite{stabilizer}, were created. Unfortunately, these currently target specific C compilers, so I could not use them here.

In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed. This way, the variability in execution time due to code layout does not affect the results.
To obtain the instruction counts I employ valgrind \cite{valgrind}, which runs the benchmarks on a virtual CPU and thus produces more reliable and reproducible measurements.


\newcommand{\andmore}[1]{\totalname{\textit{\dots and #1 more}}}
\newcommand{\progname}[1]{\multicolumn{1}{|l|}{\texttt{#1}}}
\newcommand{\totalname}[1]{\multicolumn{1}{|l|}{#1}}

\newcommand{\columnname}[1]{\hfill(#1)\hfill\nobreak}
\newcommand{\yes}{\hfill\checkmark\hfill\nobreak}

\makeatletter
\def\expandableinput#1{\@@input #1 }
\makeatother
%\afterpage{%
%\begin{landscape}
\begin{table}[p]%
\caption{Nofib results, relative to (A)}
\label{tab:nofib}
\centering
\small
\begin{tabular}{l|r>{\bf}rr|r>{\bf}rr|}
\cline{2-7}
\bigstrut                         & \multicolumn{3}{c|}{Bytes allocated} & \multicolumn{3}{c|}{Instructions executed}\\
\hline
\totalname{\bigstrut[t]}
& \columnname B  & \columnname C  & \columnname D
& \columnname B  & \columnname C  & \columnname D \\
\totalname{Arity Analysis}
& \yes & \yes &
& \yes & \yes & \\
\totalname{Co-call Analysis}
& \yes & \yes &
& \yes & \yes & \\
\totalname{\hi{foldl} via \hi{foldr}}
&            & \yes & \yes 
&            & \yes & \yes \\
\hline
\hline
\expandableinput{benchmarks/call-arity-nofib-table.tex}
\hline
\end{tabular}
\end{table}
%\end{landscape}
%}

\subsubsection{Results}

My results are shown in \cref{tab:nofib}. The three columns correspond to the variants (\ref{perf-call-arity}), (\ref{perf-call-arity-foldr}) and (\ref{perf-foldr}) described above, and the given percentages are changes relative to (\ref{perf-base}). Negative numbers indicate improvement. I list those benchmarks where there a difference of 1\% or more is observed.

As expected, enabling Call Arity does not increase the number of dynamic allocations; if it did, something would be wrong -- see \cref{ch:provingcallarity} for a formal proof of that statement.

On its own, the analysis rarely has an effect: Programmers tend to give their functions the right arities in the first place, and this includes the code in nofib.
\goodbreak

Some of the improvements, e.g.\@ in \texttt{fibheaps}, can be attributed to the use of \hi{take}: Even before the introduction of Call Arity, it was set up to be a good consumer, producing similar higher-order code as \hi{foldl} would. Call Arity can successfully optimise that.

But the real strength of Call Arity can only be seen in combination with making \hi{foldl} a good consumer: Allocation improves considerably and without the analysis, this change to \hi{foldl} would actually degrade the run time performance.

The last column (\ref{perf-foldr}) shows that without Call Arity, making \hi{foldl} a good consumer is a bad idea, and that in some cases, the number of allocations and instructions go through the roof.

\subsubsection{Difference lists}

\label{sec:dlist}%
Another setting, besides list fusion, where non-expanded function definitions may emerge is when function types are hidden behind a type abstraction, and combined using abstract combinators. A good example for this is the type of \emph{difference lists}, which represent lists as functions of type \hi{[a] → [a]}. Module \hi{DList} in \cref{fig:dlist} contains a standard implementation of difference lists. Note that all combinators are eta-reduced: The argument providing the tail of the list is omitted.

\begin{figure}
\begin{hcode}
\> !module DList !where \\
\> \hind \=m !newtype DList a = DL ([a] → [a]) \\
\=m \\
\=m fromDList :: DList a → [a] \\
\=m fromDList (DL f) = f [] \\
\=m \\
\=m singleton :: a → DList a \\
\=m singleton c = DL (c:) \\
\=m \\
\=m empty :: DList a \\
\=m empty = DL id \\
\=m \\
\=m (<>) :: DList a → DList a → DList a \\
\=m DL f <> DL g = DL (f . g) \\
\=m \\
\> !module Bench (doIt) !where \\
\=m !import DList \\
\=m !import Data.Char (intToDigit) \\
\=m \\
\=m showInt :: Int → DList Char \\
\=m showInt n \=2 | n < 10    \>= singleton (intToDigit n) \\
\=m           \=2 | otherwise \>= showInt (n  `div` 10) <> showInt (n `mod` 10) \\
\=m \\
\=m go :: [Int] -> DList Char \\
\=m go []     \> = empty \\
\=m go (x:xs) \> =  showInt x <> singleton ' ' <> go xs\\
\=m \\
\=m doIt :: [Int] → String \\
\=m doIt xs = fromDList (go xs)
\end{hcode}
\caption{Difference lists}
\label{fig:dlist}
\end{figure}

As a micro-benchmark, the code in module \hi{Bench} in \cref{fig:dlist} converts a list of non-negative integers into their decimal representation, space separated.

\begin{table}
\caption{Difference list speedup}
\label{tab:dlist}
\centering
\begin{tabular}{l|rrrr|}
\cline{2-5}
&\multicolumn{4}{c|}{Running time} \\
\hline
\multicolumn{1}{|l|}{Call Arity}            &       & \yes  & \yes  & \yes \\
\multicolumn{1}{|l|}{\hi{showInt} exported} &       &       & \yes  & \yes \\
\multicolumn{1}{|l|}{\hi{go} exported}      &       &       &       & \yes \\
\hline
\hline
\expandableinput{code/dlist/dlist-table.tex}
\hline
\end{tabular}
\end{table}

\Cref{tab:dlist} lists the execution time of applying \hi{doIt} to a list of 1,000,000 integers, measured using \texttt{criterion} \cite{criterion}. We can see that without the help of Call Arity, the code is actually 16\% slower than the equivalent code using \hi{String} and string concatenation naively. With Call Arity enabled, the difference list code runs twice as fast, beating the \hi{String} code by 35\%.

The benefits of Call Arity are less pronounced if some of the involved functions are exported. In that case, the compiler has to make conservative assumptions about how often the function is called and Call Arity cannot eta-expand it. If \hi{showInt} or \hi{go} is added to the export list of module \hi{Bench} the performance advantage compared to the String version disappears in the noise.


\subsection{Compiler performance}

\begin{table}[b]%
\caption{Compiling nofib and GHC, relative to (A)}
\label{tab:compiletimes}
\centering
\small
\begin{tabular}{l|r>{\bf}rr|r>{\bf}rr|}
\cline{2-7}
\bigstrut                         & \multicolumn{3}{c|}{Bytes allocated} & \multicolumn{3}{c|}{Compile time}\\
\hline
\totalname{\bigstrut[t]}
& \columnname B  & \columnname C  & \columnname D
& \columnname B  & \columnname C  & \columnname D \\
\totalname{Arity Analysis}
& \yes & \yes &
& \yes & \yes & \\
\totalname{Co-call Analysis}
& \yes & \yes &
& \yes & \yes & \\
\totalname{\hi{foldl} via \hi{foldr}}
&            & \yes & \yes 
&            & \yes & \yes \\
\hline
\hline
\expandableinput{benchmarks/call-arity-nofib-comp-table.tex}
\expandableinput{benchmarks/call-arity-ghc-comp-table.tex}
\hline
\end{tabular}
\end{table}

Call Arity could affect the compile times in two ways: It increases them, because the compiler does more work. But it also reduces them, as the compiler itself has been optimised more. \Cref{tab:compiletimes} shows the change in allocations done and time spent by the compiler while compiling the nofib test suite, as well as while compiling GHC itself.

In the nofib row we can see that the latter is indeed happening -- enabling Call Arity reduces the number of allocations performed by the compiler, despite it doing more work -- but this does not incur a significant change in compiler run time.
The change to \hi{foldl} alone increases compile times slightly.

In the second row, there is a third factor: The code base itself increases, due to the addition of the Call Arity code itself, which is reflected by the benchmark results.

Note that the benchmark suite is \emph{not} designed to produce stable measurements of compile time, e.g.\@ the compiler is run only once, so the significance of these numbers should not be taken too serious.

Ağacan measured the contribution of individual compiler passes to the overall compilation time, by compiling selected, widely used Haskell libraries and their dependencies. He reported his findings on the GHC developer’s mailing list\footnote{March 29, 2016; \url{https://mail.haskell.org/pipermail/ghc-devs/2016-March/011651.html}}, and found that Call Arity is responsible for 1.1\% of the compilation time.


\section{Related work}

Andrew Gill mentions in his thesis on list fusion \cite{gill} that eta-expansion is required to make \hi{foldl} a good consumer that produces good code, and outlines a simple arity analysis.
It does not discuss thunks at all and is equivalent to the second refinement in \cref{sec:why}.

\subsection{GHC's arity analyses}

The compiler already comes with an arity analysis, which works complementary to Call Arity: It ignores how functions are being used and takes their definition into account. It traverses the syntax tree and for each expression returns its arity, i.e.\ the number of arguments the expression can be applied to before doing any real work. This allows the transformation to turn $\sITE{x}{(\sLam{y}{e_1})}{(\sLam{y}{e_2})}$ into $\sLam{y}{(\sITE x {e_1}{e_2})}$ on the grounds that the check whether $x$ is true or false is a negligible amount of work, and it is therefore better to eta-expand the expression, even if this means that the check is done repeatedly. But this is just a heuristics, and can lead to unwanted performance losses, e.g.\@ if the scrutinee does a deep pattern match.\footnote{See for example \url{http://ghc.haskell.org/trac/ghc/ticket/11029}.} Therefore, Call Arity would refrain from doing this unless it knows for sure that the expression is going to be called at most once.

This arity analyser can make use of one-shot annotations on lambda binders. Such an annotation indicates that the lambda will be called at most once, which allows the analysis to derive greater arities and expand thunks: If the lambdas in $\sITE{(f~x)}{(\sLam{y}{e_1})}{(\sLam{y}{e_2})}$ are annotated as one-shot, this would be expanded to $\sLam{y}{(\sITE {(f~x)} {e_1}{e_2})}$.

The working notes in \cite{arityanalysis} describe this analysis as the \emph{forward arity analysis}. Like Call Arity, it can only determine arities of let-bound expressions and will not make any use of arity information on parameters.
\goodbreak

Consider, for example,
\begin{hcode}
\> !let \> g   \> = \ldots \\
\>      \> s f \> = f 3 \\
\> !in \ldots (s g) \ldots
\end{hcode}
where we would have a chance to find out that \hi{g} is always called with at least one argument.

A \emph{backward arity analysis} capable of doing this is also described in \cite{arityanalysis}. This analysis calculates the \emph{arity transformer} of a function \hi{f}: A mapping from the number of arguments \hi{f} is called with to the number of arguments passed to \hi{f}’s parameters. It is not implemented in GHC as such, but subsumed by the combined strictness/demand/cardinality analyser: The function \hi{s} would have a strictness signature of \texttt{<C(S),1*C1(U)>}. The strictness information on the left  indicates that \hi{s} is strict in its first argument, and also in the value returned by calling its first argument as a function; the usage information on the right indicates that evaluating \hi{s f} will evaluate \hi{f} at most once, call it at most once, and the result of that call will be used. The latest description of this analyser can be found in \cite{demandanalysis}.

Neither of these two analyses is capable of transforming the bad code from \cref{fig:tricky} into the desired form: The former has to give up as the expression \hi{f a} might be expensive; the latter looks at the definition of \hi{goB} before analysing the body and is therefore unable to make use of the fact that \hi{goB} is always called with two arguments.

\subsection{Higher order sharing analyses}
\label{sec:sophisticated}

The role of the co-call analysis in this setting is to provide a simple form of sharing analysis (using the nomenclature of \cite{hage}), which is required to safely eta-expand thunks. Such analyses have been under investigation for a long time, e.g.\@ to avoid the updating of thunks that are used at most once, or to enforce uniqueness constraints. These systems often support a higher-order analysis in some way, e.g.\@ using detailed usage types \cite{demandanalysis}, possibly with polyvariance \cite{hage}.

It would be desirable to have such expressive usage types available in our analysis, and I do not foresee a problem in using them. It will, however, be hard to obtain them: The co-call analysis does not just analyse the code as it is, but rather anticipates its shape after eta-expansion based on the Call Arity result. So in order to determine a precise higher-order demand type for a function \hi{f}, we need to know its Call Arity. For that we need to analyse the scope of \hi{f} for how it is used, which is where we want to make use of the higher-order information on \hi{f}. Going this route would require a fixed-point iteration for every binding, which is prohibitively expensive.

This is also why integrating Call Arity directly into GHC’s existing demand analyser \cite{demandanalysis}, which analyses function bodies before their uses, would be difficult.


Another noteworthy difference to the cited analyses is that these either skip the discussion of recursive bindings, or treat them too imprecisely to handle code resulting from list fusion. It would be interesting to see if the concept of a co-call graph could be used in a stand-alone backward sharing analysis to improve precision in the presence of recursion.


\subsection{Explicit one-shot annotation}
\label{sec:one-shot}

%\afterpage{%
%\begin{landscape}
\begin{table}[p]%
\caption{Measuring the effect of one-shot annotations}
\label{tab:oneshot}
\centering
\small
\begin{tabular}{l|rr>{\bf}r|rr>{\bf}r|}
\cline{2-7}
\bigstrut                         & \multicolumn{3}{c|}{Bytes allocated} & \multicolumn{3}{c|}{Instructions executed}\\
\hline
\totalname{\bigstrut[t]Call Arity}
&      & \yes & \yes
&      & \yes & \yes \\
\totalname{\hi{oneShot}}
& \yes &      & \yes
& \yes &      & \yes \\
\hline
\hline
\expandableinput{benchmarks/oneshot-table.tex}
\hline
\end{tabular}
\end{table}
%\end{landscape}
%}

While I was pondering the issue of a well-fusing \hi{foldl}, I was pursuing also another way of solving the problem, besides Call Arity:

For that I created a magic, built-in function
\begin{hcode}
\> oneShot :: (a → b) → a → b.
\end{hcode}
It is semantically the identity, but the compiler may assume that the function \hi{oneShot f} is called at most once. I can use this function when implementing \hi{foldl} in terms of \hi{foldr}:
\begin{hcode}
\> foldl k z xs = foldr (λ v fn → oneShot (λz → fn (k z v))) id xs z
\end{hcode}
This solves our problem with the bad code generated for \hi{sum (filter f [42..2016])} from \cref{sec:list-fusion}:  The compiler sees
\begin{hcode}
\> !let go = λx → \> !let r = !if x == 2016 !then  id !else go (x + 1) \\
\>                \> !in !if f x !then oneShot (λa → r (a + x)) !else  r \\
\> !in go 42 0
\end{hcode}
and, because the \hi{λa} is marked as \hi{oneShot}, the existing arity analysis will happily eta-expand \hi{go}.

Note that \hi{oneShot} is unchecked: The programmer or library author has the full responsibility to ensure that the function is really applied only once. This is given in the case of \hi{foldl}, as we know the definition of \hi{foldr} and that it applies its argument at most once for each element of the list.

The GHC developers initially decided to go the Call Arity route because it turned out to be no less powerful than the explicit annotation, has the potential to optimise existing user code as well, and ensures the correctness of the transformation.

Later, the developers decided that it does not hurt to simply employ both approaches in GHC. The nofib benchmark suite does not exhibit such a case, but it is quite possible that there are instances out there, maybe similar to the tree example in \cref{sec:weakness}, where Call Arity fails and the \hi{oneShot} annotation might save the day.

\Cref{tab:oneshot} compares the performance of
\begin{compactitem}
\item using only \hi{oneShot},
\item using only Call Arity and
\item using both, as it is the case in the released version of GHC
\end{compactitem}
against the baseline of GHC 7.10.3 without \hi{oneShot} and Call Arity, but with \hi{foldl} implemented as \hi{foldr} (i.e.\@ variant (\ref{perf-foldr}) in \cref{sec:measurements}). It shows that in most cases, \hi{oneShot} and Call Arity yield the same performance gains, and only a few benchmarks (e.g.\@ \texttt{fibheaps}, \texttt{scs}) show that Call Arity is a bit more powerful. 

\subsection{unfoldr/destroy and stream fusion}

There are various contenders to \hi{foldr}/\hi{build}-based list fusion, such as \hi{unfoldr}/\hi{destroy} \cite{unfoldr} and stream fusion \cite{streamfusion}.  They have no problem fusing \hi{foldl}, but have their own shortcomings, such as difficulties fusing \hi{unzip}, \hi{filter} and/or \hi{concatMap}; a thorough comparison is contained in \cite{practicalfusion}. After two decades, this is still an area of active research \cite{hermit}.

These systems are in practical use in array libraries like \texttt{bytestring} and \texttt{vector}. For the typical uses of lists they were inferior to \hi{foldr}/\hi{build}-based fusion, and hence the latter is used for the standard Haskell list type. Given the recent advances on both fronts, a reevaluation of this choice is due.

\subsection{Worker-wrapper list fusion}
\label{sec:takano}

On the GHC mailing list, Takano suggested an extension to \hi{foldr}/\hi{build}-based list fusion that will generate good code for left folds directly \cite{ww-fusion}. The idea is that the consumer not only specifies what the generator should use instead of the list constructors \hi{(:)} and \hi{[]}, but also a pair of worker-wrapper functions.

Slightly simplified, he extends \hi{foldr} to \hi{foldrW}. This function takes two additional arguments, here called \hi{wrap} and \hi{unwrap}, which can be used by the consumer to specify the actual type of the recursion.
\begin{hcode}
\> foldrW  \=x[@{}c] :: \> (!forall e. f e → (e → b → b)) \\
\>         \=x[@{}c] →  \> (!forall e. (e → b → b) → f e) \\
\>         \=x[@{}c] →  \> (a → b → b) → b → [a] → b \\
\> foldrW wrap unwrap f z0 list0 = wrap go list0 z0 \\
\> \hwhere \\
\> \hind \> go = unwrap \$ λ list z' → !case list !of \> {}[] \> → z' \\
\>       \>                                           \> x:xs \> → f x (wrap go xs z')
\end{hcode}

Conversely, he extends \hi{build} to \hi{buildW}: Besides passing the list constructors, this also passes a wrapper that does not actually change the type:

\begin{hcode}
\> !newtype Simple b e = Simple \{ runSimple :: e → b → b \} \\
\>  \\
\> buildW \=x[@{}c@{}] :: \> (!forall b f .\> (!forall e. f e → (e → b → b)) \\
\>        \=x[@{}c@{}]    \>[][r]       → \> (!forall e. (e → b → b) → f e) \\
\>        \=x[@{}c@{}]    \>[][r]       → \> (a → b → b) → b → b)\\
\>        \=x[@{}c@{}] →  \>{} [a] \\
\> buildW g = g runSimple Simple (:) []
\end{hcode}

This way, he can specify a fusion rule similar to the \hi{foldr/build} rule:
\begin{hcode}
\> \{-\# !RULES \\
\> \hind \> "foldrW/buildW" !forall wrap unwrap f z g. \\
\>       \> \hind \> foldrW wrap unwrap f z (buildW g) = g wrap unwrap f z \\
\>  \> \#-\}
\end{hcode}


Now every list consuming function that wants to benefit from this system needs to specify a custom pair of wrapper and unwrapper functions. For example, his definition of \hi{foldl} in terms of the extended \hi{foldrW} becomes
\begin{hcode}
\> foldl !:: !forall a b. (b → a → b) → b → [a] → b \\
\> foldl f z = λxs → foldrW wrap unwrap g id xs z \\
\> \hwhere \\
\> \hind \> wrap :: !forall e. Simple b e → (e → (b → b) → (b → b))\\
\>       \> wrap s e k a = k (s e a) \\
\>       \> unwrap :: !forall e. (e → (b → b) → (b → b)) → Simple b e\\
\>       \> unwrap u = λe a → u e id a \\
\>       \> g x next acc = next (f acc x).
\end{hcode}

Conversely, list producing functions should be defined in terms of \hi{buildW}, and making sure that the wrappers are used to shape the recursion:
%eftFB :: Int -> Int -> (Wrap f r) -> (Int -> r -> r) -> r -> r
\begin{hcode}
\> \relax [from..to] = buildW (eftFB from to)\\
\> eftFB from to wrap unwrap c n = wrap go from n\\
\> \hwhere \\
\> \hind \> go = unwrap \$ λi rest → \> !if i <= to\\
\>       \>                          \> !then \> c i (wrap go (i + 1) rest)\\
\>       \>                          \> !else \> rest.
\end{hcode}

This proposal initially looked promising: It handles the case of fusing \hi{foldl} well, and appears to be more powerful in tricky cases like fusing \hi{foldl} with a list produced by \hi{treeToList} (see \cref{sec:weakness}).

Nevertheless, a thorough evaluation\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/9545}} by David Feuer and Dan Doel revealed that the system is rather unsafe: The above fusion rule is not universally correct, and with certain combinations of producers and consumers, this can yield wrong results. A way forward would be to identify sufficient conditions about the arguments to \hi{foldrW} resp.\@ \hi{buildW} that guarantee that fusion is safe, but so far, such conditions have not been found. Given these problems, the GHC developers decided to not pursue this approach any further for now.

\subsection{Control flow based analyses}

The Call Arity analysis uses domain theory to describe its result, and iteratively finds a fixed point in the case of recursive bindings. This suggests connections to the field of data flow analysis, where analysis results are commonly calculated on a control-flow graph representation of the program. It is not obvious how to represent a Core program as such a graph, and although there are approaches to control-flow analysis of functional programs (see \cite{functional-cfg} for a recent survey), they are not used in the Haskell compiler.

GHC does employ data flow analysis based transformations, but at a much later phase and lower level, namely in the code generator \cite{hoopl}. We do not want Call Arity to happen that late in the pipeline, as some Core-to-Core transformations benefit from the effect of Call Arity, e.g.\@ by unboxing the accumulator of \hi{sum} specialised to \hi{Int}.

\section{Future work}

As usual, there is always room for improvement, both in the analysis itself and in how it is used.

\subsection{Improvements to the analysis}

Call Arity does not fully exploit the behaviour of thunks in mutual recursion. Consider this example:
\begin{hcode}
\> !let \> go x \> = !if x > 10   \> !then x               \> !else go (t1 x) \\
\>      \> t1   \> = !if f (t2 a) \> !then (λy → go (y+1)) \> !else (λy → go (y+2)) \\
\>      \> t2   \> = !if f b      \> !then (λy → go (y+1)) \> !else (λy → go (y+2)) \\
\> !in go 1 2
\end{hcode}
Currently, Call Arity will refrain from eta-expanding \hi{t1} and \hi{t2}, as they are part of a recursive binding. But \hi{t2} is in fact called at most once! All calls to \hi{t2} are done by \hi{t1}, and \hi{t1}'s result will be shared.

It remains to be seen if such situations occur in the wild and whether the benefits justify the implementation overhead.

\subsection{Tighter integration into GHC}

As explained in \cref{sec:sophisticated}, Call Arity cannot be directly merged into GHC’s existing demand analyser \cite{demandanalysis}, as they need to process let-bindings in a different order.

There is, however, a potential for better cooperation of Call Arity with the existing analyses and transformations in both directions:

Call Arity could make some use of the strictness and demand annotation that happen to be already present in the code, e.g.\@ on imported identifiers. If, for example, the function \hi{f} in the expression \hi{f (λx. g x)} happens to be annotated with the information that it calls its first argument at most once with one argument, then we could improve the analysis result and report that \hi{g} is called at most once.

I am, however, reluctant to add this functionality: It would imply that some programs might be optimised \emph{better} by splitting them into more modules, which is a harsh violation of the principle of least surprise.

Similarly, the other passes could use the information that was found out by the Call Arity pass: Thunks that are determined by Call Arity to be called at most once can be marked as one-shot, even if no eta-expansion is possible, which would allow the code generator to omit the code that implements the updating.
\label{sec:update-avoidance}

If we were willing to pay the price to include function parameters in the set of interesting variables (\cref{sec:interesting-variables}), then although Call Arity cannot make use of the thus found information to eta-expand anything, it could create a preliminary demand signature for the function that might help the subsequent pass of the demand analyser to get more precise results, or at least to converge earlier.

Finally, the information that a let-bound function or thunk is called at most once from within a recursive function allows more aggressive inlining.

For example, currently GHC does not transform
\begin{hcode}
\> !let \=1a \>= f x0 \\
\>      \=1b \>= g x0 \\
\> !in \>!let \>go \>0 \>= a \\
\>     \>     \>go \>1 \>= b \\
\>     \>     \>go \>i  \>= go (h i) \\
\>     \>!in go n
\end{hcode}
into
\begin{hcode}
\> !let \>go \>0 \>= f x0 \\
\>      \>go \>1 \>= g x0 \\
\>      \>go \>i \>= go (h i) \\
\> !in go n
\end{hcode}
as in general, inlining into a recursive group can duplicate work \cite{inlinersecrets}. In this case, it would be safe, as \hi{a} and \hi{b} are called at most once, and Call Arity is able to determine that. In principle, extending GHC to do this is not a problem; practically, the so-called float-out pass will simply undo this change, because – again in general – floating things out of a recursive group is a good idea, as it can increase sharing. In this case, no sharing can be gained, as the expression \hi{f x0} is evaluated only once, but this fact is not visible to GHC after \hi{a} and \hi{b} have been inlined.\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/10918}}


%
% % Old introduction
% List fusion does not always live up to the optimistic programmer’s expectations, though.
% Consider the code
% \begin{code}
% sum (filter f [42..2016])
% \end{code}
% Compiled with GHC-7.6, this produces code roughly equivalent to
% \begin{code}
% let go = \x ->  let r = if x == 2016 then [] else go (x + 1)
%                 in if f x  then  x : r
%                            else  r
% in foldl (+) 0 (go 42)
% \end{code}
% so while the list passed to \hi{filter} has been eliminated, one list still remains.
%
% Why is that so? Recall that list fusion is based on the \hi{foldr}/\hi{build} rule, which fuses right folds with good producers like \hi{[n..m]} and \hi{filter}. But \hi{sum} has been defined in terms of \hi{foldl}\footnote{Actually \hi{foldl'}, but that should not worry us here.}, and there is no corresponding fusion rule.
%
% It is a first semester student's exercise to write \hi{foldl} as a right fold:
% \begin{code}
% foldl k z xs = foldr (\v fn z -> fn (k z v)) id xs z
% \end{code}
% Using that definition list fusion will indeed eliminate all lists in our example, but the resulting code is very unsatisfying:
% \begin{code}
% let go = \x ->  let r = if x == 2016  then  id
%                                       else  go (x + 1)
%                 in if f x  then  \a -> r (a + x)
% 		           else  r
% in go 42 0
% \end{code}
% The recursive loop \hi{go} has become a function that takes one argument, then allocates a function closure for \hi{r} on the heap, and finally returns another heap-allocated function closure which will pop the next argument from the stack -- not the fastest way to evaluate a functional program \cite{fastcurry}.
%
% The code that we would like to see is
% \begin{code}
% let go = \x a ->  let r = \a -> if x == 2016  then  a
%                                               else  go (x + 1) a
%                   in if f x  then  r (a + x)
% 		             else  r a
% in go 42 0
% \end{code}
% where the number of arguments passed to \hi{go} and \hi{r} matches the number of lambdas that are manifest on the outer level, which avoids allocations of function closures and allows the runtime to do fast calls, or even tail-recursive jumps.
%
% It is also clear how to get from the bad to the good code: The definitions of \hi{go} and \hi{r} need to be eta-expanded with one extra argument each. After that, the usual simplifications will provide us with the good code (and simplify it even further).
%
% What is not so clear is how the compiler can know that it is allowed to eta-expand \hi{go}. In general, it would be a bad idea: Assume the body of the let were \hi{(\g -> g 0 + g 1) (go 42)}. Then eta-expanding \hi{go} would prevent the sharing of \hi{f 42}, which might be an expensive operation.
%
% But in our case, the transformation is valid! The crucial observation is
% \begin{compactitem}
% \item \hi{go} is \emph{always} called with at least two arguments, so no work is lost if the \hi{\a -> } and the \hi{let r =} are permuted, and
% \item \hi{r} is called at most once, with at least one argument, so it is ok to turn this thunk into a function.
% \end{compactitem}
%
% In general, for a let-bound variable \hi{v}, we want to know
% \begin{compactenum}
% \item a lower bound to the number of arguments it is called with and
% \item for a thunk \hi{v}, whether it is called at most once.
% \end{compactenum}
%
% Especially the second bit is tricky. In the bad code above, there are two calls to \hi{r} and they are in the body of a lambda, and yet \hi{r} is called at most once. Even a recursive function might call another function only once, if the call is not on the recursive path.
%
% Therefore, given an expression \hi{e} and an incoming arity \hi{n}, the Call Arity analysis will determine with how many arguments \hi{e} calls each of its free variables, and -- to detect once-calls and the recursive path -- for every pair of variables whether \hi{e} may call both of them.
%
%
% % Old section on design decisions
%
% \subsection{Design Decisions}
% \label{sec:design}
%
% Call Arity started very much tailored to the code produced by \hi{foldl}-as-\hi{foldr} and list fusion. This code has the distinctive feature that the interesting calls (|go| and \hi{r} in the example) are tail-calls. In particular, each of them is called at most once, and at most one of them is called.
%
% Therefore, the initial version of the analysis returned, for an expression, a set of variables (together with the arity they are called with), guaranteeing that at most one of them is called, at most once. It worked for the motivating cases, but it seemed a waste to throw away the arity information for variables that are called more than once -- if the variable is not a thunk, we can still make good use of that information! Therefore, we extended the analysis to report the arity for all interesting free variables, in addition to one set of variables of which at most one is called exclusively.
%
% Speaking in terms of this work that version represented the co-call-graph as the complement of the complete graph on the set of returned variables, which is necessarily less expressive.
%
% An annoying wart of this approach is that for an application $e_1~e_2$ (and analogously for a case construct), only the result from one of the expressions could be used: If we find out that $e_1$ calls at most one of $\{x_1,x_2\}$, and $e_2$ calls at most one of $\{x_3, x_4\}$, then we can only return one of these two results, and have to guess which one might be more useful later in the analysis. This heuristic was unsatisfying and motivated the final design, where, by returning the full co-call-graph, no information is lost.
%
% In that sense, the Call Arity analysis is heuristics-free: No guesses have to be made during the analysis.
%