Welcome, in this session, we will see an example of static analysis that's called Data Flow Analysis. Let us consider this small program again, the computation of the greatest common divisor of two numbers. And let us say we are interested in reasoning about some properties of this program. In particular, let us say we want to see which definitions get used where, which assignments of values to variables get used where in the program. So, it's one of the analyses that's done in typical compilers. To work with this analysis, it's easier to first abstract the program into what's called a control-flow graph. So, a control-flow graph is essentially a set of vertices and edges where each vertex or each node is a basic block, which is essentially a straight line, piece of code execution. The edges show the possible transfer of control from one block to another. So, here in this case we have four nodes. The header node, where we identify the root node of the graph as the function header, they call it the GCD, if you will. Then, the next node is the, new-pile-block-test. So, depending on how the test evaluates, we may either take the two branch and get into the body, or take the false branch and get out of the loop. So, those are represented by the two edges, one going to C and the other going to D. And the body at the end of execution comes back to the head of the while loop. We may want to abstract away some of the details for our analysis, things that are not of interest and just concentrate on things that are of interest. So, let us say what we are interested in is the definitions of variables. So, we know that dividend and divisor are defined in some sense at the beginning of the procedure, which means at the entry, some values are assigned to the arguments when the function is called. The node B just uses divisor to compute the test for the loop entry. It doesn't update any variable values. But in the body both the variables are assigned new values. So, it uses the current values to compute new values. So, there is a use of these variables as well as definition of variables, and finally the return statement uses one of the variables. Now, let us say our question of interest is, which definitions can reach the node C? By that we mean is the definition of the values of the variables in node A does that reach C. That's the definition of the variables in the node C itself reach C. For this simple program it's easy to visually look at it and determine the answer, but let us work through the details to understand how the flow analysis works. So, the reaching definitions problem can be stated as follows: at each node we want to compute what's coming in and based on that we can determine what's going out as definitions to the next subsequent nodes. So, the definitions that are reaching in are essentially the union of all the definitions that are flowing out of the preceding nodes in the control-flow graph. So, we can state it formally as In(n) for a node "n" is the union of Out(m) for all predecessor nodes "m". This is the one called the JOIN rule. Now, the definitions that are flowing out of the node, you're essentially taking the definitions that are reaching in, removing any definitions that are going to be reassigned. Definitions of variables for which the values are going to be reassigned. So, we call those the killed information and add in the new assignments, the new definitions for variables, those are the generated ones in the node. So, essentially these equations capture how the information flows through each node. The flow out equation is called the transfer rule. It takes the inputs and transfers it to the output making some changes. So, if you look at the reaching definitions example again, things that flow out of node A are the definitions of dividend at A and the divisor at A. Things that flow out of B using the transfer roots, you take the definitions that are coming in through the edge AB and the edge CB. So, you will receive both dividend A and divisor A from the edge AB and dividend C and divisor C from the edge CB. It doesn't change any of the values of these variables, so, it doesn't kill anything, and also it doesn't generate anything new. Finally for C, it takes the output of B and then union with what's added and removed, what's flowing in from the B-C edge. So, essentially it will return only the new values that are defined, new definitions for the variables, dividend and divisor, which is dividend C and divisor C. So, we get the idea but how do we do this systematically? Here, we kind of cheated, we just looked at it and said what it should be. But if you have to write a procedure to do this in a systematic fashion, there is a little bit of complexity involved because these are mutually recursive equations. OUT is defined in terms of IN and IN is again defined in terms of the OUT, from the predecessors and that could be loops, in general in the control-flow graph. How do we do this? So, this is where your standard algorithm that's called worklist algorithm is used. So essentially, we just start out by saying that we don't have any information. So, in this case of Reaching Definitions, no information means Out(n) is empty set for every node. Then, we maintain a worklist and add the root node of the control-flow graph to the worklist and repeatedly, do a bunch of actions as long as the worklist is not empty. So, we remove a node from the worklist, compute its new OUT value. So initially, everything is empty, so the IN would be empty and then all we would do is just add in the definitions that have been created in that node and compute the OUT. Once we do this, what we do is we add the node successors to the worklist because this node's output change, that could potentially change the output of some of its successors so we add those nodes into the worklist. And, we keep doing this until the work list becomes empty. There are some nice results which show that this approach would work, the sense that it would eventually terminate, the worklist will become empty and that's because of the structure of these rules. If you think about it in an abstract sense, each time you apply the JOIN and TRANSFER rules, it can only increase the set in some sense. The set grows in one direction or for some other analysis this set may decrease in one direction. So, these are called monotonic functions and because of the monotonicity of these TRANSFER and JOIN rules, there is a guaranteed correctness and termination for this approach. At the end, the OUT for each node will contain exactly the definitions, that flow out of that node and the IN, would contain the definitions that flow into the node. So, we can generalize this understanding to a general dataflow analysis algorithm, essentially by defining the JOIN function, which is basically saying, how do I determine the inputs based on the output from predecessors. The TRANSFER function which says how the input is transferred to the output. So, we remove the KILL and add the GEN and the direction of information flow. In some of the information flow analysis we may have to actually work backwards, from the exit point of the procedure to do the entry. So, we have to work backwards along the edges. So, we flipped predecessors, successors and OUT with IN, and essentially do the same algorithm. So, the direction is forward or backward. Once we define the JOIN function, the TRANSFER function and the direction of information flow, the same algorithm can work for a variety of analysis. Some of these analysis are things like Available Expressions, where compilers often want to optimize the computation by avoiding unnecessary computation of expressions that have been previously computed and the variables involved in the expressions haven't changed their values. So, in this case, we want to be sure that irrespective of which path, we come to a node, the expression is available before declaring that it is available for use in that node. So instead of starting with the empty set we will start with all possible expressions, and then remove things that are not available and eventually stabilize and terminate with what expressions are available at each node. So, it's again a forward analysis, but instead of reaching definitions which was an any path analysis, this is an aisle path analysis. Live Variables is another typical compiler analysis where the compiler has to decide which variables must be retained in registers, because they might be potentially used again within the procedure or within a certain window. So, in this case, the liveness of a variable is determined by whether it's used at a later point so the analysis is backward analysis. And in the domain of security, an important analysis that's often done is called Taint Analysis. That is saying, is there a potential flow of some untrusted data that should be untrusted getting to some users where you expect trusted data. For example when we are opening a file, the file name should be a trusted data source, otherwise we may potentially expose some information leaks, some information allows some bad things to happen to the program. For example, if you take a file name from a web form and use that to open the file and do some operations. A malicious user may try to open some file that shouldn't be accessible to them. So, we call the data tainted when it is coming from an untrusted source and see if the tainted data flows to locations where we expect untainted data. This can be cast as a dataflow analysis problem. Other such analysis also include checking whether a variable is read before initialization or whether the memory is being used after release. These are examples of analysis that we may be interested in and can be cast in terms of the data flow analysis that we saw. That is the limitation of this analysis in the sense that it is flow-sensitive but not path-sensitive. For example, if we look at the GCD procedure, we may ask this question of, can the definition of divisor in C reach C? It does because there is a path back from C to itself, through B again it's the body of the loop. And the dataflow analysis does indeed answer that when we look at the reach. Seeing definitions in for node C, it will have divisor C as one of the elements. However, if you're interested in asking the question, can it reach when the value of the divisor is 0? That is something we cannot answer, and that may be important because, for example, if you want to check if the modulo operation that we perform in the very first statement within C, to determine the remainder can it result in a divide-by-zero? It can if divisor is 0. It's clear here that the while loop condition makes sure it's not 0, but dataflow analysis cannot answer this. We will look at other analyses which are path-sensitive too, but as we increase the sensitivity, the complexity of the analysis grows, which means its scalability suffers. So, there is always a trade-off between how precise and sensitive your analysis must be and how scalable it should be.