All right. Welcome to the fifth module of Plant Bioinformatics. In this week's lab, we're going to be exploring tools for functional classification and pathway visualization. So, oftentimes in a -omics kind of experiment, we'll come up with lists of genes, and we can do some analysis such as cluster analysis to come up with sets of genes that are similarly expressed. We did that for instance in the fourth lab of Bioinformatic Methods II. The question is, how to make sense of these lists? One thing that we can do is we can ask, is there enrichment of a particular functional category or pathways as defined by Gene Ontology or AraCyc, et cetera? Then, we can analyze those lists of four enrichment with tools that we'll be exploring in the labs such as AgriGO. It's often useful to know if all of the genes in the list are part of a given pathway. In order to assess that, we can actually, in addition to doing these enrichment tests, we can actually map expression information in a visual analytic way onto pathways with one of the BioCyc tools or with MapMan. So we'll talk now a little bit about Gene Ontologies. We explored Gene Ontology is a little bit in Bioinformatic Methods II, lab 2, where we were looking at the interactors of a particular protein, and asking whether or not they all belong to similar biological process, and we'll explore this further today. So, ontology is a controlled vocabulary for describing a knowledge system. In the case of Gene Ontology, their three main organizing principles, and those are molecular function, biological process, and cellular component. There is similar categorization systems that are used by other groups or tools such as MapMan. Here's just a snapshot of the geneontology.org output for the DD (drill-down) browse tool. We see that in the case of biological process, we have subcategories. Under biological phase, we have cell cycle phase. Under cell cycle phase, we have let's say, M phase. Then, there are 96 genes associated with that particular M phase per category. We can see that as we move towards the top of the categorization system, we see more and more genes being associated with those particular categories. Gene Ontology was proposed in 1998 by Michael Ashburner for a conference in Montreal. The aims of GO were to develop a comprehensive shared vocabulary of terms describing aspects of molecular biology that are common to more than one life form. So this is a really important idea that we can actually map our ontology terms to genes in different species. It was also designed to describe gene products held in each contributing model organism database, to provide a scientific resource for access to the vocabularies, so the computational tools, the annotations, the associated data, and also to provide these computational tools to assist in the curation of the GO terms themselves. So, the structure of Gene Ontology is a directed acyclic graph. So it's not a hierarchical tree, it's actually an acyclic graph. What this means is, in this little snapshot, a part of the Gene Ontology Biological Process acyclic graph, is that child terms such as DNA ligation can actually have multiple parent terms such as DNA recombination in the one case or DNA repair in the other case. The other thing that you can see from this figure is that, genes involved in a particular biological process such as DNA ligation can come from different species. So in the case of yeast, the gene for DNA ligation or one of the genes for DNA ligation is CDC9. That corresponding gene in drosophila is DNA-lig I and DNA-lig II. Its Lig1 and Lig3 in mouse. Then, for every species, there'll be a set of genes that are all involved in DNA ligation, and this allows us to unify across biology, and that's why Gene Ontology was developed now 20 years ago. So, let's consider our lists of genes again, come back to our list of genes. We might be interested in knowing for a particular list of genes, is there enrichment for any particular GO term? So in this case of five genes, four of which are involved in rRNA processing, and these are yeast genes actually, how likely is it that such a list could have been generated by chance if in the overall set, the overall genome of yeast, 353 genes out of 606,441 are involved in rRNA processing? It just seems visually that that does seem to be considerable enrichment for these rRNA genes relative to the overall set, but we can actually do sampling and generate sets of genes containing five members. Actually, if you do the statistics, around 75 percent of the time, you'd actually get sets that don't contain any rRNA processing genes out of the five randomly picked genes. Maybe around 20 percent of the time you'd get one, but this is the null distribution. But in fact, if four out of five are rRNA processing genes, the probability of that occurring by chance is 4.2 times 10 to the minus five. So in fact, we can say that there does seem to be enrichment of rRNA processing genes with a p-value of 4.2 times 10 to the minus five. So, here is a slide just showing how we can actually calculate the p-value using an Excel function if you want to, but the important thing from this slide is what we need as inputs. So we need the number of genes in your list with a given function, the total number of genes in your list, so this in the previous slide that was four, this is five, the total number of genes with a given function, that was 353, and then the total number of genes would be the 6,441 number. By simply having those values, we can come up with a p-value of there being enrichment. So typically, what we would do is, we would organize our data, we would maybe call differentially expressed genes for certain samples, or we would take cutoffs from cluster analysis, we would come up with groups of genes that all show say, similar expression profiles across very many experiments, for instance. Then, we can subject those the list of genes to enrichment analysis and see in this case, this is the stress response that we have an enrichment for amino acid biosynthesis or metabolism for this set of genes. Specifically, for this set of genes, we have an enrichment from methionine metabolism. So it's a really nice way to be able to organize the data. Often, we adjust these p-values using a Bonferroni correction or Benjamini-Hochberg because of the issue of multiple testing. Oftentimes, in the case of Gene Ontology tests, we're doing hundreds if not thousands of tests for each category within the Gene Ontology. There are many tools for doing Gene Ontology term enrichment analysis, and we explored one of these in Bioinformatic Methods II in lab two, that was called BiNGO. Here, we're looking at enrichment for sets of bracket to gene product interactors. The output of BiNGO is a graph network because that's what Gene Ontology is. Each node represents a Gene Ontology term. Connections represent the parent-child relationships. Then, the nodes themselves are coloured by p-value enrichment with red being more enriched. BiNGO is also setup to analyze Arabidopsis genes or we can actually load in a custom set if we want to. We're not going to be exploring that in the lab, but it is nice to have it as available as a stand-alone tool within the Cytoscape framework if you're used to using Cytoscape. One tool that we will be exploring today is AgriGO out of Zhen Su's lab at the Chinese Agricultural University. The thing that I like about this tool is that it provides a depiction of the directed acyclic graph nature of Gene Ontology. It's divided up into the three main components, in this case, biological process. We can see how one term is related to another term or more than one term. Here's a term that has two parent terms. Then, the terms themselves are coloured according to the significance level, the enrichment for that particular term for the set of genes that you're interested in. So you can get a quick visual overview of which biological process or molecular function or cellular component might be enriched for your set of genes. Another useful thing to do with a set of genes is to do a pathway analysis. We'll be exploring a tool called AraCyc today. AraCyc is part of the BioCyc framework and there are dozens of species available within BioCyc for dozens of plant species, hundreds of other species available. What you can do with the BioCyc framework is, you can overlay expression data onto pathways to be able to see if there are specific pathways that seem to be upregulated or downregulated within your gene set of interests. You provide the gene IDs, you provide the expression information and then what AraCyc will do or BioCyc is, it will create these images for you, highlighting which pathways are increased in expression. So that's really nice. Once you're a little bit used to seeing these glyphs as they're called, you get to know which ones belong to which pathways, and nicely you could then compare two panels side-by-side under different conditions, and see if there's a differential regulation of certain pathways under different conditions. Another thing you might consider is to create your own custom figures for pathways. This is starch biosynthesis. Here, the samples are broken down according to developmental time in different parts of the grain. Maybe in this way, you can see differential expression of certain isoforms within particular pathways. Of course, it's a lot more work to produce this figure than one generated by AraCyc, but it might be useful for your own visualization purposes. All right. In today's lab, we are going to look at AgriGO as I mentioned, another tool called AmiGO, the Classification SuperViewer, TAIR as a tool for doing GO term enrichment analysis in Arabidopsis. We'll also explore an interesting tool called g:Profiler that actually allows you to see the evidence codes that are associated with each particular gene. Then again, we'll explore AraCyc as I mentioned. We'll also check out MapMan which is a Java-based tool for doing pathway enrichment analysis. MapMan actually has its own categorization system and its own curation of terms. Then finally, if you have time, you should also check out the very nice Plant Reactome, which is an open-source, open access, manually curated, and peer-reviewed pathway database for dozens of plant species. That's it for the lecture. I hope you enjoy the lab. Thanks.