[MUSIC] Welcome to Plant Bioinformatics! I'm your instructor Nicholas Provart. I've been in the plant bioinformatics field since about 1998, and this course is based on a book chapter that I developed together with Miguel de Lucas and Siobhan Brady. It finally came out in 2014 and we've used this material in a Cold Spring Harbor Laboratory annual Frontiers and Techniques in Plant Science course since about 2012. The course material was developed by Yi Fei Huang, Cathy Cha, and myself. And the course was produced by Eddi Esteban, William Heikoop, and again myself. Please do use the Coursera tools to discuss lecture content and labs, and check out Bioinformatic Methods I and II on Coursera for additional labs of interest, and I've highlighted those in the lab material. So, the course syllabus and format is as follows. We're going to be looking at a bunch of tools that are of use to plant biologist ranging from fairly straightforward genome browsers to more powerful tools for hypothesis generation. Most tools used for exploration are web based. The focus is not just on the reference plant species Arabidopsis thaliana, but rather on widely used plant bioinformatic resources across several plant classes. The modules are as follows. In Module 1, we'll go over plant genome databases and useful sites for protein information. In Module 2 we'll cover expression analysis. In Module 3 we'll look at coexpression tools. In Module 4 we'll look at promoters. In Module 5, we'll explore tools for functional classification and pathway visualization. And then finally, in Module 6 we'll do some network exploration of protein-protein interaction networks, protein-DNA interaction networks, and gene regulatory networks. Each module will consist of a two minute lecturette, followed by a 20 minute mini-lecture covering some of the relevant theory, then a 90 minute hands on lab, which is actually the main focus of each module. During the lab, you'll answer three lab quiz questions based on results generated during the lab. And then, there's a lab discussion video of the lab results if you need a hint and you get stuck. And then finally, there's a two minute concluding lecturette. There will also be two sectional quizzes after the first three modules, and the second three modules, and then a final assignment at the end of the course. The first module is dealing with plant genome databases and useful sites for protein information. So what is bioinformatics? Bioinformatics is the development and application of computational tools in managing all kinds of biological data. And it involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, proteins, and metabolites. And it's generally limited to sequence, structural, and functional analysis of genes and genomes, and their corresponding products, and it's sometimes called computational molecular biology. And this field has really developed over the past decade to help manage the huge increase in data generated by genome sequencing projects, high-throughput technologies, and so on. So this is just a graph showing you the growth of nucleotide information... the number of sequences in GenBank, or the number of bases in this case. We see that there are more than a trillion bases in GenBank and more than a billion sequences. That's a lot of information! And this is driven in part due to the rapid drop in sequencing costs. So in 2001, it cost about $10,000 to sequence a raw megabase of DNA sequence, and then, as sequencing technologies improved over the past 15 years, we see that the cost to sequence a raw megabase of DNA sequence is now pennies per megabase. And then the cost per genome has concomitantly dropped to around $1,000 per human genome. So the decreased sequencing cost in turn is driven by a proliferation of technologies. So here's a graph from a nice review by Mike Snyder and colleagues showing the development of various sequencing technologies over time. Since about 2005, with the introduction 454 Pyrosequencer, and the Y axis on this chart shows the machine output in terms of megabases. So up here at this level, we're about 100 times human exome, and at this level we're about 30 times the human genome of about 3 billion bases. Some of these machines can really pump out mindbogglingly large amounts of sequence information. This innovation in terms of sequencing technology has led to a flourishing of methods for understanding biology. So this graph nicely shows that publication date of various "seq"-based methods for determining biological information, and the y axis in this case denotes the log 10 of the number of citations of that particular method. And we see that the RNA-seq for instance is hugely cited, the node size is the monthly citation rate. And these various methods fall into different categories for determining gene expression for instance. For determining genome organization like Hi-C and ChIA PET-seq. These methods are extremely powerful for gaining new insight into biological questions. Oftentimes biological data are deposited in biological databases. And in the case of plant data, we have a lot of plant biological databases. So we'll just talk a little bit about why we need databases, accession numbers and identifiers, and finishing up with pre-computed gene trees and the utility of those that we can get from these databases. So here's a piece of sequence from arabidopsis chromosome three. This is the kind of information that you would get out of a sequencing machine. Not very useful on its own. And so we do need some annotation efforts to be able to know where certain regions are within the sequence. So for instance, the red is the five prime UTR here. This is coding sequence in the gold color. The ATG is the start methionine. And the purple sequence here represents the first intron of that gene. So it's really useful to know that information, so we can then derive computationally protein sequences from the primary nucleotide sequence. We can also use plant biological databases to get at more interesting information, not just sequence information. So if we have a gene of interest, we might identify some mutations in a gene, we can see where there are additional insertional mutations within given accessions using TAIR. We can study functional or a cis-regulatory variants in other accessions using the 1,001 Genomes site. If we look at the mutant phenotype and we don't necessarily see a mutant phenotype under normal conditions, we might then use expression databases to see under what conditions my gene of interest is expressed using eFP Browser, or ePlant, or Genevestigator. These are all tools we'll talk about in the second module. We'll look at the subcellular localization, we can understand a bit more about protein's functions based on where it is located within the cell. And then we might also want to know if a gene acts in a given metabolic pathway, what metabolites may be affected, and we can get that information from AraCyc or from Reactome, then maybe do some followup metabolite profiling. And then it might be useful also to know which molecular network your gene of interest participates. So we can ask, are there any transcription factors strongly co-expressed with my gene using co-expression tools which we'll cover in the third week? In my list of my co-expressed genes, are any cis-elements statistically over represented in the promoters? And we can use a couple of different tools that we'll cover in module four. Are there any regulators of gene expression in terms of epigenetic modifications to genomes or small RNAs, and we'll explore the MPSS database in the second week. Where is my protein expressed? Are there any alternate splice forms of my protein? Is my protein translationally modified? And we can get that information from TAIR or Araport. We can also ask, is the protein for my gene of interest in a particular metabolic pathway, and again we can use AraCyc and TAIR, as well as Gene Ontology apps that we'll explore during this course. Finally we can also ask are there any putative upstream factors or downstream factors or interactors that have been identified? And in the last module, we'll use tools such as VirtualPlant, GeneMANIA, TF2Network, and ePlant, to answer those kind of questions. We'll just take a quick look at some of the formats of data. The main kind of data that you would get from Genbank is the Genbank flatfile format where you can download sequence information as well as metadata that Genbank has accumulated over the years. We can see some information in the header, and then in the feature section, and then there's a sequence section itself. And the header has the identifier, the NCBI GenBank identifier. It tells you the length, the source, the type of molecule, and the NCBI taxonomic group and when it was last updated. And then the features include things like the coding regions, the putative translations of a given mRNA molecule and that kind of thing. In terms of plant gene identifiers, we have several different naming systems which can be confounding at times. Often times for a given genome we'll have a genome initiative identifiers. So the identifiers in the case of Arabidopsis looks like this. And then Genbank will assign its own accession.version number to sequence that is provided to it from Araport, in the case of the Araport 11 sequence release. UNIPROT in turn will also provide its own accession numbering to the record. For other plant species, they often also have their own genome initiative identifier. Again, in the case of Genbank, they'll also have their own corresponding Genbank identifies, and the identifiers for UNIPROT are again different. So just be aware of that when you're searching those databases. Sometimes these identifiers are informative and they'll tell you which chromosome for instance, the third chromosome in the case of this ABI3 gene, the gene is on, that kind of thing. In other cases they're just applied randomly to the sequences as they're processed. How can we search GenBank and other sequence databases? We can use keywords to search, and those depend on the quality of the metadata as applied by the people who are inputting the data into the databases. Then we can also search by sequence similarity using BLAST. And we actually explore Blasting in quite a lot of detail in Labs 1 and 2 of Bioinformatic Methods I, also on Coursera. And the reason we have to use a specialized tool is that Google and other search engines don't actually handle sequence searches well. They don't put gaps into the sequences to identify partial matches. And they don't know, in the case of protein sequences, which amino acids have similar properties. So BLAST has been developed to very efficiently search protein and DNA sequences. Plant genome browsers that we'll be exploring in this first lab can provide a useful first look, and in the case of Arabidopsis, Araport maintains a nice JBrowse instance for Arabidopsis sequence information. You can get epigenomic information overlaid as tracks onto the JBrowse instance. You can look at non-coding RNAs. You can look at pseudogenes, chromatin states, all sorts of really quite extensive and useful information can be identified. So in terms of definitions, we want to think about those for a few minutes. Let's consider this early globin gene in an ancestral organism. Suppose there's a gene duplication event leading to an alpha chain globin and a beta chain globin in the ancestor of several species. That would be a gene duplication event, sometimes you can have whole-genome duplications or you can have segmental gene duplications. We will end up with two copies of that gene in the common ancestor. Then, over time, there's speciation and in this case, we maybe get a frog species, a chick species, and a mouse species. Each of those species will have copies of the alpha globin gene and the beta globin gene, and those are the genes that are present in those species. And the terminology that we use here is that the alpha and the beta versions are paralogs within a species, and then orthologs are genes that are derived by speciation. All of those together are termed homologs. When we go to the gene databases, we can often get precomputed gene trees and we'll be exploring these in the first lab. They can tell us a lot about the evolutionary history of what's been going on in terms of genome organization, have there been duplication events? And we can see that by looking at these nodes here, so in the case of the ABI3 gene which we're searching for in this Compara gene tree. We see that there's a close relative, a homolog, probably an ortholog, in Arabidopsis lyrata. And that there's been a gene duplication in Brassica napus and a Glycine max for instance. So that information is useful if we're considering other species and we want to use other species for translational purposes. Has there been a gene duplication event in these other species? And if there was, when did it occur? Is it clade specific, that kind of thing. And we will explore that in the lab. So the last thing that I'll touch on is to ask the question which database do we use for what? So often times Araport and TAIR are good starting points for Arabidopsis sequence information. And then we can branch out to more specific and specialized databases such as the BAR, Plant Metabolic Network, Planteome, those kind of things, to get more specific information. And those are the kind of databases that we'll be exploring in further labs during this course. I hope you enjoy the lab, thanks!