Welcome to the course, Command Line Tools for Genomic Data Science. Today we'll be talking in lecture three about Alignment and Sequence Variation. We'll start with the definition of sequence variation, what it is, how it manifests itself, and what are the consequences. Then we're going to look at what is the work flow for determining sequence variations in genomes and exomes. And lastly, we will conclude that in the administration of relevant genomic tools for alignment and for calling variants. So let's get started. Each individual's genome sequence is unique. The differences in the DNA sequences of individuals are ultimately responsible for differences in observable traits, such as height or such as eye color, as well as for the hidden ones. For instance, susceptibility to a disease, or how the cell functions. There are about 1% differences between the genomes of humans and the genomes of chimps. And there's 0.1% differences between the genomes of any two individuals, humans. But if you're thinking about it, out of a three billion base sequence, that amounts to roughly three million differences between any two individuals. Now there have been a number of international projects, very large projects, that aim to sequence the genomes of thousands of individuals and to create a map of all the sequence variation along the human genome. And I should say those include HabMap and The 1000 Genomes Project. But these projects capture most of the sequence variation along the genome that are common to a large number of people. Nevertheless, each person will still have a relatively small number of variants that are going to be rare, or relatively unique. And I should say at this point that a variant that occurs in at least 1% of the population is called the polymorphisms. So we will be looking at sequence variation in this lecture. And by that, we mean relatively small differences such as substitutions, insertions, or the relations, either of one base, or maybe at a level of small blocks up to a thousand bases. There can also be structural variance that goes from about one kilobase to up to megabases of sequence, but they will not be addressed here. So how do we identify sequence variation in an individual? We start by taking a sample, for instance a blood sample. Extracting the DNA, sequencing it with an instrument such as Illumina, which will produce a large number typically of reads. The reads are then mapped to the genome using an alignment algorithm, and then the sequences are being analyzed. Essentially, at every position in the genome, we're looking at all the reads that line up above it, that align to it. And we're looking at the base that's contained at that particular position. So if you're looking here on the left hand side, you'll see that this individual has two alleles, an A allele, which is the same as the reference, and a C allele. And by the way, the number of reads that line up at the given position in the genome, give the so-called depth or depth of coverage. Now it might be expensive to sequence the genome from one end to the next, a procedure that is by shotgun sequencing. So, a procedure that we usually refer to as whole genome shotgun. In most of the cases, we expect that the variants that are going to be disease causing will be contained within genes. More specifically within the protein holding portions of the genes. Therefore, one other strategy would be to only sequence the DNA within the portions of the genome that correspond to genes, and we call that whole exome shotgun sequencing. In either case, the bioinformatics workflow for handling and for detecting sequence variations would start with alignment of reads to the genome, followed by what we call variant calling. In the following sections, people will be looking at the genomic tools that are employed to detect sequence variation.