In this session, we'll introduce the Biostrings package. This is a package that contains functionality for representing and manipulating biological strings and biodata. By biological strings we need DNA strings, RNA strings, or amino acids. We construct bio strings or DNA strings by using the DNA string operator, which constructs a single string. This looks very much like character Vector, but behind the scenes there's some functionality that makes it efficient for dealing with either very long strings, for example whole chromosomes, or millions of strings, like we see for short reads. There's also something called a DNAStringSet which is a collection of DNA strings. These sequences doesn't have to have the same length or width, and for DNA strings they are constrained to have elements in the IUPAC code. The IUPAC code is a way of representing strings with a little bit more granularity. So for example, you can see from this code that m means the A of the C nucleotide. The H means the AC AT nucleotide. When we subset a DNA string, we get a subsequence out. When we subset a DNA string set, we get another DNA string set. We can also get a DNA string out of a DNA string set. We have to select a single element and we have to use a double brackets. This is very similar with what happens with lists in base R, where the difference between a list with one element and the element itself. We could put names onto a DNAStringSet. And use these names to subset the string as expected. Now there's a number of convenience functions for DNA strings. We have things such as width that tells us the number of bases in the string. We can sort them. Not clear what that's always useful for. And we can reverse them. Here you want to be careful with reversion, where reversion of DNA string set just means give them in a different ordering, where as reversion on a DNA string actually reverses the string as a biological operator. Another way to get the reversion is to use the reverse operator that gives you the true biological reversion, even for DNA strings set. We can use other biological operators such as reverse compliment. [SOUND] And we can do things that only make sense for a DNA string, such as translating it into a protein. In this case here, we get an error message, or we get a warning message for two out of the three strings. Because the width of the strings are not multiples of the three. We know that amino acids or DNA strings are, DNA nucleotides are translated into amino acids one codon at a time. There's also a set of functionality for counting levels and nucleotides, dinucleotides, trinucleotides, and general article nucleotides. I use this quite a bit. So, alphabet frequency gives you a general frequency table. Which for each string tells you, or it's each string tells you how many occurrences of the different letters do we see. Often we are interested in doing something simple. For example, like computing gc content of a strength. We can do that using the letter frequency. A function where we give it some input, and then we tell it which letters do we want to compute. And do we want to count this means we are counting both G and C, so in other words we get the GC content. We can also get higher order nucleotides. So we can get dinucleotides frequency, let's see DNA2. Where we now get all dinucleotides and get trinucleotides and then we can get more general oligonucleotides. The output starts to get pretty big when we start to have like eight months or 12 months or 32 months. Finally they also have a function called consensusMatrix. Which details for each position, in this case there's five positions because the longer string had five elements. It tells us how many elements, how many strings had a specific nucleotide of that persistence. So we can see all the first, all the tree strings started with an a. This is useful when you build up precision weight matrices or motifs for transcription factor binding sites. So this was a small introduction to bio strings. Besides DNA string, there's an RNA string and an AA string for amino acid strings. There's also a general B strings for bio strings or let me say definitely a B string is a string over any kind of character. And finally in the documentation you will sometimes read about an eight string. And eight string is just any type of string. So a DNA string, RNA, amino acid, or B string.