[BLANK AUDIO] In this session, we will discuss how to get your data into Bioconductor. We've seen Bioconductor has a list of convenient data containers, such as expression sets and summarized experiment, and G ranges that are useful for storing high [INAUDIBLE] data from one of more samples. But, how do you actually get your data in there? That depends a lot on the file format you have, and have jokingly been referred to as the science of inventing new file formats. And there is some truth to that. We are dealing with many, many, many different types of file formats. And, sometimes different people misuse different file formats to store different types of data in them. This makes it hard to have a completely easy way of getting the data in. How easy it is to read in the data depends a lot on the application area you're dealing with. Let's start by discussing microarrays which is by far the easiest thing. In microarrays you typically get raw data from the microarray vendor or the microarray scanner. The two big vendors, and there's obviously other venders that are also supported by Bioconductor, but the two big vendors are Affymetrix and Illumina. And Bioconductor has a couple of packages for providing low level support or low level parsing of these binary or vendor-specific file formats. Now as an end-user you shouldn't be using these packages. You should be using high level packages that are really taking the data and putting it into a container that's ready for analysis. All right, the classic examples are Affyio, which is an old package for analyzing gene expression data from Affymetrix. Oligo which is a newer and modern version of affy that supports both Affymetrix and Nimblegen and supports both expression and SNP arrays and also expression like arrays such as exon arrays. Then there's lumi and minfi for dealing with Illumina arrays. Minfi only deals with DNA methylation arrays. If you use these functions, there's usually one or more functions in these packages, that just reads a directory full of files, and put it into a ready to work with a container. The other big application area is high frequency sequencing. And here the situation is a little bit less clear. So raw reads the four alignments are typically provided in something called a FASTQ format. We have great support for reading FASTQ files using the short read packets. The next step after you have gotten the profiles is that you usually align them to the genome using an aligner such as Bowtie, or MEQ, or BWA, or perhaps a Johnson aware alignment such as GSNAP or TopHat. And storing aligned reads, for storing aligned reads we have a versatile BAM format, BAM or SAM. One is a binary version of the other and these are being read, we have access to these files through the package. Now FASTQ, BAM, and SAM files are fairly raw files. They're usually still a bit far away from doing actual, from being upgraded, from giving the data in a form that's easy to do analysis. And what happens after the BAM file is very domain specific. So it's different from sequencing to chip sequencing to DNA sequencing. And within these domain areas, depends a lot on what you want to do with the data and what pipeline you're running. So if you have done sequencing you may be interested in getting gene level counts. Or you may be interested in a simple and doing transfer for assembly work. A piece of software such as Cufflinks for string type. And how you deal with these things is quite different. The type of data they produce is quite different from each other, and yeah. So what you need to know is you need to have a set of versatile tools that can read in different file formats and put it into a form that you want. And for next generation sequencing, the artifact layer packet is usually very useful. It's supports BigWig, and BigBed files, which we have seen before. And there are mainly domains where these files are featured heavily. Then we have VCF files. That is used a lot when you have done genotyping. VCF and BCF and there's a whole package called VariantAnnotation that deals with this type of file format. Finally, we have text files. A lot of pipelines produces text files and it is up to you as a user to write a passer or two to read in these text files. Hopefully most of the time it's really easy to read them in. And sometimes it can be a real pain. So the main work horse for reading in rectangular data in text files and base reduct table, it has lot of arguments. There's a lot of help pages and incident about it. You can usually read in files with a lot of pain with reduct table, I want to draw your attention to two recent developments. The reader packets that has a set of versatile functions, and the data.table package that also has a very fast function. The reader package is a little slower than data.table, but in my opinion, it's a little easier to use and a bit more versatile, so that's what I would recommend. Finally you can get data from popularly available datasets and from databases. And I guess if you're aggressive you can upload your own data there and download it using a package such as that we have dealt with. There's also a package called SRAdb that does the same thing for the Short Read Archive, and then finally the ArrayExpress package that interfaces to the ArrayExpress database hosted at EBI. This was a quick overview, we have discussed some of this functionality and we will discuss some of this functionality in other sessions.