• Document: Mapping strategies for sequence reads
  • Size: 1006.7 KB
  • Uploaded: 2019-03-14 13:28:11
  • Status: Successfully converted


Some snippets from your converted document:

Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements are in the sample? 2. How many copies of each element are in the sample? RNA-seq: 1. What is the sequence of each distinct RNA molecule? 2. What is the concentration of each RNA molecule? ChIP-seq: 1. What is the sequence/location of each binding site? 2. How frequently is each site bound in a population of cells? Motivation In an ideal world... • we would sequence each molecule of interest from start to finish without breaks • there would be no errors in the sequences ... and there would be an excess supply of biostatisticians In the real world... • molecules of interest need to be selected • DNA/RNA needs to be shattered into fragments • fragments need to be amplified • # reads from a fragment is hard to control (0, 1 or more times) • different parts of a class of molecules may be sequenced different numbers of times (leads to variation in coverage) • there are sequencing errors Imperfect data The data consist of • 1 or 2 read sequences from each fragment • base call qualities for each base in each read • meta-data (e.g. read ! cDNA library) On their own, unprocessed, these data are not very useful! We have accumulated (prior) biological knowledge, including • reference genome sequences • genome annotations (gene structures, binding motifs, etc) We must label (or map) reads to relate them to existing knowledge • We wish to measure quantities pertaining to features (transcripts, binding sites) • Hence we map reads ! features Mapping by alignment A common technique for mapping is alignment: Read: AGTCGACTGATGAG Reference: ...GCAGCAGCGATCGAGTCAGTCAGTCGACTGACGAGCGCGCGCATACGACT... Not always easy: • Reads are ⇠100 bp long • Genome is ⇠3,000,000,000 bp long and rather repetitive • Reference genome , sample genome (SNPs, indels, structural variants) • Reads prone to errors (if lucky 1/1000 base calls are wrong) Mapping ChIP-seq reads ChIP-seq protocol Crosslink and shear. ChIP-seq read mapping Add protein-specific ( ) antibody and immunoprecipitate. binding site ChIP-seq read mapping Sequence one end of each fragment. binding site ChIP-seq read mapping Genome alignment: read ! binding site (or thereabouts) aligns directly reverse complement aligns binding site 5' 3'' Mapping RNA-seq reads RNA-seq typical protocol • Select RNAs of interest (e.g. mRNAs (polyadenylated)) • Fragment and reverse-transcribe to ds-cDNA • Size-select, denature to ss-cDNA • Sequence n bases from one/both ends of fragments (typically n 2 (50, 100) for Illumina) density.default(x = rnorm(1e+07, 5, 0.1)) 4 3 Density 2 1 0 4.6 4.8 5.0 5.2 5.4 N = 10000000 Bandwidth = 0.003581 Fragment size read 1 read 2 ATCACTCTACTACGCGC ATCTACTATCACTATCAC TACTATCGACTACTCTAC TTAACTCCTATGTATCTC TACTATCGACTACTCTAC ACCCGATACTCGACTCT ... ... Gene expression Different kinds of RNAs (tRNAs, rRNAs, mRNAs, other ncRNAs...). Messenger RNAs of particular interest as they code for proteins. Intergenic region Intron Intergenic region Exon Gene locus { Protein-coding gene Gene expression Different kinds of RNAs (tRNAs, rRNAs, mRNAs, other ncRNAs...). Messenger RNAs of particular interest as they code for proteins. Paternal gene locus Maternal gene locus * * * * * * Gene expression Different kinds of RNAs (tRNAs, rRNAs, mRNAs, other ncRNAs...). Messenger RNAs of particular interest as they code for proteins. No one-to-one gene!mRNA mapping: 1. Alternative isoforms have distinct sequences

Recently converted files (publicly available):