Le contrôle qualité sur les données fastq
TP detection exome
Plan Théorie 1: le format FastQ et l'encodage des qualités Session pratique 1: conversion des qualités (fichier illumina.fastq) Théorie 2: le contrôle qualité et l'outil FastQC Session pratique 2: le nettoyage des données (dataset pickrel exon chr12.fastq)
FastQ 1 séquence = 4 lignes dans le fichier 1 ère ligne = identifiant de la séquence
Qualité 4ème ligne = Qualité Qualité = score calculé 2 calculs de scores existent Pe: estimated probability of error
Encoder la qualité Les scores sont encodés en ASCII (ex: '%' => 37 ) Il existe différents encodages: S,X,I,J,L
Chaque encodage correspond à un score calculé selon la formule PHRED ou SOLEXA A ce score est ajouté 33 ou 64 La valeur obtenue est convertie en ASCII et inscrite dans le fichier Ex pour un encodage Sanger: de la proba au caractère ASCII: 90% (proba) 10 (score phred) 10 + 33 = 43 43: '+' ASCII
Différents encodages mais les outils en acceptent qu'un seul format FASTQ Groomer: pour convertir les qualités
Session pratique 1 Cliquez sur shared data puis publish histories Cliquez sur TP-QC Olivier Cliquez sur Import history Visualisez (avec l'oeil) le contenu du fichier illumina.fastq Quel est l'encodage pour ces données?
Verification de l'encodage Dans la boîte de recherche d'outils tapez FastQC Selectionnez l'outil FastQC: Read QC Selectionnez le fichier illumina.fastq
Dans la boîte de recherche d'outils tapez FastQ Groomer Selectionnez FASTQ Groomer convert between various FASTQ quality formats Pour le fichier File to groom selectionnez le fichier illumina.fastq Quel est maintenant l'encodage pour ces données?
FastQC A quality control tool for high throughput sequence data Contrôle qualité sur les données sequencées Différentes analyses sur les données, pour chaque analyse:
Per base sequence quality X = Position in read Y = Quality score Box Whisker Green, orange, red Quality of calls will degrade as the run process...
Per base sequence quality lower quartile for any base < 10, or if the median for any base < 25 the lower quartile for any base < 5, or if the median for any base < 20
Per Sequence Quality Scores X = Quality scores Y = Nb sequences See if a proportion of sequences in a run have low quality => indicate a systematic pb (one end of a flowcell,...)
Per Sequence Quality Scores most frequently observed < 27 (O.2% error rate) most frequently observed < 20 (1% error rate)
Per Base Sequence Content X = position in read Y = Sequence content (%T, %C, %A, %G) In a random library: little to no difference between the different bases of a sequence run Detect overexpressed sequence (contamination)
Per Base Sequence Content Differences between A and T or G and C > 10% Differences between A and T or G and C > 20%
Per Base GC Content X = position in read Y = Sequence content (%GC) In a random library: little to no difference between the different bases of a sequence run Detect overexpressed sequence (contamination)
Per Base GC Content GC content of any base > 5% from the mean GC content GC content of any base > 10% from the mean GC content
Per Sequence GC Content X = mean GC content Y = nb sequence Compute a normal distribution (blue) Plot raw data (red) An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset
Per Sequence GC Content the sum of the deviations from the normal distribution > 15% of the reads the sum of the deviations from the normal distribution > 30% of the reads
Per Base N Content X = position in read Y = N content It's not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence. However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.
Per Base N Content any position shows an N content of >5% any position shows an N content of >20%
Sequence Length Distribution X = sequence length Y = nb sequence Detect sequences trimmed by the pipelines (to remove poor quality)
all sequences are not the same length any of the sequences have zero length
Sequence duplication level X = sequence duplication level Y = proportion of non-unique v.s. unique low level of duplication may indicate a very high level of coverage high level of duplication indicate some kind of enrichment bias
Sequence duplication level non-unique sequences make up more than 20% of the total non-unique sequences make up more than 50% of the total
Overrepresented Sequences lists all of the sequence which make up more than 0.1% of the total look for matches in a database of common contaminants
Overrepresented Sequences any sequence is found to represent more than 0.1% of the total any sequence is found to represent more than 1% of the total
Overrepresented Kmers Kmers? (5 mers) long sequences and poor quality: reduce the counts for exactly duplicated sequences. a partial sequence which is appearing at a variety of places (won't be seen by per base content plot or the duplicate sequence analysis).
a graph for the top 6 hits: enrichment of that Kmer across the length of your reads. This will show if you have a general enrichment, or if there is a pattern of bias at different points over your read length.
based on the base content of the library: calculates an expected level at which this k-mer should have been seen uses the actual count to calculate an observed/expected ratio for that k-mer
any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position k-mer is enriched more than 10 fold at any individual base position
Session pratique 2 Attention! chr12 exon: tps de traiement conséquents Les résultats sont disponibles dans l'historique groomer + fastqc on chr12 Cliquez sur shared data puis publish histories Cliquez sur groomer + fastqc on chr12 Cliquez sur Import history A partir de quel outil le dataset n 2 a t-il été obtenu? Visualisez les résultats
Dataset Public data: exome sequenced by the International HapMap Project Single-end reads of 100bp, Illumina Genome Analyzer IIx RNA-seq data of this exome available (Pickrell et al., Nature, 2010)
A partir du dataset 9, visualisez l'outil qui a été utilisé Que signifie une taille de fenêtre à 1? Pourquoi la valeur de qualité 28 a t elle été choisie? Identifiez des reads trimmés Quelles sont le valeurs de qualité qui ont été enlevées?
FastQ Quality Trimmer
Simple Trimming of the ends ATCCTTTATAAATAATTAATA Min qual <= 28? ATCCTTTATAAATAATTAAT Min qual <= 28? ATCCTTTATAAATAATTAA Min qual <= 28?... Min qual <= 28?
Quality scores after trimming