File Formats
Base calling: A process by which an order of nucleotides in a template is inferred.
FASTA
gi|21434723
>NR_024570.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence
AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
CAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGA
TAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGCACAAAGAGGGGGACCTTAGGGCCTCTT
GCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAG
CTGGTCTGAGAGGATGACCAGCAACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTG
GGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCNGCGTGTATGAAGAAGGCCTTCGGGTTGT
AAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAA
GCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGC
GTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTG
ATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCT
GGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGC
AAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCG
TGGCTTCCGGANNTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAA
TTGACGGGGGCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTC
TTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCT
GTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCA...
AB1 format
FASTQ
@HWUSI-EAS100R:6:73:941:1973#0/1 @InstrumentID : Flow-cell lane : Tile : Cluster co-x : co-y # Index / Read
@SRR22388518.1 1 length=251
NACTGAAAAACAACAAAAAGCGTTAATCAGTGCGTATAAAAGCGGATTTGACCCTAAAAATGCGGACAAAGTCGCTCAATATTGGCAAAACAAACCCACTAAAATAGACTTACATAAACCTATAAAAACTAAAGACTTCTTTAAAGGGAATACTAATATTTATAGGACACTTCGCAATTTATTTGGACAAAAATTTATGGATAGCTATATTGCTCCTAAAAGTGAAACCACAATGAAAGACTTTATGTCTA
+SRR22388518.1 1 length=251
#<<DDEHIICEHHIHIIIIIIHIHE<GHHIFHHIEHDHHIIHHIIIIIH?1CGHIHIIIIIGEHIIHEHFHIHIIIIDHHIIGHFHIEFEHEFGEGHIIFIGHHHIHIHIIIIIIIIHHIIIIGHIIHHIIIGHHIIHHHHHHHIIICEHHHHHHHIIIFHIHIIECHHHFCEHFHDHHHHFHE?FEHIGIIGHHEGFHGHIEEEHIIIIIBGHHHIHHIFHIFHHH6..6G@HEHHE8F.BFHHHFGHA8
@SRR22388518.2 2 length=251
ATATTCAAGCTATCGGTCCTCATGTAAGTGATCACCCCCATAACGCCTTGTGGGGTGGCTACGCCTTCATATAATTTTTGAGCGATACTCATGGTTTTTGTGGGCGAAAAGCCTAAAAGACTGGAAGCGCTTTGCTGTAAAGTAGAAGTCATGAAAGGGGGCGGTGTGGGGGATTTTTTAGACTTTTTAACGATACTAGAGATAGTGTAGCTTTCTTTTTCCAGTTCGTTTTTAATCTCTTGGGCTTTTTT
+SRR22388518.2 2 length=251
DDDDDIIIIIIIIIIIHIIIIIHIIIIIIIIIIIIIIIIGIIIIIIHIHHIHHHHIIIIIHIIIIIIIHHEHIIIHIHIHIIIHGIIIIIIIIH<D<EHGHGHIHIHHHIIIIHHHHIHIIIIIIHGHHGIIIIHHIIIIGHHIIIHICHH?HHGFCEHHGDHC?CEGIGHHHHE?G?HEFGEHBGGHHECHHHIGHIIIH?GHHH-8@HICHHHIHHII@@@FHIHHBHFHEH@GHIHGG?B@56FEHHH
Phred Quality Score (Q Score)
| Phred Score | Probability of incorrect base call | Base call accuracy |
|---|---|---|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10000 | 99.99% |
| 50 | 1 in 100000 | 99.999% |
| 60 | 1 in 1000000 | 99.9999% |
Encoding Q Scores
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
ASCII
|-------------------------|----|--------|------------------------------|---------------------|
33-----------------------59---64-------73----------------------------104-------------------126
Decimal Value
0........................26...31.......40.....................................................
Sanger
Phred+33
raw reads typically (0, 40)
.........................-5....0........9.............................40......................
Solexa
Solexa+64
raw reads typically (-5, 40)
...............................0........9.............................40......................
Illumina 1.3+
Phred+64
raw reads typically (0, 40)
..................................3.....9..............................41.....................
Illumina 1.5+
Phred+64
raw reads typically (3, 41)
0.2......................26...31........41....................................................
Illumina 1.8+
Phred+33
raw reads typically (0, 41)
0..................20........30........40........50.............65.......................90...
Nanopore
Phred+33
Duplex reads typically (0, 65 + 90)
0..................20........30........40........50.........................................93
PacBio
Phred+33
HiFi reads typically (0, 93)
FAST5 format
SAM/BAM
| Col | Field | Type | Description |
|---|---|---|---|
| 1 | QNAME | String | Query template NAME |
| 2 | FLAG | Int | bitwise FLAG |
| 3 | RNAME | String | References sequence NAME |
| 4 | POS | Int | 1- based leftmost mapping POSition |
| 5 | MAPQ | Int | MAPping Quality |
| 6 | CIGAR | String | CIGAR string |
| 7 | RNEXT | String | Ref. name of the mate/next read |
| 8 | PNEXT | Int | Position of the mate/next read |
| 9 | TLEN | Int | observed Template LENgth |
| 10 | SEQ | String | segment SEQuence |
| 11 | QUAL | String | Phred+33 |
BAM: Binary Alignment Map file
Header Section (Optional)
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
@RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
@PG ID:bwa VN:0.5.4
@PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, DinucCovariate, TileCovariate], default_read_group=null, default_platform=null, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, exception_if_no_tile=false, ignore_nocall_colorspace=false, pQ=5, maxQ=40, smoothing=1
Alignment Section
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>
19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9
19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><<
| Field | Alignment 1 | Alignment 2 | Alignment 3 |
|---|---|---|---|
| QNAME | 1:497:R:-272+13M17D24M | 19:20389:F:275+18M2D19M | 19:20389:F:275+18M2D19M |
| FLAG | 113 | 99 | 147 |
| RNAME | 1 | 1 | 1 |
| POS | 497 | 17644 | 17919 |
| MAPQ | 37 | 0 | 0 |
| CIGAR | 37M | 37M | 18M2D19M |
| MRNM/RNEXT | 15 | = | = |
| MPOS/PNEXT | 100338662 | 17919 | 17644 |
| ISIZE/TLEN | 0 | 314 | |
| SEQ | CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG | TATGACTGCTAATAATACCTACACATGTTAGAACCAT | GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT |
| QUAL | 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 | ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< |
GenBank (.gbk .gb)
GFF3
Terminology
| Read | A sequence of nucleotides obtained by sequencing a fragmented genome |
|---|---|
| Insert Size |
The actual length of DNA that is inserted between the adapters Read1+inner_distance(if any)+Read2
|
| Mate-pairs |
Reads constructed from DNA fragments with longer insert sizes (2-20 kb) Illumina mate-pairs are constructed from 3-5kb DNA fragments |
| Contigs & Scaffolds |
Contiguous sequence: When two sequences overlap at their ends (known as a "dove-tail" overlap), these sequences can be collapsed into a single, non-redundant sequence Scaffolds or supercontig: A scaffold is formed when an association can be made between two contigs that have no sequence overlap
|
| Sequencing Coverage |
The average number of reads that align to, or "cover," known reference bases The Lander/Waterman equation is a method for computing genome coverage. The general equation is: C = LN / G, where C is Coverage, L is Read Length, N is Number of Reads and G is Length of the Genome. |
| N50 |
The shortest contig/scaffold length, at which 50% of the bases in that assembly reside in it and other larger contigs If an assembly has N50 value of 0.8 Mb, this means 50% of the assembled bases are present in contigs/scaffolds of length 0.8 Mb and above Eg: If we have 9 contigs for an assembly with lengths of 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 Mb. Total genome length is 5.4 Mb (sum of all contigs). Half of which is 2.7 Mb. So, 3 of the large contigs 1 + 0.9 + 0.8 = 2.7 Mb have 50% of the bases. So, N50 = 0.8 Mb. L50 corresponds to the smallest number of contigs that comprise 50% of the assembly. Here, L50 = 3. |
| Draft Genome | A genome sequence that is not yet finished but is of generally high quality. Usually has more than 90% of high quality bases. May include fragments connected with Ns. |
| Gaps | A region of the genome for which no sequence is currently available. Gaps may occur both within and between genomic scaffolds. |
| Genome Annotation | A multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements |