Course Homepage
Sept 2018
This project is maintained by BIG-SA
Reads that have been aligned to a reference are no longer stored in fastq format but are stored in either SAM or BAM format. These two formats are virtually identical, however the SAM format is a text file which is easily readable to human eyes, whilst a BAM file is the same information converted to binary. This conversion means that file sizes are smaller, and that computational processes can be performed more efficiently. Typically, we work with BAM files as these provide gains in storage space & analytic speed. The tools we use to inspect these files are provided in the package samtools, which has been installed on your VM.
The reads from the previous dataset which mapped to chrI of C. elegans are in the folder ~/WGS/02_trimmedData/fastq
cd ~/WGS/02_trimmedData/fastq
ls
The BAM format is much more convenient computationally, so we have converted our alignments into BAM format using samtools view
during the alignment process.
SAM files are plain text, whilst BAM files are compressed and much easier for the computer to read/write.
As BAM files are in binary format they will look like gibberish if we try to read them directly.
Instead we can inspect them by using samtools view
as in the line above.
There is also a header seection to each file which details the fasta files used in the alignments. To view this header we use
samtools view SRR2003569_chI.bam | head
As this data can easily spill across lines, it might be helpful to maximise your terminal to try and see the complete line structure.
If we understand what information is contained within a file, we can know what decisions to make as we progress with our analysis, so let’s have a look at what the data structure is for a SAM/BAM file.
A SAM/BAM file is tab-delimited
, which means that each field is separated by a tab, giving a data structure effectively consisting of columns (or fields).
In order, these are:
1 | QNAME | Query template/pair NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
5 | MAPQ | MAPping Quality (Phred-scaled) |
6 | CIGAR | extended CIGAR string |
7 | MRNM | Mate Reference sequence NaMe (`=’ if same as RNAME) |
8 | MPOS | 1-based Mate POSistion |
9 | TLEN | inferred Template LENgth (insert size) |
10 | SEQ | query SEQuence on the same strand as the reference |
11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
Several of these fields contain useful information, so looking the the first few lines which we displayed above, you can see that these reads are mapped in pairs as consecutive entries in the QNAME field are often (but not always) identical. Most of these fields are self-explanatory, but some require exploration in more detail.
These are quite useful pieces of information, but can be difficult at first look. Head to http://broadinstitute.github.io/picard/explain-flags.html to see a helpful description. The simplest way to understand these is that it is a bitwise system so that each description heading down the page increases ina binary fashion. The first has value 1, the second has value 2, the third has value 4 & so on until you reach the final value of 2048. The integer value contained in this file is the unique sum of whichever attributes the mapping has. For example, if the read is paired \& mapped in a proper pair, but no other attributes are set, the flag field would contain the value 3.
Things can easily begin to confuse people once you start searching for specific flags, but if you remember that each attribute is like an individual flag that is either on or off (i.e. it is actually a binary bit with values 0 or 1). If you searched for flags with the value 1, you wouldn’t obtain the alignments with the exact value 1, rather you would obtain the alignments for which the first flag is set & these can take a range of values.
Let’s try this using the command samtools view
with the option -f N
to include reads with a flag set and the option -F N
to exclude reads with a specific flag set.
Let’s get the first few reads which are mapped in a proper pair, so the flag 2
will be set.
samtools view -f 2 SRR2003569_chI.bam | head
Note that none of the flags actually have the value 2, but if you typed the values 99, 147 or 163 into the webpage, you’ll see that this flag is set for all of these values. Similarly if we wanted to extract only the reads which are NOT mapped in a proper pair we would change the option to a upper-case F.
samtools view -F 2 SRR2003569_chI.bam | head
Again, try entering a few of these sample values into the webpage and you will see that this flag is not set for any of these values.
This can be a very helpful tool for extract subsets of your aligned reads. For example, we can create a new BAM file with only the reads which were aligned in a proper pair by entering the following command.
samtools view -f 2 -bo SRR2003569_chI.bam
ls -lh
You can pull out highly specific combinations of alignments should you so choose
These give useful information about the type of alignment that has been performed on the read.
In the first few reads we called up earlier, most had the value ..M
where ..
is some number.
These are the perfect Matches, where the sequence has aligned exactly.
The other abbreviations in common use are I (insertion), D (deletion) & S (soft-clipping).
Soft-clipping is a strategy used by aligners to mask mismatches, so these are often analagous to substitutions.
Hard-clipping (H) is an alternative strategy, but the difference between the two is beyond the scope of today.
What is the interpretation of the first CIGAR
string in your set of alignments?