Next Generation Sequencing techniques have brought new insights into -omics data analysis, mostly thanks to their reliability in detecting biological variants. This reliability is usually measured using a value called Phred quality score (or Q score).
The Phred score of a base is an integer value that represents the estimated probability of an error in base calling. Mathematically, a Q score is logarithmically related to the base-calling error probabilities P, and can be calculated using the following formula:
Q = -10 log10 P
In the real world, a quality score of 20 means that there is a possibility in 100 that the base in incorrect; a quality score of 40 means the chances that the base is called incorrectly is 1 in 10000.
The Phred score is also inversely related to the base call accuracy, thus a higher Q score means a more reliable base call. Here is a useful table which shows this simple relationship:
|Phred Quality Score||Incorrect base call prob||Base call accuracy|
|10||1 in 10||90%|
|20||1 in 100||99%|
|30||1 in 1000||99.9%|
|40||1 in 10000||99.99%|
In fastq files, Phred quality scores are usually represented using ASCII characters, such that the quality score of each base can be specified using a single character. While older Illumina data used to apply the ASCII_BASE 64, nowadays the ASCII_BASE 33 table has been universally adopted for NGS data:
|Q Score||ASCII char||Q Score||ASCII char||Q Score||ASCII char||Q Score||ASCII char|
Even though there are lots of Python, Biopython and stand-alone softwares for dealing with Phred quality scores, a simple command to convert an ASCII character to its correspondent quality score is the following (from the terminal):
python3 -c 'print(ord("<ASCII>")-33)'
Or, when working in a Python3 session:
In both cases, just replace
<ASCII> with the actual ASCII character and that will do the trick.