Phred quality score

#python #analysis #bioinformatics

Next Generation Sequencing techniques have brought new insights into -omics data analysis, mostly thanks to their reliability in detecting biological variants. This reliability is usually measured using a value called Phred quality score (or Q score).

The Phred score of a base is an integer value that represents the estimated probability of an error in base calling. Mathematically, a Q score is logarithmically related to the base-calling error probabilities P, and can be calculated using the following formula:

Q = -10 log10 P

In the real world, a quality score of 20 means that there is a possibility in 100 that the base in incorrect; a quality score of 40 means the chances that the base is called incorrectly is 1 in 10000.

The Phred score is also inversely related to the base call accuracy, thus a higher Q score means a more reliable base call. Here is a useful table which shows this simple relationship:

Phred Quality Score	Incorrect base call prob	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10000	99.99%

In fastq files, Phred quality scores are usually represented using ASCII characters, such that the quality score of each base can be specified using a single character. While older Illumina data used to apply the ASCII_BASE 64, nowadays the ASCII_BASE 33 table has been universally adopted for NGS data:

Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char
0	!	11	,	22	7	32	A
1	"	12	-	23	8	33	B
2	#	13	.	24	9	34	C
3	$	14	/	25	:	35	D
4	%	15	0	26	;	36	E
5	&	16	1	27	<	37	F
6	'	17	2	28	=	38	G
7	(	18	3	29	>	39	H
8	)	19	4	30	?	40	I
9	*	20	5	31	@	41	J
10	+	21	6

Even though there are lots of Python, Biopython and stand-alone softwares for dealing with Phred quality scores, a simple command to convert an ASCII character to its correspondent quality score is the following (from the terminal):