Toward the regularization of E value from BLAST similarity search into a dissimilarity measure as distance function, and the metrication of protein sequence space
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Sequence matching algorithms such as BLAST and FASTA have been widely used in searching for evolutionary origin and biological functions of newly discovered nucleic acid and protein sequences.
As parts of these search tools, alignment scores and E values are useful indicators of the quality of search results (and the relevance of the matches) from querying a database of annotated sequences, whereby a high alignment score (and inversely a low E value) reflects significant similarity between the query and the subject (target) sequences.
For cross-comparison of results from sufficiently different queries however, the interpretation of alignment score as a similarity measure and E value a dissimilarity measure becomes somewhat nuanced, and prompts herein a judicious distinction of different types of similarity.
Via a simulated formulation, we show that an adjustment of E value to account for self-matching of query and subject sequences corrects for certain ostensibly anomalous similarity comparisons, resulting in 'regularized' dissimilarity and similarity measures that would be more appropriate for cross-comparisons, as well as database applications, such as all-on-all sequence alignment or selection of diverse subsets.
In actual practice, the 'regularization' of E value dissimilarity improves clustering and subset selection.
While both E value and the 'regularized' E value share two of the four axiomatic properties of a metric space, positivity and symmetry, the latter E value further becomes reflexive and meets the condition of triangle inequality, the remaining two axioms, thus itself an appropriate distance function for metricating protein sequence space.