Toward the regularization of E value from BLAST similarity search into a dissimilarity measure as distance function, and the metrication of protein sequence space

arXiv Q-Bio

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Sequence matching algorithms such as BLAST and FASTA have been widely used in searching for evolutionary origin and biological functions of newly discovered nucleic acid and protein sequences.

As parts of these search tools, alignment scores and E values are useful indicators of the quality of search results (and the relevance of the matches) from querying a database of annotated sequences, whereby a high alignment score (and inversely a low E value) reflects significant similarity between the query and the subject (target) sequences.

For cross-comparison of results from sufficiently different queries however, the interpretation of alignment score as a similarity measure and E value a dissimilarity measure becomes somewhat nuanced, and prompts herein a judicious distinction of different types of similarity.

Via a simulated formulation, we show that an adjustment of E value to account for self-matching of query and subject sequences corrects for certain ostensibly anomalous similarity comparisons, resulting in 'regularized' dissimilarity and similarity measures that would be more appropriate for cross-comparisons, as well as database applications, such as all-on-all sequence alignment or selection of diverse subsets.

In actual practice, the 'regularization' of E value dissimilarity improves clustering and subset selection.

While both E value and the 'regularized' E value share two of the four axiomatic properties of a metric space, positivity and symmetry, the latter E value further becomes reflexive and meets the condition of triangle inequality, the remaining two axioms, thus itself an appropriate distance function for metricating protein sequence space.

전문 보기

Toward the regularization of E value from BLAST similarity search into a dissimilarity measure as distance function, and the metrication of protein sequence space

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Recursive Self-Evolving Agents via Held-Out Selection

Data and Evaluation Closed-Loop for Model Capability Enhancement

GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

arXiv의 다른 기사

Aristotelian Virtue Profiling of LLMs through Ethical Dilemmas

An AI agent for treatment reasoning over a biomedical tool universe

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models