Program Systems: Theory and Applications, 2016, Volume 7, Issue 4, Pages 347–358
Mathematical Foundations of Programming
Approximation of the longest common subsequence length for two long random strings
S. V. Znamenskij
Ailamazyan Program Systems Institute of RAS
The expected value $E$ of the longest common subsequence of letters in two random words is considered as a function of the $\alpha=|A|$ of alphabet and of words lengths $m$ and $n$.
It is assumed that each letter independently appears at any position with equal probability.
An approximate analitic expression for $E(\alpha, m, n)$ calculation is presented that allow to calculate the $E (m, n, \alpha) $ with an accuracy of $0.3$ percent for $ 64 \leqslant m + n \leqslant 65,536 $ and $ 1 <\alpha < 129$.
The coefficients are selected by hand and can be refined.
It is expected that the formula holds for each grater values of the argument with the same relative error.
Key words and phrases:
similarity of strings, sequence alignment, edit distance, LCS, Levenshtein metric.
PDF file (1547 kB)
MSC: 68T37; 68P10, 68W32
S. V. Znamenskij, “Approximation of the longest common subsequence length for two long random strings”, Program Systems: Theory and Applications, 7:4 (2016), 347–358
Citation in format AMSBIB
\paper Approximation of the longest common subsequence length for two long random strings
\jour Program Systems: Theory and Applications
Citing articles on Google Scholar:
Related articles on Google Scholar:
|Number of views:|