
New approach to searching for string median and visualization of string clusters
D. V. Gorbachev^{}, E. P. Ofitserov^{} ^{} Tula State University (Tula)
Abstract:
We consider the following problem of searching the median of a string set:
$$
c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b),
$$
where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$,
$U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has
important applications, e.g., in bioinformatics in the analysis of protein
sequences. However, it is known that in the general case for $U=G^{*}$ the
median problem is NPhard. Therefore, for an approximate solution, heuristics
algorithms were proposed, in particular, the greedy algorithm. We give a new
flexible approach based on smooth Levenshtein distance approximation
$\tilde{d}$. It uses stochastic character encoding of sequences and the
following formula for the edit distance:
$$
d(X_{1},X_{2})=\min_{(X_{1}',X_{2}")_{e}\subseteq X_{1}\times X_{2}}
\{\frac{1}{2} \X_{1}'X_{2}"\_{1}+X_{1}X_{1}'+
X_{2}X_{2}"\},
$$
where the minimum is taken over all subsequences $(X_{1}',X_{2}")_{e}$ of
equal length. On the one hand, stochastic coding extends the class, over which
the extremum is searched. However, our main result shows that the median is the
same. On the other hand, now we can use smooth optimization methods, replacing
the minimum in the definition above with a smooth approximation. We give
implementation details based on the gradient descent, including the recursion
formulas for calculating $\tilde{d}$. Effective calculation of the approximate
median allows, e.g., to use the $k$means method for string clustering. We
propose an approach to visualize such clusters based on tSNE stochastic nesting
method.
Keywords:
Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$means, string cluster visualization.
Funding Agency 

The results of the study are published with the financial support of Tulsu within the framework of the scientific project №: НИР_2018_28. 
DOI:
https://doi.org/10.22405/22268383201820293107
Full text:
PDF file (995 kB)
References:
PDF file
HTML file
UDC:
519.72 Received: 18.05.2019 Accepted:12.07.2019
Citation:
D. V. Gorbachev, E. P. Ofitserov, “New approach to searching for string median and visualization of string clusters”, Chebyshevskii Sb., 20:2 (2019), 93–107
Citation in format AMSBIB
\Bibitem{GorOfi19}
\by D.~V.~Gorbachev, E.~P.~Ofitserov
\paper New approach to searching for string median and visualization of~string clusters
\jour Chebyshevskii Sb.
\yr 2019
\vol 20
\issue 2
\pages 93107
\mathnet{http://mi.mathnet.ru/cheb755}
\crossref{https://doi.org/10.22405/22268383201820293107}
Linking options:
http://mi.mathnet.ru/eng/cheb755 http://mi.mathnet.ru/eng/cheb/v20/i2/p93
Citing articles on Google Scholar:
Russian citations,
English citations
Related articles on Google Scholar:
Russian articles,
English articles

Number of views: 
This page:  107  Full text:  27  References:  2 
