Numerical methods and programming
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Num. Meth. Prog.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Num. Meth. Prog., 2015, Volume 16, Issue 2, Pages 215–234 (Mi vmp534)  

This article is cited in 1 scientific paper (total in 1 paper)

Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams

M. A. Nokela, N. V. Lukashevichb

a Lomonosov Moscow State University, Faculty of Computational Mathematics and Cybernetics
b Lomonosov Moscow State University, Research Computing Center

Abstract: The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.

Keywords: PLSA (Probabilistic Latent Semantic Analysis), topic models, PLSA (Probabilistic Latent Semantic Analysis), word association measures, bigrams, topic coherence, perplexity.

Full text: PDF file (331 kB)
UDC: 004.852
Received: 12.03.2015

Citation: M. A. Nokel, N. V. Lukashevich, “Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams”, Num. Meth. Prog., 16:2 (2015), 215–234

Citation in format AMSBIB
\Bibitem{NokLuk15}
\by M.~A.~Nokel, N.~V.~Lukashevich
\paper Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams
\jour Num. Meth. Prog.
\yr 2015
\vol 16
\issue 2
\pages 215--234
\mathnet{http://mi.mathnet.ru/vmp534}


Linking options:
  • http://mi.mathnet.ru/eng/vmp534
  • http://mi.mathnet.ru/eng/vmp/v16/i2/p215

    SHARE: VKontakte.ru FaceBook Twitter Mail.ru Livejournal Memori.ru


    Citing articles on Google Scholar: Russian citations, English citations
    Related articles on Google Scholar: Russian articles, English articles

    This publication is cited in the following articles:
    1. I. S. Pavlovskii, P. P. Parkhomenko, “Indicators, models and methods for analysis and estimation of structures of conceptually connected texts”, Autom. Remote Control, 79:9 (2018), 1630–1642  mathnet  crossref  isi  elib
  • Numerical methods and programming
    Number of views:
    This page:115
    Full text:80

     
    Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2022