RUS  ENG JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Computer Optics:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Computer Optics, 2017, Volume 41, Issue 3, Pages 461–471 (Mi co406)  

NUMERICAL METHODS AND DATA ANALYSIS

An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets

D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov

Yaroslav-the-Wise Novgorod State University, Velikii Novgorod, Russia

Abstract: In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and completeness of reflection of revealed actual knowledge in initial phrases. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is measured by estimating the coupling strength of words from the initial phrase jointly occurring in phrases of the analyzed text together with classifying these words according to their values of TF-IDF metrics in relation to text corpus. The paper considers an extension of links of words from traditional bigrams to three and more elements for the revelation of constituents of an image of the initial phrase in the form of combinations of related words. Variants of link revelation with and without application of a database of known syntactic relations are considered here. To describe more completely the fragment of expert knowledge revealed in corpus texts, sets of the initial phrases mutually equivalent or complementary in sense and related to the same image are entered into consideration. In comparison with the search of components of the analyzed image on a syntactically marked text corpus the method for text selection offered in the current paper can reduce, on average, by 17 times the output of phrases which are irrelevant to the initial ones in terms of either the knowledge fragment described or its expression forms in a given natural language.

Keywords: pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval.

Funding Agency Grant Number
Ministry of Education and Science of the Russian Federation
Russian Foundation for Basic Research 16-01-00004
The work was partially funded by the Russian Federation Ministry of Education and Science (the basic part of the state task) and the Russian Foundation of Basic Research, grant No. 16-01-00004.


DOI: https://doi.org/10.18287/2412-6179-2017-41-3-461-471

Full text: PDF file (312 kB)
Full text: http://www.computeroptics.smr.ru/.../410320.html
References: PDF file   HTML file

Received: 10.04.2017
Accepted:01.06.2017

Citation: D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov, “An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets”, Computer Optics, 41:3 (2017), 461–471

Citation in format AMSBIB
\Bibitem{MikKozEme17}
\by D.~V.~Mikhaylov, A.~P.~Kozlov, G.~M.~Emelyanov
\paper An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets
\jour Computer Optics
\yr 2017
\vol 41
\issue 3
\pages 461--471
\mathnet{http://mi.mathnet.ru/co406}
\crossref{https://doi.org/10.18287/2412-6179-2017-41-3-461-471}


Linking options:
  • http://mi.mathnet.ru/eng/co406
  • http://mi.mathnet.ru/eng/co/v41/i3/p461

    SHARE: VKontakte.ru FaceBook Twitter Mail.ru Livejournal Memori.ru


    Citing articles on Google Scholar: Russian citations, English citations
    Related articles on Google Scholar: Russian articles, English articles
  • Computer Optics
    Number of views:
    This page:1024
    Full text:40
    References:19

     
    Contact us:
     Terms of Use  Registration  Logotypes © Steklov Mathematical Institute RAS, 2020