RUS  ENG JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB
 General information Latest issue Archive Search papers Search references RSS Latest issue Current issues Archive issues What is RSS

 Computer Optics: Year: Volume: Issue: Page: Find

 Computer Optics, 2017, Volume 41, Issue 3, Pages 461–471 (Mi co406)

NUMERICAL METHODS AND DATA ANALYSIS

An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets

D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov

Yaroslav-the-Wise Novgorod State University, Velikii Novgorod, Russia

Abstract: In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and completeness of reflection of revealed actual knowledge in initial phrases. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is measured by estimating the coupling strength of words from the initial phrase jointly occurring in phrases of the analyzed text together with classifying these words according to their values of TF-IDF metrics in relation to text corpus. The paper considers an extension of links of words from traditional bigrams to three and more elements for the revelation of constituents of an image of the initial phrase in the form of combinations of related words. Variants of link revelation with and without application of a database of known syntactic relations are considered here. To describe more completely the fragment of expert knowledge revealed in corpus texts, sets of the initial phrases mutually equivalent or complementary in sense and related to the same image are entered into consideration. In comparison with the search of components of the analyzed image on a syntactically marked text corpus the method for text selection offered in the current paper can reduce, on average, by 17 times the output of phrases which are irrelevant to the initial ones in terms of either the knowledge fragment described or its expression forms in a given natural language.

Keywords: pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval.

 Funding Agency Grant Number Ministry of Education and Science of the Russian Federation Russian Foundation for Basic Research 16-01-00004 à The work was partially funded by the Russian Federation Ministry of Education and Science (the basic part of the state task) and the Russian Foundation of Basic Research, grant No. 16-01-00004.

DOI: https://doi.org/10.18287/2412-6179-2017-41-3-461-471

Full text: PDF file (312 kB)
Full text: http://www.computeroptics.smr.ru/.../410320.html
References: PDF file   HTML file

Accepted:01.06.2017

Citation: D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov, “An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets”, Computer Optics, 41:3 (2017), 461–471

Citation in format AMSBIB
\Bibitem{MikKozEme17} \by D.~V.~Mikhaylov, A.~P.~Kozlov, G.~M.~Emelyanov \paper An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets \jour Computer Optics \yr 2017 \vol 41 \issue 3 \pages 461--471 \mathnet{http://mi.mathnet.ru/co406} \crossref{https://doi.org/10.18287/2412-6179-2017-41-3-461-471}