A. O. Bogatenkova, I. S. Kozlov, O. V. Belyaeva, A. I. Perminov, “Logical structure extraction from scanned documents”, Proceedings of ISP RAS, 32:4 (2020), 175

Proceedings of the Institute for System Programming of the RAS

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Proceedings of the Institute for System Programming of the RAS, 2020, Volume 32, Issue 4, Pages 175–188
DOI: https://doi.org/10.15514/ISPRAS-2020-32(4)-13 (Mi tisp533)

This article is cited in 3 scientific papers (total in 3 papers)

Logical structure extraction from scanned documents

A. O. Bogatenkova^a, I. S. Kozlov^b, O. V. Belyaeva^b, A. I. Perminov^a

^a Lomonosov Moscow State University
^b Ivannikov Institute for System Programming of the RAS

Full-text PDF (458 kB) Citations (3)

References:

PDF

HTML

DOI: https://doi.org/10.15514/ISPRAS-2020-32(4)-13

Abstract: Logical structure extraction from various documents has been a longstanding research topic because of its high influence on a wide range of practical applications. A huge variety of different types of documents and, as a consequence, the variety of possible document structures make this task particularly difficult. The purpose of this work is to show one of the ways to represent and extract the structure of documents of a special type. We consider scanned documents without a text layer. This means that the text in such documents cannot be selected or copied. Moreover, you cannot search for the content of such documents. However, a huge number of scanned documents exist that one needs to work with. Understanding the information in such documents may be useful for their analysis, e. g. for the effective search within documents, navigation and summarization. To cope with a large collection of documents the task should be performed automatically. The paper describes the pipeline for scanned documents processing. The method is based on the multiclass classification of document lines. The set of classes include textual lines, headers and lists. Firstly, text and bounding boxes for document lines are extracted using OCR methods, then different features are generated for each line, which are the input of the classifier. We also made available dataset of documents, which includes bounding boxes and labels for each document line; evaluated the effectiveness of our approach using this dataset and described the possible future work in the field of document processing.

Keywords: machine learning, document structure, natural language processing, OCR.

Document Type: Article

Language: Russian

Citation: A. O. Bogatenkova, I. S. Kozlov, O. V. Belyaeva, A. I. Perminov, “Logical structure extraction from scanned documents”, Proceedings of ISP RAS, 32:4 (2020), 175–188

Citation in format AMSBIB

\Bibitem{BogKozBel20}

\by A.~O.~Bogatenkova, I.~S.~Kozlov, O.~V.~Belyaeva, A.~I.~Perminov

\paper Logical structure extraction from scanned documents

\jour Proceedings of ISP RAS

\yr 2020

\vol 32

\issue 4

\pages 175--188

\mathnet{http://mi.mathnet.ru/tisp533}

\crossref{https://doi.org/10.15514/ISPRAS-2020-32(4)-13}

Linking options:

https://www.mathnet.ru/eng/tisp533

https://www.mathnet.ru/eng/tisp/v32/i4/p175

This publication is cited in the following 3 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Proceedings of the Institute for System Programming of the RAS

Statistics & downloads:
Abstract page:	189
Full-text PDF :	231
References:	46

Registration to the website

Logotypes