Proceedings of the Institute for System Programming of the RAS
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Proceedings of the Institute for System Programming of the RAS, 2024, Volume 36, Issue 5, Pages 153–162
DOI: https://doi.org/10.15514/ISPRAS-2024-36(5)-11
(Mi tisp929)
 

Automatic construction of information extraction rules for news websites

S. S. Dubovitskiia, P. A. Bedrinba, A. K. Yatskovba, M. I. Varlamova

a Ivannikov Institute for System Programming of the RAS
b Lomonosov Moscow State University
Abstract: This paper presents a method for the automatic generation of information extraction rules (sitemaps) for news websites. The proposed approach generates a sitemap based on a set of news pages from a single site, enabling attribute extraction from arbitrary news pages on that site. The method is based on applying a fine-tuned neural network model, MarkupLM, to extract information from web pages. This approach generalizes the model’s predictions at the site level, creating universal rules for attribute extraction. Experimental results show that using sitemaps generated with the fine-tuned model surpasses both existing open-source tools and the fine-tuned MarkupLM applied at the individual page level. The developed method can be extended to other domains if relevant data for model fine-tuning is available.
Keywords: information extraction, web scraping, news websites, neural networks.
Funding agency Grant number
Ministry of Science and Higher Education of the Russian Federation
The authors are grateful to the Ivannikov Institute for System Programming of the Russian Academy of Sciences.
Document Type: Article
Language: Russian
Citation: S. S. Dubovitskii, P. A. Bedrin, A. K. Yatskov, M. I. Varlamov, “Automatic construction of information extraction rules for news websites”, Proceedings of ISP RAS, 36:5 (2024), 153–162
Citation in format AMSBIB
\Bibitem{DubBedYat24}
\by S.~S.~Dubovitskii, P.~A.~Bedrin, A.~K.~Yatskov, M.~I.~Varlamov
\paper Automatic construction of information extraction rules for news websites
\jour Proceedings of ISP RAS
\yr 2024
\vol 36
\issue 5
\pages 153--162
\mathnet{http://mi.mathnet.ru/tisp929}
\crossref{https://doi.org/10.15514/ISPRAS-2024-36(5)-11}
Linking options:
  • https://www.mathnet.ru/eng/tisp929
  • https://www.mathnet.ru/eng/tisp/v36/i5/p153
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Proceedings of the Institute for System Programming of the RAS
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2025