Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Dokl. RAN. Math. Inf. Proc. Upr.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia, 2024, Volume 520, Number 2, Pages 216–227
DOI: https://doi.org/10.31857/S2686954324700589
(Mi danma601)
 

This article is cited in 1 scientific paper (total in 1 paper)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

SciRus: tiny and powerful multilingual encoder for scientific texts

N. A. Gerasimenkoabc, A. C. Vatolinbc, A. O. Yaninad, K. V. Vorontsovbcd

a SberAI, Moscow, Russia
b Artificial Intelligence Institute M. V. Lomonosov Moscow State University, Moscow, Russia
c Federal Research Center "Computer Science and Control" of Russian Academy of Sciences, Moscow, Russia
d Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Moscow Region
Citations (1)
DOI: https://doi.org/10.31857/S2686954324700589
Abstract: LLM-based representation learning is widely used to build effective information retrieval systems, including scientific domains. For making science more open and affordable, it is important that these systems support multilingual (and cross-lingual) search and do not require significant computational power. To address this we propose SciRus-tiny, light multilingual encoder trained from scratch on 44M abstracts (15B tokens) of research papers and then tuned in a contrastive manner using citation data. SciRus-tiny outperforms SciNCL, English-only SOTA-model for scientific texts, on 13/24 tasks, achieving SOTA on 7, from SciRepEval benchmark. Furthermore, SciRus-tiny is much more effective than SciNCL: it is almost 5x smaller (23M parameters vs 110M), having approximately 2x smaller embeddings (312 vs 768) and 2x bigger context length (1024 vs 512). In addition to the tiny model, we also propose the SciRus-small (61M parameters and 768 embeddings size), which is more powerful and can be used for complicated downstream tasks. We further study different ways of contrastive pre-training and demonstrate that almost SOTA results can be achieved without citation information, operating with only title-abstract pairs.
Keywords: information retrieval, interpretability and analysis of NLP models, large language models, representation learning.
Received: 27.09.2024
Accepted: 02.10.2024
English version:
Doklady Mathematics, 2024, Volume 110, Issue suppl. 1, Pages S193–S202
DOI: https://doi.org/10.1134/S1064562424602178
Bibliographic databases:
Document Type: Article
UDC: 004.048
Language: Russian
Citation: N. A. Gerasimenko, A. C. Vatolin, A. O. Yanina, K. V. Vorontsov, “SciRus: tiny and powerful multilingual encoder for scientific texts”, Dokl. RAN. Math. Inf. Proc. Upr., 520:2 (2024), 216–227; Dokl. Math., 110:suppl. 1 (2024), S193–S202
Citation in format AMSBIB
\Bibitem{GerVatYan24}
\by N.~A.~Gerasimenko, A.~C.~Vatolin, A.~O.~Yanina, K.~V.~Vorontsov
\paper SciRus: tiny and powerful multilingual encoder for scientific texts
\jour Dokl. RAN. Math. Inf. Proc. Upr.
\yr 2024
\vol 520
\issue 2
\pages 216--227
\mathnet{http://mi.mathnet.ru/danma601}
\elib{https://elibrary.ru/item.asp?id=80287449}
\transl
\jour Dokl. Math.
\yr 2024
\vol 110
\issue suppl. 1
\pages S193--S202
\crossref{https://doi.org/10.1134/S1064562424602178}
Linking options:
  • https://www.mathnet.ru/eng/danma601
  • https://www.mathnet.ru/eng/danma/v520/i2/p216
  • This publication is cited in the following 1 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2025