|
Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia, 2024, Volume 520, Number 2, Pages 228–237 DOI: https://doi.org/10.31857/S2686954324700590
(Mi danma602)
|
|
|
|
SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES
Stack more LLMs: efficient detection of machine-generated texts via perplexity approximation
G. M. Gritsaiab, I. A. Khabutdinovab, A. V. Grabovoyab a Antiplagiat Company, Moscow, Russia
b Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Moscow Region
DOI:
https://doi.org/10.31857/S2686954324700590
Abstract:
The development of large language models (LLMs) is currently receiving a great amount of interest, but an update of text generation methods should entail a continuous update of methods for detecting machine-generated texts. Earlier, it has been highlighted that values of perplexity and log-probability are able to capture a measure of the difference between artificial and human-written texts. Using this observation, we define a new criterion based on these two values to judge whether a passage is generated from a given LLM. In this paper, we propose a novel efficient method that enables the detection of machine-generated fragments using an approximation of the LLM perplexity value based on pre-collected statistical language models. Approximation lends a hand in achieving high performance and quality metrics also on fragments from weights-closed LLMs. A large number of pre-collected statistical dictionaries results in an increased generalisation ability and the possibility to cover text sequences from the wild. Such approach is easy to update by only adding a new dictionary with latest model text outputs. The presented method has a high performance and achieves quality with an average of 94% recall in detecting generated fragments among texts from various open-source LLMs. In addition, the method is able to perform in milliseconds, which outperforms state-of-the-art models by a factor of thousands.
Keywords:
machine-generated text, natural language processing, perplexity, large language models, detection of generated texts.
Received: 27.09.2024 Accepted: 02.10.2024
Citation:
G. M. Gritsai, I. A. Khabutdinov, A. V. Grabovoy, “Stack more LLMs: efficient detection of machine-generated texts via perplexity approximation”, Dokl. RAN. Math. Inf. Proc. Upr., 520:2 (2024), 228–237; Dokl. Math., 110:suppl. 1 (2024), S203–S211
Linking options:
https://www.mathnet.ru/eng/danma602 https://www.mathnet.ru/eng/danma/v520/i2/p228
|
|