General information
Latest issue
Impact factor
Guidelines for authors

Search papers
Search references

Latest issue
Current issues
Archive issues
What is RSS

Probl. Peredachi Inf.:

Personal entry:
Save password
Forgotten password?

Probl. Peredachi Inf., 2001, Volume 37, Issue 2, Pages 96–109 (Mi ppi520)  

This article is cited in 39 scientific papers (total in 39 papers)

Source Coding

Using Literal and Grammatical Statistics for Authorship Attribution

O. V. Kukushkina, A. A. Polikarpov, D. V. Khmelev

Abstract: Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D. V. Khmelev is described, where data compression algorithms are applied to authorship attribution.

Full text: PDF file (1953 kB)
References: PDF file   HTML file

English version:
Problems of Information Transmission, 2001, 37:2, 172–184

Bibliographic databases:

UDC: 621.391.1
Received: 08.08.2000
Revised: 11.01.2001

Citation: O. V. Kukushkina, A. A. Polikarpov, D. V. Khmelev, “Using Literal and Grammatical Statistics for Authorship Attribution”, Probl. Peredachi Inf., 37:2 (2001), 96–109; Problems Inform. Transmission, 37:2 (2001), 172–184

Citation in format AMSBIB
\by O.~V.~Kukushkina, A.~A.~Polikarpov, D.~V.~Khmelev
\paper Using Literal and Grammatical Statistics for Authorship Attribution
\jour Probl. Peredachi Inf.
\yr 2001
\vol 37
\issue 2
\pages 96--109
\jour Problems Inform. Transmission
\yr 2001
\vol 37
\issue 2
\pages 172--184

Linking options:

    SHARE: FaceBook Twitter Livejournal

    Citing articles on Google Scholar: Russian citations, English citations
    Related articles on Google Scholar: Russian articles, English articles

    This publication is cited in the following articles:
    1. Yu. M. Shtar'kov, “Joint Matrix Universal Coding of Sequences of Independent Symbols”, Problems Inform. Transmission, 38:2 (2002), 154–165  mathnet  crossref  mathscinet  zmath
    2. Benedetto D., Caglioti E., Loreto V., “Zipping out relevant information”, Computing in Science & Engineering, 5:1 (2003), 80–85  crossref  isi
    3. Puglisi A. Benedetto D. Caglioti E. Loreto V. Vulpiani A., “Data Compression and Learning in Time Sequences Analysis”, Physica D, 180:1-2 (2003), 92–107  crossref  mathscinet  zmath  adsnasa  isi
    4. Benedetto D., Caglioti E., Loreto V., “Comment on “Language Trees and Zipping” - Reply”, Phys. Rev. Lett., 90:8 (2003), 089804  crossref  adsnasa  isi
    5. Khmelev D., Teahan W., “Comment on “Language Trees and Zipping””, Phys. Rev. Lett., 90:8 (2003), 089803  crossref  adsnasa  isi
    6. Loreto V. Puglisi A., “Data Compression Approach to Sequence Analysis”, Modeling of Complex Systems, AIP Conference Proceedings, 661, ed. Garrido P. Marro J., Amer Inst Physics, 2003, 184–187  crossref  adsnasa  isi
    7. Baronchelli A., Caglioti E., Loreto V., Pizzi E., “Dictionary-Based Methods for Information Extraction”, Physica A, 342:1-2 (2004), 294–300  crossref  adsnasa  isi
    8. Celikel E., Dalkilic M., “Investigating the Effects of Recency and Size of Training Text on Author Recognition Problem”, Computer and Information Sciences - Iscis 2004, Proceedings, Lecture Notes in Computer Science, 3280, eds. Aykanat C., Dayar T., Korpeoglu I., Springer-Verlag Berlin, 2004, 21–30  crossref  isi
    9. B. Ya. Ryabko, V. A. Monarev, “Experimental Investigation of Forecasting Methods Based on Data Compression Algorithms”, Problems Inform. Transmission, 41:1 (2005), 65–69  mathnet  crossref  zmath  elib
    10. Uzuner, O, “A comparative study of language models for book and author recognition”, Natural Language Processing - Ijcnlp 2005, Proceedings, 3651 (2005), 969  crossref  isi
    11. Baronchelli, A, “Artificial sequences and complexity measures”, Journal of Statistical Mechanics-Theory and Experiment, 2005, P04002  crossref  mathscinet  isi
    12. Zakrevskaya N.S., “A multi-sample criterion for changepoint analysis of texts”, Korus 2005, Proceedings, 2005, 749–750  isi
    13. Marton Y., Wu N., Hellerstein L., “On Compression-Based Text Classification”, Advances in Information Retrieval, Lecture Notes in Computer Science, 3408, eds. Losada D., FernandezLuna J., Springer-Verlag Berlin, 2005, 300–314  crossref  isi
    14. Juola P., “Authorship attribution for electronic documents”, Advances in Digital Forensics II, International Federation for Information Processing, 222, 2006, 119–130  adsnasa  isi
    15. Uzuner O., “Linguistically informed digital fingerprints for text - art. no. 607208”, Security, Steganography, and Watermarking of Multimedia Contents VIII, Proceedings of the Society of Photo-Optical Instrumentation Engineers (SPIE), 6072, 2006, 7208–7208  isi
    16. Malyutov M.B., “Authorship Attribution of Texts: a Review”, General Theory of Information Transfer and Combinatorics, Lecture Notes in Computer Science, 4123, eds. Ahlswede R., Baumer L., Cai N., Aydinian H., Blinovsky V., Deppe C., Mashurian H., Springer-Verlag Berlin, 2006, 362–380  crossref  mathscinet  zmath  isi
    17. Dalkilic M.M., Clark W.T., Costello J.C., Radivojac P., “Using Compression to Identify Classes of Inauthentic Texts”, Proceedings of the Sixth SIAM International Conference on Data Mining, SIAM Proceedings Series, eds. Ghosh J., Lambert D., Skillicorn D., Srivastava J., SIAM, 2006, 604–608  mathscinet  isi
    18. B. Ya. Ryabko, “Application of Data Compression Methods to Nonparametric Estimation of Characteristics of Discrete-Time Stochastic Processes”, Problems Inform. Transmission, 43:4 (2007), 367–379  mathnet  crossref  mathscinet  zmath  isi  elib  elib
    19. Tuerkoglu F., Diri B., Amasyali M.F., “Author attribution of Turkish texts by feature mining”, Advanced Intelligent Computing Theories and Applications: With Aspects of Theoretical and Methodological Issues, Lecture Notes in Computer Science, 4681, 2007, 1086–1093  crossref  isi
    20. Raskhodnikova S., Ron D., Rubinfeld R., Smith A., “Sublinear Algorithms for Approximating String Compressibility”, Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, Lecture Notes in Computer Science, 4627, eds. Charikar M., Reingold O., Jansen K., Rolim J., Springer-Verlag Berlin, 2007, 609–623  crossref  zmath  isi
    21. Melville J.L., Riley J.F., Hirst J.D., “Similarity by Compression”, J. Chem Inf. Model., 47:1 (2007), 25–33  crossref  mathscinet  isi  elib
    22. Basile, C, “An example of mathematical authorship attribution”, Journal of Mathematical Physics, 49:12 (2008), 125211  crossref  mathscinet  zmath  adsnasa  isi
    23. Luyckx K., “Corpus Linguistics Beyond the Word-Corpus Research from Phrase to Discourse”, Literary and Linguistic Computing, 23:1 (2008), 123–125  crossref  isi
    24. Stamatatos E., “A Survey of Modern Authorship Attribution Methods”, J. Am. Soc. Inf. Sci. Technol., 60:3 (2009), 538–556  crossref  isi  elib
    25. Lambers M., Veenman C.J., “Forensic Authorship Attribution Using Compression Distances to Prototypes”, Computational Forensics, Proceedings, Lecture Notes in Computer Science, 5718, eds. Geradts Z., Franke K., Veenman C., Springer-Verlag Berlin, 2009, 13–24  crossref  isi
    26. Koppel M., Schler J., Argamon Sh., “Computational Methods in Authorship Attribution”, J. Am. Soc. Inf. Sci. Technol., 60:1 (2009), 9–26  crossref  isi  elib
    27. Reicher T., Kristo I., Belsa I., Silic A., “Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features”, Knowledge-Based and Intelligent Information and Engineering Systems, Pt II, Lecture Notes in Artificial Intelligence, 6277, no. Part ii, eds. Setchi R., Jordanov I., Howlett R., Jain L., Springer-Verlag Berlin, 2010, 21–30  isi
    28. Neelova N.V., “Model opredeleniya pervichnogo kontenta sredi mnozhestva web-dokumentov”, Nauchno-tekhnicheskie vedomosti Sankt-Peterburgskogo gosudarstvennogo politekhnicheskogo universiteta, 2011, no. 133, 13–17  elib
    29. Raskhodnikova S., Ron D., Rubinfeld R., Smith A., “Sublinear Algorithms for Approximating String Compressibility”, Algorithmica, 65:3 (2013), 685–709  crossref  mathscinet  zmath  isi  elib
    30. Stolerman A., Fridman A., Greenstadt R., Brennan P., Juola P., “Active Linguistic Authentication Using Real-Time Stylometric Evaluation For Multi-Modal Decision Fusion”, Advances in Digital Forensics X, Ifip Advances in Information and Communication Technology, 433, eds. Peterson G., Shenoi S., Springer-Verlag Berlin, 2014, 165–183  crossref  isi
    31. Segarra S., Eisen M., Ribeiro A., “Authorship Attribution Through Function Word Adjacency Networks”, IEEE Trans. Signal Process., 63:20 (2015), 5464–5478  crossref  mathscinet  isi
    32. Venckauskas A., Damasevicius R., Marcinkevicius R., Karpavicius A., “Problems of Authorship Identification of the National Language Electronic Discourse”, Information and Software Technologies, Icist 2015, Communications in Computer and Information Science, 538, eds. Dregvaite G., Damasevicius R., Springer-Verlag Berlin, 2015, 415–432  crossref  isi
    33. B. Ya. Ryabko, A. E. Gus'kov, I. V. Selivanova, “Information-theoretic method for classification of texts”, Problems Inform. Transmission, 53:3 (2017), 294–304  mathnet  crossref  isi  elib
    34. Selivanova I.V., Ryabko B.Y.A., Guskov A.E., “Classification By Compression: Application of Information-Theory Methods For the Identification of Themes of Scientific Texts”, Autom. Doc. Math. Linguist., 51:3 (2017), 120–126  crossref  mathscinet  isi
    35. Kyabko B., Guskov A., Selivanova I., “Using Data-Compressors For Statistical Analysis of Problems on Homogeneity Testing and Classification”, 2017 IEEE International Symposium on Information Theory (ISIT), IEEE International Symposium on Information Theory, IEEE, 2017, 121–125  isi
    36. Pal U., Nipu A.S., Ismail S., 2017 20Th International Conference of Computer and Information Technology (Iccit), IEEE, 2017  isi
    37. Rocha A., Scheirer W.J., Forstall Ch.W., Cavalcante T., Theophilo A., Shen B., Carvalho A.R.B., Stamatatos E., “Authorship Attribution For Social Media Forensics”, IEEE Trans. Inf. Forensic Secur., 12:1 (2017), 5–33  crossref  isi  scopus
    38. Akimushkin C., Amancio D.R., Oliveira Jr. Osvaldo N., “On the Role of Words in the Network Structure of Texts: Application to Authorship Attribution”, Physica A, 495 (2018), 49–58  crossref  isi  scopus
    39. Neal T., Sundararajan K., Fatima A., Yan Y., Xiang Y., Woodard D., “Surveying Stylometry Techniques and Applications”, ACM Comput. Surv., 50:6 (2018), 86  crossref  isi  scopus
  • Проблемы передачи информации Problems of Information Transmission
    Number of views:
    This page:1733
    Full text:643
    First page:1

    Contact us:
     Terms of Use  Registration  Logotypes © Steklov Mathematical Institute RAS, 2020