M. I. Rudakov, A. N. Beznosikov, Y. A. Kholodov, A. V. Gasnikov, “Activations and gradients compression for model-parallel training”, Dokl. RAN. Math. Inf. Proc. Upr., 514:2 (2023), 126–137; Dokl. Math., 108:suppl. 2 (2023), S272

Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Dokl. RAN. Math. Inf. Proc. Upr.:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia, 2023, Volume 514, Number 2, Pages 126–137
DOI: https://doi.org/10.31857/S2686954323601562 (Mi danma458)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Activations and gradients compression for model-parallel training

M. I. Rudakov^ab, A. N. Beznosikov^ab, Y. A. Kholodov^ab, A. V. Gasnikov^ab

^a Innopolis University, Innopolis, Republic of Tatarstan
^b Moscow Institute of Physics and Technology, Moscow, Russia

References:

PDF

HTML

DOI: https://doi.org/10.31857/S2686954323601562

Abstract: Large neural networks require enormous computational clusters of machines. Model-parallel training, when the model architecture is partitioned sequentially between workers, is a popular approach for training modern models. Information compression can be applied to decrease workers' communication time, as it is often a bottleneck in such systems. This work explores how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We analyze compression methods such as quantization and TopK compression, and also experiment with error compensation techniques. Moreover, we employ TopK with AQ-SGD per-batch error feedback approach. We conduct experiments on image classification and language model fine-tuning tasks. Our findings demonstrate that gradients require milder compression rates than activations. We observe that $K=10\%$ is the highest TopK compression level, which does not harm model convergence severely. Experiments also show that models trained with TopK perform well only when compression is also applied during inference. We find that error feedback techniques do not improve model-parallel training compared to plain compression, but allow model inference without compression with almost no quality drop. Finally, when applied with the AQ-SGD approach, TopK stronger than with $K=30\%$ worsens model performance significantly.

Keywords: distributed learning, model parallelism, activation compression, gradient compression.

Funding agency	Grant number
Russian Science Foundation	23-11-00229
The research of A. Beznosikov was supported by Russian Science Foundation (project no. 23-11-00229).

Presented: A. L. Semenov
Received: 01.09.2023
Revised: 15.09.2023
Accepted: 18.10.2023

English version:
Doklady Mathematics, 2023, Volume 108, Issue suppl. 2, Pages S272–S281
DOI: https://doi.org/10.1134/S1064562423701314

Bibliographic databases:

Document Type: Article

UDC: 517.54

Language: Russian

Citation: M. I. Rudakov, A. N. Beznosikov, Y. A. Kholodov, A. V. Gasnikov, “Activations and gradients compression for model-parallel training”, Dokl. RAN. Math. Inf. Proc. Upr., 514:2 (2023), 126–137; Dokl. Math., 108:suppl. 2 (2023), S272–S281

Citation in format AMSBIB

\Bibitem{RudBezKho23}

\by M.~I.~Rudakov, A.~N.~Beznosikov, Y.~A.~Kholodov, A.~V.~Gasnikov

\paper Activations and gradients compression for model-parallel training

\jour Dokl. RAN. Math. Inf. Proc. Upr.

\yr 2023

\vol 514

\issue 2

\pages 126--137

\mathnet{http://mi.mathnet.ru/danma458}

\crossref{https://doi.org/10.31857/S2686954323601562}

\elib{https://elibrary.ru/item.asp?id=56717792}

\transl

\jour Dokl. Math.

\yr 2023

\vol 108

\issue suppl. 2

\pages S272--S281

\crossref{https://doi.org/10.1134/S1064562423701314}

Linking options:

https://www.mathnet.ru/eng/danma458

https://www.mathnet.ru/eng/danma/v514/i2/p126

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia

Statistics & downloads:
Abstract page:	171
References:	38

Что такое QR-код?

Registration to the website

Logotypes