S. A. Chezhegov, S. N. Skorik, N. Khachaturov, D. S. Shalagin, A. A. Avetisyan, M. Takáč, Y. A. Kholodov, A. N. Beznosikov, “Local methods with adaptivity via scaling”, Russian Math. Surveys, 79:6 (2024), 1051

Russian Mathematical Surveys

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor
	Submit a manuscript

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Uspekhi Mat. Nauk:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Russian Mathematical Surveys, 2024, Volume 79, Issue 6, Pages 1051–1091
DOI: https://doi.org/10.4213/rm10209e (Mi rm10209)

Local methods with adaptivity via scaling

S. A. Chezhegov^abc, S. N. Skorik^b, N. Khachaturov^d, D. S. Shalagin^e, A. A. Avetisyan^fb, M. Takáč^g, Y. A. Kholodov^e, A. N. Beznosikov^bcf

^a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
^b Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow, Russia
^c Sber AI Lab, Moscow, Russia
^d Russian-Armenian University, Yerevan, Armenia
^e Innopolis University, Innopolis, Russia
^f Russian Presidential Academy of National Economy and Public Administration, Moscow, Russia
^g Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates

English version PDF (1067 kB) HTML full-text Russian version article

References:

PDF

HTML

DOI: https://doi.org/10.4213/rm10209e

Abstract: The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process so as to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to the theoretical analysis, we validate the performance of our methods in practice by training a neural network.
Bibliography: 49 titles.

Keywords: convex optimization, distributed optimization, adaptive methods, preconditioning.

Funding agency	Grant number
Ministry of Science and Higher Education of the Russian Federation	075-03-2024-214
The work was done in the Laboratory of Federated Learning Problems of the Ivannikov Institute for System Programming of the Russian Academy of Sciences and supported by the Ministry of Science and Higher Education of the Russian Federation under appendix no. 2 to agreement no. 075-03-2024-214. The work was partially conducted while S. A. Chezhegov was a visiting research assistant in Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).

Received: 15.08.2024

Published: 20.02.2025

Bibliographic databases:

Document Type: Article

UDC: 519.853.62

MSC: 68T07, 90C26

Language: English

Original paper language: English

1. Introduction

Distributed optimization

Today, there are numerous problem statements related to distributed optimization, with one of the most popular applications being to machine learning [1] and deep learning [2]. The rapid advancement of artificial intelligence has enabled the resolution of a wide range of challenges, ranging from object classification and credit scoring to machine translation. Consequently, the complexity of machine learning problems has been increasing steadily, necessitating the processing of ever larger data sets with increasingly large models [3]. Therefore, it is now challenging to envision learning processes without the use of parallel computing and, by extension, distributed algorithms [4]–[8].

In practice, various settings for distributed computing can be considered. One classical distributed setting involves optimization within a computing cluster. In this scenario it is assumed that optimization occurs on a computational cluster, facilitating parallel computations to expedite the learning process and enabling communication between different workers. This cluster setting closely resembles collaborative learning, where, instead of a single large cluster, there is a network of users with potentially smaller computational resources. These resources could be virtually combined into a substantial computational resource capable of addressing the overarching learning problem.

It is worth noting that, in these formulations, it can be assumed that the local data on each device are homogeneous — identical and originating from the same distribution — due to the artificial uniform partitioning of the dataset across computing devices. A more realistic setting occurs when the data are inherently distributed among the users, resulting in heterogeneity of local data across the workers. An example of a problem statement within this setting is federated learning [9]–[11], where each user’s data are private and often have a unique nature, coming from heterogeneous distributions.

Communication bottleneck

A central challenge of distributed optimization algorithms is organizing communications between computing devices or nodes. Consequently, numerous aspects need consideration, ranging from the organization of the communication process to its efficiency. In particular, communication costs can be significant due to the large volumes of information transmitted, especially in contemporary problems. As a result, total communication time in distributed learning becomes a bottleneck that must be addressed.

Various techniques exist to address the issue of total communication complexity [4], [12]–[14]. In this paper we concentrate on one such technique: local steps. The core of this approach is to enable devices to perform a certain number of local computations without the need for communication. Consequently, this reduces significantly the frequency of communications, which can in turn lower the final complexity of algorithms.

Methods with local steps

Local techniques have already been explored in various contexts. Among the first local methods, Parallel SGD was proposed in [15] and has since evolved, appearing under new identities such as Local SGD [16] and FedAvg [17]. Additionally, there have been studies introducing various modifications, including momentum [18], quantization [19], [20], and variance reduction [21], [22]. Nevertheless, new interpretations of local techniques continue to emerge. For instance, relatively recent works like SCAFFOLD [11] and ProxSkip [23], along with their modifications, have been introduced. The research path focusing on the Hessian similarity [24]–[27] of local functions, where local techniques are employed, is also worth mentioning, but in these cases most local steps are performed by a single/main device.

Although these approaches and techniques are widespread, they are all fundamentally based on variations of gradient descent, whether deterministic or stochastic. Recently, particularly in the field of machine learning, so-called adaptive methods have gained traction. These methods, which involve fitting adaptive parameters for individual components of a variable, have become increasingly popular due to research demonstrating their superior results in learning problems. This can be achieved by utilizing a scaling/preconditioning matrix that alters the vector direction of the descent. Among the most prominent methods incorporating this approach are Adam [28], RMSProp [29], and AdaGrad [30].

The choice of preconditioner

The structure of the preconditioner can vary significantly. For instance, calculations based on the gradient at the current point, as exemplified by Adam and RMSProp, can be utilized. Alternatively, a scaling matrix structure based on Hutchinson approximation [31], [32], such as that employed in OASIS [33], can be used. To enhance computation, recurrence relations (exponential smoothing) are typically introduced for the preconditioner. One of its most critical attributes is its diagonal form, attributed to the fact that using the preconditioner resembles the quasi-Newton method [34], [35]. Consequently, calculating the inverse matrix becomes necessary, and this task is simplified significantly in the case of the diagonal form.

The aforementioned techniques — local steps in distributed optimization and preconditioning — are widely used and have practical significance, yet their integration has not been explored extensively. Therefore, we have defined the following objectives for our research:

$\bullet$ combine the two techniques to introduce a new local method with preconditioning updates;
$\bullet$ obtain a general convergence analysis of this method for a specific class of adaptive settings.

2. Contributions and related works

Our contributions are delineated into four main parts.

$\bullet$ Formulation of a new local method. This paper introduces a method that merges the Local SGD technique with preconditioning updates. In this approach, each device employs preconditioning to scale gradients on each node within the communication network. Devices perform multiple iterations to solve their local problems and, during a communication round, transmit information to the server, where the collected variables are averaged. Concurrently, the preconditioning matrix is updated and distributed by the server to all clients. The concept of Local SGD has extensively been explored in the literature [9], [16], [36]–[38], with [36] providing particularly tight and definitive results [39], which we leverage for our analysis.

$\bullet$ Unified assumption on the preconditioning matrix. To facilitate the convergence analysis of our novel algorithm, the exact structure of the scaling matrix need not be specified. We introduce a classical general assumption that allows us to examine a specific class of preconditioners simultaneously. This approach aligns with previous studies [40], [41], which have indicated that many adaptive methods, including Adam, RMSProp, and OASIS, adhere to this assumption.

$\bullet$ Setting for theoretical analysis. Diverging from prior research, we depart from certain assumptions, such as gradient boundedness and gradient similarity [42; Assumptions 2 and 3]. This shift enables us to address a broader class of problems.

$\bullet$ Interpretability of results. Prior research has examined the combination of the two heuristics mentioned above. Specifically, in [42] the authors explored local methods with scaling, such as FedAdaGrad and FedAdam, but we identified discrepancies in their results. Our study elucidates why our approach is exempt from these shortcomings (see § 5.2). Our findings offer greater interpretability, despite focusing on a more generalized theory in terms of preconditioning.

3. Preliminaries, requirements, and notation

In distributed optimization, we solve an optimization problem in the form

$$ \begin{equation} \min_{x \in \mathbb{R}^d}\biggl\{f(x) \overset{{\rm def}}{=} \frac{1}{M}\sum_{m=1}^{M} f_m (x)\biggr\}, \end{equation} \tag{1} $$

where

$$ \begin{equation*} f_m(x) \overset{{\rm def}}{=}\mathsf{E}_{z_m \sim \mathcal{D}_m}[f_m(x,z_m)] \end{equation*} \notag $$

is the loss function for the $m$th client, $m \in [M]$, $\mathcal{D}_m$ is the distribution of the data for the $m$th client, and $x$ denotes the parameters of the model. We also denote the solution of the problem (1) by $x_\ast$.

Given the diverse formulations of distributed learning, we differentiate between two scenarios: a homogeneous (identical) and a heterogeneous one. Formally, in the homogeneous case the equality of loss functions is guaranteed by the uniformity of the data:

$$ \begin{equation*} f_1(x)=\cdots=f_m(x). \end{equation*} \notag $$

Conversely, the heterogeneous case arises when such an equality does not hold.

To prove convergence, we introduce classical assumptions [43], [44] for objective functions: smoothness, unbiasedness and bounded variance, and the smoothness of the stochastic function. Our analysis, based on [36], adopts the same set of assumptions.

Assumption 1. Assume that each function $f_m$ is $\mu$-strongly convex for $\mu \geqslant 0$ and $L$-smooth. That is, for all $x,y \in \mathbb{R}^d$, we have

$$ \begin{equation*} \frac{\mu}{2}\,\|x-y\|^2 \leqslant f_m (x)-f_m (y)- \langle\nabla f_m (y),x-y\rangle \leqslant \frac{L}{2}\,\|x-y\|^2. \end{equation*} \notag $$

Also, we define $\kappa \overset{\rm def}{=} L/\mu$, the condition number.

Next, we state two sets of assumptions about the problem’s stochastic nature, leading to varying convergence rates.

Assumption 2. Assume that $f_m$ satisfies the following: for all $x \in \mathbb{R}^d$ with $z \sim {\mathcal{D}_m}$ drawn independent and identically distributed (i.i.d.) in accordance with the distribution ${\mathcal{D}_m}$, we have

$$ \begin{equation*} \begin{gathered} \, \mathsf{E}_{z \sim \mathcal{D}_m}[\nabla f_m(x,z)]=\nabla f_m(x), \\ \mathsf{E}_{z \sim \mathcal{D}_m}\bigl[\|\nabla f_m(x,z)- \nabla f_m(x)\|^2\bigr] \leqslant \sigma^2. \end{gathered} \end{equation*} \notag $$

Assumption 2 traditionally controls stochasticity but often does not apply to finite-sum problems (for $\mu$-strongly convex, $\mu > 0$, objectives [45]). To encompass such cases, we introduce the following assumption.

Assumption 3. Assume that $f_m(\,\cdot\,,z)\colon \mathbb{R}^d \to \mathbb{R}$ is almost surely $L$-smooth and $\mu$-strongly convex.

Additionally, to yield more precise results in heterogeneous scenarios, we introduce the following definition:

$$ \begin{equation*} \sigma_{\rm dif}^2 \overset{{\rm def}}{=} \frac{1}{M} \sum_{m=1}^{M} \mathsf{E}_{z_m \sim \mathcal{D}_m}\bigl[\|\nabla f_m(x_\ast,z_m)\|^2\bigr]. \end{equation*} \notag $$

4. Preconditioning meets local method

We are now prepared to present our algorithm, introduce a unified assumption for the preconditioner along with its properties, and discuss theoretical results for convergence within two distinct regimes.

4.1. SAVIC

Let us describe an algorithm (Algorithm 1), which is based on Local SGD. Each client maintains its own variable $x_t^m$, where $m \in [M]$ represents the client index, $t$ denotes the iteration number, and $f_m$ is the loss function of the $m$th client. Additionally, we define a series of moments of time $t_1,t_2,\ldots$ , designated for communication. Depending on whether the current point in time aligns with a communication moment, a client either performs local steps or transmits its current information to the server, where averaging occurs. The innovation introduced in this algorithm is the matrix $\widehat{D}^{t_p}$, with $t_p$ indicating one of the synchronization iterations. In the algorithm described this modification acts as a preconditioner, highlighted in blue for emphasis. It is crucial to note that the matrix is updated exclusively during synchronization moments and remains consistent across all clients. The rationale behind these facts will be elucidated later, when we delve into the properties of the preconditioner.

4.2. Preconditioning

In practice, we often encounter various recurrence relations between preconditioners for two adjacent iterations. One rule for updating the scaling matrix is given by the following expression:

$$ \begin{equation} (D^{t})^2=\beta_t(D^{t-1})^2+(1-\beta_t)(H^{t})^2, \end{equation} \tag{2} $$

where $\beta_t \in [0,1]$ is a preconditioning momentum parameter and $(H^t)^2$ is a diagonal matrix. This update mechanism is observed in Adam-based methods, where

$$ \begin{equation*} (H^t)^2=\operatorname{diag}\bigl(\nabla f(x_t,z_t) \odot \nabla f(x_t,z_t)\bigr); \end{equation*} \notag $$

here $z_t$ is an independent random variable. The original Adam [28] has $\beta_t=(\beta-\beta^{t+1})/(1-\beta^{t+1})$ while an earlier method, RMSProp [29] has $\beta_t \equiv \beta$. Furthermore, the update rule (2) is also applicable to AdaHessian [46], which relies on the Hutchinson method. For this, we need to select the momentum parameter as $\beta_t=(\beta-\beta^{t+1})/(1-\beta^{t+1})$ and set

$$ \begin{equation*} (H^t)^2=\operatorname{diag}\bigl(v_t \odot \nabla^2 f(x_t,z_t)v_t\bigr)^2, \end{equation*} \notag $$

where $v_t$ has independent identically distributed elements which follow the Rademacher distribution.

Alternatively, the update for the preconditioning matrix can be represented as

$$ \begin{equation} D^{t}=\beta_tD^{t-1}+(1-\beta_t)H^{t}. \end{equation} \tag{3} $$

This approach is also commonly used, for instance, in methods like OASIS [33], where $\beta_t \equiv \beta$ and

$$ \begin{equation*} H^t=\operatorname{diag}\bigl(v_t \odot \nabla^2 f(x_t,z_t)v_t\bigr). \end{equation*} \notag $$

Despite the presence of Hessians in AdaHessian and OASIS, computing the matrix of second derivatives is unnecessary; it suffices to compute the gradient $\nabla f(x_t,z_t)$ and then calculate the gradient of $\langle \nabla f(x_t, z_t), v_t \rangle$ (that is, we just need to take a Hessian-vector product).

In practice, to avoid division by zero, a positive definite matrix is typically used. This can be achieved by the following modification of the matrix $D^t$:

$$ \begin{equation} (\widehat{D}^t)_{ii}=\max\{\alpha,|D^t_{ii}|\}. \end{equation} \tag{4} $$

Alternatively, the scaling matrix can be modified by adding $\alpha$ to each diagonal element: $(\widehat{D}^t)_{ii}=|D^t_{ii}|+\alpha$, preserving the core idea of ensuring that the preconditioner is positive definite.

To summarize the approaches, we make the following assumption.

Assumption 4. Assume that $D^{0}$ and $H^t$ satisfy the relations below for some $\alpha > 0$ and $\Gamma \geqslant \alpha$, for all $t$:

$$ \begin{equation*} \alpha I \preceq D^{0} \preceq \Gamma I, \qquad \alpha I \preceq H^{t} \preceq \Gamma I, \end{equation*} \notag $$

where $I$ is an identity matrix.

The above assumption implies the following lemma.

Lemma 1 ([41], Lemma 1). Assume that $D^0$ and, for all $t$, the $H^t$ are diagonal matrices with elements not greater than $\Gamma$ in absolute value. Then for matrices $\widehat D^t$ obtained by the rules (2)–(4), the following holds:

(1) the $\widehat D^t$ are diagonal matrices with non-negative elements and
$$ \begin{equation*} \alpha I \preceq \widehat{D}^{t} \preceq \Gamma I; \end{equation*} \notag $$
(2) $\widehat{D}^{t+1} \preceq \biggl(1+\dfrac{(1-\beta_{t+1})\Gamma^2} {2\alpha^2}\biggr)\widehat{D}^{t}$ for (2);
(3) $\widehat{D}^{t+1} \preceq \biggl(1+\dfrac{2(1-\beta_{t+1})\Gamma} {\alpha}\biggr)\widehat{D}^{t}$ for (3).

It is straightforward to demonstrate (we refer to Table 1 and for further details, to [40], Lemma 2.1, parts B and C) that classical and well-known preconditioners meet the conditions of Lemma 1. The results of Lemma 1 have already been validated in [41]. However, for a clear convergence analysis, we need to formulate the following corollary.

Table 1.$\Gamma$ for various preconditioners. $G$ is the upper bound on the gradient norm. The presence of $G$ is typical for an analysis of RMSProp and Adam [47]

Method	OASIS	RMSProp	Adam
$\Gamma$	$\sqrt{d}\,L$	$G$	$G$

Corollary 1. Suppose $\{\widehat{x}_t \}_{t=0}^\infty$ are average points generated by Algorithm 1. Moreover, for any $t$ we have the corresponding scaling matrix $\widehat{D}^{t}$. Hence, according to the update rules (2), (3), and (4), we obtain

$$ \begin{equation*} \|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t+1}} \leqslant \bigl(1+(1-\beta_{t+1})C\bigr)\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^t}, \end{equation*} \notag $$

where $C$ depends on the preconditioner update setting. In particular, choosing $\beta_{t+1}$ in a certain way for each setting allows one to claim the following result:

$$ \begin{equation*} \|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t+1}} \leqslant \biggl(1+\frac{\gamma\mu}{2\Gamma}\biggr) \|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t}}, \end{equation*} \notag $$

where

$$ \begin{equation*} C=\begin{cases} \dfrac{\Gamma^2}{2\alpha^2} &\textit{for (2)},\vphantom{\biggl\}} \\ \dfrac{2\Gamma}{\alpha} &\textit{for (3)},\vphantom{\biggl\}} \end{cases}\quad\textit{which implies that}\quad \beta_{t+1} \leqslant\begin{cases} 1-\dfrac{\gamma\mu\alpha^2}{\Gamma^3} &\textit{for (2)}, \vphantom{\biggl\}} \\ 1-\dfrac{\gamma\mu\alpha}{4\Gamma^2} &\textit{for (3)}, \vphantom{\biggl\}} \end{cases} \end{equation*} \notag $$

and $\|x\|^2_{\widehat{D}^{t}}$ is the squared norm induced by the matrix, that is,

$$ \begin{equation*} \|x\|^2_{\widehat{D}^{t}}=\langle x,\widehat{D}^{t}x\rangle. \end{equation*} \notag $$

4.3. Convergence analysis

In the following subsections we present the convergence results of Algorithm 1 for different settings: identical and heterogeneous data.

4.3.1. Identical data

Below is the main result for the identical data case.

Theorem 1. Suppose that Assumptions 1, 2, and 4 hold for $\mu > 0$. Then for Algorithm 1 with identical data and constant stepsize $\gamma > 0$ such that $\gamma \leqslant \alpha/(4L)$, and $H \geqslant 1$ such that $\max_{p}|t_p-t_{p+1}| \leqslant H$, for all $T$ we have

$$ \begin{equation*} \mathsf{E}[f(\overline{x}_{T-1})-f(x_\ast)]\leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)^{T} \frac{2\Gamma\|x_0-x_\ast\|^2}{\gamma}+\frac{4\gamma\sigma^2}{\alpha M}+ \frac{8\gamma^2L\Gamma(H-1)\sigma^2}{\alpha^2}\,, \end{equation*} \notag $$

where

$$ \begin{equation*} \begin{alignedat}{2} \widehat{x}_t &\overset{{\rm def}}{=} \frac{1}{M}\sum_{m=1}^M x_t^m,&\qquad w_t&\overset{{\rm def}}{=}\biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)^{-(t+1)}, \\ W_{T-1} &\overset{{\rm def}}{=} \sum_{t=0}^{T-1} w_t,&\qquad \overline{x}_{T-1}&\overset{{\rm def}}{=}\frac{1}{W_{T-1}}\sum_{t=0}^{T-1} w_t\widehat{x}_t. \end{alignedat} \end{equation*} \notag $$

Using this theorem and selecting the parameters properly, we obtain the following result, which ensures convergence.

Corollary 2. Choosing

$$ \begin{equation*} \gamma=\min\biggl\{\frac{\alpha}{4L}\,,\frac{2\Gamma}{\mu T} \log\biggl(\max\biggl\{2, \frac{\mu^2\|x_0-x_\ast\|^2T^2}{4\Gamma^2c}\biggr\}\biggr)\biggr\}, \end{equation*} \notag $$

where

$$ \begin{equation*} c \overset{{\rm def}}{=} \frac{4\alpha^2}{\alpha M}\,, \end{equation*} \notag $$

in Theorem 1, we claim the following result for the convergence rate:

$$ \begin{equation*} \mathsf{E}[f(\overline{x}_{T-1})-f(x_\ast)]\,{=}\,\widetilde{\mathcal{O}} \biggl(\frac{L\Gamma\|x_0-x_\ast\|^2}{\alpha} \exp\biggl(-\frac{\mu\alpha T}{\Gamma L}\biggr)+ \frac{\Gamma\sigma^2}{\alpha\mu^2 MT}+ \frac{L\Gamma^2 \sigma^2(H-1)}{\mu T^2 \alpha^2}\biggr), \end{equation*} \notag $$

where $\widetilde{\mathcal{O}} (\,{\cdot}\,)$ omits polylogarithmic and constant factors.

Since the idea of the proof of Corollary 2 is the same as in the heterogeneous case, we omit it.

4.3.2. Heterogeneous data

Next, we show convergence guarantees for the heterogeneous case.

Theorem 2. Suppose that Assumptions 1, 3, and 4 hold. Then for Algorithm 1 with heterogeneous setting, $M \geqslant 2$, $\max_{p}|t_p-t_{p+1}| \leqslant H$, and $\gamma > 0$ such that $\gamma \leqslant \dfrac{\alpha}{10(H-1)L}$ , we have

$$ \begin{equation*} \mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] \leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)^T \frac{\Gamma\|x_{0}-x_\ast\|^2}{\gamma}+\gamma\sigma_{\rm dif}^2 \biggl(\frac{9(H-1)}{2\alpha}+\frac{8}{M\alpha}\biggr), \end{equation*} \notag $$

where

$$ \begin{equation*} \begin{alignedat}{2} \widehat{x}_t &\overset{{\rm def}}{=} \frac{1}{M}\sum_{m=1}^M x_t^m,&\qquad w_t &\overset{{\rm def}}{=}\biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)^{-(t+1)}, \\ W_{T-1} &\overset{{\rm def}}{=} \sum_{t=0}^{T-1} w_t,&\qquad \overline{x}_{T-1} &\overset{{\rm def}}{=} \frac{1}{W_{T-1}} \sum_{t=0}^{T-1}w_t \widehat{x}_t. \end{alignedat} \end{equation*} \notag $$

The next corollary describes the convergence of Algorithm 1 in the heterogeneous case.

Corollary 3. Choosing

$$ \begin{equation*} \gamma=\min\biggl\{\frac{\alpha}{10(H-1)L}\,,\frac{2\Gamma}{\mu T} \log\biggl(\max\biggl\{2, \frac{\mu^2\|x_{0}-x_\ast\|^2 T^2}{4\Gamma c}\biggr\}\biggr)\biggr\} \end{equation*} \notag $$

in Theorem 2, where

$$ \begin{equation*} c \overset{{\rm def}}{=} \sigma_{\rm dif}^2\biggl(\frac{9(H-1)}{2\alpha}+ \frac{8}{M\alpha}\biggr), \end{equation*} \notag $$

we claim the following result for the convergence rate:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] \\ &\qquad=\widetilde{\mathcal{O}}\biggl(\frac{(H-1)L\Gamma}{\alpha}\, \|x_0-x_\ast\|^2\exp\biggl(-\frac{\mu T \alpha}{\Gamma(H-1)L}\biggr)+ \frac{\Gamma\sigma_{\rm dif}^2}{\alpha\mu T} \biggl((H-1)+\frac{1}{M}\biggr)\biggr). \end{aligned} \end{equation*} \notag $$

5. Discussion

In this section we discuss the results obtained by Algorithm 1 and compare these results qualitatively with the existing estimates for algorithms of similar structure.

5.1. Interpretation of our results

For a comprehensive understanding of the results obtained, it is essential to refer to both the theoretical analysis (see § 4.3) and experiments conducted (see § 6).

$\bullet$ Preservation of analysis structure. The structure of the estimates obtained during our analysis remains consistent with that found in [36], which served as the foundation for our analysis. Given that the estimates in the original paper [36] are shown to be optimal, as demonstrated in [39], our estimates also achieve optimal performance within a specific class of preconditioning matrices.

$\bullet$ Boundary behaviour. The primary distinction between our analysis and that of the unscaled version in [36] lies in the introduction of constants $\alpha$ and $\Gamma$ into our estimates. The impact of $\Gamma$ is generally not significant, often becoming apparent through certain assumptions or lemmas, as illustrated in [40]. However, $\alpha$ represents a parameter that can be adjusted in practice, similarly to implementations in Adam or OASIS. Thus, the sensitivity of our estimates to this parameter, which is typically quite small, is notably significant.

$\bullet$ The relationship between experiments and theory. The theoretical findings suggest a less favourable convergence rate for our method as compared to classical Local SGD, which can be attributed to an additional multiplicative factor $\Gamma/\alpha$ in our estimates. Conversely, experimental results indicate an enhanced convergence rate for our method over Local SGD. This discrepancy arises because our theorems rely on a unified assumption regarding the preconditioning matrix, whereas experiments employ specific scaling matrix structures. These structures, when incorporated into the theoretical analysis, could potentially reduce the algorithm’s complexity. Our findings do not necessarily indicate a theoretical improvement over the existing methods; rather, they confirm that convergence is achievable with the incorporation of adaptive structures through scaling. This opens avenues for future research, particularly in the exploration of specific types of preconditioning matrices, an area where a significant gap exists across all adaptive methods employing scaling.

5.2. Discussion of results from [42]

In this section we compare our approach with the algorithm (Algorithm 2) developed in [42].

The following assumptions were made in [42].

Assumption 5. Assume that each $f_m$ is $L$-smooth, that is,

$$ \begin{equation*} \|\nabla f_m(x)-\nabla f_m(y)\| \leqslant L \|x-y\| \quad\text{for all}\ \ x, y \in \mathbb{R}^d. \end{equation*} \notag $$

Assumption 6. Assume that the $\{f_m\}_{m=1}^M$ satisfy the following inequalities with $z_m \sim \mathcal{D}_m$:

$$ \begin{equation*} \begin{alignedat}{2} \mathsf{E}\bigl[\|\nabla[f_m(x,z_m)]_j-[\nabla f_m(x)]_j\|^2\bigr] &\leqslant \sigma_{{\rm l},j}^2 &&\quad \text{(local variance)}, \\ \frac{1}{M}\sum_{m=1}^M \|[\nabla f_m(x)]_j-[\nabla f(x)]_j\|^2 &\leqslant \sigma_{{\rm g},j}^2 &&\quad \text{(global variance)} \end{alignedat} \end{equation*} \notag $$

for all $x \in \mathbb{R}^d$ and $j \in [d]$.

Assumption 7. Assume that $f_m(x,z)$ have $G$-bounded gradients, that is, for any $m \in [M]$, $x \in \mathbb{R}^d$, and $z \sim \mathcal{D}_m$ we have

$$ \begin{equation*} |[\nabla f_m(x,z)]_j| \leqslant G \end{equation*} \notag $$

for all $j \in [d]$.

Based on these assumptions, the authors of [42] presented the following results.

Theorem 3. Suppose that Assumptions 5, 6, and 7 are satisfied. Then, for Algorithm 2 with $\sigma^2 \overset{{\rm def}}{=} \sigma_{\rm l}^2+6K\sigma_{\rm g}^2$ and $\eta_{\rm l}$ such that

$$ \begin{equation} \begin{aligned} \, \eta_{\rm l} \leqslant \begin{cases} \dfrac{1}{16K}\min\biggl\{\dfrac{1}{L}\,,\dfrac{1}{T^{1/6}} \biggl[\dfrac{\tau}{120L^2G}\biggr]^{1/3}\biggr\}, \\ \dfrac{1}{16K}\min\biggl\{\dfrac{\tau \eta L}{2G^2}\,,\dfrac{\tau}{4L\eta}\,, \dfrac{1}{T^{1/4}}\biggl[\dfrac{\tau^2}{GL\eta}\biggr]^{1/2}\biggr\}, \end{cases} \end{aligned} \end{equation} \tag{5} $$

it follows that for all $T$,

$$ \begin{equation*} \min_{0 \leqslant t \leqslant T-1}\mathsf{E}\bigl[\|\nabla f(x_t)\|^2\bigr] \leqslant \mathcal{O}\biggl(\biggl[\frac{G}{\sqrt{T}}+ \frac{\tau}{\eta_{\rm l}KT}\biggr](\Psi+{\Psi}_{\rm var})\biggr), \end{equation*} \notag $$

where

$$ \begin{equation*} \begin{aligned} \, \Psi&=\frac{f(x_0)-f(x^*)}{\eta}+\frac{5\eta_{\rm l}^3K^2L^2T}{2\tau}\sigma^2, \\ \Psi_{\rm var}&=\frac{2\eta_{\rm l}KG^2+\tau\eta L}{\tau^2} \biggl[\frac{2\eta_{\rm l}^2KT}{m}\sigma_{\rm l}^2+ 10\eta_{\rm l}^4K^3L^2T\sigma^2\biggr]. \end{aligned} \end{equation*} \notag $$

Using Theorem 3, we can set $\eta=\operatorname{const}$. Let us fix $T > 0$ and consider the case when $\tau$ tends to zero. Without loss of generality, according to (5), we obtain $\eta_{\rm l} \sim \tau$. Substituting such $\eta_{\rm l}=\tau\overline{\eta}_{\rm l}$ gives

$$ \begin{equation*} \begin{aligned} \, \min_{0 \leqslant t \leqslant T-1} \mathsf{E}\bigl[\|\nabla f(x_t)\|^2\bigr] &\leqslant \mathcal{O}\biggl(\biggl[\frac{G}{\sqrt{T}}+\frac{\tau}{\eta_{\rm l}KT}\biggr] (\Psi+{\Psi}_{\mathrm{var}})\biggr) \\ &=\mathcal{O}\biggl(\biggl[\frac{G}{\sqrt{T}}+ \frac{1}{\overline{\eta}_{\rm l}KT}\biggr] \biggl(\frac{f(x_0)-f(x^*)}{\eta}+ \frac{5\tau^2\overline{\eta}_{\rm l}^3L^2K^2T}{2}\,\sigma^2 \\ &\qquad+ \frac{2\overline{\eta}_{\rm l}KG^2+\eta L}{\tau} \biggl[\frac{2\tau^2\overline{\eta}_{\rm l}^2KT}{m} \sigma_{\rm l}^2+ 10\tau^4\overline{\eta}_{\rm l}^4K^3L^2T\sigma^2\biggr]\biggr)\biggr) \\ &=\mathcal{O}\biggl(\biggl[\frac{G}{\sqrt{T}}+ \frac{1}{\overline{\eta}_{\rm l}KT}\biggr]\biggl(\frac{f(x_0)-f(x^*)}{\eta}+ \tau\biggl(\frac{5\tau\overline{\eta}_{\rm l}^3L^2K^2T}{2}\,\sigma^2 \\ &\qquad+(2\overline{\eta}_{\rm l}KG^2+\eta L) \biggl[\frac{2\overline{\eta}_{\rm l}^2KT}{m}\,\sigma_{\rm l}^2+ 10\tau^2\overline{\eta}_{\rm l}^4K^3L^2T\sigma^2\biggr]\biggr)\biggr)\biggr) \\ &=\mathcal{O}\biggl(\biggl[\frac{G}{\sqrt{T}}+ \frac{1}{\overline{\eta}_{\rm l}KT}\biggr]\frac{f(x_0)-f(x^*)}{\eta}\biggr) \end{aligned} \end{equation*} \notag $$

because $\tau$ tends to zero.

Hence the result appears to be independent of $\sigma_{\rm l}^2$, suggesting that the algorithm operates at any noise level, which seems impractical. This issue can be related to an error in the proof of Theorem 1 in [42]. Specifically, at the end of Theorem 1, on p. 18 in [42], multiplication in Lines 3 and 4 introduces $\tau^3$ in the denominator, whereas the final estimate incorrectly indicates $\tau^2$. Therefore, a more accurate representation of ${\Psi}_{\mathrm{var}}$ would be

$$ \begin{equation*} {\Psi}_{\rm var}=\frac{2\eta_{\rm l}KG^2+ \tau\eta L}{\tau^3}\biggl[\frac{2\eta_{\rm l}^2KT}{m} \sigma_{\rm l}^2+10\eta_{\rm l}^4K^3L^2T\sigma^2\biggr], \end{equation*} \notag $$

with $\tau^3$ instead of $\tau^2$. However, even after addressing this error, the analysis still faces challenges. With Algorithm 2 allowing $\beta_1=0$, and by substituting in $v_{-1}=1$ (feasible as $\tau$ approaches zero), we have the chain of conclusions:

(1) $\eta^{}_{\rm l}$ becomes very small with small $\tau$, leading to minimal changes in the difference $x_{i,k+1}^t-x_{i,k}^t$;
(2) consequently, $\Delta_i^t$ is negligible;
(3) thus, $\Delta_t$ is sufficiently small;
(4) given $\beta_1=0$, $m_t$ is extremely small;
(5) also, $v_t \sim 1$ because of point 3);
(6) as a consequence, $x_{t+1}=x_t+m_t/(\sqrt{v_t}+\tau) \approx x_t$, because of point 4).

Then, as a direct consequence, the smaller $\tau$, the more iterations are needed for convergence, since the changes in iterations are becoming smaller due to reduction of $\tau$. This issue arises fundamentally from neglecting $v_{-1}$ in the final analysis. The claim that $\sqrt{v_{t-1,j}} \leqslant \eta_{\rm l} KG\sqrt{T}$ (p. 16 in [42]) does not involve $v_{-1}$, affecting the final convergence estimate. If $v_{-1} \sim \tau^2$, then $\Delta_t/(\sqrt{v_t}+\tau) \sim \operatorname{const}$, resolving the issue of minor changes in $x_t$ across iterations.

6. Experiments

In this section we describe the experimental setups and present the results.

6.1. Setup

Datasets

We utilize the CIFAR-10 dataset [48] in our experiments. We chose the number of clients $M$ equal to 10 (the same number as the number of classes in the CIFAR-10 dataset). We divide the data into training and test parts in the percentage ratio 90%–10%. We divide the training sample among the devices in equal numbers. To realize the heterogeneity of the data for each of the clients we select a ‘main’ class of 10. We choose 30%, 50%, or 70% of the ‘main’ class for the corresponding client and add the rest of the data evenly from the remaining samples.

Metric

Since we solve the classification problem, we use standard metrics such as cross-entropy loss and accuracy.

Models

We choose the ResNet-18 model [49] for our analysis.

Optimization methods

For our experiments we implemented three different preconditioning matrices: the identity matrix (representing pure Local SGD with momentum), the matrix from Adam [28], and the matrix from OASIS [33]. In the case when Adam or OASIS are used, we study two ways in which the updating of the scaling matrix works: global (as done in Algorithm 1, where all devices have the same matrix and update it at the time of synchronization) and local (where each device updates its own scaling matrix at each iteratio — we do not present theoretical studies of this approach).

For all methods we chose a heavy-ball momentum $\beta_1$ equal to $0.9$, a scaling momentum $\beta_2$ equal to $0.999$, a batch size to be $256$, and a number of local iterations between communications to be $18$ (one epoch).

6.2. Results

The outcomes of the experiments are shown in Fig. 1. The results, contrary to theoretical expectations, demonstrate that methods with scaling achieve the required accuracy faster than those without it. This outcome is both classical and consequential, as the theoretical framework typically considers a general type of preconditioning matrix, which does not incorporate the specifics of adaptive scaling. Moreover, the local scaling (for which we do not provide a theory due to the fact that it is a more complex case compared to global scaling) works better for Adam than the approach using Algorithm 1, but for OASIS the global scaling is no worse and sometimes even better.

Figure 1.A comparison of different federated optimization methods: without scaling (SGD), Adam (local Adam local and global Adam global scalings) and OASIS (local OASIS local and global OASIS global scalings). We consider the behaviour of the accuracy (left) and loss function (right) on the ResNet-18 learning problem on CIFAR-10 with different degrees of heterogeneity: 30% (top), 50% (middle), and 70% (bottom) of the main class. The horizontal axis shows the synchronization/communication round number.

7. Conclusion

In this paper we present a unified convergence analysis of a method that combines Local SGD with a preconditioning technique. We demonstrate that the theoretical convergence rate of the method is preserved, except for the introduction of a multiplicative factor, $\Gamma/\alpha$. This modification is due to our consideration of a general form for the scaling matrix. Additionally, we present experiments showing that Local SGD with scaling outperforms the version without it. Our paper also identifies areas for future work, suggesting that one could consider specific types of preconditioning matrices to demonstrate theoretical improvements in convergence. Also, an interesting question for future research is the construction of a theory of local individual scaling, which surpasses global scaling from Algorithm 1 in experiments.

Appendix A. Basic facts and auxiliary lemmas

We use notation similar to that of [16] and denote the sequence of time stamps when synchronization happens by $(t_{p})_{p=1}^{\infty}$. Given stochastic gradients $g_t^1,g_t^2,\dots,g_t^M$ at time $t \geqslant 0$, we define

$$ \begin{equation*} g_t \overset{{\rm def}}{=} \frac{1}{M}\sum_{m=1}^M g_t^m \end{equation*} \notag $$

and

$$ \begin{equation*} \overline{g}_t^m \overset{{\rm def}}{=} \mathsf{E}[g_t^m]=\begin{cases} \nabla f(x_t^m) & \text{for identical data}, \\ \nabla f_m(x_t^m) & \text{otherwise}, \end{cases} \qquad \overline{g}_t \overset{{\rm def}}{=} \mathsf{E}[g_t]. \end{equation*} \notag $$

Let us introduce two definitions, which are crucial for our analysis:

$$ \begin{equation*} V_t \overset{{\rm def}}{=}\frac{1}{M}\sum_{m=1}^M\|x_t^m- \widehat{x}_t\|^2_{\widehat{D}^{t_p}}\quad \text{for}\ \ t_p \leqslant t < t_{p+1}; \qquad \widehat{x}_t \overset{{\rm def}}{=} \frac{1}{M}\sum_{m=1}^M x_t^m. \end{equation*} \notag $$

Throughout the proofs, we use the variance decomposition that holds for any random vector $X$ with finite second moment:

$$ \begin{equation} \mathsf{E}\bigl[\|X\|^2\bigr]=\mathsf{E}\bigl[\|X-\mathsf{E}[X]\|^2\bigr]+ \|\mathsf{E}[X]\|^2. \end{equation} \tag{6} $$

In particular, its version for vectors with finite number of values gives

$$ \begin{equation} \frac{1}{M}\sum_{m=1}^M \|X_m\|^2=\frac{1}{M}\sum_{m=1}^M \biggl\|X_m- \frac{1}{M}\sum_{i=1}^M X_i\biggr\|^2+ \biggl\|\frac{1}{M}\sum_{m=1}^M X_m\biggr\|^2. \end{equation} \tag{7} $$

As a consequence of (6), we have

$$ \begin{equation} \mathsf{E}\bigl[\|X-\mathsf{E}[X]\|^2\bigr]\leqslant \mathsf{E}\bigl[\|X]\|^2\bigr]. \end{equation} \tag{8} $$

For any convex function $f$ and any vectors $x^1,\dots,x^M$ we have Jensen’s inequality:

$$ \begin{equation} f\biggl(\frac{1}{M}\sum_{m=1}^M x^m\biggr)\leqslant \frac{1}{M}\sum_{m=1}^M f(x^m). \end{equation} \tag{9} $$

As a special case where $f(x)=\|x\|^2$, we obtain

$$ \begin{equation} \biggl\|\frac{1}{M}\sum_{m=1}^M x_m\biggr\|^2\leqslant \frac{1}{M}\sum_{m=1}^M \|x_m\|^2. \end{equation} \tag{10} $$

We denote the Bregman divergence associated with a function $f$ and arbitrary $x$ and $y$ by $D_f(x,y)$:

$$ \begin{equation*} D_f(x,y)\overset{{\rm def}}{=}f(x)-f(y)-\langle\nabla f(y),x-y\rangle. \end{equation*} \notag $$

If $f$ is $L$-smooth and convex, then for any $x$ and $y$ we have

$$ \begin{equation} \|\nabla f(x)-\nabla f(y)\|^2\leqslant 2LD_f(x,y). \end{equation} \tag{11} $$

If $f$ satisfies Assumption 1, then

$$ \begin{equation} f(x)+\langle\nabla f(y),x-y\rangle+\frac{\mu}{2}\,\|y-x\|^2 \leqslant f(y)\quad\text{for all} \ \ x,y \in \mathbb{R}^d. \end{equation} \tag{12} $$

We also use the following facts:

$$ \begin{equation} \|x+y\|^2 \leqslant 2\|x\|^2+2\|y\|^2, \end{equation} \tag{13} $$

$$ \begin{equation} 2\langle a,b\rangle \leqslant \zeta\|a\|^2+\zeta^{-1}\|b\|^2\quad \text{for all}\ \ a,b \in \mathbb{R}^d\ \ \text{and}\ \ \zeta > 0, \end{equation} \tag{14} $$

$$ \begin{equation} \biggl(1-\frac{p}{2}\biggr)^{-1} \leqslant (1+p)\quad \text{for all}\ \ p \in [0,1]. \end{equation} \tag{15} $$

Appendix B. Proof of Corollary 1

Let us consider the following two cases.

(1) $t_p \leqslant t < t_{p+1}-1$ for some $p \in \mathbb{N}$. In this case the matrices $\widehat{D}^{t}$ and $\widehat{D}^{t+1}$ are equal by construction. Hence the assertion of the corollary is obvious.

(2) $t=t_{p+1}-1$ for some $p \in \mathbb{N}$. Here we have a change of the matrix which generates a norm (it becomes new at the iteration $t_{p+1}$). But this fact is obvious due to Lemma 1.

Since we can have no more cases, the above finishes the proof of Corollary 1.

Appendix C. Proofs for identical data

C.1. Auxiliary lemma

Lemma 2. Assume that for any $t$ such that $t_p \leqslant t < t_{p+1}$ we have $\widehat{D}^{t_p}$. Under Assumptions 1, 2, and 4, for Algorithm 1, which runs for identical data with $\gamma \leqslant \alpha/(2L)$ and $|t_p-t_{t+1}| \leqslant H$ we have

$$ \begin{equation*} \mathsf{E}[V_{t}] \leqslant (H-1)\frac{\gamma^2\sigma^2}{\alpha}\,. \end{equation*} \notag $$

Proof. Let $t \in \mathbb{N}$ be such that $t_p \leqslant t \leqslant t_{p+1}-1$. Recall that for time $t$ such that $t_p \leqslant t < t_{p+1}$ we have

$$ \begin{equation*} x_{t+1}^m=x_t^m-\gamma (\widehat{D}^{t_p})^{-1}g_{t}^m\quad\text{and}\quad \widehat{x}_{t+1}=\widehat{x}_t-\gamma (\widehat{D}^{t_p})^{-1}g_t. \end{equation*} \notag $$

Hence for the expectation conditional on $x_t^1, x_t^2,\dots,x_t^M$ (we use the notation $\mathsf{E}[\,\cdot\,]=\mathsf{E}[\,\cdot\,\mid x_t^1,x_t^2,\dots,x_t^M]$ for brevity) we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|x_{t+1}^m-\widehat{x}_{t+1}\|^2_{\widehat{D}^{t_p}}\bigr]&= \mathsf{E}\bigl[\|x_t^m-\gamma(\widehat{D}^{t_p})^{-1}g_{t}^m-\widehat{x}_t+ \gamma(\widehat{D}^{t_p})^{-1}g_t\|^2_{\widehat{D}^{t_p}}\bigr] \\ &=\|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}} \\ &\qquad+\gamma^2\mathsf{E}\bigl[\|(\widehat{D}^{t_p})^{-1}\nabla f(x_t^m,z_m)- (\widehat{D}^{t_p})^{-1}g_t\|^2_{\widehat{D}^{t_p}}\bigr] \\ &\qquad-2\gamma\mathsf{E}\bigl[\langle x_t^m- \widehat{x}_t,(\widehat{D}^{t_p})^{-1}\nabla f(x_t^m, z_m)- (\widehat{D}^{t_p})^{-1}g_t\rangle_{\widehat{D}^{t_p}}\bigr] \\ &=\|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}} \\ &\qquad+\gamma^2\mathsf{E}\bigl[\|(\widehat{D}^{t_p})^{-1} \nabla f(x_t^m, z_m)-(\widehat{D}^{t_p})^{-1}g_t\|^2_{\widehat{D}^{t_p}}\bigr] \\ &\qquad-2\gamma\bigl\langle x_t^m-\widehat{x}_t, \mathsf{E}[\nabla f(x_t^m,z_m)-g_t]\bigr\rangle \\ &=\|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}} \\ &\qquad+\gamma^2\mathsf{E}\bigl[\|(\widehat{D}^{t_p})^{-1}\nabla f(x_t^m,z_m)- (\widehat{D}^{t_p})^{-1}g_t\|^2_{\widehat{D}^{t_p}}\bigr] \\ &\qquad-2\gamma\langle x_t^m-\widehat{x}_t,\nabla f(x_{t}^m)\rangle \\ &\qquad+2\gamma\langle x_t^m-\widehat{x}_t,\overline{g}_t\rangle. \end{aligned} \end{equation*} \notag $$

Averaging both sides over $M$ and noting that $V_t=\dfrac{1}{M}\displaystyle\sum_{m=1}^M \|x_{t}^m- \widehat{x}_t\|^2_{\widehat{D}^{t_p}}$, we have

$$ \begin{equation} \begin{aligned} \, \mathsf{E}[V_{t+1}]&=V_t+\frac{\gamma^2}{M}\sum_{m=1}^M \mathsf{E}\bigl[\|(\widehat{D}^{t_p})^{-1}\nabla f(x_t^m,z_m)- (\widehat{D}^{t_p})^{-1}g_t\|^2_{\widehat{D}^{t_p}}\bigr] \nonumber \\ &\qquad-\frac{2\gamma}{M} \sum_{m=1}^M\langle x_t^m- \widehat{x}_t, \nabla f(x_t^m)\rangle+2 \gamma \underbrace{\langle\widehat{x}_t-\widehat{x}_t,\overline{g}_t\rangle}_{=0} \nonumber \\ &=V_t+\frac{\gamma^2}{M}\sum_{m=1}^M\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad-\frac{2\gamma}{M}\sum_{m=1}^M \langle x_t^m-\widehat{x}_t,\nabla f(x_t^m)\rangle. \end{aligned} \end{equation} \tag{16} $$

Now observe that by expanding the square we have

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)-g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] &=\mathsf{E}\bigl[\|\nabla f(x_t^m, z_m)-\overline{g}_t+ \overline{g}_t-g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &=\mathsf{E}\bigl[\|\nabla f(x_t^m, z_m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad+\mathsf{E}\bigl[\|\overline{g}_t- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad+2\mathsf{E}\bigl[\bigl\langle\nabla f(x_t^m, z_m)- \overline{g}_t,\overline{g}_t-g_t\bigr\rangle_{(\widehat{D}^{t_p})^{-1}}\bigr]. \end{aligned} \end{equation} \tag{17} $$

We decompose the first term in the last equality again by expanding the square and using that $\mathsf{E}[\nabla f(x_t^m,z_m)]=\nabla f(x_t^m)$:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad=\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\|\nabla f(x_t^m)-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad\qquad+2\mathsf{E}\bigl[\langle\nabla f(x_t^m, z_m)-\nabla f(x_t^m), \nabla f(x_t^m)-\overline{g}_t\rangle_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad=\mathsf{E}\bigl[\|\nabla f(x_t^m, z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\|\nabla f(x_t^m)-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad\qquad+2\bigl\langle\mathsf{E}[\nabla f(x_t^m,z_m)- \nabla f(x_t^m)],\nabla f(x_t^m)- \overline{g}_t\bigr\rangle_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad=\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\|\nabla f(x_t^m)-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad\qquad+2\underbrace{\bigl\langle\nabla f(x_t^m)-\nabla f(x_t^m), \nabla f(x_t^m)-\overline{g}_t\bigr\rangle_{(\widehat{D}^{t_p})^{-1}}}_{=0} \\ &\qquad=\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\|\nabla f(x_t^m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation*} \notag $$

Plugging this into (17) we can obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)-g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] &=\mathsf{E}\bigl[\|\nabla f(x_t^m, z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad+\|\nabla f(x_t^m)-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad+\mathsf{E}\bigl[\|\overline{g}_t- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad+2\mathsf{E}\bigl[\langle\nabla f(x_t^m, z_m)-\overline{g}_t, \overline{g}_t-g_t\rangle_{(\widehat{D}^{t_p})^{-1}}\bigr]. \end{aligned} \end{equation*} \notag $$

Averaging over $M$ gives

$$ \begin{equation*} \begin{aligned} \, &\frac{1}{M}\sum_{m=1}^M \mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad=\frac{1}{M}\sum_{m=1}^M \mathsf{E}\bigl[\|\nabla f(x_t^m, z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\frac{1}{M}\sum_{m=1}^M \|\nabla f(x_t^m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad\qquad+\mathsf{E}\bigl[\|\overline{g}_t- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad-2\mathsf{E}\bigl[\|\overline{g}_t- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr], \end{aligned} \end{equation*} \notag $$

where we have used the notation $g_t=\displaystyle\sum_{m=1}^M \nabla f(x_t^m,z_m)$. Hence

$$ \begin{equation} \begin{aligned} \, &\frac{1}{M}\sum_{m=1}^M\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad\leqslant\frac{1}{M}\sum_{m=1}^M\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad\qquad+\frac{1}{M}\sum_{m=1}^M \|\nabla f(x_t^m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation} \tag{18} $$

Now we can note that for the first term in (18) with Assumption 2, using the inequality $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant (1/\alpha)\|x\|^2$, we have

$$ \begin{equation} \mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \leqslant \frac{1}{\alpha}\,\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2\bigr]\leqslant \frac{\sigma^2}{\alpha}\,. \end{equation} \tag{19} $$

For the second term in (18) we obtain

$$ \begin{equation*} \begin{aligned} \, \|\nabla f(x_t^m)-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}&= \|\nabla f(x_t^m)-\nabla f(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad+\|\nabla f(\widehat{x}_t)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad+2\bigl\langle\nabla f(x_t^m)- \nabla f(\widehat{x}_t),\nabla f(\widehat{x}_t)- \overline{g}_t\bigr\rangle_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation*} \notag $$

Averaging over $M$ and using the notation $\overline{g}_t=\dfrac{1}{M}\displaystyle\sum_{m=1}^M \nabla f(x_t^m)$, we have

$$ \begin{equation*} \begin{aligned} \, \frac{1}{M}\sum_{m=1}^{M}\|\nabla f(x_t^m)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}}&= \frac{1}{M}\sum_{m=1}^M\|\nabla f(x_t^m)- \nabla f(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad+\|\nabla f(\widehat{x}_t)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\qquad-2\|\nabla f(\widehat{x}_t)- \overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \\ &\leqslant \frac{1}{M}\sum_{m=1}^M\|\nabla f(x_t^m)- \nabla f(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation*} \notag $$

Then, using the property of the matrix $\widehat{D}^{t_p}$ that $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant (1/\alpha)\|x\|^2$, the assumption about $L$-smoothness (see Assumption 1), and Jensen’s inequality, we obtain

$$ \begin{equation} \begin{aligned} \, &\frac{1}{M}\sum_{m=1}^M\|\nabla f(x_t^m)- \nabla f(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad\,\,\leqslant \frac{1}{M\alpha}\sum_{m=1}^M \|\nabla f(x_{t}^m)- \nabla f(\widehat{x}_t)\|^2 \nonumber \\ &\qquad\overset{(11)}{\leqslant} \frac{1}{M\alpha}\sum_{m=1}^M 2L\bigl(f(\widehat x_t)-f(x_t^m)- \langle\widehat x_t-x_t^m,\nabla f(x_t^m)\rangle\bigr) \nonumber \\ &\qquad\,\,\leqslant \frac{2L}{M\alpha}\sum_{m=1}^M \langle x_t^m-\widehat x_t,\nabla f(x_t^m)\rangle. \end{aligned} \end{equation} \tag{20} $$

Plugging (20) and (19) into (18) gives

$$ \begin{equation} \frac{1}{M}\sum_{m=1}^M\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \leqslant \frac{\sigma^2}{\alpha}+\frac{2L}{M\alpha}\sum_{m=1}^M \langle x_t^m-\widehat x_t,\nabla f(x_t^m)\rangle. \end{equation} \tag{21} $$

Substituting (21) into (16), we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[V_{t+1}] &\leqslant V_t+\frac{\gamma^2 \sigma^2}{\alpha}+ \frac{2L \gamma^2}{M\alpha}\sum_{m=1}^M \langle x_t^m-\widehat x_t,\nabla f(x_t^m)\rangle \\ &\qquad-\frac{2\gamma}{M}\sum_{m=1}^M \langle x_t^m-\widehat{x}_t, \nabla f(x_t^m)\rangle \\ &=V_t+\frac{\gamma^2\sigma^2}{\alpha}- \biggl(1-\frac{\gamma L}{\alpha}\biggr)\frac{2\gamma}{M} \sum_{m=1}^M\langle x_t^m-\widehat{x}_t,\nabla f(x_t^m)\rangle \\ &\leqslant V_t+\frac{\gamma^2\sigma^2}{\alpha}+ \biggl(1-\frac{\gamma L}{\alpha}\biggr)\frac{2\gamma}{M} \sum_{m=1}^M\biggl(f(\widehat{x}_t)-f(x_t^m)- \frac{\mu}{2}\,\|x_t^m-\widehat{x}_t\|^2\biggr) \\ &\leqslant V_t+\frac{\gamma^2\sigma^2}{\alpha}+ \biggl(1-\frac{\gamma L}{\alpha}\biggr)\frac{2\gamma}{M}\sum_{m=1}^M \biggl(f(\widehat{x}_t)-f(x_t^m)-\frac{\mu}{2\Gamma}\, \|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}}\biggr) \\ &\leqslant \biggl(1-\gamma\biggl(1-\frac{\gamma L}{\alpha}\biggr) \frac{\mu}{\Gamma}\biggr)V_t+\frac{\gamma^2\sigma^2}{\alpha}\,, \end{aligned} \end{equation*} \notag $$

where we have used $\mu$-strong convexity, the fact that $(1/\Gamma)\|x\|^2_{\widehat{D}^{t_p}} \leqslant \|x\|^2$, and Jensen’s inequality. Using that $\gamma \leqslant \alpha/(2L)$, we can conclude that

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[V_{t+1}]&\leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)V_t+ \frac{\gamma^2\sigma^2}{\alpha} \\ &\leqslant V_t+\frac{\gamma^2\sigma^2}{\alpha}\,. \end{aligned} \end{equation*} \notag $$

We take the full expectation and iterate the above inequality:

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[V_t] &\leqslant \mathsf{E}[V_{t_p}]+ \frac{\gamma^2\sigma^2}{\alpha}(t-t_p) \\ &\leqslant \mathsf{E}[V_{t_p}]+\frac{\gamma^2\sigma^2}{\alpha}(t_{p+1}-t_p-1) \\ &\leqslant \mathsf{E}[V_{t_p}]+\frac{\gamma^2\sigma^2}{\alpha}(H-1). \end{aligned} \end{equation*} \notag $$

It remains to notice that by the design of Algorithm 1 we have $V_{t_p}=0$. $\Box$

C.2. Other lemmas

Lemma 3. Let $(x_t^m)_{t \geqslant 0}$ be iterates generated by the run of Algorithm 1 with identical data. Suppose that $f$ satisfies Assumption 1, the matrix $\widehat{D}^{t_p}$ satisfies Assumption 4, and $\gamma$ is such that $\gamma \leqslant \alpha/(4L)$. Then, for any $t$ such that $t_p \leqslant t < t_{p+1}$ we have

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \mathsf{E}\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma^2}{\alpha}\,\mathsf{E}\|g_t-\overline{g}_t\|^2 \\ &\qquad-\frac{\gamma}{2}\,\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)]+ \frac{2\gamma L}{\alpha}\,V_t. \end{aligned} \end{equation*} \notag $$

Proof. From the update rule we obtain

$$ \begin{equation} \begin{aligned} \, \|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}&=\|\widehat{x}_t- \gamma(\widehat{D}^{t_p})^{-1}g_t-x_\ast\|^2_{\widehat{D}^{t_p}} \nonumber \\ &=\|\widehat{x}_t-\gamma (\widehat{D}^{t_p})^{-1}g_t -x_\ast- \gamma(\widehat{D}^{t_p})^{-1}\overline{g}_t+ \gamma(\widehat{D}^{t_p})^{-1} \overline{g}_t\|^2_{\widehat{D}^{t_p}} \nonumber \\ &=\|\widehat{x}_t-x_\ast-\gamma (\widehat{D}^{t_p})^{-1} \overline{g}_t\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|g_t-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad+2\gamma \bigl\langle\widehat{x}_t-x_\ast- \gamma(\widehat{D}^{t_p})^{-1} \overline{g}_t,\overline{g}_t-g_t\bigr\rangle. \end{aligned} \end{equation} \tag{22} $$

Observe that

$$ \begin{equation} \begin{aligned} \, \|\widehat{x}_t-x_\ast-\gamma(\widehat{D}^{t_p})^{-1} \overline{g}_t\|^2_{\widehat{D}^{t_p}}&= \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad-2\gamma\langle\widehat{x}_t-x_\ast, \overline{g}_t\rangle \nonumber \\ &=\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad-\frac{2\gamma}{M}\sum_{m=1}^M \langle\widehat{x}_t-x_\ast,\nabla f(x_t^m)\rangle \nonumber \\ &\leqslant \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma^2}{M} \sum_{m=1}^M \|\nabla f(x_t^m)\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad-\frac{2\gamma}{M} \sum_{m=1}^M \langle\widehat{x}_t-x_t^m+x_t^m -x_\ast, \nabla f(x_t^m)\rangle \nonumber \\ &=\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}} \nonumber \\ &\qquad+\frac{\gamma^2}{M}\sum_{m=1}^M\|\nabla f(x_t^m)- \nabla f(x_\ast)\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad-\frac{2\gamma}{M} \sum_{m=1}^M \langle x_t^m -x_\ast, \nabla f(x_t^m)\rangle \nonumber \\ &\qquad-\frac{2\gamma}{M}\sum_{m=1}^M \langle\widehat{x}_t-x_t^m,\nabla f(x_t^m)\rangle, \end{aligned} \end{equation} \tag{23} $$

where we have used the fact that

$$ \begin{equation*} \biggl\|\,\sum_{m=1}^M a_m\biggr\|^2 \leqslant M \sum_{m=1}^M \|a_m\|^2. \end{equation*} \notag $$

In view of the property of $\widehat{D}^{t_p}$ that $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant (1/\alpha)\|x\|^2$, we obtain

$$ \begin{equation} \|\nabla f(x_t^m)-\nabla f(x_\ast)\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant \frac{1}{\alpha}\,\|\nabla f(x_t^m)-\nabla f(x_\ast)\|^2. \end{equation} \tag{24} $$

By $L$-smoothness (Assumption 1, see also (11)),

$$ \begin{equation} \|\nabla f(x_t^m)-\nabla f(x_\ast)\|^2 \leqslant 2L(f(x_t^m)-f_\ast)\,, \end{equation} \tag{25} $$

and by $\mu$-strong convexity,

$$ \begin{equation} -\langle x_t^m-x_\ast, \nabla f(x_t^m)\rangle \leqslant -(f(x_t^m)- f_\ast)-\frac{\mu}{2}\,\|x_t^m-x_\ast\|^2. \end{equation} \tag{26} $$

To estimate the last term in (23) we use the inequality $2\langle a,b\rangle \leqslant \gamma \|a\|^2+\gamma^{-1} \|b\|^2$ for $\gamma=2L > 0$. This gives

$$ \begin{equation} \begin{aligned} \, -2\langle\widehat{x}_t-x_t^m,\nabla f(x_t^m)\rangle &\leqslant 2L\|\widehat{x}_t-x_t^m\|^2+\frac{1}{2L}\|\nabla f(x_t^m)\|^2 \nonumber \\ &= 2L\|\widehat{x}_t-x_t^m\|^2+ \frac{1}{2L}\|\nabla f(x_t^m)-\nabla f(x_\ast)\|^2 \nonumber \\ &\leqslant 2L\|\widehat{x}_t-x_t^m\|^2+(f(x_t^m)-f_\ast), \end{aligned} \end{equation} \tag{27} $$

where we used (25) in the last inequality. By applying (24)–(27) to (23), we obtain

$$ \begin{equation} \begin{aligned} \, &\|\widehat{x}_t-x_\ast- \gamma(\widehat{D}^{t_p})^{-1}\overline{g}_t\|^2_{\widehat{D}^{t_p}} \nonumber \\ &\qquad\leqslant \|\widehat{x_t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{2\gamma L}{M}\sum_{m=1}^M\|\widehat{x}_t-x_t^m\|^2 \nonumber \\ &\qquad\qquad+\frac{2\gamma}{M}\sum_{m=1}^M \biggl(\biggl(\frac{\gamma L}{\alpha}-\frac{1}{2}\biggr)\bigl(f(x_t^m)- f_\ast\bigr)-\frac{\mu}{2}\,\|x_t^m-x_\ast\|^2\biggr). \end{aligned} \end{equation} \tag{28} $$

For $\gamma \leqslant \alpha/(4L)$ we have $\gamma L/\alpha-1/2 \leqslant -1/4$. By applying Jensen’s inequality to the convex function $a\bigl(f(x)-f_\ast\bigr)+b\|x-x_\ast\|^2$ for $a=1/2-\gamma L/\alpha \geqslant 0$ and $b=\mu/2 \geqslant 0$, we have

$$ \begin{equation} -\frac{1}{M}\sum_{m=1}^M\bigl(a(f(x_t^m)-f_\ast)+b\|x_t^m-x_\ast\|^2\bigr) \leqslant -\bigl(a (f(\widehat{x}_t)-f_\ast)\bigr)- b\|\widehat{x}_t-x_\ast\|^2, \end{equation} \tag{29} $$

hence we can continue with (28) by substituting in (29) and using the inequalities $(1/\alpha)\|x\|^2_{\widehat{D}^{t_p}} \geqslant \|x\|^2 \geqslant (1/\Gamma)\|x\|^2_{\widehat{D}^{t_p}}$:

$$ \begin{equation} \begin{aligned} \, \|\widehat{x}_t-x_\ast-\gamma(\widehat{D}^{t_p})^{-1} \overline{g}_t\|^2_{\widehat{D}^{t_p}} &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr)\|\widehat{x}_t- x_\ast\|^2_{\widehat{D}^{t_p}}-\frac{\gamma}{2}(f(\widehat{x}_t)-f_\ast) \nonumber \\ &\qquad+\frac{2\gamma L}{M\alpha}\sum_{m=1}^M \|\widehat{x}_t-x_t^m\|^2_{\widehat{D}^{t_p}}. \end{aligned} \end{equation} \tag{30} $$

Plugging in (30) and taking the full expectation we obtain

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}} &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \mathsf{E} \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|g_t-\overline{g}_t\|^2_{(\widehat{D}^{t_p})^{-1}} \nonumber \\ &\qquad-\frac{\gamma}{2}(f(\widehat{x}_t)-f_\ast)+ \frac{2\gamma L}{M\alpha}\sum_{m=1}^M \|\widehat{x}_t-x_t^m\|^2_{\widehat{D}^{t_p}} \nonumber \\ &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \mathsf{E} \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+\frac{\gamma^2}{\alpha} \mathsf{E}\|g_t-\overline{g}_t\|^2 \nonumber \\ &\qquad-\frac{\gamma}{2}\,\mathsf{E}[D_{f} (\widehat{x}_t,x_\ast)]+ \frac{2\gamma L}{M\alpha}\sum_{m=1}^M \mathsf{E} \|\widehat{x}_t-x_t^m\|^2_{\widehat{D}^{t_p}}. \end{aligned} \end{equation} \tag{31} $$

Using the definition of $V_t$, we claim the final result. $\Box$

Lemma 4. Suppose that Assumption 2 holds. If Algorithm 1 runs with identical data, then we have

$$ \begin{equation*} \mathsf{E}\bigl[\|g_t-\overline{g}_t\|^2\bigr] \leqslant \frac{\sigma^2}{M}\,. \end{equation*} \notag $$

Proof. Because the stochastic gradients $\nabla f(x_t^m, z_m)$ are independent, according to Assumption 2 we have

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[g_t-\overline{g}_t\|^2\bigr]&= \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^M\bigl(\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\bigr)\biggr\|^2\biggr] \\ &=\frac{1}{M^2}\,\mathsf{E}\biggl[\biggl\|\,\sum_{m=1}^{M}\bigl(\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\bigr)\biggr\|^2\biggr] \\ &=\frac{1}{M^2}\sum_{m=1}^{M}\mathsf{E}\bigl[\|\nabla f(x_t^m,z_m)- \nabla f(x_t^m)\|^2\bigr] \leqslant \frac{\sigma^2}{M}\,.\quad\Box \end{aligned} \end{equation*} \notag $$

Lemma 4 is proved.

C.3. Proof of Theorem 1

Combining Lemmas 3 and 4, we have

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \mathsf{E} \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma^2\sigma^2}{\alpha M} \nonumber \\ &\qquad-\frac{\gamma}{2}\,\mathsf{E}[D_{f} (\widehat{x}_t, x_\ast)]+ \frac{2\gamma L}{\alpha}\,V_t. \end{aligned} \end{equation} \tag{32} $$

Using Lemma 2 we can bound above the $\mathsf{E}[V_t]$ term in (32):

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr)\mathsf{E} \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma^2\sigma^2}{\alpha M} \\ &\qquad-\frac{\gamma}{2}\,\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)]+ \frac{2\gamma^3 L}{\alpha^2}(H-1)\sigma^2. \end{aligned} \end{equation*} \notag $$

Applying Corollary 1 and using the inequality $1+\gamma\mu/(2\Gamma) \leqslant 2$, we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t+1}}\bigr] &\leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr) \mathsf{E} \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t}}+ \frac{2\gamma^2\sigma^2}{\alpha M} \\ &\qquad-\gamma\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)]+ \frac{4\gamma^3L}{\alpha^2}(H-1)\sigma^2. \end{aligned} \end{equation*} \notag $$

Taking the full expectation and summing with the weights $w_t$, we have

$$ \begin{equation*} \begin{aligned} \, \sum_{t=0}^{T-1}w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_*\|_{\widehat{D}^{t+1}}^2\bigr]&\leqslant \sum_{t=0}^{T-1}w_t\biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr) \mathsf{E}\bigl[\|\widehat{x}_t-x_*\|_{\widehat{D}^t}^2\bigr] \\ &\qquad-\sum_{t=0}^{T-1}\frac{\gamma w_t}{2}\, \mathsf{E}[D_f(\widehat{x}_t,x_*)] \\ &\qquad+\sum_{t=0}^{T-1}w_t\biggl(\frac{2\gamma^2\sigma^2}{\alpha M}+ \frac{4\gamma^3L(H-1)\sigma^2}{\alpha^2}\biggr). \end{aligned} \end{equation*} \notag $$

Rearranging terms and using that $w_t=(1-\gamma\mu/(2\Gamma))^{-(t+1)}$, we obtain

$$ \begin{equation*} \begin{aligned} \, \sum_{t=0}^{T-1}\frac{\gamma w_t}{2}\,\mathsf{E}[D_f(\widehat{x}_t,x_*)] &\leqslant\sum_{t=0}^{T-1}\bigl(w_{t-1}\mathsf{E}\bigl[\|\widehat{x}_t- x_*\|_{\widehat{D}^t}^2\bigr]-w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_*\|_{\widehat{D}^{t+1}}^2\bigr]\bigr) \\ &\qquad+\sum_{t=0}^{T-1}w_t\biggl(\frac{2\gamma^2\sigma^2}{\alpha M}+ \frac{4\gamma^3L(H-1)\sigma^2}{\alpha^2}\biggr) \\ &\leqslant\|x_0-x_*\|_{\widehat{D}^0}^2+ \sum_{t=0}^{T-1}w_t\biggl(\frac{2\gamma^2\sigma^2}{\alpha M}+ \frac{4\gamma^3L(H-1)\sigma^2}{\alpha^2}\biggr). \end{aligned} \end{equation*} \notag $$

Let us define

$$ \begin{equation*} \overline{x}_{T-1}\overset{\operatorname{def}}{=} \frac{1}{W_{T-1}}\sum_{t=0}^{T-1}w_t\widehat{x}_t,\quad\text{where}\ \ W_{T-1}\overset{\operatorname{def}}{=}\sum_{t=0}^{T-1}w_t. \end{equation*} \notag $$

Therefore, dividing both sides of the above inequality by $\gamma W_{t-1}/2$, we conclude that

$$ \begin{equation*} \frac{1}{W_{T-1}}\sum_{t=0}^{T-1}\mathsf{E}[D_f(\widehat{x}_t,x_*)]\leqslant \frac{2\|x_0-x_*\|_{\widehat{D}^0}^2}{\gamma W_{T-1}}+ \frac{4\gamma^2\sigma^2}{\alpha M}+\frac{8\gamma^2L(H-1)\sigma^2}{\alpha^2} \end{equation*} \notag $$

and

$$ \begin{equation*} \mathsf{E}[f(\overline{x}_{T-1})-f(x_*)]\leqslant \frac{1}{W_{T-1}}\sum_{t=0}^{T-1}\mathsf{E}[D_f(\widehat{x}_t,x_*)], \end{equation*} \notag $$

where we have used the definition of Bregman divergence and Jensen’s inequality. To finish the proof, we use the bound $W_{T-1}\geqslant w_{T-1}\geqslant (1-\gamma\mu/(2\Gamma))^{-T}$ and the fact that $(1/\Gamma)\|x\|_{\widehat{D}^{t_p}}^2\leqslant\|x\|^2$. $\Box$

Appendix D. Proofs for heterogeneous data

D.1. Auxiliary lemmas

Lemma 5. Suppose that Assumptions 1, 3, and 4 hold for $\mu \geqslant 0$. If Algorithm 1 runs for heterogeneous data with $M \geqslant 2$, then for any $t$ such that $t_p \leqslant t < t_{p+1}$ we have

$$ \begin{equation} \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^M \nabla f_m(x_t^m,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \leqslant \frac{2L^2}{\alpha^2}\,V_t+\frac{8L}{\alpha} D_{f}(\widehat{x}_t,x_\ast)+ \frac{4\sigma_{\rm dif}^2}{M\alpha}. \end{equation} \tag{33} $$

Proof. Starting from the left-hand side, we obtain

$$ \begin{equation} \begin{aligned} \, &\mathsf{E}\biggr[\biggr\|\frac{1}{M}\sum_{m=1}^M\nabla f_m(x_t^m,z_m)\biggl\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\leqslant 2\mathsf{E}\biggr[\biggr\|\frac{1}{M} \sum_{m=1}^M \nabla f_m(x_t^m,z_m)- \frac{1}{M}\sum_{m=1}^{M}\nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\qquad+2\mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{n} \nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr]. \end{aligned} \end{equation} \tag{34} $$

To bound the first term in (34), we need to use the $L$-smoothness of $f_m(\,\cdot\,,z_m)$ and the fact that $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant (1/\alpha)\|x\|^2$:

$$ \begin{equation} \begin{aligned} \, &2 \mathsf{E}\biggl[\biggr\|\frac{1}{M}\sum_{m=1}^{M} \bigl(\nabla f_m(x_t^m,z_m)-\nabla f_m (\widehat{x}_t,z_m)\bigr)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\leqslant \frac{2}{M}\sum_{m=1}^{M}\mathsf{E} \bigl[\|\nabla f_m(x_t^m,z_m)- \nabla f_m(\widehat{x}_t,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad\leqslant \frac{2}{M\alpha}\sum_{m=1}^{M} \mathsf{E}\bigl[\|\nabla f_m (x_t^m, z_m)- \nabla f_m (\widehat{x}_t, z_m)\|^2\bigr] \nonumber \\ &\qquad\leqslant \frac{2 L^2}{M\alpha^2}\sum_{m=1}^{M} \|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}}, \end{aligned} \end{equation} \tag{35} $$

where we have used Jensen’s inequality and the convexity of the function $\|x\|^2$. For the second term in (34) we have

$$ \begin{equation} \begin{aligned} \, &\mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{M}\nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\overset{(6)}{=}\mathsf{E}\biggl[\biggl\|\frac{1}{M} \sum_{m=1}^{M}\bigl(\nabla f_m(\widehat{x}_t,z_m)- \nabla f_m(\widehat{x}_t)\bigr)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\quad\,\,+\biggl\|\frac{1}{M}\sum_{m=1}^{M}\nabla f_m(\widehat{x}_t)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation} \tag{36} $$

For the first term on the right-hand side of (36), by the independence of the $z_i$ and the fact that $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant(1/\alpha)\|x\|^2$ we obtain

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\biggl[\biggr\|\frac{1}{M}\sum_{m=1}^{M}\bigl(\nabla f_m(\widehat{x}_t,z_m)- \nabla f_m(\widehat{x}_t)\bigr)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \\ &\qquad=\frac{1}{M^2}\sum_{m=1}^{M}\mathsf{E} \bigl[\|\nabla f_m(\widehat{x}_t,z_m)- \nabla f_m(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\!\!\qquad\overset{(8)}{\leqslant} \frac{1}{M^2}\sum_{m=1}^{M}\mathsf{E} \bigl[\|\nabla f_m(\widehat{x}_t,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\leqslant \frac{2}{M^2}\sum_{m=1}^{M} \mathsf{E}\bigl[\|\nabla f_m (\widehat{x}_t,z_m)- \nabla f_m(x_\ast,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\qquad+\frac{2}{M^2} \sum_{m=1}^{M} \mathsf{E}\bigl[\|\nabla f_m(x_\ast,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad\leqslant \frac{2}{M^2\alpha}\sum_{m=1}^{M} \mathsf{E}\bigl[\|\nabla f_m(\widehat{x}_t,z_m)- \nabla f_m (x_\ast, z_m)\|^2\bigr] \\ &\qquad\qquad+\frac{2}{M^2\alpha}\sum_{m=1}^{M} \mathsf{E}\bigl[\|\nabla f_m (x_\ast, z_m)\|^2\bigr] \\ &\!\!\qquad\overset{(11)}{\leqslant} \frac{4L}{M^2\alpha}\sum_{m=1}^{M} D_{f_m}(\widehat{x}_t,x_\ast)+\frac{2\sigma_{\rm dif}^2}{M\alpha^2} \\ &\qquad=\frac{4L}{M\alpha}D_{f}(\widehat{x}_t,x_\ast)+ \frac{2\sigma_{\rm dif}^2}{M\alpha}\,. \end{aligned} \end{equation*} \notag $$

Substituting this into (36), one can obtain

$$ \begin{equation} \begin{aligned} \, &\mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{M}\nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad\leqslant \frac{4L}{M\alpha} D_{f}(\widehat{x}_t,x_\ast)+ \frac{2\sigma_{\rm dif}^2}{M\alpha}+ \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{M} \nabla f_m(\widehat{x}_t)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\qquad=\frac{4L}{M\alpha}D_{f}(\widehat{x}_t,x_\ast)+ \frac{2\sigma_{\rm dif}^2}{M\alpha}+ \|\nabla f(\widehat{x}_t)\|_{(\widehat{D}^{t_p})^{-1}}. \end{aligned} \end{equation} \tag{37} $$

Now notice that

$$ \begin{equation*} \begin{aligned} \, \|\nabla f(\widehat{x}_t)\|^2_{(\widehat{D}^{t_p})^{-1}} &\leqslant \frac{1}{\alpha}\|\nabla f(\widehat{x}_t)\|^2 \\ &=\frac{1}{\alpha}\|\nabla f(\widehat{x}_t)-\nabla f(x_\ast)\|^2 \leqslant \frac{2L}{\alpha}D_{f}(\widehat{x}_t,x_\ast). \end{aligned} \end{equation*} \notag $$

Using this in (37) gives

$$ \begin{equation*} \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{M}\nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \leqslant \frac{2L}{\alpha}\biggl(1+\frac{2}{M}\biggr)D_{f}(\widehat{x}_t,x_\ast)+ \frac{2\sigma_{\rm dif}^2}{M\alpha}\,. \end{equation*} \notag $$

Since $M \geqslant 2$, we have $1+2/M \leqslant 2$, and therefore

$$ \begin{equation} \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^{M} \nabla f_m(\widehat{x}_t,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \leqslant \frac{4L}{\alpha}D_{f}(\widehat{x}_t,x_\ast)+ \frac{2\sigma_{\rm dif}^2}{M\alpha}\,. \end{equation} \tag{38} $$

Combining (35) and (38) in (34), we have

$$ \begin{equation*} \mathsf{E}\biggl[\biggl\|\frac{1}{M}\sum_{m=1}^M \nabla f_m(x_t^m,z_m)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \leqslant \frac{2L^2}{\alpha^2}\,V_t+\frac{8L}{\alpha}D_{f}(\widehat{x}_t,x_\ast)+ \frac{4\sigma_{\rm dif}^2}{M\alpha}\,. \quad\Box \end{equation*} \notag $$

Lemma 6. Suppose that Assumptions 1 and 4 hold for $\mu \geqslant 0$. Then, if Algorithm 1 runs with heterogeneous data, for any $t$ such that $t_p \leqslant t < t_{p+1}$ we have

$$ \begin{equation*} -\frac{2}{M}\sum_{m=1}^{M}\langle\widehat{x}_t-x_\ast,\nabla f_m(x_t^m)\rangle \leqslant -2D_{f}(\widehat{x}_t,x_\ast)-\frac{\mu}{\Gamma}\, \|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+\frac{L}{\alpha}\,V_t. \end{equation*} \notag $$

Proof. Starting with the left-hand side, we have

$$ \begin{equation} \begin{aligned} \, -2\langle\widehat{x}_t-x_\ast,\nabla f_m(x_t^m)\rangle&= -2\langle x_t^m-x_\ast,\nabla f_m (x_t^m)\rangle \nonumber \\ &\qquad-2\langle\widehat{x}_t-x_t^m,\nabla f_m (x_t^m)\rangle. \end{aligned} \end{equation} \tag{39} $$

The first term in (39) is bounded by strong convexity:

$$ \begin{equation} -\langle x_t^m-x_\ast,\nabla f_m (x_t^m)\rangle \leqslant f_m (x_\ast)-f_m (x_t^m)-\frac{\mu}{2}\,\|x_t^m-x_\ast\|^2. \end{equation} \tag{40} $$

For the second term we use $L$-smoothness:

$$ \begin{equation} -\langle\widehat{x}_t-x_t^m,\nabla f_m (x_t^m)\rangle \leqslant f_m(x_t^m)-f_m(\widehat{x}_t)+\frac{L}{2}\,\|x_t^m-\widehat{x}_t\|^2. \end{equation} \tag{41} $$

Combining (41) and (40) in (39), we obtain

$$ \begin{equation*} \begin{aligned} \, -2\langle\widehat{x}_t-x_\ast,\nabla f_m (x_t^m) \rangle &\leqslant 2\biggl(f_m (x_\ast)-f_m (x_t^m)- \frac{\mu}{2}\,\|x_t^m-x_\ast\|^2\biggr) \\ &\qquad+2\biggl(f_m (x_t^m)-f_m(\widehat{x}_t)+ \frac{L}{2}\,\|x_t^m-\widehat{x}_t\|^2\biggr) \\ &=2\biggl(f_m (x_\ast)-f_m (\widehat{x}_t)-\frac{\mu}{2}\, \|x_t^m-x_\ast\|^2+\frac{L}{2}\,\|x_t^m-\widehat{x}_t\|^2\biggr). \end{aligned} \end{equation*} \notag $$

Averaging over $M$ gives

$$ \begin{equation*} \begin{aligned} \, -\frac{2}{M}\sum_{m=1}^{M}\langle\widehat{x}_t- x_\ast,\nabla f_m(x_t^m)\rangle &\leqslant -2\bigl(f(\widehat{x}_t)-f(x_\ast)\bigr) \\ &\qquad-\frac{\mu}{M}\sum_{m=1}^{M}\|x_t^m-x_\ast\|^2+ \frac{L}{M}\sum_{m=1}^{M}\|x_t^m-\widehat{x}_t\|^2. \end{aligned} \end{equation*} \notag $$

Noting that $f(\widehat{x}_t)-f(x_\ast)$ in the first term is the Bregman divergence $D_{f}(\widehat{x}_t,x_\ast)$ and using Jensen’s inequality $-\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}\|x_t^m-x_\ast\|^2 \leqslant -\|\widehat{x}_t-x_\ast\|^2$, we have

$$ \begin{equation*} -\frac{2}{M}\sum_{m=1}^{M}\langle\widehat{x}_t-x_\ast, \nabla f_m(x_t^m)\rangle \leqslant -2D_{f}(\widehat{x}_t,x_\ast)- \mu\|\widehat{x}_t-x_\ast\|^2+ \frac{L}{M}\sum_{m=1}^{M}\|x_t^m-\widehat{x}_t\|^2. \end{equation*} \notag $$

In view of the fact that $(1/\Gamma)\|x\|^2_{\widehat{D}^{t_p}} \leqslant \|x\|^2 \leqslant (1/\alpha)\|x\|^2_{\widehat{D}^{t_p}}$ and the definition of $V_t$, one can obtain

$$ \begin{equation*} -\frac{2}{M}\sum_{m=1}^{M}\langle\widehat{x}_t-x_\ast, \nabla f_m (x_t^m)\rangle \leqslant -2D_{f}(\widehat{x}_t,x_\ast)- \frac{\mu}{\Gamma}\,\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{L}{\alpha}\,V_t.\quad\Box \end{equation*} \notag $$

Lemma 7. Suppose that Assumptions 1, 3, and 4 hold. If Algorithm 1 runs for heterogeneous data with $\sup_{p}|t_p-t_{p+1}| \leqslant H$, then for any $t$ such that $t_p \leqslant t \leqslant t_{p+1}-1$, for $\gamma \leqslant \dfrac{\alpha}{3L(H-1)}$ and $w_j \overset{{\rm def}}{=} \biggl(1-\dfrac{\gamma\mu}{2\Gamma}\biggr)^{-(j+1)}$ we have

$$ \begin{equation*} \begin{aligned} \, \sum_{j=t_p}^t w_j\mathsf{E}[V_j] &\leqslant \frac{36\gamma^2(H-1)^2 L}{\alpha}\sum_{j=t_p}^t w_j \mathsf{E}[D_{f} (\widehat{x}_k, x_\ast)] \\ &\qquad+\frac{18\gamma^2(H-1)^2 \sigma_{\rm dif}^2}{\alpha} \sum_{j=t_p}^t w_j. \end{aligned} \end{equation*} \notag $$

Proof. Let $G_k \overset{{\rm def}}{=} \displaystyle\sum_{m=1}^M(\widehat{D}^{t_p})^{-1} \nabla f_m(x_k^m,z_m)$. From the definition of $V_t$ we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[V_t]&=\frac{1}{M}\!\sum_{m=1}^{M} \mathsf{E}\bigl[\|x_t^m-\widehat{x}_t\|^2_{\widehat{D}^{t_p}}\bigr] \\ &=\frac{1}{M}\!\sum_{m=1}^{M}\mathsf{E}\biggl[\biggl\|\biggl(x_{t_p}^{m}- \gamma\sum_{k=t_p}^{t-1}(\widehat{D}^{t_p})^{-1}\nabla f_m(x_k^m,z_m)\biggr)- \biggl(x_{t_p}\!-\gamma\sum_{k=t_p}^{t-1} G_k\biggr)\!\biggr\|^2_{\widehat{D}^{t_p}}\biggr]. \end{aligned} \end{equation*} \notag $$

Using that $x_{t_p}=x_{t_p}^{m}$ for all $m \in [M]$, we have

$$ \begin{equation} \begin{aligned} \, \mathsf{E}[V_t]&=\frac{\gamma^2}{M}\sum_{m=1}^{M}\mathsf{E} \biggl[\biggl\|\,\sum_{k=t_p}^{t-1}\bigl(\nabla f_m(x_k^m,z_m)- g_k\bigr)\biggr\|^2_{(\widehat{D}^{t_p})^{-1}}\biggr] \nonumber \\ &\!\!\overset{(10)}{\leqslant} \frac{\gamma^2(t-t_p)}{M} \sum_{m=1}^{M}\,\sum_{k=t_p}^{t-1}\mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)- g_k\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\leqslant \frac{\gamma^2(t-t_p)}{M}\sum_{m=1}^{M}\,\sum_{k=t_p}^{t-1} \mathsf{E}\bigl[\|\nabla f_m(x_k^m, z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\big] \nonumber \\ &\leqslant \frac{\gamma^2(H-1)}{M}\sum_{m=1}^{M}\,\sum_{k=t_p}^{t-1} \mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr], \end{aligned} \end{equation} \tag{42} $$

where in the third line we used the definition of $g_k$ in the following way:

$$ \begin{equation*} \frac{1}{M}\sum_{m=1}^M\mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)- g_k\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr]\leqslant \frac{1}{M}\sum_{m=1}^M \mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr], \end{equation*} \notag $$

and in the fourth line we used that $t-t_p \leqslant t_{p+1}-t_p-1 \leqslant H-1$. Decomposing the gradient norm, one can obtain

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] &\leqslant 3\mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)- \nabla f_m(\widehat{x}_k,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad+3\mathsf{E}\bigl[\|\nabla f_m(\widehat{x}_k,z_m)- \nabla f_m(x_\ast,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad+3\mathsf{E} \bigl[\|\nabla f_m(x_\ast,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr]. \end{aligned} \end{equation} \tag{43} $$

For the first term in (43) we have

$$ \begin{equation} \begin{aligned} \, &\mathsf{E}\bigl[\|\nabla f_m(x_k^m, z_m)- \nabla f_m(\widehat{x}_t,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad\leqslant \frac{1}{\alpha}\,\mathsf{E}\bigl[\|\nabla f_m(x_k^m,z_m)- \nabla f_m(\widehat{x}_t,z_m)\|^2\bigr] \nonumber \\ &\qquad\leqslant \frac{L^2}{\alpha}\, \mathsf{E}\bigl[\|x_k^m-\widehat{x}_k\|^2\bigr]. \end{aligned} \end{equation} \tag{44} $$

The second term can be bounded using smoothness and the property of $(\widehat{D}^{t_p})^{-1}$ that $\|x\|^2_{(\widehat{D}^{t_p})^{-1}} \leqslant (1/\alpha)\|x\|^2$:

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\bigl[\|\nabla f_m (\widehat{x}_k, z_m)- \nabla f_m(x_\ast,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] &\leqslant \frac{1}{\alpha}\,\mathsf{E}\bigl[\|\nabla f_m(\widehat{x}_k,z_m)- \nabla f_m(x_\ast,z_m)\|^2\bigr] \nonumber \\ &\!\!\overset{(11)}{\leqslant} \frac{2L}{\alpha}\, \mathsf{E}[D_{f_m}(\widehat{x}_k,x_\ast)]. \end{aligned} \end{equation} \tag{45} $$

Substituting (45) and (44) into (43), averaging over $m$, using that $\alpha\|x\|^2 \leqslant \|x\|^2_{\widehat{D}^{t_p}}$, and recalling the definition of $\sigma_{\rm dif}^2$, we have

$$ \begin{equation} \begin{aligned} \, &\frac{1}{M}\sum_{m=1}^M\mathsf{E} \bigl[\|\nabla f_m(x_k^m,z_m)\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \nonumber \\ &\qquad\leqslant \frac{3L^2}{M\alpha}\sum_{m=1}^{M}\mathsf{E} \bigl[\|x_k^m-\widehat{x}_k\|^2\bigr]+\frac{6L}{\alpha}\, \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)]+ \frac{3\sigma_{\rm dif}^2}{\alpha} \nonumber \\ &\qquad\leqslant \frac{3 L^2}{\alpha^2}\,\mathsf{E}[V_k]+ \frac{6L}{\alpha}\,\mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)]+ \frac{3\sigma_{\rm dif}^2}{\alpha}\,. \end{aligned} \end{equation} \tag{46} $$

Plugging (46) into (42) and summing the inequalities with the weights $w_j$ we obtain

$$ \begin{equation} \begin{aligned} \, \sum_{j=t_p}^t w_j\mathsf{E}[V_j] &\leqslant \sum_{j=t_p}^t w_j\gamma^2(H-1) \sum_{k=t_p}^{j-1}\biggl(\frac{3 L^2}{\alpha^2}\,\mathsf{E}[V_k]+ \frac{6L}{\alpha}\,\mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)]+ \frac{3\sigma_{\rm dif}^2}{\alpha}\biggr) \nonumber \\ &=\gamma^2(H-1)\sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_j \biggl(\frac{3 L^2}{\alpha^2}\,\mathsf{E}[V_k]+\frac{6L}{\alpha}\, \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)]+ \frac{3\sigma_{\rm dif}^2}{\alpha}\biggr) \nonumber \\ &=\gamma^2(H-1)\sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1} w_j\,\frac{3 L^2}{\alpha^2}\,\mathsf{E}[V_k] \nonumber \\ &\qquad+\gamma^2(H-1)\sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1} w_j\,\frac{6 L}{\alpha}\,\mathsf{E}[D_{f}(\widehat{x}_k, x_\ast)] \nonumber \\ &\qquad+\gamma^2(H-1) \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1} w_j\,\frac{3\sigma_{\rm dif}^2}{\alpha}\,. \end{aligned} \end{equation} \tag{47} $$

Let us consider the sequence $\{w_k\}_{k=0}^{\infty}$. Recall that

$$ \begin{equation*} w_k=(1-\eta)^{-(k+1)},\quad\text{where}\ \ \eta \overset{{\rm def}}{=} \frac{\gamma\mu}{2\Gamma}\,. \end{equation*} \notag $$

Then for $j$ such that $0 \leqslant j \leqslant H-1$ and $\gamma \leqslant \alpha/(3(H-1)L)$ we obtain

$$ \begin{equation} \begin{aligned} \, w_k&=(1-\eta)^{-(k-j+1)}(1-\eta)^{-j} \leqslant (1-\eta)^{-(k-j+1)}(1+2\eta)^{j} \nonumber \\ &\leqslant w_{k-j}\biggl(1+\frac{\gamma\mu}{\Gamma}\biggr)^j \leqslant w_{k-j}\biggl(1+\frac{1}{3(H-1)}\biggr)^j \nonumber \\ &\leqslant w_{k-j}\exp{\biggl(\frac{j}{3(H-1)}\biggr)} \leqslant w_{k-j}\exp\biggl(\frac{1}{3}\biggr) \leqslant 2w_{k-j}. \end{aligned} \end{equation} \tag{48} $$

Using the result above, let us bound the terms in (47):

$$ \begin{equation} \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_j\mathsf{E}[V_k] = \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_{k+(j-k)}\mathsf{E}[V_k] \overset{(48)}{\leqslant}\sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}2w_{k} \mathsf{E}[V_k] \nonumber \end{equation} \notag $$

$$ \begin{equation} =\sum_{j=t_p+1}^t\,\sum_{k=t_p}^{j-1}2w_{k}\mathsf{E}[V_k] \leqslant 2(t-t_p) \sum_{k=t_p}^{j-1}w_{k}\mathsf{E}[V_k] \nonumber \end{equation} \notag $$

$$ \begin{equation} \leqslant 2(H-1)\sum_{j=t_p}^t w_{j}\mathsf{E}[V_j]; \end{equation} \tag{49} $$

$$ \begin{equation} \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_j\mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)] = \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_{k+(j-k)} \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)] \nonumber \end{equation} \notag $$

$$ \begin{equation} \overset{(48)}{\leqslant} \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}2w_{k} \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)] \nonumber \end{equation} \notag $$

$$ \begin{equation} =\sum_{j=t_p+1}^t\,\sum_{k=t_p}^{j-1}2w_{k} \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)] \nonumber \end{equation} \notag $$

$$ \begin{equation} \leqslant 2(t-t_p) \sum_{k=t_p}^{j-1}w_{k} \mathsf{E}[D_{f}(\widehat{x}_k,x_\ast)] \nonumber \end{equation} \notag $$

$$ \begin{equation} \leqslant 2(H-1)\sum_{j=t_p}^t w_{j}\mathsf{E}[D_{f}(\widehat{x}_j,x_\ast)]; \end{equation} \tag{50} $$

$$ \begin{equation} \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_j \sigma_{\rm dif}^2 = \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1}w_{k+(j-k)}\sigma_{\rm dif}^2 \overset{(48)}{\leqslant} \sum_{j=t_p}^t\,\sum_{k=t_p}^{j-1} 2w_{k} \sigma_{\rm dif}^2 \nonumber \end{equation} \notag $$

$$ \begin{equation} =\sum_{j=t_p+1}^t\,\sum_{k=t_p}^{j-1}2w_{k} \sigma_{\rm dif}^2 \leqslant 2(t-t_p) \sum_{k=t_p}^{j-1}w_{k} \sigma_{\rm dif}^2 \nonumber \end{equation} \notag $$

$$ \begin{equation} \leqslant 2(H-1)\sum_{j=t_p}^t w_{j} \sigma_{\rm dif}^2. \end{equation} \tag{51} $$

Substituting (49), (50), and (51) into (47), we can obtain

$$ \begin{equation*} \begin{aligned} \, \sum_{j=t_p}^t w_j\mathsf{E}[V_j] &\leqslant \frac{6\gamma^2L^2(H-1)^2}{\alpha^2} \sum_{j=t_p}^t w_j\mathsf{E}[V_j] \\ &\qquad+\frac{12\gamma^2(H-1)^2 L}{\alpha}\sum_{j=t_p}^t w_j \mathsf{E}[D_{f} (\widehat{x}_j, x_\ast)] \\ &\qquad+\frac{6\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha}\sum_{j=t_p}^t w_j. \end{aligned} \end{equation*} \notag $$

Note that we have the same summands in both sides of the inequality. Then

$$ \begin{equation*} \begin{aligned} \, \biggl(1-\frac{6\gamma^2L^2(H-1)^2}{\alpha^2}\biggr)\sum_{j=t_p}^t w_j \mathsf{E}[V_j] &\leqslant \frac{12\gamma^2(H-1)^2 L}{\alpha} \sum_{j=t_p}^t w_j\mathsf{E}[D_{f}(\widehat{x}_j,x_\ast)] \\ &\qquad+\frac{6\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha}\sum_{j=t_p}^t w_j. \end{aligned} \end{equation*} \notag $$

Since $\gamma \leqslant \dfrac{\alpha}{3(H-1)L}$ and thus $1-\dfrac{6\gamma^2L^2(H-1)^2}{\alpha^2} \geqslant \dfrac{1}{3}$ , we claim that

$$ \begin{equation*} \begin{aligned} \, \sum_{j=t_p}^t w_j\mathsf{E}[V_j] &\leqslant \frac{36\gamma^2(H-1)^2 L}{\alpha}\sum_{j=t_p}^t w_j \mathsf{E}[D_{f}(\widehat{x}_j,x_\ast)] \\ &\qquad+\frac{18\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha}\sum_{j=t_p}^t w_j. \quad\Box \end{aligned} \end{equation*} \notag $$

Lemma 8. Suppose that Assumptions 1, 3, and 4 hold for Algorithm 1, which runs for heterogeneous data with $M \geqslant 2$. Then, for any $\gamma \geqslant 0$ and $t$ such that $t_p \leqslant t \leqslant t_{p+1}-1$ we have

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] &\leqslant \biggl(1-\frac{\gamma \mu}{\Gamma}\biggr) \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma L}{\alpha}\biggl(1+\frac{2\gamma L}{\alpha}\biggr) V_t \\ &\qquad-2\gamma\biggl(1-\frac{4\gamma L}{\alpha}\biggr) D_{f}(\widehat{x}_t,x_\ast)+\frac{4\gamma^2\sigma_{\rm dif}^2}{M\alpha}\,. \end{aligned} \end{equation*} \notag $$

In particular, if $\gamma \leqslant \alpha/(8L)$, then

$$ \begin{equation*} \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] \leqslant \biggl(1-\frac{\gamma \mu}{\Gamma}\biggr) \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{5\gamma L}{4\alpha}\,V_t-\gamma D_{f}(\widehat{x}_t,x_\ast)+ \frac{4\gamma^2\sigma_{\rm dif}^2}{M\alpha}\,. \end{equation*} \notag $$

Proof. First we use the update rule $\widehat{x}_{t+1}=\widehat{x}_t-\gamma(\widehat{D}^{t_p})^{-1}g_t$:

$$ \begin{equation*} \begin{aligned} \, \|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}&= \bigl\|\widehat{x}_t-\gamma(\widehat{D}^{t_p})^{-1}g_t- x_\ast\bigr\|^2_{\widehat{D}^{t_p}} \\ &=\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|g_t\|^2_{(\widehat{D}^{t_p})^{-1}}- 2\gamma\langle\widehat{x}_t-x_\ast,g_t\rangle \\ &=\|\widehat{x}_t-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\|g_t\|^2_{(\widehat{D}^{t_p})^{-1}}- \frac{2\gamma}{M}\sum_{m=1}^{M} \langle\widehat{x}_t-x_\ast,\nabla f_m(x_t^m,z_m)\rangle. \end{aligned} \end{equation*} \notag $$

Taking the full expectation and using Lemmas 5 and 6 we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] &\leqslant \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\mathsf{E}\bigl[\|g_t\|^2_{(\widehat{D}^{t_p})^{-1}}\bigr] \\ &\qquad-\frac{2\gamma}{M} \sum_{m=1}^{M} \langle\widehat{x}_t-x_\ast,\nabla f_m (x_t^m)\rangle \\ &\leqslant \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma^2\biggl(\frac{2L^2}{\alpha^2}\,V_t+ \frac{8L}{\alpha}D_{f}(\widehat{x}_t,x_\ast)+ \frac{4\sigma_{\rm dif}^2}{M\alpha}\biggr) \\ &\qquad- \frac{2\gamma}{M}\sum_{m=1}^{M} \langle\widehat{x}_t-x_\ast,\nabla f_m (x_t^m)\rangle \\ &\leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \frac{\gamma L}{\alpha}\biggl(1+\frac{2\gamma L}{\alpha}\biggr) V_t \\ &\qquad-2\gamma\biggl(1-\frac{4\gamma L}{\alpha}\biggr) D_{f}(\widehat{x}_t,x_\ast)+\frac{4\gamma^2\sigma_{\rm dif}^2}{M\alpha}\,. \end{aligned} \end{equation*} \notag $$

This result is the first part of our lemma.

If $\gamma \leqslant \alpha/(8L)$, then

$$ \begin{equation*} 1-\frac{4\gamma L}{\alpha} \geqslant \frac{1}{2}\quad\text{and}\quad 1+\frac{2\gamma L}{\alpha} \leqslant \frac{5}{4}\,, \end{equation*} \notag $$

and, finally,

D.2. Proof of Theorem 2

We start with Lemma 8:

$$ \begin{equation*} \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t_p}}\bigr] \leqslant \biggl(1-\frac{\gamma\mu}{\Gamma}\biggr) \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t_p}}+ \gamma\biggl(\frac{5L}{4\alpha}V_t-D_{f}(\widehat{x}_t, x_\ast)\biggr)+ \frac{4\gamma^2\sigma_{\rm dif}^2}{M\alpha}\,. \end{equation*} \notag $$

Applying Corollary 1 and using the inequality $1+\gamma\mu/(2\Gamma) \leqslant 2$, we obtain

$$ \begin{equation*} \mathsf{E}\bigl[\|\widehat{x}_{t+1}-x_\ast\|^2_{\widehat{D}^{t+1}}\bigr] \leqslant \biggl(1-\frac{\gamma \mu}{2\Gamma}\biggr) \|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t}}+ \gamma\biggl(\frac{5L}{2\alpha}\,V_t-2D_{f}(\widehat{x}_t,x_\ast)\biggr)+ \frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\,. \end{equation*} \notag $$

Taking the full expectation and summing with the weights $w_t$ gives

$$ \begin{equation} \begin{aligned} \, \sum_{t=0}^{T-1}w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_\ast\|^2_{\widehat{D}^{t+1}}\bigr] &\leqslant \biggl(1-\frac{\gamma \mu}{2\Gamma}\biggr)\sum_{t=0}^{T-1} w_t \mathsf{E}\bigl[\|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t}}\bigr] \nonumber \\ &\qquad+ \gamma \sum_{t=0}^{T-1}\mathsf{E}\biggl[\frac{5L}{2\alpha} w_t V_t- 2 w_t D_{f}(\widehat{x}_t,x_\ast)\biggr] \nonumber \\ &\qquad+\frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\sum_{t=0}^{T-1} w_t. \end{aligned} \end{equation} \tag{52} $$

Using that $T=t_p$ for some $p \in \mathbb{N}$, we can decompose the second term and apply Lemma 7:

$$ \begin{equation*} \begin{aligned} \, \sum_{t=0}^{T-1} w_t\mathsf{E}\biggl[\frac{5L}{2\alpha}\,V_t-& 2 D_{f}(\widehat{x}_t,x_\ast)\biggr]=\sum_{k=1}^{p}\, \sum_{t=t_{k-1}}^{t_{k}-1} 2w_t\mathsf{E}\biggl[\frac{5L}{4\alpha}\,V_t- D_{f}(\widehat{x}_t,x_\ast)\biggr] \\ &\leqslant \sum_{k=1}^{p}\,\sum_{t=t_{k-1}}^{t_k-1} 2w_t \biggl(\frac{45L^2\gamma^2 (H-1)^2}{\alpha^2}-1\biggr) \mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)] \\ &\qquad+\frac{45L\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha^2} \sum_{k=1}^{p}\,\sum_{t=t_{k-1}}^{t_k-1}w_t. \end{aligned} \end{equation*} \notag $$

By the assumption that $\gamma \leqslant \alpha/(10(H-1)L)$ we have

$$ \begin{equation*} \frac{45L^2\gamma^2(H-1)^2}{\alpha^2}-1 \leqslant -\frac{1}{2}\,. \end{equation*} \notag $$

Using this, one can obtain

$$ \begin{equation*} \begin{aligned} \, \sum_{t=0}^{T-1}w_t\mathsf{E}\biggl[\frac{5L}{2\alpha}\,V_t- 2D_{f}(\widehat{x}_t,x_\ast)\biggr] &\leqslant-\sum_{k=1}^{p}\, \sum_{t=t_{k-1}}^{t_k-1}w_t\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)] \\ &\qquad+\frac{45L\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha^2} \sum_{k=1}^{p}\,\sum_{t=t_{k-1}}^{t_k-1}w_t \\ &=-\sum_{t=0}^{T-1}w_t\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)] \\ &\qquad+\frac{45L\gamma^2(H-1)^2\sigma_{\rm dif}^2}{\alpha^2} \sum_{t=0}^{T-1}w_t. \end{aligned} \end{equation*} \notag $$

Plugging this into (52) gives

$$ \begin{equation*} \begin{aligned} \, \sum_{t=0}^{T-1}w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_\ast\|^2_{\widehat{D}^{t+1}}\bigr] &\leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)\sum_{t=0}^{T-1} w_t\mathsf{E}\bigl[\|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t}}\bigr] \\ &\qquad-\gamma\sum_{t=0}^{T-1}w_t\mathsf{E}[D_{f} (\widehat{x}_t,x_\ast)] \\ &\qquad+\biggl(\frac{45L\gamma^3(H-1)^2\sigma_{\rm dif}^2}{\alpha^2}+ \frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\biggr)\sum_{t=0}^{T-1}w_t. \end{aligned} \end{equation*} \notag $$

Rearranging terms, we have

$$ \begin{equation*} \begin{aligned} \, &\gamma \sum_{t=0}^{T-1} w_t\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)] \\ &\qquad\leqslant \sum_{t=0}^{T-1} \biggl(\biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr) w_t\mathsf{E}\bigl[\|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t}}\bigr]- w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_\ast\|^2_{\widehat{D}^{t+1}}\bigr]\biggr) \\ &\qquad\qquad+\biggl(\frac{45L\gamma^3(H-1)^2\sigma_{\rm dif}^2}{\alpha^2}+ \frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\biggr)\sum_{t=0}^{T-1} w_t. \end{aligned} \end{equation*} \notag $$

Noting that $W_T \overset{{\rm def}}{=}\displaystyle\sum_{t=0}^T w_t$ and dividing both sides by $\gamma W_{T-1}$, we obtain

$$ \begin{equation*} \begin{aligned} \, &\frac{1}{W_{T-1}}\sum_{t=0}^{T-1} w_t\mathsf{E}[D_{f}(\widehat{x}_t,x_\ast)] \\ &\qquad\leqslant \frac{1}{\gamma W_{T-1}}\sum_{t=0}^{T-1}\biggl(\biggl(1- \frac{\gamma \mu}{2\Gamma}\biggr)w_t\mathsf{E}\bigl[\|\widehat{x}_{t}- x_\ast\|^2_{\widehat{D}^{t}}\bigr]-w_t \mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_\ast\|^2_{\widehat{D}^{t+1}}\bigr]\biggr) \\ &\qquad\qquad+\frac{1}{\gamma W_{T-1}}\biggl(\frac{45L\gamma^3(H-1)^2 \sigma_{\rm dif}^2}{\alpha^2}+ \frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\biggr)\sum_{t=0}^{T-1}w_t. \end{aligned} \end{equation*} \notag $$

Let us define

$$ \begin{equation*} \overline{x}_{T-1} \overset{{\rm def}}{=} \frac{1}{W_{T-1}} \sum_{t=0}^{T-1}w_t \widehat{x}_t. \end{equation*} \notag $$

Using this definition, the definition of Bregman divergence, the equality $w_{t-1}=w_t(1-\gamma \mu/(2\Gamma))$, the fact that

$$ \begin{equation*} W_{T-1} \geqslant w_{T-1}=(1-\gamma \mu/(2\Gamma))^{-T}, \end{equation*} \notag $$

and the boundary for $\gamma$, counting the telescopic sum, and applying Jensen’s inequality to the left-hand side, we claim the final result:

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] &\leqslant \frac{1}{\gamma W_{T-1}}\sum_{t=0}^{T-1} \Bigl(w_{t-1} \mathsf{E}\bigl[\|\widehat{x}_{t}-x_\ast\|^2_{\widehat{D}^{t}}\bigr]- w_t\mathsf{E}\bigl[\|\widehat{x}_{t+1}- x_\ast\|^2_{\widehat{D}^{t+1}}\bigr]\Bigr) \\ &\qquad+\frac{1}{\gamma W_{T-1}} \biggl(\frac{45L\gamma^3(H-1)^2\sigma_{\rm dif}^2}{\alpha^2}+ \frac{8\gamma^2\sigma_{\rm dif}^2}{M\alpha}\biggr)\sum_{t=0}^{T-1}w_t \\ &\leqslant \biggl(1-\frac{\gamma \mu}{2\Gamma}\biggr)^T \frac{\|x_{0}-x_\ast\|^2_{\widehat{D}^0}}{\gamma}+ \gamma\sigma_{\rm dif}^2\biggl(\frac{9(H-1)}{2\alpha}+ \frac{8}{M\alpha}\biggr) \\ &\leqslant \biggl(1-\frac{\gamma\mu}{2\Gamma}\biggr)^T \frac{\Gamma\|x_{0}-x_\ast\|^2}{\gamma}+ \gamma\sigma_{\rm dif}^2\biggl(\frac{9(H-1)}{2\alpha}+ \frac{8}{M\alpha}\biggr). \quad \Box \end{aligned} \end{equation*} \notag $$

D.3. Proof of Corollary 3

We start with Theorem 2:

$$ \begin{equation*} \mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] \leqslant \biggl(1-\frac{\gamma \mu}{2\Gamma}\biggr)^T \frac{\Gamma\|x_{0}-x_\ast\|^2}{\gamma}+\gamma\sigma_{\rm dif}^2 \biggl(\frac{9(H-1)}{2\alpha}+\frac{8}{M\alpha}\biggr). \end{equation*} \notag $$

With the new notation

$$ \begin{equation*} c \overset{{\rm def}}{=} \sigma_{\rm dif}^2 \biggl(\frac{9(H-1)}{2\alpha}+\frac{8}{M\alpha}\biggr) \end{equation*} \notag $$

we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] &\leqslant \biggl(1-\frac{\gamma \mu}{2\Gamma}\biggr)^T \frac{\Gamma\|x_{0}-x_\ast\|^2}{\gamma}+c\gamma \\ &\leqslant \exp\biggl(-\frac{\gamma\mu T}{2\Gamma}\biggr) \frac{\Gamma\|x_{0}-x_\ast\|^2}{\gamma}+c\gamma. \end{aligned} \end{equation*} \notag $$

The bound for $\gamma$ from Theorem 2 is $\gamma \leqslant \dfrac{\alpha}{10(H-1)L}$ . Let us consider the following two cases.

$\bullet$ $\dfrac{\alpha}{10(H-1)L} \geqslant \dfrac{2\Gamma}{\mu T}\log\biggl(\max\biggl\{2, \dfrac{\mu^2\|x_{0}-x_\ast\|^2T^2}{4\Gamma c}\biggr\}\biggr)$. Hence, choosing

$$ \begin{equation*} \gamma=\frac{2\Gamma}{\mu T}\log\biggl(\max \biggl\{2,\frac{\mu^2\|x_{0}-x_\ast\|^2T^2}{4\Gamma c}\biggr\}\biggr), \end{equation*} \notag $$

we obtain

$$ \begin{equation*} \begin{aligned} \, &\widetilde{\mathcal{O}}\biggl(\mu\|x_{0}-x_\ast\|^2\,T \exp\biggl(-\log\biggl(\max\biggl\{2, \frac{\mu^2\|x_{0}-x_\ast\|^2 T^2}{4\Gamma c}\biggr\}\biggr)\biggr)\biggr)+ \widetilde{\mathcal{O}}\biggl(\frac{\Gamma c}{\mu T}\biggr) \\ &\qquad=\widetilde{\mathcal{O}}\biggl(\frac{\Gamma c}{\mu T}\biggr), \end{aligned} \end{equation*} \notag $$

where in the case of $2 \geqslant \dfrac{\mu^2\|x_{0}-x_\ast\|^2T^2}{4\Gamma c}$ we have $\mu\|x_{0}-x_\ast\|^2T \leqslant \dfrac{8\Gamma c}{\mu T}$ .

$\bullet$ $\dfrac{\alpha}{10(H-1)L} \leqslant \dfrac{2\Gamma}{\mu T}\log\biggl(\max\biggl\{2, \dfrac{\mu^2\|x_{0}-x_\ast\|^2 T^2}{4\Gamma c}\biggr\}\biggr)$. Then we choose

$$ \begin{equation*} \gamma=\frac{\alpha}{10(H-1)L}\,. \end{equation*} \notag $$

Hence we obtain

$$ \begin{equation*} \begin{aligned} \, &\frac{10(H-1)L\Gamma\|x_{0}-x_\ast\|^2}{\alpha} \exp\biggl(-\frac{\alpha\mu T}{20\Gamma(H-1)L}\biggr)+ \frac{c\alpha}{10(H-1)L} \\ &\qquad\leqslant \frac{10(H-1)L\Gamma\|x_{0}-x_\ast\|^2}{\alpha} \exp\biggl(-\frac{\alpha\mu T}{20\Gamma(H-1)L}\biggr) \\ &\qquad\qquad+\frac{2\Gamma c\log(\max\{2, \mu^2\|x_{0}-x_\ast\|^2T^2/(4\Gamma c)\})}{\mu T} \\ &\qquad=\widetilde{\mathcal{O}}\biggl(\frac{(H-1)L\Gamma}{\alpha}\, \|x_0-x_\ast\|^2\exp\biggl(-\frac{\mu T\alpha}{\Gamma(H-1)L}\biggr)\biggr) \\ &\qquad\qquad+\widetilde{\mathcal{O}}\biggl(\frac{\Gamma c}{\mu T}\biggr). \end{aligned} \end{equation*} \notag $$

Substituting $c$ and combining the above results, we have the following estimate:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[f(\overline{x}_{T-1})-f(x_\ast)\bigr] \\ &\qquad=\widetilde{\mathcal{O}}\,\biggl(\frac{(H-1)L\Gamma}{\alpha}\, \|x_0-x_\ast\|^2\exp\biggl(-\frac{\mu T \alpha}{\Gamma(H-1)L}\biggr)+ \frac{\Gamma\sigma_{\rm dif}^2}{\alpha\mu T} \biggl((H-1)+\frac{1}{M}\biggr)\biggr), \end{aligned} \end{equation*} \notag $$

which finishes the proof of Corollary 3. $\Box$



Bibliography

1.	S. Shalev-Shwartz and S. Ben-David, Understanding machine learning. From theory to algorithms, Cambridge Univ. Press, Cambridge, 2014, xvi+397 pp.
2.	I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, Adapt. Comput. Mach. Learn., MIT Press, Cambridge, MA, 2016, xxii+775 pp.
3.	S. Arora, N. Cohen, and E. Hazan, “On the optimization of deep networks: implicit acceleration by overparameterization”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 244–253 https://proceedings.mlr.press/v80/arora18a.html
4.	V. Smith, S. Forte, Chenxin Ma, M. Takáč, M. I. Jordan, and M. Jaggi, “CoCoA: A general framework for communication-efficient distributed optimization”, J. Mach. Learn. Res., 18 (2018), 230, 49 pp.
5.	K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, Distributed learning with compressed gradient differences, 2023 (v1 – 2019), 59 pp., arXiv: 1901.09269
6.	J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
7.	S. Chraibi, A. Khaled, D. Kovalev, P. Richtárik, A. Salim, and M. Takáč, Distributed fixed point methods with compressed iterates, 2019, 15 pp., arXiv: 1912.09925
8.	V. Pirau, A. Beznosikov, M. Takáč, V. Matyukhin, and A. Gasnikov, “Preconditioning meets biased compression for efficient distributed optimization”, Comput. Manag. Sci., 21:1 (2024), 14, 22 pp.
9.	J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp., arXiv: 1610.05492
10.	P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
11.	S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143 https://proceedings.mlr.press/v119/karimireddy20a.html
12.	Y. Arjevani and O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764 https://papers.nips.cc/paper_files/paper/2015/hash/7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html
13.	D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS'17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718 https://proceedings.neurips.cc/paper_files/paper/2017/hash/6c340f25839e6acdc73414517203f5f0-Abstract.html
14.	A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, On biased compression for distributed learning, 2024 (v1 – 2020), 50 pp., arXiv: 2002.12410
15.	L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
16.	S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 pp., arXiv: 1805.09767
17.	B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282 https://proceedings.mlr.press/v54/mcmahan17a
18.	Jianyu Wang, V. Tantia, N. Ballas, and M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp., arXiv: 1910.00643
19.	D. Basu, D. Data, C. Karakus, and S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations”, IEEE J. Sel. Areas Inform. Theory, 1:1 (2020), 217–226
20.	A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031 https://proceedings.mlr.press/v108/reisizadeh20a.html
21.	Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 pp., arXiv: 1912.12844
22.	P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, and P. K. Varshney, Parallel restarted SPIDER–communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 pp., arXiv: 1912.06036
23.	K. Mishchenko, G. Malinovsky, S. Stich, and P. Richtárik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769 https://proceedings.mlr.press/v162/mishchenko22b
24.	O. Shamir, N. Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008 https://proceedings.mlr.press/v32/shamir14.html
25.	H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, and L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227 https://proceedings.mlr.press/v119/hendrikx20a.html
26.	A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184 https://proceedings.neurips.cc/paper_files/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html
27.	D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, and G. Scutari, Optimal gradient sliding and its application to distributed optimization under similarity, 2022, 24 pp., arXiv: 2205.15136
28.	D. P. Kingma and J. L. Ba, Adam: a method for stochastic optimization, 2017 (v1 – 2014), 15 pp., arXiv: 1412.6980
29.	T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”, COURSERA: Neural networks for machine learning, 4:2 (2012), 26–31
30.	J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization”, J. Mach. Learn. Res., 12 (2011), 2121–2159
31.	C. Bekas, E. Kokiopoulou, and Y. Saad, “An estimator for the diagonal of a matrix”, Appl. Numer. Math., 57:11-12 (2007), 1214–1229
32.	A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, and M. Takáč, “Stochastic gradient methods with preconditioned updates”, J. Optim. Theory Appl., 201:2 (2024), 471–489
33.	M. Jahani, S. Rusakov, Zheng Shi, P. Richtárik, M. W. Mahoney, and M. Takáč, Doubly adaptive scaled algorithm for machine learning using second-order information, 2021, 44 pp., arXiv: 2109.05198
34.	J. E. Dennis, Jr., and J. J. Moré, “Quasi-Newton methods, motivation and theory”, SIAM Rev., 19:1 (1977), 46–89
35.	R. Fletcher, Practical methods of optimization, 2nd ed., Wiley-Intersci. [John Wiley & Sons], New York, 2013, xiv+436 pp.
36.	A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529 https://proceedings.mlr.press/v108/bayoumi20a.html
37.	A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393 https://proceedings.mlr.press/v119/koloskova20a.html
38.	A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133 https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
39.	M. R. Glasgow, Honglin Yuan, and Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090 https://proceedings.mlr.press/v151/glasgow22a
40.	A. Sadiev, A. Beznosikov, A. J. Almansoori, D. Kamzolov, R. Tappenden, and M. Takáč, Stochastic gradient methods with preconditioned updates, 2024 (v1 – 2022), 40 pp., arXiv: 2206.00285
41.	A. Beznosikov, A. Alanov, D. Kovalev, M. Takáč, and A. Gasnikov, On scaled methods for saddle point problems, 2023 (v1 – 2022), 54 pp., arXiv: 2206.08303
42.	S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan, Adaptive federated optimization, 2021 (v1 – 2020), 38 pp., arXiv: 2003.00295
43.	Y. Nesterov, Introductory lectures on convex optimization. A basic course, Appl. Optim., 87, Kluwer Acad. Publ., Boston, MA, 2004, xviii+236 pp.
44.	A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, Wiley-Intersci. Ser. Discrete Math., Wiley-Intersci. Publ., John Wiley & Sons, Inc., New York, 1983, xv+388 pp.
45.	Lam Nguyen, Phuong Ha Nguyen, M. Dijk, P. Richtárik, K. Scheinberg, and M. Takác, “SGD and Hogwild! Convergence without the bounded gradients assumption”, Proceedings of the 35th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 80, 2018, 3750–3758 https://proceedings.mlr.press/v80/nguyen18c
46.	Zhewei Yao, A. Gholami, Sheng Shen, M. Mustafa, K. Keutzer, and M. Mahoney, “ADAHESSIAN: An adaptive second order optimizer for machine learning”, Proc. Int. AAAI Conf., 35:12 (2021), 10665–10673
47.	A. Défossez, L. Bottou, F. Bach, and N. Usunier, A simple convergence proof of Adam and Adagrad, 2022 (v1 – 2020), 30 pp., arXiv: 2003.02395
48.	A. Krizhevsky, Learning multiple layers of features from tiny images, Tech. rep., Univ. of Toronto, Toronto, ON, 2009, 60 pp. pp. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
49.	Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition”, Proceedings of the 2016 IEEE conference on computer vision and pattern recognition (Las Vegas, NV), IEEE, 2016, 770–778

Citation: S. A. Chezhegov, S. N. Skorik, N. Khachaturov, D. S. Shalagin, A. A. Avetisyan, M. Takáč, Y. A. Kholodov, A. N. Beznosikov, “Local methods with adaptivity via scaling”, Russian Math. Surveys, 79:6 (2024), 1051–1091

Citation in format AMSBIB

\Bibitem{CheSkoKha24}

\by S.~A.~Chezhegov, S.~N.~Skorik, N.~Khachaturov, D.~S.~Shalagin, A.~A.~Avetisyan, M.~Tak\'a{\v c}, Y.~A.~Kholodov, A.~N.~Beznosikov

\paper Local methods with adaptivity via scaling

\jour Russian Math. Surveys

\yr 2024

\vol 79

\issue 6

\pages 1051--1091

\mathnet{http://mi.mathnet.ru/eng/rm10209}

\crossref{https://doi.org/10.4213/rm10209e}

\mathscinet{https://mathscinet.ams.org/mathscinet-getitem?mr=4867090}

\adsnasa{https://adsabs.harvard.edu/cgi-bin/bib_query?2024RuMaS..79.1051C}

\isi{https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=Publons&SrcAuth=Publons_CEL&DestLinkType=FullRecord&DestApp=WOS_CPL&KeyUT=001443210000001}

\scopus{https://www.scopus.com/record/display.url?origin=inward&eid=2-s2.0-105000352516}

Linking options:

https://www.mathnet.ru/eng/rm10209

https://doi.org/10.4213/rm10209e

https://www.mathnet.ru/eng/rm/v79/i6/p117

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	445
Russian version PDF:	11
English version PDF:	56
Russian version HTML:	12
English version HTML:	117
References:	54
First page:	22

Registration to the website

Logotypes