Introduction

Once upon a time, models needed more data than parameters to provide a meaningful fitting. Deep networks seem to avoid this basic constraint. In fact, more weights than data is the standard situation for deep-learning networks that typically fit the data and still show good predictive performance on new data1. Of course, it has been known for some time that the key to good predictive performance is controlling the complexity of the network and not simply the raw number of its parameters. The complexity of the network depends on appropriate measures of complexity of the space of functions realized by the network such as VC dimension, covering numbers and Rademacher numbers. Complexity can be controlled during optimization by imposing a constraint, often under the form of a regularization penalty, on the norm of the weights, as all the notions of complexity listed above depend on it. The problem is that there is no obvious control of complexity in the training of deep networks! This has given an aura of magic to deep learning and has contributed to the belief that classical learning theory does not hold for deep networks.

In the case of regression for shallow linear networks such as kernel machines, it is well known from work on inverse problems and in machine learning (see refs. 2,3) that iterative gradient descent (GD) has a vanishing regularizing effect with the iteration number t (for fixed step size) being equivalent to \(\frac{1}{\lambda }\) (where λ is the corresponding regularization parameter): thus t →  corresponds to λ  → 0. The simplest example is least-square regression on a linear network, where vanishing regularization unifies both the overdetermined and the underdetermined cases as follows

$$\min_{w\in {R}^{d}}\frac{1}{n}| | Y-Xw| {| }^{2}+\lambda | | w| {| }^{2}$$
(1)

yielding \({w}_{\lambda }={({X}^{T}X+\lambda nI)}^{-1}{X}^{T}Y\). It is noteworthy that \({\mathrm{lim}}_{\lambda \to 0}{w}_{\lambda }={w}^{\dagger }\) is the pseudoinverse. In this case, iterative GD minimizing \(\frac{1}{n}| | Y-Xw| {| }^{2}\) performs an implicit regularization equivalent to taking λ  → 0 in wλ above.

The question is whether an effective regularization may be similarly induced in nonlinear deep networks by GD and how. This paper addresses the question of implicit regularization in the specific case of training wrt exponential-type loss functions—such as cross-entropy or exponential loss. It is worth noting that cross-entropy is the loss function used in training deep networks for classification problems, and that most of the practical applications and successes of deep networks, at least so far, involve classification.

This paper answers in the affirmative: there is a hidden regularization in typical training of deep networks. The basic observation is that the weights computed during minimization of the exponential loss do not matter by themselves and in fact they diverge during GD. As shown in the next section, in the case of classification—both binary and multiclass—only the normalized weights matter: thus, complexity control is implicit in defining the normalized weights Vk as the variables of interest. What is not completely obvious is that commonly used GD on the unnormalized weights induces a dynamics of the normalized weights, which converges to a stationary point. We show that at any finite time of the dynamics, the stationary points of the flow of Vk satisfy the necessary and sufficient conditions for a minimum. This mechanism underlies regularization in deep networks for exponential losses and, as we discuss later, is likely to be the starting point to explain their prediction performance.

Results

Loss functions

We consider for simplicity of exposition the case of binary classification. We call “loss” the measure of performance of the network f on a training set \(S=\{\left({x}_{1},{y}_{1}\right),\cdots \ ,\left({x}_{N},{y}_{N}\right)\}\). The most common loss optimized during training for binary classification is the logistic loss \(L(f)=\frac{1}{N}{\sum }_{n=1}^{N}\, {\mathrm{ln}}(1+{e}^{-{y}_{n}f({x}_{n})})\). We focus on the closely related, simpler exponential loss \(L(f(w))={\sum }_{n=1}^{N}{e}^{-f({x}_{n}){y}_{n}}\). We call classification “error” \(\frac{1}{N}{\sum }_{n=1}^{N}\ H\)(−yn(f(xn)), where y is binary (y  {−1, +1}) and H is the Heaviside function with H(−yf(x))  = 1 if  −yf  >  0, which correspond to the wrong classification. We say that f separates the data if ynf(xn) >  0, n. We will typically assume that GD at some t  >  T0 will reach an f that separates the data (which is usually the case for overparametrized deep networks). There is a close relation between the exponential or logistic loss and the classification error: both the exponential and the logistic losses are upper bounds for the classification error. Minimizing the exponential or logistic loss implies minimizing the classification error. Minimization of the loss can be performed by GD techniques. In today’s praxis, stochastic GD (SGD) is used to train deep networks. We focus here on GD for simplicity. Our main results should also hold for SGD.

Deep networks

We define a deep network with K layers and coordinate-wise scalar activation functions σ(z): R → R as the set of functions f(Wx) =  σ(WKσ(WK−1 σ(W1x))), where the input is xRd, the weights are given by the matrices Wk, one per layer, with matching dimensions. The symbol W is used to denote the set of Wk matrices k  = 1, , K. For simplicity, we consider here the case of binary classification in which f takes scalar values, implying that the last layer matrix WK is \({W}^{K}\in {{\bf{R}}}^{1,{K}_{l}}\). As mentioned, the labels are yn  {−1, 1}. The weights of hidden layer l are collected in a matrix of size hl  ×  hl−1. There are no biases apart from the input layer where the bias is instantiated by one of the input dimensions being a constant. The activation function is the Rectified Linear Unit (ReLU) activation. For ReLU activations, the following important positive one-homogeneity property holds \(\sigma (z)=\frac{\partial \sigma (z)}{\partial z}z\). For the network, homogeneity implies \(f_W(x)={\prod }_{k=1}^{K}{\rho }_{k}\; f ({V}_{1},\cdots \,{V}_{K};{x}_{n})\), where Wk  =  ρkVk.

The network is a function f(x)  =  f(W1, , WKx) where x is the input and the weight matrices Wk are the parameters. We define the normalized network \(\tilde{f}\) as \(f_V=\rho \tilde{f}\), with \(||V_k||=1, \rho ={\prod }_{k=1}^{K}{\rho }_{k}\); \(\tilde{{\mathbb{F}}}\) is the associated class of “normalized neural networks” \(\tilde{f}(x)\). It is noteworthy that the definitions of ρk, Vk, and \(\tilde{f}\) all depend on the choice of the norm used in normalization. It is also worth noting that because of homogeneity of the ReLU network \(f(x)=\rho \tilde{f}(x)\), the signs of f and \(\tilde{f}\) are the same.

For simplicity of notation we consider for each weight matrix Vk the corresponding “vectorized” representation in terms of vectors \({W}_{k}^{i,j}={W}_{k}\) for each k layer.

We use the following definitions and properties (for a vector w and the 2-norm) neglecting indeces:

  • Define \(\frac{w}{| | w| {| }_{2}}=v\); thus \(w=| | w| {| }_{2} v\) with \(| | v| {| }_{2}=1\).

  • The following relations are easy to check:

    1. 1.

      \(\frac{\partial | | w| {| }_{2}}{\partial w}=v\)

    2. 2.

      \(S=I- v v^{T}=I-\frac{w{w}^{T}}{| | w| {| }_{2}^{2}}\).

    3. 3.

      \(\frac{\partial v}{\partial w}=\frac{S}{| | w| {| }_{2}}\).

Training by unconstrained gradient descent

Consider the typical training of deep networks. The exponential loss (more in general an exponential-type loss such as the cross-entropy) is minimized by GD. The gradient dynamics is given by

$${\dot{W}}_{k}^{i,j}=-\frac{\partial L}{\partial {W}_{k}^{i,j}}=\sum _{n=1}^{N}{y}_{n}\left[\frac{\partial f({x}_{n})}{\partial {W}_{k}^{i,j}}\right]{e}^{-{y}_{n}f({x}_{n})}.$$
(2)

Clearly there is no explicit regularization or norm control in the typical GD dynamics of Eq. (2). Assuming that for T >  T0 GD achieves separability, the empirical loss goes to zero and the norms of the weights ρk  =  Wk2 grow to infinity k. For classification, however, only the normalized network outputs matter because of the softmax operation.

Training by constrained gradient descent

Let us contrast the typical GD above with a classical approach that uses complexity control. In this case the goal is to minimize \(L(f_W)= \sum _{n=1}^{N}{e}^{-f_W({x}_{n}){y}_{n}}=\sum _{n=1}^{N}{e}^{-\rho f_V({x}_{n}){y}_{n}}\), with ρ =  ∏ρk, subject to \(| | {V}_{k}| {| }_{p}^{p}=1\,\forall k\), that is under a unit norm constraint for the weight matrix at each layer (if p = 2 then \(\sum _{i,j}{({V}_{k})}_{i,j}^{2}=1\) is the Frobenius norm). It is noteworthy that the relevant function is not f(x) and the associated Wk, but rather \(f_V(x)\), and the associated Vk as the normalized network \(\tilde{f}(x)\) is what matters for classification, and other key properties such as the margin depend on it.

In terms of ρVk, the unconstrained gradient of L gives

$$\dot{{\rho }_{k}}=-{V}_{k}^{T}\frac{\partial L}{\partial {W}_{k}}\quad \dot{{V}_{k}}=-{\rho }_{k}\frac{\partial L}{\partial {W}_{k}}$$
(3)

as \(\dot{{\rho }_{k}}=-\frac{\partial L}{\partial {\rho }_{k}}=-\frac{\partial {W}_{k}}{\partial {\rho }_{k}}\frac{\partial L}{\partial {W}_{k}}\) and \(\dot{{V}_{k}}=-\frac{\partial L}{\partial {V}_{k}}=-\frac{\partial {W}_{k}}{\partial {V}_{k}}\frac{\partial L}{\partial {W}_{k}}\).

There are several ways to enforce the unit norm constraint on the dynamics of Vk. The most obvious one consists of Lagrange multipliers. We use an equivalent technique, which is also equivalent to natural gradient, called tangent gradient transformation4 of a gradient increment g(t) into Sg(t). For a unit L2 norm constraint, the projector \(S=I-\frac{u{u}^{T}}{| | u| {| }_{2}^{2}}\) enforces the unit norm constraint. According to theorem 1 in ref. 4, the dynamical system \(\dot{u}=Sg\) with u(0)2 = 1 describes the flow of a vector u that satisfies u(t)2  =  1 for all t  ≥  0. Applying the tangent gradient transformation to \(\frac{\partial L}{\partial {V}_{k}}\) yields

$$\dot{{\rho }_{k}}=-{V}_{k}^{T}\frac{\partial L}{\partial {W}_{k}}\quad \dot{{V}_{k}}=-S{\rho }_{k}\frac{\partial L}{\partial {W}_{k}}$$
(4)

with \({S}_{k}=I-\frac{{V}_{k}{V}_{k}^{T}}{| | {V}_{k}| {| }_{2}^{2}}\). It is relatively easy to check (see ref. 5) that the dynamics of Eq. (4) is the same as of the weight normalization algorithm, originally6 defined for each layer in terms of \(w=g\frac{v}{| | v| | }\), as

$$\dot{g}=\frac{v}{| | v| | }\frac{\partial L}{\partial w},\quad \dot{v}=\frac{g}{| | v| | }S\frac{\partial L}{\partial w}$$
(5)

with \(S=I-\frac{v{v}^{T}}{| | v| {| }^{2}}\). The reason Eqs. (4) and (5) are equivalent is because v2 in Eq. (5) can be shown to be constant in time5. Weight normalization and the closely related batch normalization technique are in common use for training deep networks. Empirically, they behave similar to unconstrained GD with some advantages especially for very deep networks. Our derivation, however, seems to suggest that they could be different, as weight normalization enforces an explicit, although so far unrecognized, unit norm constraint (on the Vk dynamics), which unconstrained GD (Eq. (2)) seems not to enforce.

Implicit complexity control

The first step in solving the puzzle is to reparametrize Eq. (2) in terms of Vkρk with \({W}_{k}^{i,j}={\rho }_{k}{V}_{k}^{i,j}\), and Vk2 = 1 at convergence. Following the chain rule for the time derivatives, the dynamics for Wk, Eq. (2), is identical to the following dynamics for ρk =  Wk and Vk:

$$\dot{{\rho }_{k}}={V}_{k}\dot{{W}_{k}}\quad \dot{{V}_{k}}=\frac{{S}_{k}}{{\rho }_{k}}\dot{{W}_{k}}$$
(6)

where \({S}_{k}=I-{V}_{k}{V}_{k}^{T}\) emerges this time from the change of variables, as \(\frac{\partial {V}_{k}}{\partial {W}_{k}}=\frac{S}{{\rho }_{k}}\). Inspection of the equation for \(\dot{{V}_{k}}\) shows that there is a unit constraint on the L2 norm of Vk, because of the presence of S: in fact, a tangent gradient transformation on \(\dot{{V}_{k}}\) would not change the dynamics, as S is a projector and S2 = S. Consistently with this conclusion, unconstrained GD has the same critical points for Vk as weight normalization but a somewhat different dynamics: in the one-layer case, weight normalization is

$$\dot{\rho }={v}^{T}\dot{w}\quad \dot{v}=S\rho \dot{w}$$
(7)

which has to be compared with the typical gradient equations in the ρ and Vk variables given by

$$\dot{\rho }={v}^{T}\dot{w}\quad \dot{v}=\frac{S}{\rho }\dot{w}.$$
(8)

The two dynamical systems are thus quite similar, differing by a ρ2 factor in the \(\dot{v}\) equations. It is clear that the stationary points of the gradient for the v vectors—that is the values for which \(\dot{v}=0\)—are the same in both cases, as for any t  > 0, ρ(t) >  0.

Importantly, the almost equivalence between constrained and unconstrained GD is true only when p = 2 in the unit Lp norm constraint5. In both cases the stationary point (for fixed but usually very large ρ, that is a very long time)  are the same and likely to be (local) minima. In the case of deep networks, we expect multiple such minima for any finite ρ.

Convergence to minimum norm and maximum margin

Consider the flow of Vk in Eq. (6) for fixed ρ. If we assume that for t > T0, \(f(V;x)\) separates the data, i.e., \(\forall n\ {y}_{n}\, f(V) ({x}_{n})\, > \, 0\), then \(\frac{d}{dt}{\rho }_{k}\, > \, 0\), i.e., ρ diverges to with \({\mathrm{lim}}_{t\to \infty }\dot{\rho }=0\). In the one-layer network case, the dynamics yields \(\rho \approx {\mathrm{log}} t\) asymptotically. For deeper networks, this is different. Banburski et al.5 shows that the product of weights at each layer diverges faster than logarithmically, but each individual layer diverges slower than in the one-layer case. Banburski et al.5 also shows that \(\dot{{\rho }_{k}^{2}}\) (the rate of growth of ρk2) is independent of k.

The stationary points at which \(\dot{{V}_{k}}=0\) for fixed ρ —if they exist—satisfy the necessary condition for a minimum, i.e.,

$$\sum {e}^{-{y}_{n}\rho f_V ({x}_{n})}\left(\frac{\partial f_V({x}_{n})}{\partial {V}_{k}^{i,j}}-{V}_{k}^{i,j}f_V({x}_{n})\right)=0$$
(9)

are critical points  of the loss for fixed ρ (as the domain is compact minima exist). Let us call them \({V}_{k}(\rho )=mi{n}_{| | {V}_{k}| | =1}\sum _{n}{e}^{-{y}_{n}\rho \tilde{f}({x}_{n})}\). Consider now the limit for ρ  → 

$$\mathop{\mathrm{lim}}\limits_{\rho \to \infty }mi{n}_{| | {V}_{k}| | =1}\sum _{n}{e}^{-{y}_{n}\rho \tilde{f}({x}_{n})}=\mathop{\mathrm{lim}}\limits_{\rho \to \infty }mi{n}_{| | {V}_{k}| | =1}{e}^{-\rho \tilde{f}(x* )},$$
(10)

where x* corresponds to the training data that are support vectors and have the smallest margin. This provides a solution Vk(), which corresponds to a maximum margin solution, because \({\mathrm{lim}}_{\rho \to \infty }{V}_{k}(\rho )={\min }_{n}{\max }_{| | {V}_{k}| | =1}\tilde{{f}_{\rho }}({x}_{n})\) (see also ref. 5). Those maximum margin solutions are also minimum norm solution in terms of the Wk for a fixed margin  ≥1.

In summary, the dynamics of Vk for each fixed ρ converges to a critical point which is likely to be a minimum on the boundary of the ρ ball. For ρ → , the stationary points of the full dynamical system Eq. (6) are reached. These points are degenerate equilibria5.

Discussion

The main conclusion of this analysis is that unconstrained GD in Wk followed by normalization of Wk is equivalent to imposing a L2 unit norm constraint on \({V}_{k}=\frac{{W}_{k}(t)}{| | {W}_{k}(t)| | }\) during GD. For binary and multiclass classification, only the normalized weights are needed. To provide some intuition, consider that GD is steepest descent wrt the L2 norm and the steepest direction of the gradient depends on the norm. The fact that the direction of the weights converge to stationary points of the gradient under a constraint is the origin of the hidden complexity control, described as implicit bias of GD by Srebro and colleagues7, who first showed its effect in the special case of linear networks under the exponential loss. In addition to the approach summarized here, another elegant theory8,9,10, leading to several of the same results, has been developed around the notion of margin, which is closely related, as in the case of support vector machines, to minimization of an exponential-type loss function under a norm constraint.

In summary, there is an implicit regularization in deep nonlinear networks trained on exponential-type loss functions, originating in the GD technique used for optimization. The solutions are in fact the same as that are obtained by vanishing regularized optimization. This is thus similar to—but more robust than—the classical implicit regularization induced by iterative GD on linear networks under the square loss and with appropriate initial conditions. In our case, the maximum margin solutions are independent of initial conditions and the linearity of the network. The specific solutions, however, are not unique, unlike the linear case: they depend on the trajectory of gradient flow, each corresponding to one of multiple minima of the loss, each one being a margin maximizer. In general, each solution will show a different test performance. Characterizing the conditions that lead to the best among the margin maximizers is an important open problem.

The classical analysis of ERM algorithms studies their asymptotic behavior for the number of data n going to infinity. In this limiting classical regime, n > D, where D is the fixed number of weights; consistency (informally the expected error of the empirical minimizer converges to the best in the class) and generalization (the empirical error of the minimizer converges to the expected error of the minimizer) are equivalent. The capacity control described in this  paper implies that there is asymptotic generalization and consistency in deep networks in the classical regime (see Fig. 1). However, as we mentioned, it has been shown in the case of kernel regression, that there are situations in which there is simultaneously interpolation of the training data and good expected error. This is a modern regime in which D  > n but \(\gamma =\frac{D}{n}\) is constant. In the linear case, it corresponds to the limit for λ  = 0 of regularization, i.e., the pseudoinverse. It is likely that deep nets may have a similar regime, in which case the implicit regularization described here is an important prerequisite for a full explanation—as it is the case for kernel machines under the square loss. In fact, the maximum margin solutions we characterize here are equivalent to minimum norm solutions (for margin equal to 1, see ref. 5). Minimum norm of the weight matrices implies minimum uniform stability and thus suggests minimum expected error, see ref. 11. This argument would explain why deep networks trained with exponential losses predict well and why classification error does not increase with overparametrization (see Fig. 2). It would also explain, in the case of kernel methods and square-loss regression, why the pseudoinverse solution provides good expected error and at the same time perfect interpolation on the training set12,13 with a data-dependent double-descent behavior.

Fig. 1: Classical generalization and consistency in deep networks.
figure 1

a Unnormalized cross-entropy loss in CIFAR-10 for randomly labeled data. b Cross-entropy loss for the normalized network for randomly labeled data. c Generalization cross-entropy loss (difference between training and testing loss) for the normalized network for randomly labeled data as a function of the number of data N. The generalization loss converges to zero as a function of N but very slowly.

Fig. 2: No overfitting in deep networks.
figure 2

Empirical and expected error in CIFAR-10 as a function of number of neurons in a 5-layer convolutional network. The expected classification error does not increase when increasing the number of parameters beyond the size of the training set.

As we mentioned, capacity control follows from convergence of the normalized weights during GD to a regularized solution with vanishing λ. It is very likely that the same result we obtained for GD also holds for SGD in deep networks (the equivalence holds for linear networks14). The convergence of SGD usually follows convergence of GD, although rates  being different. The Robbins–Siegmund theorem is a tool to establish almost sure convergence under surprisingly mild conditions.

It is not clear whether a similar effective regularization should also hold for deep networks with more than two layers trained with square loss. In fact, we have not been able to find a mechanism that could lead to a similarly robust regularization. Therefore, we conjecture that deep networks trained with square loss (with more than two layers and not reducible to kernels) do not converge to minimum norm solutions, unlike the same networks trained on exponential-type losses.

Interestingly, the theoretical observations we have described suggest that the dynamics of ρ may be controlled independently from GD on the Vk, possibly leading to faster and better algorithms for training deep networks. A hint of this possibilities is given by an analysis for linear networks (see ref. 5) of the dynamics of weight normalization (Eq. (7)) vs. the dynamics of the unconstrained gradient (Eq. (8)). Under the same simplified assumptions on the training data, the weight normalization dynamics converges much faster—as \(\frac{1}{{t}^{1/2{\mathrm{log}}t}}\)—than the typical dynamics, which converges to the stationary point as \(\frac{1}{{\mathrm{log}}t}\). This prediction was verified with simulations. Together with the observation that ρ(t) associated with Eq. (8) is monotonic in t after separability is reached, it suggests exploring a family of algorithms that consist of an independently driven forcing term ρ(t) coupled with the equation in Vk from Eq. (8).

$$\dot{{V}_{k}}=\frac{\rho }{{\rho }_{k}^{2}}\sum _{n=1}^{N}{e}^{-\rho f_V({x}_{n})}\left(\frac{\partial f_V({x}_{n})}{\partial {V}_{k}}- V_{k}\;f_V({x}_{n})\right).$$
(11)

The open question is of course what is the optimal ρ(t) schedule for converging to the margin maximizer that is best in terms of expected error.