{"title": "Backpropagation-Friendly Eigendecomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 3162, "page_last": 3170, "abstract": "Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the network's features.", "full_text": "Backpropagation-Friendly Eigendecomposition\n\nWei Wang1, Zheng Dang2, Yinlin Hu1, Pascal Fua1, and Mathieu Salzmann1\n\n1CVLab, EPFL, CH-1015 Lausanne, Switzerland {first.last}@epfl.ch\n2Xi\u2019an Jiaotong University, China {dangzheng713@stu.xjtu.edu.cn}\n\nAbstract\n\nEigendecomposition (ED) is widely used in deep networks. However, the backprop-\nagation of its results tends to be numerically unstable, whether using ED directly\nor approximating it with the Power Iteration method, particularly when dealing\nwith large matrices. While this can be mitigated by partitioning the data in small\nand arbitrary groups, doing so has no theoretical basis and makes its impossible to\nexploit the power of ED to the full.\nIn this paper, we introduce a numerically stable and differentiable approach to\nleveraging eigenvectors in deep networks. It can handle large matrices without\nrequiring to split them. We demonstrate the better robustness of our approach over\nstandard ED and PI for ZCA whitening, an alternative to batch normalization, and\nfor PCA denoising, which we introduce as a new normalization strategy for deep\nnetworks, aiming to further denoise the network\u2019s features.\n\n1\n\nIntroduction\n\nIn recent years, Eigendecomposition (ED) has been incorporated in deep learning algorithms to\nperform PCA whitening [1], ZCA whitening [2], second-order pooling [3], generative modeling [4, 5],\nkeypoint detection and matching [6\u20138], pose estimation [9], camera self-calibration [10], and graph\nmatching [11]. This requires backpropagating the loss through the ED, which can be done but is\noften unstable. This is because, as shown in [3], the partial derivatives of an ED-based loss depend\n\non a matrix (cid:101)K with elements\n\ni (cid:54)= j\ni = j\n\n,\n\n(1)\n\n(cid:101)Kij =\n\n(cid:26) 1/(\u03bbi\u2212\u03bbj ),\n\n0,\n\nwhere \u03bbi denotes the ith eigenvalue of the matrix being processed. Thus, when two eigenvalues are\nvery close, the partial derivatives become very large, causing arithmetic over\ufb02ow.\nThe Power Iteration (PI) method [12] is one way around this problem. In its standard form, PI\nrelies on an iterative procedure to approximate the dominant eigenvector of a matrix starting from an\ninitial estimate of this vector. This has been successfully used for graph matching [11] and spectral\nnormalization in generative models [4], which only require the largest eigenvalue. In theory, PI can\nbe used in conjunction with a de\ufb02ation procedure [13] to \ufb01nd all eigenvalues and eigenvectors. This\ninvolves computing the dominant eigenvector, removing the projection of the input matrix on this\nvector, and iterating. Unfortunately, two eigenvalues being close to each other or one being close to\nzero can trigger large round-off errors that eventually accumulate and result in inaccurate eigenvector\nestimates. The results are also sensitive to the number of iterations and to how the vector is initialized\nat the start of each de\ufb02ation step. Finally, the convergence speed decreases signi\ufb01cantly when the\nratio between the dominant eigenvalue and others becomes close to one.\nIn short, both SVD [3] and PI [11] are unsuitable for use in a deep network that requires the\ncomputation of gradients to perform back-propagation. This is particularly true when dealing with\nlarge matrices for which the chances of two eigenvalues being almost equal is larger than for small\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fones. This why another popular way to get around these dif\ufb01culties is to use smaller matrices, for\nexample by splitting the feature channels into smaller groups before computing covariance matrices,\nas in [2] for ZCA whitening [14, 15]. This, however, imposes arbitrary constraints on learning. They\nare not theoretically justi\ufb01ed and may degrade performance.\nIn this paper, we therefore introduce a numerically stable and differentiable approach to performing\nED within deep networks in such as way that their training is robust. To this end, we leverage\nthe fact that the forward pass of ED is stable and yields accurate eigenvector estimates. As the\naforementioned problems come from the backward pass, once the forward pass is complete we rely\non PI for backprogation purposes, leveraging the ED results for initialization. We will show that our\nhybrid training procedure consistently outperforms both ED and PI in terms of stability for\n\n\u2022 ZCA whitening [2]: An alternative to Batch Normalization [16], which involves linearly\ntransforming the feature vectors so that their covariance matrices becomes the identity matrix\nand that they can be considered as decoupled.\n\n\u2022 PCA denoising [17]: A transformation of the data that reduces the dimensionality of the\ninput features by projecting them on a subspace spanned by a subset of the eigenvectors of\ntheir covariance matrix, and projects them back to the original space. Here, we introduce\nPCA denoising as a new normalization strategy for deep networks, aiming to remove the\nirrelevant signal from their feature maps.\n\nExploiting the full power of both these techniques requires performing ED on relatively large matrices\nand back-propagating the results, which our technique allows whereas competing ones tend to fail.\nThe code is available at https://github.com/WeiWangTrento/Power-Iteration-SVD.\n\n2 Numerically Stable Differentiable Eigendecomposition\n\nGiven an input matrix M, there are two standard ways to exploit the ED of M within a deep network:\n1. Perform ED using SVD or QR decomposition and use analytical gradients for backpropagation.\n2. Given randomly-chosen initial guesses for the eigenvectors, run a PI de\ufb02ation procedure during\n\nthe forward pass and compute the corresponding gradients for backpropagation purposes.\n\nUnfortunately, as discussed above, the \ufb01rst option is prone to gradient explosion when two or\nmore eigenvalues are close to each other while the accuracy of the second strongly depends on the\ninitial vectors and on the number of iterations in each step of the de\ufb02ation procedure. This can\nbe problematic because the PI convergence rate depends geometrically on the ratio |\u03bb2/\u03bb1| of the\ntwo largest eigenvalues. Therefore, when this ratio is close to 1, the eigenvector estimate may be\ninaccurate when performing a limited number of iterations. This can lead to training divergence and\neventual failure, as we will see in the results section.\nOur solution is to rely on the following hybrid strategy:\n1. Use SVD during the forward pass because, by relying on a divide-and-conquer strategy, it tends to\n\nbe numerically more stable than QR decomposition [12].\n\n2. Compute the gradients for backpropagation from the PI derivations, but using the SVD-computed\n\nvectors for initialization purposes.\n\nIn the remainder of this section, we show that the resulting PI gradients not only converge to the\nanalytical ED ones, but are bounded from above by a factor depending on the number of PI iterations,\nthus preventing their explosion in practice. In the results section, we will empirically con\ufb01rm this by\nshowing that training an ED-based deep network using our approach consistently converges, whereas\nthe standard SVD and PI algorithms often diverge.\n\n2.1 Power Iteration Gradients\n\nLet M be a covariance matrix, and therefore be positive semi-de\ufb01nite and symmetric. We now\nfocus on the leading eigenvector of M. Since the de\ufb02ation procedure simply iteratively removes the\nprojection of M on its leading eigenvector, the following derivations remain valid at any step of this\nprocedure. To compute the leading eigenvector v of M, PI relies on the iterative update\n\nv(k) = Mv(k\u22121)/(cid:107)Mv(k\u22121)(cid:107) ,\n\n(2)\n\n2\n\n\fwhere (cid:107)\u00b7(cid:107) denotes the (cid:96)2 norm. The PI gradients can then be computed as [18].\n\n(cid:0)I\u2212v(k+1)v(k+1)(cid:62)(cid:1)\n(cid:13)(cid:13)Mv(k)(cid:13)(cid:13)\n\nK\u22121(cid:88)\n\nk=0\n\n\u2202L\n\u2202M\n\n=\n\n\u2202L\n\n\u2202v(k+1)\n\nv(k)(cid:62) ,\n\n\u2202L\n\u2202v(k)\n\n=M\n\n\u2202L\n\n\u2202v(k+1)\n\n.\n\n(3)\n\n(cid:0)I\u2212v(k+1)v(k+1)(cid:62)(cid:1)\n(cid:13)(cid:13)Mv(k)(cid:13)(cid:13)\n\nTypically, to initialize PI, v(0) is taken as a random vector such that (cid:107)v(0)(cid:107)=1. Here, however, we\nrely on SVD to compute the true eigenvector v. Because v is an accurate estimate of the eigenvector,\nfeeding it as initial value in PI will yield v=v(0)\u2248v(1)\u2248v(2)\u2248\u00b7\u00b7\u00b7\u2248v(k) \u00b7\u00b7\u00b7\u2248v(K). Exploiting this\nin Eq. 3 and introducing the explicit form of \u2202L\n\nM(cid:0)I \u2212 vv(cid:62)(cid:1)\n\n(cid:33)\n\u2202v(k) , k = 1, 2,\u00b7\u00b7\u00b7 , K, into \u2202L\n\nMK\u22121(cid:0)I \u2212 vv(cid:62)(cid:1)\n\n(cid:32)(cid:0)I \u2212 vv(cid:62)(cid:1)\n\n\u2202L\n\u2202M\n\n=\n\n(cid:107)Mv(cid:107) +\n\n(cid:107)Mv(cid:107)2\n\n+ \u00b7\u00b7\u00b7 +\n\n(cid:107)Mv(cid:107)K\n\n\u2202M lets us write\nv(cid:62) .\n\n\u2202L\n\n\u2202v(K)\n\n(4)\n\nThe details of this derivation are provided in the supplementary material. In our experiments, Eq. 4 is\nthe form we adopt to compute the ED gradients.\n\n2.2 Relationship between PI and Analytical ED Gradients\n\nWe now show that when K goes to in\ufb01nity, the PI gradients of Eq. 4 are the same as the analytical\nED ones.\n\ni\n\n\u03bb2\n1\n\n=\n\n=\n\ni\n\n+\n\n(cid:1)\n\n(cid:33)\n\n(cid:1)\n\n\u2202L\n\u2202M\n\n1 + \u03bbk\n\nviv(cid:62)\n\ni\n\n2v2v(cid:62)\n\n+ \u00b7\u00b7\u00b7 +\n\nnvnv(cid:62)\nn ,\n\n(cid:0)(cid:80)n\n\ni=2 \u03bbiviv(cid:62)\n\nMk = V\u03a3kV(cid:62) = \u03bbk\n\n1v1v(cid:62)\n(cid:1)\n\n(cid:32)(cid:0)(cid:80)n\ni=2 viv(cid:62)\n(cid:32) n(cid:88)\n(cid:16)\n\u03bb1\n\nPower Iteration Gradients Revisited. To reformulate the PI gradients, we rely on the fact that\n2 + \u00b7\u00b7\u00b7 + \u03bbk\n(5)\nand that (cid:107)Mv(cid:107) = (cid:107)\u03bbv(cid:107) = \u03bb, where v = v1 is the dominant eigenvector and \u03bb = \u03bb1 is the dominant\n(cid:0)(cid:80)n\neigenvalue. Introducing Eq. 5 into Eq. 4, lets us re-write the gradient as\ni=2 \u03bbK\u22121\n(cid:33)\ni\n1/\u03bb1 + 1/\u03bb1 (\u03bbi/\u03bb1)1 +\u00b7\u00b7\u00b7 + 1/\u03bb1 (\u03bbi/\u03bb1)K\u22121(cid:17)\n\u03bbK\n1\n(cid:19)1\n(cid:18) \u03bbi\n(cid:32) n(cid:88)\n\n1\n\u03bb1\nIntroducing Eq. 7 into Eq. 6 yields\n\n1 \u2212 (\u03bbi/\u03bb1)k \u2192 1, when k \u2192 \u221e, because|\u03bbi/\u03bb1|\u22641,\n\nEq. 6 de\ufb01nes a geometric progression. Given that\n\n\u2192 1\n(cid:33)\n\n(cid:19)k\u22121\n(cid:33)\n\nwe have\n1\n\u03bb1\n\n(1 \u2212 ( \u03bbi\n1 \u2212 \u03bbi\n\n, when k \u2192 \u221e.\n\n(cid:32) 1\n\n\u2202L\n\u2202v(K)\n\n1 v(cid:62)\n1 .\n\n+\u00b7\u00b7\u00b7 +\n\n1 \u2212 \u03bbi\n\nviv(cid:62)\n\ni\n\n\u2202L/\u2202v(K)\n\n1\n\u03bb1\n\nv(cid:62)\n\n1\n\n)k)\n\n(6)\n\n(7)\n\n\u03bb1\n\n1\n\u03bb1\n\n=\n\ni=2\n\n+\n\n1\n\n\u03bb1\n\n\u03bb1\n\n\u2202L\n\u2202M\n\n=\n\n\u03bb1\n\n1 \u2212 \u03bbi\n\nviv(cid:62)\n\ni\n\nv(cid:62)\n1 =\n\n\u2202L\n\u2202v(k)\n\n1\n\nv(cid:62)\n1 .\n\n\u2202L\n\u2202v(k)\n\n1\n\n(8)\n\nAnalytical ED Gradients As shown in [3], the analytic form of the ED gradients can be written as\n\nwhere (cid:101)K given by Eq. 1. By making use of the same properties as before, this can be re-written as\n\n= v1\n\n\u2202\u03a3\n\ndiag\n\n+\n\n1\n\n(9)\n\nv(cid:62)\n1 ,\n\n(cid:19)(cid:19)\n\n(cid:18)\n\nv(cid:62)\n\n\u2202L\n\u2202v1\n\n\u03bb1\n\n\u03bb1\n\n(cid:32) n(cid:88)\n(cid:18) \u2202L\n\ni=2\n\n(cid:19)\n\nviv(cid:62)\n\u03bb1 \u2212 \u03bbi\n\ni\n\n(cid:41)\n\n\u03bb1\n\n(cid:18) \u03bbi\n(cid:33)\n(cid:40)(cid:18)(cid:101)K(cid:62) \u25e6\nn(cid:88)\n\n\u03bb1\n\n1\n\n\u2202L\n\u2202M\n\n=\n\ni=2\n\n\u03bb1 \u2212 \u03bbi\n\nviv(cid:62)\n\ni\n\n\u2202L\n\u2202v1\n\n1 + \b\b\b\b\b\nviv(cid:62)\nv(cid:62)\n\n\u2202L\n\u2202\u03bbi\n\ni\n\n.\n\n(10)\n\nThe the last term of Eq. 10 can be ignored because \u03bbi is not involved in any computation during the\nforward pass. Instead, we compute the eigenvalue as the Rayleigh quotient v(cid:62)Mv/v(cid:62)v, which only\ndepends on the eigenvector and thus only need the gradients w.r.t. to them. A detailed derivation is\nprovided in the supplementary material.\nThis shows that the partial derivatives of v1 computed using PI have the same form as the analytical\nED ones when k \u2192 \u221e. Similar derivations can be done for vi, i = 2, 3, .... This justi\ufb01es our use of\nPI to approximate the analytical ED gradients during backpropagation. We now turn to showing that\nthe resulting gradient estimates are upper-bounded and can therefore not explode.\n\ni=2\n\n\u2202L\n\u2202M\n\n3\n\n\fFigure 1: (a) shows how the value of (\u03bbk/\u03bb1)k changes w.r.t. the eigenvalue ratio \u03bbk/\u03bb1 and iteration\nnumber k. (b) shows the contour of curved surface in (a).\n\n\u03bbi/\u03bb1\n\nk = (cid:100)ln(0.05)/ln(\u03bbi/\u03bb1)(cid:101) 2\n\n0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95 0.99 0.995 0.999\n2995\n\n299\n\n598\n\n19\n\n14\n\n29\n\n59\n\n2\n\n3\n\n4\n\n5\n\n6\n\n9\n\nTable 1: The minimum value of k we need to guarantee (\u03bbi/\u03bb1)k < 0.05.\n\n2.3 Upper Bounding the PI Gradients\n\n1\n\u03bb1\n\ni\n\n1\n\ni\n\n\u2202M\n\ni=2\n\n=\n\ni=2\n\n\u03bb1\n\ni=2\n\n+\n\n1\n\u03bb1\n\n+\n\n1\n\u03bb1\n\nv(cid:62)\n1 ,\n\nqK\u22121\n\nviv(cid:62)\n\n\u2202L\n\u2202v1\n\n\u2202L\n\u2202M\n\nq1\u00b7\u00b7\u00b7+\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u2264\n\n(cid:18) 1\n\nwhere q=\u03bbi/\u03bb1. Therefore,\n\n\u03bb1\u2212\u03bbi\n(cid:32)\n(cid:18) 1\n\n1\n\u03bb1\n\nn(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n(cid:88)\n\nRecall that, when the input matrix has two equal eigenvalues, \u03bb1 = \u03bbi, the gradients computed\nfrom Eq.10 go to \u00b1\u221e, as the denominator is 0. However, as Eq. 6 can be viewed as the geometric\nseries expansion of Eq. 10, as shown below, it provides an upper bound on the gradient magnitude.\nSpeci\ufb01cally, we can write\n\n(cid:32) n(cid:88)\n1 \u2248\nv(cid:62)\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:18) \u03bbi\n(cid:19)2\n(cid:18) \u03bbi\n(cid:19)1\n(cid:18) \u03bbi\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)(cid:13) K\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)v(cid:62)\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:19)\n(cid:13)(cid:13)\u2264 n(cid:88)\nThis yields an upper bound of(cid:13)(cid:13) \u2202L\n(cid:13)(cid:13). However, if \u03bb1 = 0, this upper bound also becomes \u221e. To\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:13)(cid:13)(cid:13)(cid:13) \u2264 nK\n(cid:13)(cid:13)(cid:13)(cid:13)\n\navoid this, knowing that M is symmetric positive semi-de\ufb01nite, we modify it as M = M + \u0001I, where\nI is the identity matrix. This guarantees that the eigenvalues of M + \u0001I are greater than or equal to \u0001.\nIn practice, we set \u0001 = 10\u22124. Thus, we can write\n\n(cid:19)K\u22121(cid:33)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:13)(cid:13)(cid:13)(cid:13) ,\n\nviv(cid:62)\n\n\u2202L\n\u2202v1\n\n(cid:33)\n(cid:19)\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)v(cid:62)\n(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202L\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)v(cid:62)\n(cid:13)(cid:13)\u2264 nK\n\n(cid:13)(cid:13)(cid:13)(cid:13) .\n\n\u2202(M + \u0001I)\n\n+\u00b7\u00b7\u00b7 +\n\nwhere n is the dimension of the matrix and K is the power iteration number, which means that\nchoosing a speci\ufb01c value of K amounts to choosing an upper bound for the gradient magnitudes. We\nnow provide an empirical approach to doing so.\n\nviv(cid:62)\n\ni\n\n\u03bb1\n\n\u2202v1\n\n+\n\n1\n\u03bb1\n\n\u00b7\u00b7\u00b7 +\n\n1\n\u03bb1\n\nviv(cid:62)\n\ni\n\n\u0001\n\n\u2202v1\n\n\u03bb1\n\ni=2\n\n\u03bb1\n\ni=2\n\n\u2202v1\n\n1\n\n\u2202v1\n\n1\n\n\u2202v1\n\n1\n\n1\n\u03bb1\n\n\u03bb1\n\n\u03bb1\n\n\u03bb1\n\n(11)\n\n(12)\n\n\u2202M\n\n\u2202L\n\n\u2264\n\nviv(cid:62)\n\ni\n\n(13)\n\n2.4 Choosing an Appropriate Power Iteration Number K\nRecall from Eqs. 7 and 8 that, for the PI gradients to provide a good estimate of the analytical ones,\nwe need (\u03bbi/\u03bb1)k to go to zero. Fig. 1 shows how the value (\u03bbi/\u03bb1)k evolves for different power\niteration number k and ratio \u03bbi/\u03bb1. This suggests the need to select an appropriate k for each \u03bbi/\u03bb1.\nTo this end, we assume that (\u03bbi/\u03bb1)k\u22640.05 is a good approximation to (\u03bbi/\u03bb1)k=0. Then we have\n(\u03bbi/\u03bb1)k\u22640.05 \u21d4 k ln(\u03bbi/\u03bb1)\u2264ln(0.05) \u21d4 k \u2265 ln(0.05)/ln(\u03bbi/\u03bb1).\n(14)\nThat is, the minimum value of k to satisfy (\u03bbi/\u03bb1)k\u22640.05 is k = (cid:100)ln(0.05)/ln(\u03bbi/\u03bb1)(cid:101).\nTable 1 shows the minimum number of iterations required to guarantee that our assumption holds\nfor different values of \u03bbi/\u03bb1. Note that when two eigenvalues are very close to each other, e.g.,\n\u03bbi/\u03bb1 = 0.999, we need about 3000 iterations to achieve a good approximation. However, this is rare\nin practice. In CIFAR-10, using ResNet18, we observed that the average \u03bbi/\u03bbi\u22121 lies in [0.84, 0.86]\n\n4\n\n0.010.010.010.010.050.050.050.050.20.20.20.20.10.20.30.40.50.60.70.80.9510152025(a)(b)\finterval for the mini-batches. Given the values shown in Table 1, we therefore set the power iteration\nnumber to be K = 19, which yields a good approximation in all these cases.\n\n2.5 Practical Considerations\nIn practice, we found that other numerical imprecisions due to the computing platform itself, such as\nthe use of single vs double precision, could further affect training with eigendecomposition. Here, we\ndiscuss our practical approach to dealing with these issues.\nThe \ufb01rst issue we observed was that, in the forward pass, the eigenvalues computed using SVD\nmay be inaccurate, with SVD sometimes crashing when the matrix is ill-conditioned. Recall that, to\nincrease stability, we add a small value \u0001 to the diagonal of the input matrix M. As a consequence, all\nthe eigenvalues should be greater than or equal to \u0001. However, this is not the case when we use \ufb02oat\ninstead of double precision. To solve this problem, we employ truncated SVD. That is, we discard\nthe eigenvectors whose eigenvalue \u03bbi \u2264 \u0001 and the subsequent ones.\n\nThe second issue is related to the precision of the eigenvalues computed using (cid:101)\u03bbi = v(cid:62)(cid:102)Mv/v(cid:62)v.\ni , (cid:101)\u03bbi may be inaccurate and sometimes\nBecause of the round-off error from (cid:102)M = (cid:102)M \u2212(cid:102)Mviv(cid:62)\nThese two practical issues correspond to the breaking conditions \u03bbi \u2264 \u0001, (cid:101)\u03bbi\u2212\u03bbi\n\u2265 0.1 de\ufb01ned in\nAlg.1 that provides the pseudo-code for our ZCA whitening application. Furthermore, we add another\nenergy in(cid:102)M is less than 0.0001.\nconstraint, \u03b3i \u2265 (1\u2212 0.0001), implying that we also truncate the eigendecomposition if the remaining\n\nnegative. To avoid using incorrect eigenvalues, we also need to truncate it.\n\n\u03bbi\n\nNote that this last condition can be modi\ufb01ed by using a different threshold. To illustrate this, we\ntherefore also perform experiments with a higher threshold, thus leading to our PCA denoising layer,\nwhich aims to discard the noise in the network features. As shown in our experiments, our PCA\ndenoising layer achieves competitive performance compared with the ZCA one.\n\nk=1 \u03bbk;\n\n(cid:80)i\nk=1 \u03bbk/(cid:80)n\ni vi); (cid:102)M =(cid:102)M \u2212(cid:102)Mviv(cid:62)\n\ni ;\n\ni (cid:102)Mvi/(v(cid:62)\n\nAlgorithm 1: Forward Pass of ZCA whitening in Practice.\nData: \u00b5=Avg(X),\nResult: V(cid:62)\u039bV=SVD(M); \u039b=diag(\u03bb1, ..., \u03bbn); V= [v1,\u00b7\u00b7\u00b7 , vn]; \u03b3i =\n\n(cid:101)X=X\u2212\u00b5, M=(cid:101)X(cid:101)X(cid:62)+\u0001I, (X\u2208Rc\u00d7n);\n\n1 for i = 1 : n do\n2\n3\n4\n5\n6\n7\n8 end\n\nInput: E\u00b5 \u2190 0, ES \u2190 I,(cid:102)M \u2190 M, rank \u2190 1, momentum \u2190 0.1\nvi = Power Iteration ((cid:102)M, vi); (cid:101)\u03bbi=v(cid:62)\nif \u03bbi\u2264\u0001 or |(cid:101)\u03bbi\u2212\u03bbi|/\u03bbi\u22650.1 or \u03b3i\u2265(1\u22120.0001) then\nrank=i,(cid:101)\u039b=[(cid:102)\u03bb1,\u00b7\u00b7\u00b7 ,(cid:101)\u03bbi].\n9 truncate eigenvector matrix: (cid:101)V \u2190 [v1,\u00b7\u00b7\u00b7 , vrank];\n10 compute subspace: S \u2190 (cid:101)V((cid:101)\u039b)\u2212 1\n2(cid:101)V(cid:62);\n11 compute ZCA transformation: X \u2190 S(cid:101)X\n\nE\u00b5=momentum \u00b7 \u00b5 + (1 \u2212 momentum) \u00b7 E\u00b5;\n12 update running mean:\n13 update running subspace: ES=momentum \u00b7 S + (1 \u2212 momentum) \u00b7 ES;\n\nbreak;\n\n;\n\n\u00af\n\nelse\n\nend\n\nOutput: compute af\ufb01ne transformation: X = \u03b3X + \u03b2\n\nIn Alg.1, the operation: vi = Power Iteration((cid:102)M, vi) has no computation involved in the forward pass.\nIt only serves to save(cid:102)M, vi that will be used to compute the gradients during the backward pass.\n\nFurthermore, compared with standard batch normalization [16], we replace the running variance by a\nrunning subspace ES but keep the learnable parameters \u03b3, \u03b2. The running mean E\u00b5 and running\nsubspace ES are used in the testing phase.\n3 Experiments\nWe now demonstrate the effectiveness of our Eigendecomposition layer by using it to perform ZCA\nwhitening and PCA denoising. ZCA whitening has been shown to improve classi\ufb01cation performance\n\n5\n\n\fd = 4\nd = 8\n-\n4.59\n4.54 \u00b1 0.08 -\n\nSVD\n\nMethods Error\nMin\nMean\nSuccess Rate 46.7%\nMin\nMean\nSuccess Rate 100%\nMin\nMean\nSuccess Rate 100%\n\nOurs\n\nPI\n\n4.44\n4.99\u00b10.51\n\n4.59\n4.71\u00b10.11\n\nMatrix Dimension\n\nd = 16\n-\n-\n0%\n-\n-\n0%\n4.40\n\n0%\n6.28\n-\n6.7%\n4.43\n4.62\u00b10.18 4.63\u00b10.14\n100%\n\n100%\n\nd = 64\n-\n-\n0%\n-\n-\n0%\n4.44\n\nd = 32\n-\n-\n0%\n-\n-\n0%\n4.46\n4.64\u00b10.15 4.59 \u00b1 0.09\n100%\n\n100%\n\nTable 2: Errors and success rates using ResNet 18 with standard SVD, Power Iteration (PI), and our\nmethod on CIFAR10. d is the size of the feature groups we process individually.\n\nMethods Error BN\nResNet18 Min\nResNet50 Min\n\n21.68\n\nd = 4\n21.03\n\n20.66\n\nd = 64\n21.04\n\nd = 32\n21.36\n\nd = 16\n21.14\n\nd = 8\n21.15\n\nMean 21.85\u00b10.14 21.39\u00b10.23 21.58\u00b10.27 21.45\u00b10.25 21.56\u00b10.35 21.51\u00b10.28\nMean 21.62\u00b10.65 19.94\u00b10.44 19.54\u00b10.23 19.92\u00b10.12 20.59\u00b10.58 20.98\u00b10.31\n\n19.24\n\n19.28\n\n20.79\n\n19.78\n\n20.15\n\nTable 3: Errors rates using ResNet18 or ResNet50 with Batch Normalization (BN) or our method on\nCIFAR100. d is the size of the feature groups we process individually.\n\nover batch normalization [2]. We will demonstrate that we can deliver a further boost by making it\npossible to handle larger covariance matrices within the network. PCA denoising has been used for\nimage denoising [17] but not in a deep learning context, presumably because it requires performing\nED on large matrices. We will show that our approach solves this problem.\nIn all the experiments below, we use either Resnet18 or Resnet50 [19] as our backbone. We retain\ntheir original architectures but introduce an additional layer between the \ufb01rst convolutional layer and\nthe \ufb01rst pooling layer. For both ZCA and PCA, the new layer computes the covariance matrix of the\nfeature vectors, eigendecomposes it, and uses the eigenvalues and eigenvectors as described below.\nAs discussed in Section 2.4, we use K = 19 power iterations when backprogating the gradients\nunless otherwise speci\ufb01ed. To accommodate this additional processing, we change the stride s and\nkernel sizes in the subsequent blocks to Conv1(3\u00d73,s=1)->Block1(s=1)->Block2(s=2)->Block3(s=2)-\n>Block4(s=2)->AvgPool(4\u00d74)->FC.\n\n3.1 ZCA Whitening\nThe ZCA whitening algorithm takes as input a d \u00d7 n matrix X where d represents the feature\ndimensionality and n the number of samples. It relies on eigenvalues and eigenvectors to compute a\nd \u00d7 d matrix S such that the covariance matrix of SX is the d \u00d7 d identity matrix, meaning that the\ntransformed features are now decorrelated. ZCA has shown great potential to boost the classi\ufb01cation\naccuracy of deep networks [2], but only when d can be kept small to prevent the training procedure\nfrom diverging. To this end, the c output channels of a convolutional layer are partitioned into G\ngroups so that each one contains only d = c/G features. ZCA whitening is then performed within\neach group independently. This can be understood as a block-diagonal approximation of the complete\nZCA whitening. In [2], d is taken to be 3, and the resulting 3\u00d73 covariance matrices are then less\nlikely to have similar or zero eigenvalues. The pseudo-code for ZCA whitening is given in Alg. 1.\nWe \ufb01rst use CIFAR-10 [20] to compare the behavior of our approach with that of standard SVD and\nPI for different number of groups G, corresponding to different matrix dimensions d. Because of\nnumerical instability, the training process of these method often crashes. In short, we observed that\n1. For SVD, when G = 16, d = 4, 8 out of 15 trials failed; when G\u22648, d\u22658, the algorithm failed\n\n2. For PI, when G = 16, d = 4, all the trials succeeded; when G = 8, d = 8, only 1 out of 15 trials\n\neverytime.\nsucceeded; when G\u22644, d\u226516, the algorithm failed everytime.\n\n6\n\n\fFigure 2: Training loss as a function of the number of epochs for d=8 on the left and d=4 on the\nright. In both cases, the loss of standard PI method is very unstable while our method is very stable.\n\nFigure 3: Performance of our PCA denoising layer as a function of (a) the percentage of preserved Information,\n(b) the number e of eigenvectors retained, (c) the number of power iterations.\n\n3. For our algorithm, we never saw a training failure cases independently of the matrix dimension\n\nranging from d = 4 to d = 64.\n\nIn Table 2, we report the mean classi\ufb01cation error rates and their standard deviations over the trials in\nwhich they succeeded, along with their success rates. For d = 4, SVD unsurprisingly delivers the best\naccuracy because using analytical derivative is the best thing one can do when possible. However,\neven for such a small d, it succeeds only in 46.7% of trials and systematically fails larger values of d.\nPI can handle d = 8 but not consistently as it succeeds only 6.7% of the time. Our approach succeeds\nfor all values of d up to at least 64 and delivers the smallest error rate for d = 16, which con\ufb01rms\nthat increasing the size of the groups can boost performance.\nWe report equivalent results in Table 3 on CIFAR-100 using either ResNet18 or ResNet50 as the\nbackbone. Our approach again systematically delivers a result. Being able to boost d to 64 for\nResNet18 and to 32 for ResNet50 allows us to outperform batch normalization in terms of mean error\nin both cases.\nIn Fig. 2, we plot the training losses when using either PI or our method as a function of the number\nof epochs on both datasets. Note how much smoother ours are.\n3.2 PCA Denoising\nWe now turn to PCA denoising that computes the c \u00d7 c covariance matrix M of a c \u00d7 n matrix X,\n\ufb01nds the eigenvectors of M, projects X onto the subspace spanned by the e eigenvectors associated\nto the e largest eigenvalues, with e < n, and projects the result back to the original space. Before\ndoing PCA, the matrix X needs to be standardized by removing the mean and scaling the data by its\nstandard deviation, i.e., X = (X\u2212\u00b5)/\u03c3.\nDuring training e can be either taken as a \ufb01xed number, or set dynamically for each individual batch.\ni=1 \u03bbi \u2265 0.95, where the \u03bbi are\nIn this case, a standard heuristic is to choose it so that \u03b3e =\nthe eigenvalues, meaning that 95% of the variance present in the original data is preserved. We will\ndiscuss both approaches below. As before, we run our training scheme 5 times for each setting of the\nparameters and report the mean classi\ufb01cation accuracy and its variance on CIFAR-10.\nPercentage of Preserved Information vs Performance. We \ufb01rst set e to preserve a \ufb01xed percent-\nage in each batch as described above. In practice, this means that e is always much smaller than\nthe channel number c. For instance, after the \ufb01rst convolutional layer that has 64 channels, we\nobserved that e remains smaller than 7 most of the time when preserving 85% of the variance, 15\nwhen preserving 99%, and 31 when preserving 99.5%. As can be seen in Fig.3(a) and Table 4(a),\nretaining less than 90% of the variance hurts performance, and the best results are obtained for 99%,\n\n(cid:80)e\ni=1 \u03bbi/(cid:80)n\n\n7\n\nTraining error(a) Group = 8(b) Group =16GroupZCA-PIGroupZCA-OursGroupZCA-PIGroupZCA-Ours8590951004.555.566.5PCABatch NormLayer510154.44.64.855.25.4510152025304.44.64.855.25.4(c)(b)(a)\f(a) Preserved Information\nPercentage (%) Error (%)\n5.60 \u00b1 0.74\n5.05 \u00b1 0.07\n4.67 \u00b1 0.06\n4.63 \u00b1 0.11\n4.72 \u00b1 0.10\n4.81 \u00b1 0.19\n\n85\n90\n95\n99\n99.5\n100\n\n(b) Eigenvector No.\n\nEig-vector No. Error (%)\n5.09 \u00b1 0.12\n4.82 \u00b1 0.07\n4.64 \u00b1 0.18\n4.63 \u00b1 0.17\n4.65 \u00b1 0.08\n\n2\n4\n8\n16\n32\n\n(c) Power Iteration No.\n\nPower Iteration No. Error (%)\n4.72 \u00b1 0.10\n4.63 \u00b1 0.09\n4.67 \u00b1 0.09\n4.61 \u00b1 0.14\n4.69 \u00b1 0.11\n4.63 \u00b1 0.11\n4.61 \u00b1 0.03\n\n1\n2\n4\n9\n14\n19\n29\n\nTable 4: Performance of PCA denoising layer vs Preserved information, Eigenvector No, and PI Number.\n\nBN\n\nMethods\n\nError\nMin\nMean\nMin\nPCA(PI)\nMean\nPCA(SVD) Min\nMean\nPCA(Ours) Min\nMean\n\nValue\n4.66\n\n4.81\u00b10.19\n5.35\u00b10.25\n\n5.05\n\nNaN\nNaN\n4.58\n\n4.63\u00b10.11\n\nFigure 4: Training loss as a function of the\nnumber of epochs.\n\nTable 5: Final error rate for all four\nmethods we tried.\n\nwith a mean error of 4.63. Our PCA denoising layer then outperforms Batch Normalization (BN)\nwhose mean error is 4.81, as shown in Table 5.\nNumber of Eigenvectors vs Performance. We now set e to a \ufb01xed value. As shown in Fig.3(b)\nand Table 4(b), the performance is stable when 8 \u2264 e \u2264 32, and the optimum is reached for e = 16,\nwhich preserves approximately 99.9% of the variance on average. Note that the average accuracy is\nthe same as in the previous scenario but that its standard deviation is a bit larger.\nNumber of Power Iterations vs Performance. As preserving 99.9% of information yields good\nresults, we now dynamically set e accordingly, and report errors as a function of the number of power\niterations during backpropagation in Table 4(c). Note that for our PCA denoising layer, 2 power\niterations are enough to deliver an error rate that is very close the best one obtained after 29 iterations,\nthat is, 4.63% of 4.61%. In this case, our method only incurs a small time penalty at run-time. In\npractice, on one single Titan XP GPU server, for one minibatch with batchsize 128, using ResNet18\nas backbone, 2 power iterations take 104.8 ms vs 82.7ms for batch normalization. Note that we\nimplemented our method in Pytorch [21], with the backward pass written in python, which leaves\nroom for improvement.\nTraining Stability. Fig.4 depicts the training curve for our method, PI, and standard BN. Note that\nthe loss values for our method and BN decrease smoothly, which indicates that the training process\nis very stable. By contrast, the PI training curve denotes far more instability and ultimately lower\nperformance. In Table. 5, we report the \ufb01nal error rates, where the NaN values associated to SVD\ndenotes the fact that using the analytical SVD derivatives failed all \ufb01ve times we tried.\n4 Discussion & Conclusion\nIn this paper, we have introduced a numerically stable differentiable eigendecomposition method that\nrelies on the SVD during the forward pass and on Power Iterations to compute the gradients during\nthe backward pass. Both the theory and the experimental results con\ufb01rm the increased stability that\nour method brings compared with standard SVD or Power Iterations alone. In addition to performing\nZCA more effectively than before, this has enabled us to introduce a PCA denoising layer, which has\nproven to be effective to improve the performance on classi\ufb01cation tasks.\nThe main limitation of our method is that the accuracy of our algorithm depends on the accuracy\nof the SVD in the forward pass, which is not always perfect. Besides, the truncation operation in\nAlg. 1 prevents our method from computing the gradients for the full rank eigenvectors. Moreover,\nthe for-loop in Alg. 1 slows down the speed of the gradient computation. In future work, we will\ntherefore explore approaches to improve it.\n\n8\n\nEpochsTraining errorPCA-PIPCA-OursBatch Norm\fReferences\n[1] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In NIPS, 2015.\n\n[2] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR,\n\n2018.\n\n[3] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep\n\nNetworks with Structured Layers. In CVPR, 2015.\n\n[4] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative\n\nadversarial networks. In ICLR, 2018.\n\n[5] Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and coloring batch transform\n\nfor gans. In ICLR, 2018.\n\n[6] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints\n\nvia End-To-End Geometric Reasoning. In NIPS, 2018.\n\n[7] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to Find Good\n\nCorrespondences. In CVPR, 2018.\n\n[8] R. Ranftl and V. Koltun. Deep Fundamental Matrix Estimation. In ECCV, 2018.\n\n[9] Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann.\nEigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses. In\nECCV, 2018.\n\n[10] T. Papadopoulo and M. Lourakis. Estimating the jacobian of the singular value decomposition:\n\nTheory and applications. In ECCV, pages 554\u2013570, 2000.\n\n[11] Andrei Zan\ufb01r and Cristian Sminchisescu. Deep Learning of Graph Matching. In CVPR, 2018.\n\n[12] Yuji Nakatsukasa and Nicholas J Higham. Stable and ef\ufb01cient spectral divide and conquer\nalgorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scienti\ufb01c\nComputing, 35(3):A1325\u2013A1349, 2013.\n\n[13] R. Burden and J. Faires. Numerical Analysis. Cengage, 1989.\n\n[14] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The\n\nAmerican Statistician, 72(4):309\u2013314, 2018.\n\n[15] Anthony J Bell and Terrence J Sejnowski. The \u201cindependent components\u201d of natural scenes are\n\nedge \ufb01lters. Vision research, 37(23):3327\u20133338, 1997.\n\n[16] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by\n\nReducing Internal Covariate Shift. In ICML, 2015.\n\n[17] Y. Babu, M. Subramanyam, and M. Prasad. PCA Based Image Denoising. Signal & Image\n\nProcessing, 3(2), 2012.\n\n[18] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic Label Graph\n\nMatching for Unsupervised Video Re-Identi\ufb01cation. In ICCV, 2017.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR,\n\npages 770\u2013778, 2016.\n\n[20] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master\u2019s thesis,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\n9\n\n\f", "award": [], "sourceid": 1776, "authors": [{"given_name": "Wei", "family_name": "Wang", "institution": "EPFL"}, {"given_name": "Zheng", "family_name": "Dang", "institution": "Xi'an Jiaotong University"}, {"given_name": "Yinlin", "family_name": "Hu", "institution": "EPFL"}, {"given_name": "Pascal", "family_name": "Fua", "institution": "EPFL, Switzerland"}, {"given_name": "Mathieu", "family_name": "Salzmann", "institution": "EPFL"}]}