Volume 1, Issue 3
Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes

Chao Ma, Daniel Kunin, Lei Wu & Lexing Ying

J. Mach. Learn. , 1 (2022), pp. 247-267.

Published online: 2022-09

Category: Theory

[An open-access article; the PDF is free to any online user.]

Export citation
  • Abstract

A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [1, 2] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

  • General Summary

The loss landscape of neural networks cannot be studied only using local quadratic approximation. We examined the loss landscape in a region larger than the reach of good quadratic approximation. Numerically, we observe that neural network loss functions possess a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

  • AMS Subject Headings

  • Copyright

COPYRIGHT: © Global Science Press

  • Email address
  • BibTex
  • RIS
  • TXT
@Article{JML-1-247, author = {Ma , ChaoKunin , DanielWu , Lei and Ying , Lexing}, title = {Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes}, journal = {Journal of Machine Learning}, year = {2022}, volume = {1}, number = {3}, pages = {247--267}, abstract = {

A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [1, 2] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

}, issn = {2790-2048}, doi = {https://doi.org/10.4208/jml.220404}, url = {http://global-sci.org/intro/article_detail/jml/21028.html} }
TY - JOUR T1 - Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes AU - Ma , Chao AU - Kunin , Daniel AU - Wu , Lei AU - Ying , Lexing JO - Journal of Machine Learning VL - 3 SP - 247 EP - 267 PY - 2022 DA - 2022/09 SN - 1 DO - http://doi.org/10.4208/jml.220404 UR - https://global-sci.org/intro/article_detail/jml/21028.html KW - Neural network loss, Subquadratic growth, Multiscale structure, Edge of stability. AB -

A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [1, 2] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.

Ma , ChaoKunin , DanielWu , Lei and Ying , Lexing. (2022). Beyond the Quadratic Approximation: The Multiscale Structure of Neural Network Loss Landscapes. Journal of Machine Learning. 1 (3). 247-267. doi:10.4208/jml.220404
Copy to clipboard
The citation has been copied to your clipboard