Volume 3, Issue 4
Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)

Yuqing Li, Tao Luo & Nung Kwan Yip

CSIAM Trans. Appl. Math., 3 (2022), pp. 692-760.

Published online: 2022-11

Export citation
  • Abstract

Gradient descent yields zero training loss in polynomial time for deep neural networks despite non-convex nature of the objective function. The behavior of network in the infinite width limit trained by gradient descent can be described by the Neural Tangent Kernel (NTK) introduced in [25]. In this paper, we study dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in [24]. For a ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width $m$ with respect to the number of training samples $n$ from quartic to cubic. Our analysis suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.

  • AMS Subject Headings

68U99, 90C26, 34A45

  • Copyright

COPYRIGHT: © Global Science Press

  • Email address
  • BibTex
  • RIS
  • TXT
@Article{CSIAM-AM-3-692, author = {Li , YuqingLuo , Tao and Yip , Nung Kwan}, title = {Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)}, journal = {CSIAM Transactions on Applied Mathematics}, year = {2022}, volume = {3}, number = {4}, pages = {692--760}, abstract = {

Gradient descent yields zero training loss in polynomial time for deep neural networks despite non-convex nature of the objective function. The behavior of network in the infinite width limit trained by gradient descent can be described by the Neural Tangent Kernel (NTK) introduced in [25]. In this paper, we study dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in [24]. For a ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width $m$ with respect to the number of training samples $n$ from quartic to cubic. Our analysis suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.

}, issn = {2708-0579}, doi = {https://doi.org/10.4208/csiam-am.SO-2021-0053}, url = {http://global-sci.org/intro/article_detail/csiam-am/21154.html} }
TY - JOUR T1 - Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH) AU - Li , Yuqing AU - Luo , Tao AU - Yip , Nung Kwan JO - CSIAM Transactions on Applied Mathematics VL - 4 SP - 692 EP - 760 PY - 2022 DA - 2022/11 SN - 3 DO - http://doi.org/10.4208/csiam-am.SO-2021-0053 UR - https://global-sci.org/intro/article_detail/csiam-am/21154.html KW - Residual networks, training process, neural tangent kernel, neural tangent hierarchy. AB -

Gradient descent yields zero training loss in polynomial time for deep neural networks despite non-convex nature of the objective function. The behavior of network in the infinite width limit trained by gradient descent can be described by the Neural Tangent Kernel (NTK) introduced in [25]. In this paper, we study dynamics of the NTK for finite width Deep Residual Network (ResNet) using the neural tangent hierarchy (NTH) proposed in [24]. For a ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width $m$ with respect to the number of training samples $n$ from quartic to cubic. Our analysis suggests strongly that the particular skip-connection structure of ResNet is the main reason for its triumph over fully-connected network.

Li , YuqingLuo , Tao and Yip , Nung Kwan. (2022). Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH). CSIAM Transactions on Applied Mathematics. 3 (4). 692-760. doi:10.4208/csiam-am.SO-2021-0053
Copy to clipboard
The citation has been copied to your clipboard