ํด๋ฆฝ๋ณด๋“œ์— ๋ณต์‚ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค
Post

(AlexNet) ImageNet Classification with Deep Convolutional Neural Networks

(AlexNet) ImageNet Classification with Deep Convolutional Neural Networks

๐Ÿ“š ์ •๋ฆฌ

๐Ÿ“Œ ์ œ๋ชฉ

ImageNet Classification with Deep Convolutional Neural Networks


๐ŸŒŸ ์ดˆ๋ก

  • 1000-way softmax : 1000๊ฐœ์˜ cls.๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์—, output์„ 1000๊ฐœ์˜ vec.์œผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค.
    • Top-1 error: ๊ฐ€์žฅ ํฐ ํ™•๋ฅ ์ด ์ •๋‹ต์ด ์•„๋‹ ๋•Œ.
    • Top-5 error: ๊ฐ€์žฅ ํฐ ํ™•๋ฅ  5๊ฐœ ์•ˆ์— ์ •๋‹ต์ด ์—†์„ ๋•Œ.
  • saturating neuron : ๋‰ด๋Ÿฐ์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ค‘์—์„œ ์ž…๋ ฅ๊ฐ’์ด ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋ฉด ํฌํ™”๋˜๋Š” ํ˜„์ƒ์œผ๋กœ, sigmoid, tanh์— ํ•ด๋‹น โ†’ ๊ฒฝ์‚ฌ์†Œ์‹ค ๋ฌธ์ œ ๋”ฐ๋ผ์„œ AlexNet์—์„œ๋Š” ReLU ์ฑ„ํƒ
ํ•ญ๋ชฉ๋‚ด์šฉ
๋ฐ์ดํ„ฐ์…‹ImageNet LSVRC-2010 (120๋งŒ ์žฅ, 1000 ํด๋ž˜์Šค), ILSVRC-2012 ๋Œ€ํšŒ ์ฐธ๊ฐ€
๋ชจ๋ธ ๊ทœ๋ชจ60M ํŒŒ๋ผ๋ฏธํ„ฐ, 650k ๋‰ด๋Ÿฐ
์•„ํ‚คํ…์ฒ˜5๊ฐœ convolutional layer + 3๊ฐœ fully-connected layer + softmax(1000-way)
ํ•ต์‹ฌ ๊ธฐ๋ฒ•ReLU(non-saturating neuron), GPU ๊ธฐ๋ฐ˜ ํšจ์œจ์  ํ•™์Šต, Dropout ์ •๊ทœํ™”
์„ฑ๋ŠฅILSVRC-2010: top-1 37.5%, top-5 17.0% / ILSVRC-2012: top-5 15.3% (1๋“ฑ)
๊ธฐ์—ฌCNN์„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐยทGPU์™€ ๊ฒฐํ•ฉ โ†’ ๊ธฐ์กด feature-engineering ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํš๊ธฐ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ์•„ํ‚คํ…์ณ AlexNet Architecture



๐Ÿ’ก ๊ฒฐ๋ก  & ๊ณ ์ฐฐ

ํ•ญ๋ชฉ๋‚ด์šฉ
ILSVRC-2010CNN: top-1 37.5%, top-5 17.0% (๊ธฐ์กด ์ตœ๊ณ : 47.1%, 28.2%)
ILSVRC-2012๋‹จ์ผ CNN: top-5 18.2% โ†’ 5๊ฐœ ์•™์ƒ๋ธ”: 16.4% โ†’ ์‚ฌ์ „ํ•™์Šต+์•™์ƒ๋ธ”: 15.3% (2์œ„๋Š” 26.2%)
ImageNet 2009 (10k ํด๋ž˜์Šค)CNN: top-1 67.4%, top-5 40.9% (๊ธฐ์กด 78.1%, 60.9%)
Qualitative ๋ถ„์„CNN์€ ์ƒ‰์ƒยท๋ฐฉํ–ฅยท์ฃผํŒŒ์ˆ˜ selective kernel ํ•™์Šต / GPU๋ณ„ specialization ๋ฐœ์ƒ
Feature space4096์ฐจ์› feature vector๋กœ ์ด๋ฏธ์ง€ ๊ฐ„ semantic similarity ๋ฐ˜์˜, raw pixel distance๋ณด๋‹ค ์˜๋ฏธ ์žˆ๋Š” ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅ
์ถ”๊ฐ€ ์ œ์•ˆauto-encoder๋กœ feature vector ์••์ถ• โ†’ ํšจ์œจ์  ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(Image retrieval)
  • ๋„คํŠธ์›Œํฌ๋ฅผ ๊นŠ๊ฒŒ ๋งŒ๋“ ๋‹ค๋ฉด ํšจ๊ณผ๊ฐ€ ๋” ํด๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜์–ด ์ง„๋‹ค. ๋˜ํ•œ unsupervised pretraining๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ• ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋œ๋‹ค.
  • CNN์˜ ๊นŠ์ด๊ฐ€ ์„ฑ๋Šฅํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•œ๋‹ค.
ํ•ญ๋ชฉ๋‚ด์šฉ
๊นŠ์ด์˜ ์ค‘์š”์„ฑconvolution layer ํ•˜๋‚˜๋งŒ ์ œ๊ฑฐํ•ด๋„ top-1 ์„ฑ๋Šฅ ์•ฝ 2% ์ €ํ•˜ โ†’ ๊นŠ์€ ๊ตฌ์กฐ๊ฐ€ ํ•ต์‹ฌ
๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์Œ. ํ•˜์ง€๋งŒ ๋” ํฐ ๋„คํŠธ์›Œํฌยท๋ผ๋ฒจ ๋ถ€์กฑ ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ
์„ฑ๋Šฅ ์Šค์ผ€์ผ๋ง๋„คํŠธ์›Œํฌ๋ฅผ ํฌ๊ฒŒ ํ•˜๊ณ  ํ•™์Šต ์‹œ๊ฐ„์„ ๋Š˜๋ฆฌ๋ฉด ์„ฑ๋Šฅ์ด ๊ณ„์† ํ–ฅ์ƒ๋จ (scale-up ํšจ๊ณผ)
ํ•œ๊ณ„์ธ๊ฐ„ ๋‡Œ์˜ ์‹œ๊ฐ ํ”ผ์งˆ(infero-temporal pathway)์— ๋น„ํ•˜๋ฉด ์•„์ง ๋ฉ€์—ˆ๋‹ค๊ณ  ์–ธ๊ธ‰
๋ฏธ๋ž˜ ๋ฐฉํ–ฅ์ •์  ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ ๋น„๋””์˜ค ํ•™์Šต์œผ๋กœ ํ™•์žฅ โ†’ ์‹œ๊ณ„์—ด์  ๋‹จ์„œ(temporal structure) ํ™œ์šฉ

๐Ÿ—ƒ๏ธ ๋ฐ์ดํ„ฐ

  • Amazon Mechanical Turk : ์ž‘์—…์„ ์ž˜๊ฒŒ ์ชผ๊ฐœ์–ด(HITs, Human Intelligence Tasks) ์˜จ๋ผ์ธ์œผ๋กœ ๋‹ค์ˆ˜์˜ ์‚ฌ๋žŒ(worker)์—๊ฒŒ ํ• ๋‹น, ์†Œ์•ก ๋ณด์ƒ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ํ•˜๋Š” ์‹œ์Šคํ…œ. ์ ์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋Œ€๊ทœ๋ชจ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ๊ฐ€๋Šฅ
  • CNN๋ชจ๋ธ์— input์œผ๋กœ ๋„ฃ๊ธฐ ์œ„ํ•ด์„œ, 256x256์œผ๋กœ ๋‹ค์šด์ƒ˜ํ”Œ๋ง ์ง„ํ–‰
    • ์ง์‚ฌ๊ฐํ˜•์˜ ๊ฒฝ์šฐ์—๋Š” ์ด๋ฏธ์ง€๊ฐ€ ์งง์€๋ณ€์„ 256์œผ๋กœ resizingํ•œ ํ›„, ๊ทธ ๊ฒฐ๊ณผ๋ฌผ์—์„œ ์ค‘์•™ 256x256ํŒจ์น˜๋ฅผ ์ž˜๋ผ๋ƒ„
  • ์ „์ฒ˜๋ฆฌ๋Š” ํ•˜์ง€ ์•Š๊ณ , ํ›ˆ๋ จ์…‹ ์ „์ฒด์—์„œ ํ”ฝ์…€ ๋‹จ์œ„ ํ‰๊ท ๊ฐ’์„ ๋นผ๋Š” ์ž‘์—…๋งŒ ์ˆ˜ํ–‰ํ•ด์„œ, ๋„คํŠธ์›Œํฌ๋Š” ์ค‘์‹ฌํ™”๋œ ์›์‹œ RGB๊ฐ’์œผ๋กœ ํ•™์Šต๋˜์–ด์ง
ํ•ญ๋ชฉ๋‚ด์šฉ
์›๋ณธ ๋ฐ์ดํ„ฐ์…‹ImageNet (15M+ ์ด๋ฏธ์ง€, 22k ํด๋ž˜์Šค)
ILSVRC ๋ฒ„์ „2010/2012 ๋Œ€ํšŒ์šฉ ํ•˜์œ„์…‹ (1000 ํด๋ž˜์Šค, ์•ฝ 120๋งŒ train, 50k val, 150k test)
ํŠน์ง•ILSVRC-2010๋งŒ test ๋ผ๋ฒจ ๊ณต๊ฐœ (์‹คํ—˜ ๊ฒ€์ฆ์šฉ)
์˜ค๋ฅ˜ ์ง€ํ‘œTop-1 (์ตœ๊ณ  ํ™•๋ฅ  ์˜ˆ์ธก์ด ํ‹€๋ฆฐ ๊ฒฝ์šฐ), Top-5 (์ƒ์œ„ 5๊ฐœ ์˜ˆ์ธก ์•ˆ์— ์ •๋‹ต ์—†์Œ)
์ „์ฒ˜๋ฆฌ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ 256ร—256์œผ๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆ ํ›„ ์ค‘์•™ crop / train set ํ”ฝ์…€ ํ‰๊ท ๊ฐ’์„ ๋นผ๊ณ  raw RGB ์‚ฌ์šฉ
์˜์˜๋‹น์‹œ๋กœ์„œ๋Š” unprecedented(์ „๋ก€ ์—†๋Š”) ๋Œ€๊ทœ๋ชจ labeled dataset. CNN ํ•™์Šต ๊ฐ€๋Šฅ์ผ€ ํ•œ ๊ธฐ๋ฐ˜.

๐Ÿ“Œ ์„œ๋ก 

1) Stationarity of statistics (ํ†ต๊ณ„์  ํŠน์„ฑ์˜ ๋ถˆ๋ณ€์„ฑ)

  • ์˜๋ฏธ: ์ด๋ฏธ์ง€์˜ ํ†ต๊ณ„์  ํŒจํ„ด(๊ฒฝ๊ณ„, ์งˆ๊ฐ, ์ƒ‰ ๋ถ„ํฌ ๋“ฑ)์€ ๊ณต๊ฐ„ ์ „์ฒด์—์„œ ๋น„์Šทํ•˜๋‹ค๋Š” ๊ฐ€์ •.
  • ์˜ˆ: ๊ณ ์–‘์ด์˜ ๊ท€(edge ํŒจํ„ด)๋“  ์ž๋™์ฐจ์˜ ๋ฐ”ํ€ด(circle-like edge)๋“ , ์ด๋ฏธ์ง€์˜ ์–ด๋А ์œ„์น˜์— ์žˆ๋”๋ผ๋„ ๊ฐ™์€ ํŠน์ง• ๊ฒ€์ถœ๊ธฐ๊ฐ€ ์ ์šฉ ๊ฐ€๋Šฅํ•ด์•ผ ํ•จ.
  • CNN์—์„œ ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ์‹:
    • ๊ฐ€์ค‘์น˜ ๊ณต์œ (weight sharing) โ†’ ๊ฐ™์€ convolution filter(์ปค๋„)๋ฅผ ์ด๋ฏธ์ง€ ์ „์—ญ์— ์ ์šฉ.
    • ๋•๋ถ„์— ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ค„๊ณ , ํ•™์Šต๋œ ํŠน์ง•์€ ์œ„์น˜์— ๋ฌด๊ด€ํ•˜๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (translation invariance).

2) Locality of pixel dependencies (ํ”ฝ์…€ ๊ฐ„ ์˜์กด์„ฑ์˜ ์ง€์—ญ์„ฑ)

  • ์˜๋ฏธ: ํ”ฝ์…€์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๊ฒƒ๋ณด๋‹ค ๊ฐ€๊นŒ์šด ํ”ฝ์…€๋ผ๋ฆฌ ๋” ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฐ€์ •.
  • ์˜ˆ: ์ด์›ƒํ•œ ํ”ฝ์…€์€ ๊ฐ™์€ ๋ฌผ์ฒด(๊ณ ์–‘์ด์˜ ํ„ธ, ์ž๋™์ฐจ ํ‘œ๋ฉด)์— ์†ํ•  ํ™•๋ฅ ์ด ๋†’์Œ. ๋ฐ˜๋ฉด, ์™ผ์ชฝ ์œ„ ํ”ฝ์…€๊ณผ ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ ํ”ฝ์…€์€ ๋…๋ฆฝ์ ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํผ.
  • CNN์—์„œ ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ์‹:
    • ์ง€์—ญ์  ์—ฐ๊ฒฐ(local receptive field) โ†’ ๊ฐ ๋‰ด๋Ÿฐ์€ ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ์•„๋‹Œ **์ž‘์€ ์˜์—ญ(์˜ˆ: 3ร—3, 5ร—5)**๋งŒ ๋ณธ๋‹ค.
    • ์ด ์ง€์—ญ์  ํŠน์ง•๋“ค์ด ๊ณ„์ธต์ ์œผ๋กœ ์กฐํ•ฉ๋˜๋ฉด์„œ, ์ €์ˆ˜์ค€(edge) โ†’ ์ค‘๊ฐ„์ˆ˜์ค€(texture, parts) โ†’ ๊ณ ์ˆ˜์ค€(object) ํ‘œํ˜„์œผ๋กœ ๋ฐœ์ „.

3) CNN์˜ ์žฅ์ ์œผ๋กœ ์—ฐ๊ฒฐ

  • ์œ„ ๋‘ ๊ฐ€์ง€ ๊ฐ€์ • ๋•๋ถ„์—:
    • ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ํฐ ๋ชจ๋ธ ํ‘œํ˜„๋ ฅ ํ™•๋ณด (weight sharing์œผ๋กœ ์ ˆ์•ฝ).
    • ํ•™์Šต ํšจ์œจ์ด ๊ฐœ์„  โ†’ ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋‹ค ๋ณด์ง€ ์•Š์•„๋„, ์ž‘์€ ํ•„ํ„ฐ๋กœ ๊ตญ์†Œ ํŠน์ง•์„ ์žก์•„๋‚ด๊ณ , ์ด๊ฑธ ์ „์ฒด ์ด๋ฏธ์ง€์— ๋ฐ˜๋ณต ์ ์šฉ.
    • ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ โ†’ ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์…‹์—๋งŒ ํŠนํ™”๋˜์ง€ ์•Š๊ณ , ์œ„์น˜ยท๋ฐฐ๊ฒฝ ๋ณ€ํ™”์—๋„ ๊ฐ•์ธ.
  • ์ตœ๊ทผ๊นŒ์ง€ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ทœ๋ชจ๊ฐ€ ์ž‘๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹จ์ˆœํ•œ ๋ฌธ์ œ๋กœ ์ถฉ๋ถ„ํ•˜๋ฉฐ ์ธ๊ฐ„์˜ ์„ฑ๋Šฅ์— ๊ทผ์ ‘ํ–ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œํ™˜๊ฒฝ์€ ํฐ ๋ณ€ํ™”๋ฅผ ๋ณด์ด๋ฏ€๋กœ ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๊ณ  ImageNet๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ธํ•ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.
  • ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•˜๋ ค๋ฉด ํฐ ์šฉ๋Ÿ‰์„ ๊ฐ€์ง„ ๋ชจ๋ธ์ด ํ•„์š”ํ•œ๋ฐ, CNN์€ ์ด๋ฏธ์ง€์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ€์ •์„ ๊ฐ€์ง€๋ฉฐ, ๊นŠ์ด์™€ ํญ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๊ณ , ์ผ๋ฐ˜์ ์ธ ๋ ˆ์ด์–ด์™€ ๋น„๊ตํ•˜๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜๊ฐ€ ์ ์–ด ํ›จ์”ฌ ํ•™์Šต์ด ์šฉ์ดํ•˜๋‹ค.
  • ์ด๋Ÿฌํ•œ ์žฅ์ ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์•„์ง๊นŒ์ง€๋„ CNN์€ ๊ณ„์‚ฐ๋น„์šฉ์ด ํฌ๋‚˜ GPU์˜ ๋ฐœ์ „๊ณผ CNN 2D์˜ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์กŒ๊ณ , ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ๋•๋ถ„์— ๋ชจ๋ธ์„ ๊ณผ์ ํ•ฉ์—†์ด ํ›ˆ๋ จ์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค.
ํ•ญ๋ชฉ๋‚ด์šฉ
๊ธฐ์กด ํ•œ๊ณ„MNIST, CIFAR ๋“ฑ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ โ†’ ๋‹จ์ˆœ ๊ณผ์ œ์—๋Š” ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ์‹ค์ œ ์‚ฌ๋ฌผ ์ธ์‹์€ ๋ถˆ๊ฐ€๋Šฅ
๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐImageNet (15M+ ์ด๋ฏธ์ง€, 22k ํด๋ž˜์Šค) โ†’ ๋ณธ๊ฒฉ์ ์ธ CNN ํ•™์Šต ๊ฐ€๋Šฅ
ํ•„์š”ํ•œ ๋ชจ๋ธ ํŠน์„ฑํฐ ์šฉ๋Ÿ‰(capacity), ์‚ฌ์ „ ์ง€์‹ ๋‚ด์žฅ(translation invariance, locality ๋ฐ˜์˜)
CNN ์žฅ์ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ, ์ง€์—ญ์  ์—ฐ๊ฒฐ, ์ด๋ฏธ์ง€ ํ†ต๊ณ„ ๊ฐ€์ • ํ™œ์šฉ โ†’ ์ผ๋ฐ˜ ์‹ ๊ฒฝ๋ง ๋Œ€๋น„ ํšจ์œจ์ 
๊ธฐ์ˆ ์  ๋ŒํŒŒ๊ตฌGPU + ๊ณ ์† 2D convolution ๊ตฌํ˜„ โ†’ ๋Œ€๊ทœ๋ชจ CNN ํ›ˆ๋ จ ํ˜„์‹คํ™”
๋…ผ๋ฌธ ๊ธฐ์—ฌ- ๋Œ€๊ทœ๋ชจ CNN ํ•™์Šต ์„ฑ๊ณต


๐Ÿ”ฌ ์‹คํ—˜๊ณผ์ •

๐Ÿ“š 3. The Architecture

  • ReLU : ํฌํ™” ๋น„์„ ํ˜• ํ•จ์ˆ˜๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์—์„œ ๋น„ํฌํ™”ํ•จ์ˆ˜์— ๋น„ํ•ด ํ›จ์”ฌ ๋А๋ฆฌ๋‹ค. ReLU๋Š” ๊ฐ™์€ ๊นŠ์ด์˜ Tanh์— ๋น„ํ•ด ๋ช‡๋ฐฐ๋‚˜ ํ•™์Šต์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค.


ReLU vs Tanh convergence speed

ํ•ญ๋ชฉ๋‚ด์šฉ
๋„คํŠธ์›Œํฌ ๊ณ„์ธต์ด 8๊ฐœ ํ•™์Šต ๊ณ„์ธต โ†’ 5 convolution + 3 fully-connected
์ถœ๋ ฅ๋งˆ์ง€๋ง‰ fully-connected layer โ†’ 1000-way softmax (ImageNet ํด๋ž˜์Šค ํ™•๋ฅ )
ReLU ์ •์˜f(x) = max(0, x) (๋น„ํฌํ™”ํ˜• ๋น„์„ ํ˜• ํ•จ์ˆ˜)
๊ธฐ์กด ๋ฐฉ์‹sigmoid, tanh โ†’ ํฌํ™” ์˜์—ญ์—์„œ gradient โ‰ˆ 0 โ†’ ํ•™์Šต ๋А๋ฆผ (๊ฒฝ์‚ฌ ์†Œ์‹ค ๋ฌธ์ œ)
์žฅ์ ReLU๋Š” ์–‘์ˆ˜ ์˜์—ญ์—์„œ gradient=1 โ†’ ํ•™์Šต ์†๋„ ์ˆ˜๋ฐฐ ํ–ฅ์ƒ
์‹คํ—˜ ๊ทผ๊ฑฐCIFAR-10 ์‹คํ—˜์—์„œ tanh ๋Œ€๋น„ 6๋ฐฐ ๋น ๋ฅธ ์ˆ˜๋ ด ์†๋„ ํ™•์ธ
์˜์˜GPU + ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ, โ€œ์‹ค์ œ๋กœ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œโ€ ๋Œ€๊ทœ๋ชจ CNN์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“  ํ•ต์‹ฌ
  • columnar
    • โ€œColumnโ€ = ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋œ ํ•˜๋‚˜์˜ CNN.
    • ์—ฌ๋Ÿฌ ๊ฐœ column์„ ๋ณ‘๋ ฌ๋กœ ๋‘๊ณ , ๊ฐ๊ฐ์˜ CNN์ด ๊ฐ™์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌ.
    • ๋งˆ์ง€๋ง‰์— ๊ฐ column์˜ softmax ์ถœ๋ ฅ(probability distribution)์„ ํ‰๊ท ํ•˜๊ฑฐ๋‚˜ ํˆฌํ‘œํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก.
    • ์žฅ์ : ๋‹จ์ผ CNN๋ณด๋‹ค ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๊ณ , ๊ณผ์ ํ•ฉ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ.
    • ๋‹จ์ : ๊ฐ column์ด ์™„์ „ํžˆ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ, ๊ณ„์‚ฐ๋Ÿ‰์ด ํฌ๊ณ  column ์ˆ˜์— ๋น„๋ก€ํ•ด ์ž์›์ด ํ•„์š”.
  • GPU๋Š” ์„œ๋กœ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ง์ ‘ ์ฝ๊ณ  ์“ฐ๊ธฐ๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๊ต์ฐจ GPU๋ฅผ ์ž˜ ์ง€์›ํ•œ๋‹ค.
  • GPU 2๊ฐœ์— ๋Œ€ํ•ด ์ ˆ๋ฐ˜์˜ ์ปค๋„์„ ์œ„์น˜์‹œ์ผœ ์‚ฌ์šฉํ•˜๋ฉฐ ํŠน์ • ์ธต์—์„œ๋Š” gpu๊ฐ„ ๊ณต์œ ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. 2-3๋ฒˆ layer๊ฐ€ ํ•ด๋‹น๋ถ€๋ถ„์œผ๋กœ 3๋ฒˆ layer๋Š” 2๋ฒˆ์˜ ๋ชจ๋“  gpu์— ๋Œ€ํ•ด์„œ ์ž…๋ ฅ์„ ๋ฐ›์ง€๋งŒ, 4๋ฒˆ layer๋Š” ๊ฐ™์€ gpu๋‚ด์—์„œ๋งŒ ์ž…๋ ฅ์„ ๋ฐ›๋Š”๋‹ค.
  • ์ด๋Ÿฌํ•œ ํŒจํ„ด์€ cross-validation์œผ๋กœ ์กฐ์ ˆํ•œ๋‹ค. ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, ์‹œ๊ฐ„๋˜ํ•œ ์•ฝ๊ฐ„ ์งง์•„์กŒ๋‹ค.
  • columnar == ๋…๋ฆฝ๊ตฌ์กฐ, AlexNet == ํ˜‘๋ ฅ๊ตฌ์กฐ(featuremap์„ ๋‚˜๋ˆ”)
  • Local Response Normalization์„ ์‚ฌ์šฉํ•˜์˜€์Œ
    • ๋‰ด๋Ÿฐ๊ฐ„ ๊ฒฝ์Ÿ์„ ์œ ๋„(์ƒ๋ฌผํ•™์  ๋‰ด๋Ÿฐ์˜ ์ธก๋ฉด์–ต์ œ์—์„œ ์˜๊ฐ์„ ๋ฐ›์Œ, ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„ ์ถœ๋ ฅ๊ฐ„์˜ ๊ฒฝ์Ÿ์„ ์œ ๋„)
    • CNN โ†’ ReLU โ†’ LRN (๋ช‡๋ช‡ ์ธต์— ์‚ฌ์šฉ)


Local Response Normalization formula

ํ•ญ๋ชฉ๋‚ด์šฉ
GPU ๋ณ‘๋ ฌ ํ•™์Šต2 GPU์— ์ ˆ๋ฐ˜์”ฉ ์ปค๋„ ๋ถ„์‚ฐ, ํŠน์ • ๊ณ„์ธต๋งŒ ๊ต์ฐจ ์—ฐ๊ฒฐ โ†’ ์„ฑ๋Šฅ ํ–ฅ์ƒ, ํ•™์Šต์‹œ๊ฐ„ ๋‹จ์ถ•
๋น„๊ต๋‹จ์ผ GPU ๋Œ€๋น„ top-1 ์˜ค๋ฅ˜์œจ 1.7%โ†“, top-5 1.2%โ†“
๊ตฌ์กฐ์  ํŠน์„ฑCireลŸan et al.์˜ columnar CNN ์œ ์‚ฌ, ํ•˜์ง€๋งŒ column ๊ฐ„ ์™„์ „ํžˆ ๋…๋ฆฝX
์ •๊ทœํ™” ์•„์ด๋””์–ดLocal Response Normalization (LRN), lateral inhibition์—์„œ ์˜๊ฐ
์ •๊ทœํ™” ์‹bแตข = aแตข / (k + ฮฑ ฮฃโฑผ (aโฑผยฒ))^ฮฒ, ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ (k=2, n=5, ฮฑ=1e-4, ฮฒ=0.75)
ํšจ๊ณผILSVRC top-1 1.4%โ†“, top-5 1.2%โ†“ / CIFAR-10 ์˜ค๋ฅ˜์œจ 13%โ†’11%
  • overlapping pooling์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•์œผ๋กœ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.
ํ•ญ๋ชฉ๋‚ด์šฉ
Pooling ์ •์˜์ผ์ •ํ•œ ์˜์—ญ์˜ ๋‰ด๋Ÿฐ ์ถœ๋ ฅ์„ ์š”์•ฝ (๋Œ€ํ‘œ์ ์œผ๋กœ max-pooling)
์ „ํ†ต ๋ฐฉ์‹non-overlapping (s = z) โ†’ ๊ฒฉ์ž ์˜์—ญ์ด ๊ฒน์น˜์ง€ ์•Š์Œ
AlexNet ๋ฐฉ์‹overlapping pooling (s < z) ์‚ฌ์šฉ โ†’ s=2, z=3
ํšจ๊ณผtop-1 ์˜ค๋ฅ˜์œจ 0.4%โ†“, top-5 ์˜ค๋ฅ˜์œจ 0.3%โ†“
์ถ”๊ฐ€ ๊ด€์ฐฐ๊ฒน์น˜๋Š” pooling์€ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋˜๊ธฐ ์กฐ๊ธˆ ๋” ์–ด๋ ค์›€


AlexNet layer connection diagram

  • ํฐ ํ•„ํ„ฐ์™€ ํฐ stride๋กœ ์‹œ์ž‘ โ†’ ์ ์ฐจ ์ž‘์€ ํ•„ํ„ฐ๋กœ ๊นŠ์ด ์žˆ๊ฒŒ ์—ฐ๊ฒฐ โ†’ fully-connected layer์—์„œ high-level representation ํ˜•์„ฑ โ†’ softmax ์ถœ๋ ฅ
  • Conv3๋Š” ๋‘ GPU ๋ชจ๋‘ ์—ฐ๊ฒฐ, Conv4ยท5๋Š” ๊ฐ™์€ GPU ๋‚ด๋ถ€ ์—ฐ๊ฒฐโ€ ์ •๋„๋ฅผ ํ‘œ์— ๋„ฃ์–ด์ฃผ๋ฉด ๋” ์™„์„ฑ๋„๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค.
ํ•ญ๋ชฉ๋‚ด์šฉ
์ „์ฒด ๊ตฌ์กฐ5 convolutional + 3 fully-connected โ†’ ๋งˆ์ง€๋ง‰ softmax(1000-way)
์ž…๋ ฅ224ร—224ร—3 RGB ์ด๋ฏธ์ง€
Conv196๊ฐœ์˜ 11ร—11ร—3 ํ•„ํ„ฐ, stride=4
Conv2256๊ฐœ์˜ 5ร—5ร—48 ํ•„ํ„ฐ (์ด์ „ layer ์ถœ๋ ฅ ์ผ๋ถ€๋งŒ ์—ฐ๊ฒฐ, GPU ๋ถ„์‚ฐ ๊ตฌ์กฐ)
Conv3384๊ฐœ์˜ 3ร—3ร—256 ํ•„ํ„ฐ (์–‘์ชฝ GPU ๋ชจ๋‘ ์—ฐ๊ฒฐ)
Conv4384๊ฐœ์˜ 3ร—3ร—192 ํ•„ํ„ฐ (GPU ๋‚ด๋ถ€ ์—ฐ๊ฒฐ)
Conv5256๊ฐœ์˜ 3ร—3ร—192 ํ•„ํ„ฐ (GPU ๋‚ด๋ถ€ ์—ฐ๊ฒฐ, ์ดํ›„ pooling)
FC1โ€“2๊ฐ๊ฐ 4096 ๋‰ด๋Ÿฐ (dropout ์ ์šฉ)
FC31000 ๋‰ด๋Ÿฐ โ†’ softmax
ReLU๋ชจ๋“  conv, FC layer์— ์ ์šฉ
Normalization/PoolingConv1, Conv2 ๋’ค์— LRN + pooling, Conv5 ๋’ค์— pooling

๐Ÿ“š 4. Reducing Overfitting

4.1. ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (Data Augmentation)

  • ์ด๋ฏธ์ง€ crop ๋ฐ ์ขŒ์šฐ ๋ฐ˜์ „ : 256x256 ์ด๋ฏธ์ง€์—์„œ ์ž„์˜์˜ 224x224 ํŒจ์น˜ ๋ฐ ๊ทธ ์ขŒ์šฐ ๋ฐ˜์ „๋ณธ์„ ์ž˜๋ผ๋‚ด์–ด ํ•™์Šต์— ์‚ฌ์šฉ
    • ์•ฝ 2048๋ฐฐ์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€ ํšจ๊ณผ. ํ…Œ์ŠคํŠธ ์‹œ์—๋Š” ๋„ค ๊ตฌ์„ + ์ค‘์•™ 5๊ฐœ ํŒจ์น˜์™€ ๊ฐ ์ขŒ์šฐ ๋ฐ˜์ „๋ณธ์„ ํฌํ•จํ•ด ์ด 10๊ฐœ ํŒจ์น˜์˜ softmax ์ถœ๋ ฅ์„ ํ‰๊ท ํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก
  • RGB ์ฑ„๋„ ๊ฐ•๋„ ๋ณ€ํ™˜ : ํ›ˆ๋ จ์…‹ ์ „์ฒด ํ”ฝ์…€ ๊ฐ’์— ๋Œ€ํ•ด PCA๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ, ๊ฐ ์ด๋ฏธ์ง€ ํ”ฝ์…€ ๊ฐ’์— PCA์˜ ์ฃผ์„ฑ๋ถ„ ๋ฒกํ„ฐ๋ฅผ ์ผ์ • ๋น„์œจ๋กœ ๋”ํ•จ (๋‚œ์ˆ˜ ํ™œ์šฉ)
  • ์ด ํšจ๊ณผ๋ฅผ ํ†ตํ•ด ์กฐ๋ช… ์„ธ๊ธฐ๋‚˜ ์ƒ‰์ด ๋‹ฌ๋ผ์ ธ๋„ ๋ฌผ์ฒด ์ •์ฒด์„ฑ์ด ์œ ์ง€๋˜๋Š” ์ž์—ฐ ์ด๋ฏธ์ง€์˜ ์„ฑ์งˆ์„ ๋ชจ๋ฐฉ


RGB channel intensity variation example

4.2. ๋“œ๋กญ์•„์›ƒ (Dropout)

  • ํ•™์Šต ์‹œ ๊ฐ hidden neuron์˜ ์ถœ๋ ฅ์„ ํ™•๋ฅ  0.5๋กœ 0์œผ๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆผ. ๋”ฐ๋ผ์„œ ์ „ํŒŒ ๋ฐ ์—ญ์ „ํŒŒ์— ์ผ์ • ๋‰ด๋Ÿฐ๋“ค์ด ์ฐธ์—ฌํ•˜์ง€ ์•Š์Œ.
  • ๋งค๋ฒˆ ์ž…๋ ฅ์‹œ ๋‹ค๋ฅธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ฐ€์ง€๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ง (but ๊ฐ€์ค‘์น˜๋Š” ๊ณต์œ ).
  • ์ฆ‰, ํŠน์ • ๋‰ด๋Ÿฐ์ด ๋‹ค๋ฅธ ๋‰ด๋Ÿฐ์˜ ์กด์žฌ์— ์ง€๋‚˜์น˜๊ฒŒ ์˜์กด(co-adaptation)ํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ธ๋‹ค.
  • ํ…Œ์ŠคํŠธ ์‹œ์—๋Š” ๋ชจ๋“  ๋‰ด๋Ÿฐ์„ ์‚ฌ์šฉํ•˜๋˜, ์ถœ๋ ฅ์— 0.5๋ฅผ ๊ณฑํ•ด์คŒ.
  • ์ฒซ ๋‘ fully-connected layer์— ์ ์šฉ. Dropout์ด ์—†์„ ๋•Œ ๋„คํŠธ์›Œํฌ๊ฐ€ ์‹ฌ๊ฐํ•˜๊ฒŒ ๊ณผ์ ํ•ฉ๋˜์—ˆ์œผ๋ฉฐ, ์ ์šฉ ์‹œ ํ•™์Šต ์ˆ˜๋ ด ์†๋„๊ฐ€ ์•ฝ ๋‘ ๋ฐฐ ๋А๋ ค์ง€์ง€๋งŒ ๊ณผ์ ํ•ฉ์€ ํฌ๊ฒŒ ์ค„์–ด๋“ฆ.

์ •๋ฆฌ

ํ•ญ๋ชฉ๋‚ด์šฉ
๋ฌธ์ œํŒŒ๋ผ๋ฏธํ„ฐ 60M โ†’ ๋ฐ์ดํ„ฐ(120๋งŒ ์žฅ)๋กœ๋Š” ๊ณผ์ ํ•ฉ ์‹ฌ๊ฐ
ํ•ด๊ฒฐ์ฑ…1Data Augmentation
โ€“ ๋ฌด์ž‘์œ„ crop & ์ขŒ์šฐ ๋ฐ˜์ „ (ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ 2048๋ฐฐ ํ™•์žฅ)ย 
โ€“ PCA ๊ธฐ๋ฐ˜ ์ƒ‰์ƒ ์ฆ๊ฐ• (์กฐ๋ช…ยท์ƒ‰ ๋ณ€ํ™” ๋ถˆ๋ณ€์„ฑ ๋ฐ˜์˜)ย 
ํšจ๊ณผ1crop ์—†์œผ๋ฉด ๊ณผ์ ํ•ฉ ์‹ฌ๊ฐ / ์ƒ‰์ƒ ์ฆ๊ฐ• ์‹œ top-1 ์˜ค๋ฅ˜์œจ 1%โ†“
ํ•ด๊ฒฐ์ฑ…2Dropout
โ€“ ํ•™์Šต ์‹œ ๋‰ด๋Ÿฐ์„ ํ™•๋ฅ  0.5๋กœ ์ œ๊ฑฐย 
โ€“ ๋‰ด๋Ÿฐ ๊ฐ„ ๊ณต์ ์‘(co-adaptation) ๋ฐฉ์ง€ย 
โ€“ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ โ†‘, ํ•˜์ง€๋งŒ ์ˆ˜๋ ด ์†๋„ ์•ฝ 2๋ฐฐ ๋А๋ ค์งย 
์ ์šฉ ์œ„์น˜Fully-connected 1, 2์ธต

๐Ÿ“š 5. Details of learning

5. ํ•™์Šต ์„ธ๋ถ€์‚ฌํ•ญ

  • **ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(SGD)**์œผ๋กœ ํ•™์Šต. ๋ฐฐ์น˜ ํฌ๊ธฐ 128, ๋ชจ๋ฉ˜ํ…€ 0.9, ๊ฐ€์ค‘์น˜ ๊ฐ์‡ (weight decay) 0.0005๋กœ ์„ค์ •.
  • ์ž‘์€ weight decay๋Š” ๋‹จ์ˆœํ•œ ์ •๊ทœํ™” ์ด์ƒ์˜ ์—ญํ• ์„ ํ•˜์˜€์Œ โ†’ ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ ํ•™์Šต์„ ํ•˜๋„๋ก ๋„์™€์ฃผ์—ˆ์Œ.


Weight update rule

  • ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”: ๊ฐ ๊ณ„์ธต์˜ ๊ฐ€์ค‘์น˜๋Š” ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 0.01์˜ Gaussian ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง.
  • Bias ์ดˆ๊ธฐํ™”: ๋‘ ๋ฒˆ์งธ, ๋„ค ๋ฒˆ์งธ, ๋‹ค์„ฏ ๋ฒˆ์งธ convolutional layer์™€ fully-connected hidden layer์˜ ๋‰ด๋Ÿฐ bias๋Š” ์ƒ์ˆ˜ 1๋กœ ์ดˆ๊ธฐํ™”. (bias = 1 : ReLU ๋‰ด๋Ÿฐ์ด ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์–‘์ˆ˜ ์ž…๋ ฅ์„ ๋” ์ž˜ ๋ฐ›๊ฒŒ ๋˜์–ด ํ•™์Šต์ด ๋นจ๋ผ์ง€๋Š” ํšจ๊ณผ). ๋‚˜๋จธ์ง€ layer์˜ bias๋Š” 0์œผ๋กœ ์ดˆ๊ธฐํ™”.
  • ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„: ๋ชจ๋“  layer์— ๋™์ผํ•œ ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ํ•™์Šต ์ค‘ ์ˆ˜๋™์œผ๋กœ ์กฐ์ •. ๊ฒ€์ฆ ์˜ค๋ฅ˜(validation error)๊ฐ€ ๊ฐœ์„ ๋˜์ง€ ์•Š์œผ๋ฉด ํ•™์Šต๋ฅ ์„ 10๋ฐฐ ๋‚ฎ์ถค. ์ดˆ๊ธฐ ํ•™์Šต๋ฅ ์€ 0.01์ด์—ˆ๊ณ , ์ข…๋ฃŒ ์ „๊นŒ์ง€ ์ด ์„ธ ๋ฒˆ ๋‚ฎ์ถค.
  • ์ „์ฒด ํ•™์Šต์€ ์•ฝ 120๋งŒ ์žฅ์˜ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์„ 90 epoch ์ˆœํ™˜ํ•˜๋Š” ๋™์•ˆ ์ง„ํ–‰๋˜์—ˆ์œผ๋ฉฐ, 3GB ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ฐ€์ง„ NVIDIA GTX 580 GPU ๋‘ ์žฅ์—์„œ 5~6์ผ์ด ์†Œ์š”๋จ.
This post is licensed under CC BY 4.0 by the author.