ํด๋ฆฝ๋ณด๋“œ์— ๋ณต์‚ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค
Post

(ZFNet) Visualizing and Understanding Convolutional Networks

(ZFNet) Visualizing and Understanding Convolutional Networks

๐Ÿ“š ์ •๋ฆฌ

๐Ÿ“Œ ์ œ๋ชฉ

Visualizing and Understanding Convolutional Networks


๐ŸŒŸ ์ดˆ๋ก

0. ์ดˆ๋ก

  • ์™œ ConvNet์ด ์ž‘๋™ํ•˜๋Š”์ง€ ๋ช…ํ™•ํ•œ ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค, ๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค.
  • ์ง„๋‹จ์  ๋„๊ตฌ ์ œ์•ˆ : ์ค‘๊ฐ„ ํŠน์ง• ์ถ”์ถœ(intermediate feature layers)์˜ ๊ธฐ๋Šฅ๊ณผ ๋ถ„๋ฅ˜๊ธฐ์˜ ๋™์ž‘์„ ์ดํ•ด โ†’ ๊ธฐ์กด ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋ฐ ๋„์›€์„ ์ค€๋‹ค. ์ฆ‰ ๋ˆˆ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค(๋ธ”๋ž™๋ฐ•์Šค ๋ถ€๋ถ„์„ ํ™”์ดํŠธ๋ฐ•์Šค๋กœ)
  • ablation study ์ˆ˜ํ–‰
  • ZFNet๊ฐœ๋ฐœ - ์ „์ด๋Šฅ๋ ฅ ํ™•์ธ

์ •๋ฆฌ

ํ•ญ๋ชฉ๋‚ด์šฉ
๋ฐ์ดํ„ฐ์…‹ImageNet 2012, Caltech-101, Caltech-256
๋ชจ๋ธ ๊ตฌ์กฐAlexNet ๊ธฐ๋ฐ˜, stride/filter ๊ฐœ์„ , softmax ์ถœ๋ ฅ
์—ฐ๊ตฌ ๊ธฐ์—ฌDeconvnet ์‹œ๊ฐํ™”, ๊ตฌ์กฐ ์ตœ์ ํ™”, ablation, ์ „์ดํ•™์Šต
ํ‰๊ฐ€ ๊ฒฐ๊ณผImageNet์—์„œ AlexNet๋ณด๋‹ค ๋‚ฎ์€ ์˜ค๋ฅ˜์œจ, Caltech์—์„œ SOTA ๋‹ฌ์„ฑ

๐Ÿ’ก ๊ฒฐ๋ก  & ๊ณ ์ฐฐ

6. Discussion

  • ์‹œ๊ฐํ™”ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•จ์œผ๋กœ์„œ, feature๋“ค์ด ๋ฌด์ž‘์œ„์ ์ด๊ฑฐ๋‚˜ ํ•ด์„ ๋ถˆ๊ฐ€๋Šฅํ•œ ํŒจํ„ด์ด ์•„๋‹˜์ด ๋“ค์–ด๋‚ฌ์Œ
    • ๊ณ„์ธต์ด ๊นŠ์–ด์งˆ์ˆ˜๋ก compositionality(์กฐํ•ฉ์„ฑ), invariance(๋ถˆ๋ณ€์„ฑ), class discrimination(ํด๋ž˜์Šค ๊ตฌ๋ณ„) ๋“ฑ ์ง๊ด€์ ์ธ ํŠน์„ฑ๋“ค์ด ๋ณด์˜€๋‹ค.
    • ์‹œ๊ฐํ™” ๊ธฐ๋ฒ•์ด ๋ชจ๋ธ ๋””๋ฒ„๊น…์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. AlexNet ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ
  • Occlusion(๊ฐ€๋ฆผ) ์‹คํ—˜์„ ํ†ตํ•ด, clf.๊ฐ€ scene context(์žฅ๋ฉด์˜ ๊ด‘๋ฒ”์œ„ํ•œ ๋งฅ๋ฝ)์„ ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ด๋ฏธ์ง€์˜ local structure(๊ตญ์†Œ์  ๊ตฌ์กฐ)์— ๋งค์šฐ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•จ.
  • Ablation study๋ฅผ ํ†ตํ•ด ๊ฐœ๋ณ„ ๊ณ„์ธต์ด ์•„๋‹ˆ๋ผ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ€์ง€๋Š” minimum depth(์ตœ์†Œํ•œ์˜ ๊นŠ์ด)๊ฐ€ ์„ฑ๋Šฅ์— ํ•„์ˆ˜์ ์ด๋‹ค.
  • ์ „์ด ํ•™์Šต์˜ ํšจ์šฉ์„ฑ์„ ์ฆ๋ช…ํ–ˆ๋‹ค(์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ), ๊ทธ๋Ÿฌ๋‚˜ PASCAL๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์—์„œ dataset bias๋•Œ๋ฌธ์—, ์ผ๋ฐ˜ํ™”๊ฐ€ ์•ฝํ–ˆ์œผ๋‚˜ ๊ทธ ๋งˆ์ €๋„ ๋‚ฎ์€ ๊ฐ์†Œ์˜€๋‹ค. โ†’ ๊ฐ์ฒดํƒ์ง€๊นŒ์ง€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™•์žฅ๋  ์ˆ˜ ์žˆ๋‹ค.

ํ•ต์‹ฌ

  • convnet์€ ํ•ด์„๊ฐ€๋Šฅํ•˜๊ณ , ๊ณ„์ธต์ด ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ถ”์ƒํ™”, ๋ถˆ๋ณ€์„ฑ์ด ์ฆ๊ฐ€
  • ์‹œ๊ฐํ™” : ๋‹จ์ˆœ ์‹œ๊ฐ ์„ค๋ช…๋„๊ตฌ๊ฐ€ ์•„๋‹ˆ๋ผ, ๋ชจ๋ธ ๋””๋ฒ„๊น…(๊ฐœ์„ ์šฉ)์— ์‚ฌ์šฉ๊ฐ€๋Šฅ
  • Occlusion : ๋ชจ๋ธ์€ ๋งฅ๋ฝ๋ณด๋‹ค, ์ง„์งœ ๊ฐ์ฒด ๊ตฌ์กฐ(local structure)์— ์ง‘์ค‘
  • Ablation : ๋ชจ๋ธ์€ ์ธต์˜ ๋‰ด๋Ÿฐ์ˆ˜๋ณด๋‹ค, ์ถฉ๋ถ„ํ•œ ๊นŠ์ด๊ฐ€ ์„ฑ๋Šฅ์— ํ•ต์‹ฌ
  • ์ „์ดํ•™์Šต ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์—์„œ ํšจ๊ณผ์ , ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ์˜ bias๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ํšจ๊ณผ์ ์ผ๊ฑฐ๋ผ ์˜ˆ์ƒ
    • ImageNet โ†’ Caltech(์ž‘์€ ๋ฐ์ดํ„ฐ์…‹) : ํšจ๊ณผ์ 
    • ์—ญ์€ ํšจ๊ณผ์ ์ธ์ง€ ์˜๋ฌธ
    • ๋Œ€๊ทœ๋ชจ๋ฐ์ดํ„ฐ์…‹์ด๋”๋ผ๋„ ๋ชจ๋“ ๋„๋ฉ”์ธ์— ์™„์ „ํžˆ ์ „์ด๋˜์ง€ ์•Š์Œ โ†’ ์†์‹คํ•จ์ˆ˜๋ฅผ ์†๋ณธ๋‹ค๋ฉด ๋” ํ–ฅ์ƒ๋˜์ง€ ์•Š์„๊นŒ?

5์ค„ ์š”์•ฝ

  • ConvNet ํŠน์ง•์€ ๋ฌด์ž‘์œ„๊ฐ€ ์•„๋‹ˆ๋ผ ์ ์ง„์ ์œผ๋กœ ์ถ”์ƒํ™”๋˜๋Š” ์˜๋ฏธ ์žˆ๋Š” ํ‘œํ˜„
  • Deconvnet ์‹œ๊ฐํ™”๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ์œ ์šฉํ•œ ์ง„๋‹จ ๋„๊ตฌ
  • ๋ชจ๋ธ์€ ๊ตญ์†Œ์  ๊ตฌ์กฐ๋ฅผ ์ž˜ ์žก์•„๋‚ด๋ฉฐ ๊นŠ์ด๊ฐ€ ํ•„์ˆ˜์ ์ž„์„ ํ™•์ธ
  • ImageNet ํ•™์Šต ๋ชจ๋ธ์€ Caltech ๊ณ„์—ด์—์„œ SOTA ๋‹ฌ์„ฑ, ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹ ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„ ์ œ๊ธฐ
  • PASCAL์—์„œ๋Š” ์ผ๋ฐ˜ํ™”๊ฐ€ ์ œํ•œ์  โ†’ dataset bias, loss function ๊ฐœ์„  ํ•„์š”

์ •๋ฆฌ

ํ•ญ๋ชฉ๋‚ด์šฉ
์‹œ๊ฐํ™” ๋ฐœ๊ฒฌํŠน์ง•์€ ์ถ”์ƒํ™”ยท๋ถˆ๋ณ€์„ฑยทํด๋ž˜์Šค ๊ตฌ๋ณ„์„ ์ ์ฐจ ๊ฐ•ํ™”
Occlusion๊ฐ์ฒด ์œ„์น˜์— ๋ฏผ๊ฐ โ†’ ๋ฐฐ๊ฒฝ ๋งฅ๋ฝ๋งŒ ์ด์šฉํ•˜์ง€ ์•Š์Œ
AblationํŠน์ • ์ธต๋ณด๋‹ค ๊นŠ์ด(depth) ์ž์ฒด๊ฐ€ ํ•ต์‹ฌ
์ „์ด ์„ฑ๋ŠฅCaltech-101/256์—์„œ SOTA, PASCAL์€ dataset bias๋กœ ๋‹ค์†Œ ์ €ํ•˜
์‹œ์‚ฌ์ ์ž‘์€ ๋ฒค์น˜๋งˆํฌ์˜ ์œ ํšจ์„ฑ ์žฌ๊ฒ€ํ† , loss function ๊ฐœ์„  ์‹œ ๊ฐ์ฒด ํƒ์ง€๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ

๐Ÿ—ƒ๏ธ ๋ฐ์ดํ„ฐ

๋ฐ์ดํ„ฐ์…‹(์ƒ๋žต)

๋ฐ์ดํ„ฐ์…‹ํฌ๊ธฐ/๊ตฌ์„ฑํŠน์ง•ํ™œ์šฉ ๋ชฉ์ 
ImageNet 2012130๋งŒ ํ•™์Šต / 5๋งŒ ๊ฒ€์ฆ / 10๋งŒ ํ…Œ์ŠคํŠธ, 1000 ํด๋ž˜์Šค๋Œ€๊ทœ๋ชจ, ๊ฐ์ฒด ์ค‘์‹ฌConvNet ํ•™์Šต ๋ฐ ์„ฑ๋Šฅ ํ‰๊ฐ€
Caltech-101101 ํด๋ž˜์Šค, ํด๋ž˜์Šค๋‹น 15~30 ํ•™์Šต, ์ตœ๋Œ€ 50 ํ…Œ์ŠคํŠธ์†Œ๊ทœ๋ชจ, ๋‹จ์ˆœ ๊ฐ์ฒด์ „์ดํ•™์Šต ํšจ๊ณผ ๊ฒ€์ฆ
Caltech-256256 ํด๋ž˜์Šค, ํด๋ž˜์Šค๋‹น 15~60 ํ•™์Šตํด๋ž˜์Šค ์ˆ˜ ๋งŽ๊ณ  ๋‹ค์–‘์„ฑ ํผ์ „์ดํ•™์Šต ๊ฐ•๊ฑด์„ฑ ํ‰๊ฐ€
PASCAL VOC 201220 ํด๋ž˜์Šค, ์žฅ๋ฉด ๋‚ด ๋‹ค์ค‘ ๊ฐ์ฒด ํฌํ•จ๋ณต์žกํ•œ ์žฅ๋ฉด, multi-objectConvNet ์ผ๋ฐ˜ํ™” ํ•œ๊ณ„ ํ™•์ธ

๐Ÿ“Œ ์„œ๋ก 

1. ์„œ๋ก 

  • 1990๋…„ ์ดˆ ์ฒ˜์Œ ์ œ์•ˆ๋œ CNN์€ AlexNet(2012)๋ถ€ํ„ฐ ํš๊ธฐ์ ์ธ ๋ชจ๋ธ๋กœ ๋ฐœ์ „ํ•ด ์™”๋‹ค. ๊ทธ ์ด์œ ๋Š”
    • ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ์ด์šฉ ๊ฐ€๋Šฅ์„ฑ
    • GPU์˜ ๊ตฌํ˜„
    • ์ •๊ทœํ™” ๊ธฐ๋ฒ•(Dropout, etc.)
  • ๊ทธ๋Ÿฌ๋‚˜ Blackbox(๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ถˆ๋ช…ํ™•)์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋” ๋‚˜์€ ๋ชจ๋ธ ๊ฐœ๋ฐœ์ด ๋‹จ์ˆœํ•œ ์‹œํ–‰์ฐฉ์˜ค์— ์˜์กดํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋–ค ๊ณ„์ธต์—์„œ feature map์„ ์‹œ๊ฐํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ๊ณตํ•˜๊ณ , ์ด๋Š” ์–ด๋–ค ํŠน์ง•์ด ์–ด๋–ป๊ฒŒ ์ง„ํ™”ํ•˜๋Š”์ง€ ๊ด€์ฐฐํ•˜๊ณ  ๋ชจ๋ธ์„ ์ง„๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Deconvolutional Network(deconvnet, (Zeiler et al., 2011))์„ ํ™œ์šฉํ•˜์—ฌ, feature activation์„ ๋‹ค์‹œ ํ”ฝ์…€ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•œ๋‹ค. ๋˜ ์ด๋ฏธ์ง€์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€๋ ค์„œ ๋ถ„๋ฅ˜์˜ ๋ฏผ๊ฐ๋„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์–ด๋–ค ๋ถ€๋ถ„์ด ๋ถ„๋ฅ˜์— ์ค‘์š”ํ•œ์ง€ ํ™•์ธํ•œ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ AlexNet์—์„œ ์ข€ ๋” ๋ฐœ์ „ํ•œ ZFNet์„ ๋งŒ๋“ค์—ˆ๋‹ค.

โœ… 5์ค„ ์š”์•ฝ

  • ConvNet์€ ์ตœ๊ทผ ImageNet ๋“ฑ์—์„œ ์„ฑ๋Šฅ์„ ํ˜์‹ ์ ์œผ๋กœ ๊ฐœ์„ 
  • ๊ทธ๋Ÿฌ๋‚˜ ๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์—ฌ์ „ํžˆ ๋ถˆ๋ช…ํ™•
  • ๋ณธ ๋…ผ๋ฌธ์€ deconvnet ๊ธฐ๋ฐ˜ ์‹œ๊ฐํ™”๋กœ ์ด๋ฅผ ๋ถ„์„
  • ๋ชจ๋ธ ๊ตฌ์กฐ ๊ฐœ์„  ๋ฐ ์ง„๋‹จ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œ
  • ImageNet ํ•™์Šต ํŠน์ง•์ด ์ „์ดํ•™์Šต์—์„œ๋„ ํƒ์›”ํ•จ์„ ๋ณด์ž„

๐Ÿ“Œ ์ •๋ฆฌ

ํ•ญ๋ชฉ๋‚ด์šฉ
๋ฐฐ๊ฒฝConvNet ์„ฑ๋Šฅ ๊ธ‰์ƒ์Šน (CIFAR-10, ImageNet ๋“ฑ)
ํ•œ๊ณ„๋‚ด๋ถ€ ๋™์ž‘ ์›๋ฆฌ์— ๋Œ€ํ•œ ์ดํ•ด ๋ถ€์กฑ
๊ธฐ์—ฌ์‹œ๊ฐํ™” ๊ธฐ๋ฒ• ์ œ์•ˆ (deconvnet, occlusion)
์—ฐ๊ตฌ ์ „๋žตAlexNet ๊ตฌ์กฐ โ†’ ๊ฐœ์„  โ†’ ์‹œ๊ฐํ™” ๊ธฐ๋ฐ˜ ์ง„๋‹จ โ†’ ์ „์ด ์„ฑ๋Šฅ ํ™•์ธ
์‚ฌ์ „ํ•™์Šต ๊ตฌ๋ถ„์ง€๋„ ์‚ฌ์ „ํ•™์Šต(supervised pre-training) vs ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต(unsupervised pre-training) ๋Œ€๋น„


๐Ÿ”ฌ ์‹คํ—˜๊ณผ์ •

๐Ÿ“š ๊ด€๋ จ ์—ฐ๊ตฌ

1.1. ๊ด€๋ จ ์—ฐ๊ตฌ

  • ๋Œ€๋ถ€๋ถ„์€ ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋งŒ ์ง์ ‘ ์‹œ๊ฐํ™”ํ•œ๋‹ค. ๋” ๊นŠ์€ ์ธต์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์ด ์ œํ•œ์ ์ด๋‹ค.
  • ๊ฐ ๋‰ด๋Ÿฐ ์œ ๋‹›์˜ ํ™œ์„ฑํ™”๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์ด๋ฏธ์ง€ ๊ณต๊ฐ„์—์„œ ๊ฐ ์œ ๋‹›์˜ optimal stimulus(์ตœ์  ์ž๊ทน)์„ ์ฐพ์•˜์œผ๋‚˜ ์ด๋Š” ์ดˆ๊ธฐํ™”์— ๋ฏผ๊ฐํ•˜์—ฌ ์œ ๋‹›์˜ invariances(๋ถˆ๋ณ€์„ฑ)์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์ œ๊ณตํ•˜์ง€ ๋ชปํ•œ๋‹ค. โ†’ ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Hessian์„ ์ˆ˜์น˜์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ ์ผ๋ถ€ ํ†ต์ฐฐ์„ ์ œ๊ณตํ–ˆ์œผ๋‚˜ ๊นŠ์–ด์งˆ์ˆ˜๋ก ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š”๋‹ค(์ด์ฐจ ๊ทผ์‚ฌ์˜ ๋‹จ์ )
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋น„๋ชจ์ˆ˜์  ๊ด€์ ์˜ ๋ถˆ๋ณ€์„ฑ ์‹œ๊ฐํ™”๋ฅผ ์ œ๊ณต, ์ด๋ฏธ์ง€๋ฅผ ์ž˜๋ผ๋‚ด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ top-down projection์„ ํ†ตํ•ด ํŠน์ • ํ”ผ์ณ๋งต์„ ์ž๊ทนํ•˜๋Š” ํŒจ์น˜ ๋‚ด๋ถ€์˜ ๋“œ๋Ÿฌ๋‚ธ๋‹ค

ํ—ค์„ธ ํ–‰๋ ฌ(Hessian Matrix)

  • ์–ด๋–ค ํ•จ์ˆ˜ f(x)์˜ **Hessian ํ–‰๋ ฌ(Hessian matrix)**์€ **์ด์ฐจ ๋„ํ•จ์ˆ˜(์ด๊ณ„ ๋ฏธ๋ถ„)**๋ฅผ ๋ชจ์•„๋†“์€ ์ •๋ฐฉํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.
  • Hessian์€ ํ•จ์ˆ˜์˜ **๊ณก๋ฅ (curvature)**์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ํ•จ์ˆ˜๊ฐ€ ํŠน์ • ์ง€์ ์—์„œ ๋ณผ๋ก(convex)ํ•œ์ง€, ์˜ค๋ชฉ(concave)ํ•œ์ง€, ๋˜๋Š” ์•ˆ์žฅ์ (saddle point)์ธ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ์ตœ์ ํ™”(Optimization): 2์ฐจ ์ตœ์ ํ™” ๊ธฐ๋ฒ•(Newtonโ€™s method ๋“ฑ)์€ Hessian์„ ์ด์šฉํ•ด ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฏผ๊ฐ๋„ ๋ถ„์„(Sensitivity Analysis): ํŠน์ • ์ž…๋ ฅ ๋ณ€ํ™”๊ฐ€ ์ถœ๋ ฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€, ์ฆ‰ ๋ชจ๋ธ์˜ **๊ตญ์†Œ์  ๋ถˆ๋ณ€์„ฑ(local invariance)**์„ ๋ถ„์„ํ•  ๋•Œ Hessian์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋‰ด๋Ÿฐ์˜ ์ถœ๋ ฅ์ด ์ž…๋ ฅ ๋ณ€ํ™”์— ๋”ฐ๋ผ ์–ผ๋งˆ๋‚˜ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€(๊ณก๋ฅ )๋ฅผ ๋ถ„์„
  • ๊ณก๋ฅ ์ด ๋‚ฎ์€ ๋ฐฉํ–ฅ โ†’ ๋‰ด๋Ÿฐ์ด ๊ทธ ๋ฐฉํ–ฅ์˜ ์ž…๋ ฅ ๋ณ€ํ™”์—๋Š” ๋ถˆ๋ณ€(invariant)
  • ๊ณก๋ฅ ์ด ๋†’์€ ๋ฐฉํ–ฅ โ†’ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ โ†’ ์ค‘์š”ํ•œ ํŒจํ„ด ๋ฐฉํ–ฅ

ํ•ต์‹ฌ

  • ์‹œ๊ฐํ™” ์—ฐ๊ตฌ๋Š” ์ดˆ๊ธฐ์— ์‹œ์ž‘์ธต์ด๋‚˜ ์ดˆ๊ธฐ์ธต์—๋งŒ ๊ตญํ•œ๋˜์–ด ๊นŠ์€ ์ธต์€ ํ•ด์„ ๋ถˆ๊ฐ€๋Šฅ
    • ์ตœ์  ์ž๊ทน ํƒ์ƒ‰ (Erhan et al., 2009): ์ด๋ฏธ์ง€ ๊ณต๊ฐ„์—์„œ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• โ†’ ํ™œ์„ฑํ™” ๊ทน๋Œ€ํ™” โ†’ ๋‹จ์ : ์ดˆ๊ธฐํ™” ๋ฏผ๊ฐ, ๋ถˆ๋ณ€์„ฑ ์ •๋ณด ์—†์Œ
    • Hessian ๊ธฐ๋ฐ˜ ๋ถˆ๋ณ€์„ฑ ๋ถ„์„ (Le et al., 2010): Hessian ๊ทผ์‚ฌ๋กœ ๋ถˆ๋ณ€์„ฑ ํŒŒ์•… โ†’ ๋‹จ์ : ๊ณ ์ฐจ์› ์ธต์˜ ๋ณต์žกํ•œ ๋ถˆ๋ณ€์„ฑ์„ ๋‹จ์ˆœ ์ด์ฐจ์‹์œผ๋กœ ์„ค๋ช… ๋ถˆ๊ฐ€
    • ํŒจ์น˜ ๊ธฐ๋ฐ˜ ์‹œ๊ฐํ™” (Donahue et al., 2013): ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ•ํ•œ ํ™œ์„ฑํ™”๋ฅผ ์ผ์œผํ‚ค๋Š” ํŒจ์น˜ ์‹๋ณ„ โ†’ ๋‹จ์ : ๋‹จ์ˆœ crop, feature map ๋‚ด๋ถ€ ๊ตฌ์กฐ๋Š” ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•จ

โœ… 5์ค„ ์š”์•ฝ

  • ๊ณผ๊ฑฐ ์‹œ๊ฐํ™” ์—ฐ๊ตฌ๋Š” ์ฃผ๋กœ ์ฒซ ๋ฒˆ์งธ ์ธต์— ์ง‘์ค‘
  • Erhan et al. (2009): ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์œผ๋กœ ์ตœ์  ์ž๊ทน ํƒ์ƒ‰, ๋ถˆ๋ณ€์„ฑ ์„ค๋ช… ๋ถ€์กฑ
  • Le et al. (2010): Hessian ๊ทผ์‚ฌ๋กœ ๋ถˆ๋ณ€์„ฑ ๋ถ„์„, ๊ณ ์ธต์—์„œ๋Š” ๋ถ€์ •ํ™•
  • Donahue et al. (2013): ๋ฐ์ดํ„ฐ์…‹ ํŒจ์น˜ ๊ธฐ๋ฐ˜ ์‹œ๊ฐํ™”, ๊ตฌ์กฐ์  ํ•ด์„ ์ œํ•œ
  • ๋ณธ ๋…ผ๋ฌธ: Deconvnet์„ ํ†ตํ•ด ๋น„๋ชจ์ˆ˜์ , ๊ตฌ์กฐ์  ์‹œ๊ฐํ™” ์ œ๊ณต โ†’ ๊ณ ์ธต feature ํ•ด์„ ๊ฐ€๋Šฅ

๐Ÿ“Œ ์ •๋ฆฌ

์—ฐ๊ตฌ๋ฐฉ๋ฒ•ํ•œ๊ณ„
Erhan et al. (2009)์ด๋ฏธ์ง€ ๊ณต๊ฐ„ ๊ฒฝ์‚ฌํ•˜๊ฐ• โ†’ ์ตœ์  ์ž๊ทน์ดˆ๊ธฐํ™” ๋ฏผ๊ฐ, ๋ถˆ๋ณ€์„ฑ ์„ค๋ช… ๋ถˆ๊ฐ€
Le et al. (2010)Hessian ๊ทผ์‚ฌ โ†’ ๋ถˆ๋ณ€์„ฑ ๋ถ„์„๊ณ ์ฐจ์› ์ธต์˜ ๋ณต์žก์„ฑ ๋ฐ˜์˜ ๋ชปํ•จ
Donahue et al. (2013)ํŒจ์น˜ ์‹๋ณ„ โ†’ ํ™œ์„ฑํ™” ํ•ด์„๋‹จ์ˆœ crop, ๊ตฌ์กฐ ์„ค๋ช… ํ•œ๊ณ„
๋ณธ ๋…ผ๋ฌธDeconvnet ๊ธฐ๋ฐ˜ top-down projection๊ณ ์ธต feature ๊ตฌ์กฐ์  ํ•ด์„ ๊ฐ€๋Šฅ

๐Ÿ“š 2. Approach

2. Approach

  • ์ง€๋„ํ•™์Šต
  • Layer๊ตฌ์กฐ : Conv โ†’ ReLU โ†’ (์˜ต์…˜) Max Pooling : Local์— ๋Œ€ํ•ด โ†’ (์˜ต์…˜) Local Contrast Normalization : feature map ์ „๋ฐ˜์„ ์ •๊ทœํ™”
  • ๋„คํŠธ์›Œํฌ๊ฐ€ ๊นŠ์–ด์ง€๋ฉด ์†Œ์ˆ˜์˜ fc layer๋กœ ๊ตฌ์„ฑ, ๋งˆ์ง€๋ง‰์€ softmax clf.
  • ์•„ํ‚คํ…์ณ

ZFNet Architecture

  • ์†์‹คํ•จ์ˆ˜ : Cross-entropy
  • Optimizer : SGD(mini-batch)

โœ… 5์ค„ ์š”์•ฝ

  • ConvNet์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๊ณ„์ธต์  ๋ณ€ํ™˜์„ ํ†ตํ•ด ํด๋ž˜์Šค ํ™•๋ฅ ๋กœ ๋งคํ•‘
  • ๊ณ„์ธต์€ ํ•ฉ์„ฑ๊ณฑ, ReLU, ํ’€๋ง, ์ •๊ทœํ™”๋กœ ๊ตฌ์„ฑ
  • ์ƒ์œ„ ๊ณ„์ธต์€ fully-connected + softmax ๋ถ„๋ฅ˜๊ธฐ
  • ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค๊ณผ backpropagation์œผ๋กœ ํ•™์Šต
  • ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(SGD)์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”

๐Ÿ“Œ ์ •๋ฆฌ

์š”์†Œ์„ค๋ช…
์ž…๋ ฅ์ปฌ๋Ÿฌ 2D ์ด๋ฏธ์ง€ (xi)
์ถœ๋ ฅํด๋ž˜์Šค ํ™•๋ฅ  ๋ฒกํ„ฐ (ลทi)
๊ณ„์ธต ๊ตฌ์„ฑํ•ฉ์„ฑ๊ณฑ + ReLU + (ํ’€๋ง) + (์ •๊ทœํ™”)
์ƒ์œ„ ๊ตฌ์กฐFully-connected layers
์ตœ์ข… ๋ถ„๋ฅ˜๊ธฐSoftmax
์†์‹ค ํ•จ์ˆ˜Cross-entropy
ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜Backpropagation + SGD

2.1 Visualization with a Deconvnet

  • Deconvolutional Network : Conv์™€ ๋™์ผํ•œ ๊ตฌ์„ฑ์ด์ง€๋งŒ ํ”ฝ์…€ โ†’ ํŠน์ง• ํˆฌ์˜๊ณผ ๋ฐ˜๋Œ€๋กœ ํŠน์ง• โ†’ ํ”ฝ์…€ ํˆฌ์˜์„ํ•œ๋‹ค.
    • ๊ธฐ์กด์—๋Š” ๋น„์ง€๋„ํ•™์Šต์šฉ์œผ๋กœ ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์ด๋ฏธ ํ•™์Šต๋œ ๋ชจ๋ธ์„ probeํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
  • ๋ถ„์„๊ณผ์ •
    1. convnet์— deconvnet์„ ์—ฐ๊ฒฐํ•œ๋‹ค.
    2. ์ด๋ฏธ์ง€๋ฅผ convnet์— ๋„ฃ์€ ํ›„ feature์„ ๊ณ„์‚ฐํ•˜๊ณ , ํŠน์ • ํ™œ์„ฑ์„ ์„ ํƒํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ 0์œผ๋กœ ๋งŒ๋“ ๋‹ค.
    3. ์ด feature์„ deconvnet์— ๋„ฃ์–ด์„œ ๋ณต์›ํ•œ๋‹ค.
      1. unpool
      2. rectify
      3. filter ๊ณผ์ •์„ ๊ฑฐ์ณ ๋ฐ”๋กœ ์•„๋ž˜ ๊ณ„์ธต์˜ activity๋ฅผ ๋ณต์›
  • ํ•ต์‹ฌ์—ฐ์‚ฐ
    • Unpooling : max pooling์€ ๋น„๊ฐ€์—ญ์ ์ธ๋ฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด pooling์‹œ ๊ฐ ์˜์—ญ์˜ ์ตœ๋Œ€๊ฐ’์˜ ์œ„์น˜๋ฅผ switch๋ณ€์ˆ˜๋กœ ๊ธฐ๋กํ•˜๊ณ , ๊ทธ ๊ฐ’์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ณต์› ๊ฒฐ๊ณผ๋ฅผ ์ •ํ™•ํ•œ ์œ„์น˜์— ๋ฐฐ์น˜
    • Rectification : ConvNet์ฒ˜๋Ÿผ ReLU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณต์›(๋ณต์›๋œ feature๊ฐ€ ์–‘์ˆ˜)
    • Filtering : ConvNet์˜ ํ•™์Šต๋œ ํ•„ํ„ฐ๋ฅผ ์ „์น˜ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•œ๋‹ค.(์ˆ˜์ง, ์ˆ˜ํ‰๋ฐฉํ–ฅ)
  • ํ•œ๊ฐœ์˜ ํ™œ์„ฑ์œผ๋กœ๋ถ€ํ„ฐ ์–ป๋Š” ๊ฒฐ๊ณผ๋Š” ์ด๋ฏธ์ง€์˜ ์–ด๋–ค ๊ตฌ์กฐ์˜ ์ผ๋ถ€๋ถ„๊ณผ ์œ ์‚ฌํ•˜๋‹ค. ๋ชจ๋ธ์€ ํŒ๋ณ„์ ์œผ๋กœ ํ•™์Šต๋˜๋ฏ€๋กœ, ์ž…๋ ฅ์ด๋ฏธ์ง€์—์„œ ์–ด๋–ค๋ถ€๋ถ„์ด ์ค‘์š”ํ–ˆ๋Š”์ง€๋ฅผ ๋“œ๋Ÿฌ๋‚ธ๋‹ค. ๋‹ค๋งŒ ์ƒ์„ฑ๋ชจ๋ธ์˜ ์ƒ˜ํ”Œ์ด ์•„๋‹ˆ๋ผ, ๋‹จ์ˆœํžˆ ์—ญํˆฌ์˜๋œ๋‹ค๋Š” ์ ์ด๋‹ค. โ†’ ์ƒ์„ฑ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ ์—ญํˆฌ์˜์ด๋‹ค. โ†’ ์ฆ‰ ๋ชจ๋ธ์ด ์ž…๋ ฅ๊ตฌ์กฐ์˜ ์–ด๋–ค๋ถ€๋ถ„์„ ๊ฐ€์ง€๊ณ  ํŒ๋‹จํ–ˆ๋Š”์ง€ ์ง๊ด€์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค.
  • ๋ถ„์„ ๊ตฌ์กฐ

Deconvnet Process

โœ… 5์ค„ ์š”์•ฝ

  • Deconvnet์„ ์ด์šฉํ•ด ์ค‘๊ฐ„์ธต feature map์„ ์ž…๋ ฅ ํ”ฝ์…€ ๊ณต๊ฐ„์œผ๋กœ ๋ณต์›
  • ๊ณผ์ •: (Unpool โ†’ ReLU โ†’ Filter) ๋ฐ˜๋ณต
  • pooling switch๋กœ ์›๋ž˜ ์ž๊ทน ์œ„์น˜ ๋ณต์›
  • ๊ฒฐ๊ณผ: ํŠน์ • activation์ด ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ์–ด๋–ค ๋ถ€๋ถ„์— ์˜ํ•ด ์œ ๋ฐœ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ ๊ฐ€๋Šฅ
  • ์ด๋Š” ์ƒ์„ฑ์ด ์•„๋‹Œ ํŒ๋ณ„ ๊ธฐ๋ฐ˜ projection โ†’ ConvNet์˜ ํŒ๋ณ„ ๊ทผ๊ฑฐ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์‹œ๊ฐํ™”

๐Ÿ“Œ ์ •๋ฆฌ

๋‹จ๊ณ„ConvNet (์ •๋ฐฉํ–ฅ)Deconvnet (์—ญ๋ฐฉํ–ฅ)
PoolingMax poolingUnpooling (switch ์‚ฌ์šฉ)
ReLUReLUReLU
FilteringLearned filterTransposed filter (flip)
๊ฒฐ๊ณผFeature map์ž…๋ ฅ ๊ณต๊ฐ„ ๋ณต์›

๐Ÿ“š 3. Training Detail

3. ํ•™์Šต ์„ธ๋ถ€ ์‚ฌํ•ญ

  • AlexNet์˜ GPU๋ถ„์‚ฐ ํ•™์Šต์„ ํ–ˆ๊ธฐ๋•Œ๋ฌธ์— 3, 4, 5์ธต์€ **sparse connections(ํฌ์†Œ ์—ฐ๊ฒฐ - ์šฐ๋ฆฌ๊ฐ€ ๊ธฐ์กด์— ์•„๋Š” sparse๊ฐ€ ์•„๋‹ˆ๋ผ, GPU๋กœ ์ธํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ณ ๋ถ„์‚ฐ์„ ์˜๋ฏธ)**์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, ๋ณธ ๋ชจ๋ธ์—์„œ๋Š” dense connections๋กœ ๋Œ€์ฒด
    • ๋˜ํ•œ 1์ธต๊ณผ 2์ธต์— ์„ธ๋ถ€ ์ˆ˜์ •์ด ์ด๋ฃจ์–ด์ง(์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ)
  • AlexNet๊ณผ ๋™์ผํ•˜๊ฒŒ ์ฆ๊ฐ•
  • 1์ธต ํ•„ํ„ฐ๋ฅผ ์‹œ๊ฐํ™” ํ•œ๊ฒฐ๊ณผ, ์ผ๋ถ€ ํ•„ํ„ฐ๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ์ง€๋ฐฐ์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด **RMS๊ฐ’์ด 10^-1์„ ์ดˆ๊ณผํ•˜๋Š” ํ•„ํ„ฐ๋Š” ๊ฐ•์ œ๋กœ renormalization(์žฌ์ •๊ทœํ™”, 0.1๋กœ ๋งŒ๋“ฆ, ๊ท ํ˜• ์œ ์ง€)**ํ•œ๋‹ค. ์ด๋Š” [-128, 128]์ธ 1์ธต์—์„œ ์ค‘์š”ํ•˜๋‹ค.

โœ… 5์ค„ ์š”์•ฝ

  • ImageNet 2012์—์„œ ํ•™์Šต (130๋งŒ ์žฅ, 1000 ํด๋ž˜์Šค)
  • ์ „์ฒ˜๋ฆฌ: ๋ฆฌ์‚ฌ์ด์ฆˆยทํฌ๋กญยทํ‰๊ท  ์ œ๊ฑฐยท๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
  • ํ•™์Šต: SGD, mini-batch=128, learning rate=0.01 ์‹œ์ž‘, momentum=0.9, Dropout=0.5
  • ํ•„ํ„ฐ ์žฌ์ •๊ทœํ™”๋กœ ํŠน์ • ํ•„ํ„ฐ ์ง€๋ฐฐ ๋ฐฉ์ง€
  • 70 epoch, GPU 1์žฅ, ์•ฝ 12์ผ ์†Œ์š”

๐Ÿ“Œ ์ •๋ฆฌ

ํ•ญ๋ชฉ์„ค์ •
๋ฐ์ดํ„ฐImageNet 2012 (1.3M, 1000 ํด๋ž˜์Šค)
์ „์ฒ˜๋ฆฌ๋ฆฌ์‚ฌ์ด์ฆˆ 256, ์ค‘์•™ ํฌ๋กญ, ํ‰๊ท  ์ œ๊ฑฐ, 224x224 ์„œ๋ธŒ ํฌ๋กญ 10๊ฐœ, flip
์ตœ์ ํ™”SGD, batch=128, lr=0.01, momentum=0.9
์ •๊ทœํ™”Dropout(0.5), filter RMS clipping(0.1)
๊ตฌ์กฐ ์ฐจ์ดAlexNet sparse โ†’ dense ์—ฐ๊ฒฐ
ํ•™์Šต ์‹œ๊ฐ„70 epoch, GTX580 GPU, 12์ผ

๐Ÿ“š 4. Convnet Visualization

Feature Visualization across Layers

4. ์‹œ๊ฐํ™”

  • ๊ฐ ์ธต๋ณ„ ํŠน์ง• : ๊ทธ ๋ ˆ์ด์–ด์—์„œ์˜ ๋ณต์›์ด ์•„๋‹ˆ๋ผ, ์—ญ์œผ๋กœ ์ „๋ถ€๋‹ค ๊ฑฐ์นœ ํ›„ ๋ณต์›, ๋”ฐ๋ผ์„œ ๊นŠ์–ด์งˆ์ˆ˜๋ก ํ•ด์ƒ๋„๊ฐ€ ๋†’์•„์ง„๋‹ค. e.g. 2์ธต์˜ ๊ฒฝ์šฐ 2 โ†’ 1, 5์ธต์˜ ๊ฒฝ์šฐ 5 โ†’ 4 โ†’ โ€ฆ โ†’ 1 ์ฆ‰ ํ•˜์œ„๊ณ„์ธต(์ดˆ๊ธฐ์ธต)๊ณผ ์ƒ์œ„๊ณ„์ธต(ํ›„๋ฐ˜์ธต)์„ ๋น„๊ต
    • 2์ธต(layer 2): ๋ชจ์„œ๋ฆฌ(corner), ์ƒ‰์ƒ/์—ฃ์ง€ ๊ฒฐํ•ฉ ๊ตฌ์กฐ์— ๋ฐ˜์‘
    • 3์ธต(layer 3): ํ…์Šค์ฒ˜(texture)์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ๋ถˆ๋ณ€์„ฑ ํŒจํ„ด ํฌ์ฐฉ
    • 4์ธต(layer 4): ํด๋ž˜์Šค ํŠน์ด์ (class-specific) ํŒจํ„ด (์˜ˆ: ๊ฐœ ์–ผ๊ตด, ์ƒˆ ๋‹ค๋ฆฌ)
    • 5์ธต(layer 5): ํฌ์ฆˆ ๋ณ€ํ™”๊ฐ€ ํฐ ์ „์ฒด ๊ฐ์ฒด (์˜ˆ: ํ‚ค๋ณด๋“œ, ๊ฐœ ์ „์ฒด ๋ชจ์Šต)
  • ์ž…๋ ฅ ๋ณ€ํ˜•(input deformation)์— ๋Œ€ํ•œ ๋ถˆ๋ณ€์„ฑ(invariance)์„ ํ™•์ธ โ†’ ์ž‘์€ ๋ณ€ํ™”๋Š” ํ•˜์œ„์ธต์—์„œ ํฐ ํšจ๊ณผ๋ฅผ ์ฃผ์ง€๋งŒ ์ƒ์œ„์ธต์—๋Š” quasi-linear(์•ˆ์ •์ )์ธ ๋ฐ˜์‘์„ ๋ณด์—ฌ์คŒ

Feature Invariance Visualization

  • ๊ณ„์ธต์  ์„ฑ๊ฒฉ(hierarchical nature)
  • ํ•˜์œ„ ๊ณ„์ธต์€ ์†Œ์ˆ˜์˜ epochs๋งŒ์— ์ˆ˜๋ ด, ์ƒ์œ„๋Š” ์˜ค๋ž˜๊ฑธ๋ฆผ

Feature Evolution during Training

  • ์ฆ‰ ConvNet์€ ๊นŠ๊ฒŒ ํ•™์Šต๋ ์ˆ˜๋ก ์ถ”์ƒ์ ์ธ ํŠน์ง•์„ ํ•™์Šต

โœ… 5์ค„ ์š”์•ฝ

  • Deconvnet์œผ๋กœ ๊ฐ ์ธต์˜ feature map์„ ํ”ฝ์…€ ๊ณต๊ฐ„์œผ๋กœ ๋ณต์›
  • ํ•˜์œ„์ธต: ์ €์ˆ˜์ค€ ํŠน์ง•(์—ฃ์ง€, ์ฝ”๋„ˆ), ์ค‘๊ฐ„์ธต: ํ…์Šค์ฒ˜, ์ƒ์œ„์ธต: ๊ฐ์ฒด/๋ถ€์œ„
  • ํ›ˆ๋ จ ์ดˆ๊ธฐ์— ํ•˜์œ„์ธต์€ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด, ์ƒ์œ„์ธต์€ ๋Šฆ๊ฒŒ ๋ฐœ๋‹ฌ
  • ์ž…๋ ฅ ๋ณ€ํ˜•์— ๋Œ€ํ•ด ์ƒ์œ„์ธต์€ ๋” ์•ˆ์ •์  โ†’ ๋ถˆ๋ณ€์„ฑ ํ™•๋ณด
  • ConvNet์€ ๊ณ„์ธต์ ์œผ๋กœ ์˜๋ฏธ ์žˆ๋Š” ํ‘œํ˜„์„ ํ•™์Šตํ•จ์„ ์‹ค์ฆ

๐Ÿ“Œ ์ •๋ฆฌ

์ธต (Layer)์ฃผ์š” ํŠน์ง•์‹œ๊ฐํ™” ๊ฒฐ๊ณผ
Layer 2์ฝ”๋„ˆ, ์—ฃ์ง€+์ƒ‰ ๊ฒฐํ•ฉ๊ธฐ๋ณธ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ ๊ฐ์ง€
Layer 3ํ…์Šค์ฒ˜, ๋ฐ˜๋ณต ํŒจํ„ด๋ฉ”์‹œ(mesh), ํ…์ŠคํŠธ ์ธ์‹
Layer 4ํด๋ž˜์Šค ํŠน์ด์  ๋ถ€์œ„๊ฐœ ์–ผ๊ตด, ์ƒˆ ๋‹ค๋ฆฌ ๋“ฑ
Layer 5์ „์ฒด ๊ฐ์ฒด๊ฐœ, ํ‚ค๋ณด๋“œ ๋“ฑ ๋‹ค์–‘ํ•œ ํฌ์ฆˆ

4.1. Architecture Selection

  • AlexNet์˜ ๋ฌธ์ œ์ 
    • 1์ธต : ๋งค์šฐ ๊ณ ์ฃผํŒŒ(high frequency)์™€ ์ €์ฃผํŒŒ(low frequency) ์ •๋ณด๊ฐ€ ํ˜ผํ•ฉ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ค‘๊ฐ„ ์ฃผํŒŒ์ˆ˜(mid frequency) ์˜์—ญ์˜ ์ปค๋ฒ„๊ฐ€ ๋ถ€์กฑ
    • 2์ธต : 1์ธต ํ•ฉ์„ฑ๊ณฑ์—์„œ stride=4๋ฅผ ์‚ฌ์šฉํ•œ ํƒ“์— **aliasing artifact(์ƒ˜ํ”Œ๋ง ์™œ๊ณก)**์ด ๋ฐœ์ƒ
  • ZFNet ์ˆ˜์ •์‚ฌํ•ญ
    • 1์ธต ํ•„ํ„ฐ ํฌ๊ธฐ๋ฅผ 11x11์—์„œ 7x7๋กœ ์ค„์ด๊ณ ,
    • ํ•ฉ์„ฑ๊ณฑ stride๋ฅผ 4์—์„œ 2๋กœ ์ถ•์†Œํ•˜์˜€๋‹ค.

AlexNet vs ZFNet Layer 1 and 2 Visualization

โœ… 5์ค„ ์š”์•ฝ

  • ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด AlexNet์˜ 1ยท2์ธต์—์„œ ๋ฌธ์ œ์  ๋ฐœ๊ฒฌ
  • 1์ธต: ์ค‘๊ฐ„ ์ฃผํŒŒ์ˆ˜ ๋ถ€์กฑ, 2์ธต: stride=4๋กœ ์ธํ•œ aliasing
  • ๊ฐœ์„ : 11x11 ํ•„ํ„ฐ โ†’ 7x7, stride=4 โ†’ 2
  • ๊ฒฐ๊ณผ: ์ •๋ณด ๋ณด์กด โ†‘, aliasing โ†“
  • ์„ฑ๋Šฅ ๋˜ํ•œ ๊ฐœ์„ ๋จ (Section 5.1)

๐Ÿ“Œ ์ •๋ฆฌ

๋ฌธ์ œ (AlexNet)ํ•ด๊ฒฐ์ฑ… (Zeiler & Fergus)๊ฒฐ๊ณผ
1์ธต: ๊ณ ยท์ €์ฃผํŒŒ ์œ„์ฃผ, mid-frequency ๋ถ€์กฑํ•„ํ„ฐ ํฌ๊ธฐ 11x11 โ†’ 7x7๋” ๊ท ํ˜• ์žกํžŒ ํ•„ํ„ฐ
2์ธต: stride=4 โ†’ aliasing ๋ฐœ์ƒstride=4 โ†’ 2aliasing ์ œ๊ฑฐ, ์ •๋ณด ๋ณด์กด โ†‘
์„ฑ๋Šฅ๊ฐœ์„  ์ „ AlexNet๊ฐœ์„  ํ›„ ๋” ๋‚ฎ์€ ์˜ค๋ฅ˜์œจ

4.2 Occlusion Sensitivity

  • ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์ธ์‹ํ•˜๋Š”์ง€, ์ฃผ๋ณ€ ๋งฅ๋ฝ๋งŒ ์‚ฌ์šฉํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•ด๋ณด๋Š” ์‹คํ—˜์œผ๋กœ, ๊ฒฐ๋ก ์ ์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์ง€์šฐ๋Š” ๊ฒฝ์šฐ ์˜ฌ๋ฐ”๋ฅธ ํด๋ž˜์Šค ํ™•๋ฅ ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง„๋‹ค.
  • ์ฆ‰ ๋ชจ๋ธ์€ ๊ฐ์ฒด ์ž์ฒด์— ์ง‘์ค‘ํ•˜์—ฌ, ๊ฐ์ฒด ํƒ์ง€์˜ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ์‹œ์‚ฌ

Occlusion Sensitivity Experiment

โœ… 5์ค„ ์š”์•ฝ

  • Occlusion์œผ๋กœ ๋ชจ๋ธ์ด ๋ฐฐ๊ฒฝ์ด ์•„๋‹Œ ๊ฐ์ฒด ์œ„์น˜์— ์˜์กดํ•จ์„ ํ™•์ธ
  • ๊ฐ์ฒด ๋ถ€์œ„๊ฐ€ ๊ฐ€๋ ค์ง€๋ฉด ์˜ฌ๋ฐ”๋ฅธ ํด๋ž˜์Šค ํ™•๋ฅ  ๊ธ‰๋ฝ
  • top conv layer์˜ feature map ํ™œ์„ฑ๋„๋„ ํ•จ๊ป˜ ๊ธ‰๋ฝ
  • Deconvnet ์‹œ๊ฐํ™”๊ฐ€ ์ง„์งœ ์ž๊ทน ๊ตฌ์กฐ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค๋Š” ๊ฒ€์ฆ
  • ConvNet์€ ์žฅ๋ฉด context๋ณด๋‹ค ๊ฐ์ฒด local structure์— ๋ฏผ๊ฐ

๐Ÿ“Œ ์ •๋ฆฌ

์งˆ๋ฌธ๋ฐฉ๋ฒ•๊ฒฐ๊ณผ์˜๋ฏธ
๊ฐ์ฒด ์œ„์น˜๋ฅผ ๋ณด๋Š”๊ฐ€, ๋ฐฐ๊ฒฝ context๋ฅผ ๋ณด๋Š”๊ฐ€?์ด๋ฏธ์ง€ ์˜์—ญ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ฐ€๋ฆผ๊ฐ์ฒด ๋ถ€๋ถ„์ด ๊ฐ€๋ ค์ง€๋ฉด ํ™•๋ฅ  ๊ธ‰๋ฝConvNet์€ ๊ฐ์ฒด local structure์— ์ง‘์ค‘
์‹œ๊ฐํ™” ์‹ ๋ขฐ์„ฑ ๊ฒ€์ฆtop conv layer feature map ํ™œ์„ฑ๋„ ๊ด€์ฐฐoccluder๊ฐ€ ํ•ด๋‹น ๊ตฌ์กฐ๋ฅผ ๊ฐ€๋ฆฌ๋ฉด ํ™œ์„ฑ๋„ ๊ธ‰๋ฝDeconvnet ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๊ฐ€ ์‹ค์ œ feature์™€ ์ผ์น˜

4.3. Correspondence Analysis

  • ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค์€ **ํŠน์ • ๊ฐ์ฒด ๋ถ€์œ„๊ฐ„์˜ ๋Œ€์‘(์–ผ๊ตด ์†, ์ฝ”์™€ ๋ˆˆ)**์„ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ์—๋Š” ์ด๋Ÿฌํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์•”๋ฌต์ ์œผ๋กœ ๋Œ€์‘ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

โœ… ๋ฐฉ๋ฒ•

  1. ๊ฐœ ์–ผ๊ตด ์ด๋ฏธ์ง€ 5์žฅ ์„ ํƒ
  2. ๋™์ผํ•œ ์œ„์น˜(์˜ˆ: ์™ผ์ชฝ ๋ˆˆ)๋ฅผ ๊ฐ€๋ ค์„œ ์›๋ณธ๊ณผ feature ์ฐจ์ด ฮตli ๊ณ„์‚ฐ
  3. ์ด๋ฏธ์ง€ ์Œ(i, j) ๊ฐ„์˜ ์ฐจ์ด ๋ฒกํ„ฐ ์ผ๊ด€์„ฑ์„ ํ•ด๋ฐ ๊ฑฐ๋ฆฌ๋กœ ์ธก์ • (ฮ”l)
  4. ํŠน์ • ๋ถ€์œ„ vs ๋ฌด์ž‘์œ„ ๋ถ€์œ„ ๋น„๊ต

โœ… ๊ฒฐ๊ณผ

  • Layer 5: ๋ˆˆยท์ฝ” ๊ฐ™์€ ์˜๋ฏธ ์žˆ๋Š” ๋ถ€์œ„์—์„œ ฮ” ๊ฐ’์ด ๋‚ฎ์Œ โ†’ ๋Œ€์‘์„ฑ ํ™•๋ณด
  • Layer 7: breed ํŒ๋ณ„์— ์ง‘์ค‘ํ•˜๋ฏ€๋กœ ฮ” ๊ฐ’์ด ๋ฌด์ž‘์œ„ ๋ถ€์œ„์™€ ์œ ์‚ฌ โ†’ ๋ถ€์œ„ ๋Œ€์‘ ์ •๋ณด ์•ฝํ™”

โœ… ์˜๋ฏธ

  • ConvNet์€ ๋ช…์‹œ์ ์œผ๋กœ correspondence๋ฅผ ์ •์˜ํ•˜์ง€ ์•Š์•„๋„, ์ค‘๊ฐ„์ธต์—์„œ ๊ฐ์ฒด ๋ถ€์œ„ ๊ฐ„ ์•”๋ฌต์  ๋Œ€์‘์„ ํ•™์Šต
  • ๊ทธ๋Ÿฌ๋‚˜ ๊นŠ์€ ์ธต์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ์ด ์ •๋ณด๋Š” ์‚ฌ๋ผ์ง€๊ณ , ํด๋ž˜์Šค ๊ตฌ๋ถ„์— ๋” ํŠนํ™”๋จ

ํ•ด๋ฐ ๊ฑฐ๋ฆฌ ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์žˆ์„๋•Œ, ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์˜ ์›์†Œ ๊ฐœ์ˆ˜๋ฅผ ์ƒˆ๋Š” ๊ฑฐ๋ฆฌ ์ธก๋„ ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์—์„œ๋„ ๊ฐ™์€ ๋ถ€์œ„๊ฐ€ ๊ณตํ†ต๋œ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ์Œ์„ ์˜๋ฏธ

Correspondence Analysis Experiment

โœ… 5์ค„ ์š”์•ฝ

  • ConvNet์ด ์•”๋ฌต์ ์œผ๋กœ ๊ฐ์ฒด ๋ถ€์œ„ ๋Œ€์‘์„ ํ•™์Šตํ•˜๋Š”์ง€ ์‹คํ—˜
  • ๋™์ผ ๋ถ€์œ„๋ฅผ ๊ฐ€๋ ค feature ๋ณ€ํ™”๋Ÿ‰์„ ๋น„๊ต
  • Layer 5: ๋ˆˆยท์ฝ”์—์„œ ๋ณ€ํ™”๊ฐ€ ์ผ๊ด€์  โ†’ ๋Œ€์‘์„ฑ ์กด์žฌ
  • Layer 7: breed ํŒ๋ณ„๋กœ ์น˜์ค‘ โ†’ ๋Œ€์‘์„ฑ ๊ฐ์†Œ
  • ConvNet์€ ์ค‘๊ฐ„์ธต์—์„œ correspondence๋ฅผ ํ˜•์„ฑํ•˜์ง€๋งŒ, ๊นŠ์€ ์ธต์—์„œ๋Š” ํŒ๋ณ„์  ํŠน์ง•์— ์ง‘์ค‘

๐Ÿ“Œ ์ •๋ฆฌ

์ธต (Layer)๋ถ€์œ„ฮ” ๊ฐ’ ๊ฒฐ๊ณผ์˜๋ฏธ
Layer 5๋ˆˆยท์ฝ” vs ๋ฌด์ž‘์œ„๋ˆˆยท์ฝ” ฮ” ๋” ๋‚ฎ์Œ๋ถ€์œ„ ๊ฐ„ ๋Œ€์‘์„ฑ ํ™•๋ณด
Layer 7๋ˆˆยท์ฝ” vs ๋ฌด์ž‘์œ„์œ ์‚ฌbreed ๊ตฌ๋ถ„์— ์ง‘์ค‘, correspondence ์•ฝํ™”

๐Ÿ“š 5. Experiments

5.1. ImageNet 2012 โœ… 5์ค„ ์š”์•ฝ

  • ImageNet 2012: ํ•™์Šต 130๋งŒ / ๊ฒ€์ฆ 5๋งŒ / ํ…Œ์ŠคํŠธ 10๋งŒ, 1000 ํด๋ž˜์Šค
  • AlexNet ๊ตฌ์กฐ ์žฌํ˜„ โ†’ ๋ณด๊ณ ๋œ ์„ฑ๋Šฅ๊ณผ ๋™์ผ
  • stride 4โ†’2, filter 11ร—11โ†’7ร—7 โ†’ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ๋‹จ์ผ ๋ชจ๋ธ: Top-5 error 1.7% ๊ฐœ์„ 
  • ์•™์ƒ๋ธ”: 14.8% error โ†’ ๋‹น์‹œ ์ตœ๊ณ  ์„ฑ๋Šฅ, ๋น„-ConvNet์˜ ์ ˆ๋ฐ˜ ์ˆ˜์ค€

๐Ÿ“Œ ์ •๋ฆฌํ‘œ (5.1 ImageNet 2012)

๋ชจ๋ธTop-5 Error (%)๋น„๊ณ 
AlexNet (2012)16.4Krizhevsky et al.
Zeiler & Fergus (๋‹จ์ผ)์•ฝ 14.7โ€“15.0stride=2, filter=7ร—7 ์ ์šฉ
Zeiler & Fergus (์•™์ƒ๋ธ”)14.82012 ํ•™์Šต์…‹ ๊ธฐ์ค€ ์ตœ๊ณ  ์„ฑ๋Šฅ
๋น„-ConvNet (Gunji et al.)26.2๊ฐ™์€ ๋Œ€ํšŒ ์ƒ์œ„ entry

ImageNet 2012 Results Table

5.2 Feature Generalization โœ… 5์ค„ ์š”์•ฝ

  • ImageNet ์‚ฌ์ „ํ•™์Šต feature๋Š” ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ๊ฐ•๋ ฅ
  • Caltech-101: 86.5%, ๊ธฐ์กด ์ตœ๊ณ ์น˜๋ณด๋‹ค +2.2%
  • Caltech-256: 74.2%, ๊ธฐ์กด ์ตœ๊ณ ์น˜๋ณด๋‹ค +19%
  • PASCAL: ํ‰๊ท  79.0%, ์ตœ๊ณ ์น˜ 82.2%์— ๊ทผ์ ‘, ์ผ๋ถ€ ํด๋ž˜์Šค๋Š” ์šฐ์œ„
  • ConvNet์€ ๋ฒ”์šฉ์  ์ „์ด ํ•™์Šต ๋„๊ตฌ์ž„์„ ์ž…์ฆ

๐Ÿ“Œ ์ •๋ฆฌํ‘œ (5.2 Feature Generalization)

๋ฐ์ดํ„ฐ์…‹์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ ์„ฑ๋Šฅ๊ธฐ์กด ์ตœ๊ณ ์น˜Scratch ํ•™์Šต์˜๋ฏธ
Caltech-10186.5%81.4%46.5%์†Œ๊ทœ๋ชจ์—์„œ๋„ SOTA
Caltech-25674.2% (60 imgs/class)55.2%38.8%๋Œ€๊ทœ๋ชจ/์†Œ๊ทœ๋ชจ ๋ชจ๋‘ ์••๋„
PASCAL VOC 201279.0% (mean)82.2%-๋‹ค์ค‘ ๊ฐ์ฒด ์žฅ๋ฉด, ์ผ๋ถ€ ํด๋ž˜์Šค๋Š” ๋” ์šฐ์œ„

5.3 Feature Analysis

  • ๊นŠ์–ด์งˆ์ˆ˜๋ก ํŒ๋ณ„ ์„ฑ๋Šฅ ์ฆ๊ฐ€

โœ… 5์ค„ ์š”์•ฝ

  • ConvNet์˜ feature ํŒ๋ณ„๋ ฅ์€ ๊นŠ์„์ˆ˜๋ก ํ–ฅ์ƒ
  • Layer 1: ์ €์„ฑ๋Šฅ (์—ฃ์ง€ยท์ƒ‰ ๊ธฐ๋ฐ˜)
  • Layer 5: ํฐ ํ–ฅ์ƒ (์ค‘๊ฐ„ ํŠน์ง•์ด ๋งค์šฐ ๊ฐ•๋ ฅ)
  • Layer 7: Caltech-256์—์„œ๋Š” ์ตœ๊ณ , Caltech-101์—์„œ๋Š” plateau
  • ConvNet์€ ๊ณ„์ธต์ ์œผ๋กœ ์ ์  ๊ฐ•๋ ฅํ•œ ํŠน์ง• ํ‘œํ˜„์„ ํ•™์Šต

๐Ÿ“Œ ์ •๋ฆฌ

๋ฐ์ดํ„ฐ์…‹Layer 1Layer 3Layer 5Layer 7์˜๋ฏธ
Caltech-10144.8%72.3%86.2%85.5%์ค‘๊ฐ„~์ƒ์œ„์ธต์—์„œ ํฐ ํ–ฅ์ƒ, ์ตœ์ƒ์œ„์ธต์€ plateau
Caltech-25624.6%46.0%65.6%71.7%์ธต์ด ๊นŠ์„์ˆ˜๋ก ๊ณ„์† ํ–ฅ์ƒ, ์ตœ์ƒ์œ„์ธต์ด ์ตœ๊ฐ•
This post is licensed under CC BY 4.0 by the author.