[논문 번역] ImageNet Classification with Deep Convolutional Neural Networks

논문 번역

[논문 번역] ImageNet Classification with Deep Convolutional Neural Networks

라이언 영어 2025. 1. 30. 23:31

ImageNet Classification with Deep Convolutional Neural Networks

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
우리는 ImageNet LSVRC-2010 대회에서 120만 고해상도 이미지를 1,000개의 다른 클래스로 분류하기 위해 대규모 심층 합성곱 신경망을 학습시켰다.

On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art.
테스트 데이터에서, 37.5%와 17%의 top1과 top5 에러율 달성했고 이는 이전의 SOTA 보다 상당히 더 좋은 결과이다.

The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
6,000만 개의 파라미터와 65만 개의 뉴런을 가진 신경망은 다섯 개의 합성곱 레이어로 구성되어 있고, 그 중 일부 레이어 뒤에 최대 풀링 레이어와 final 1000-way softmax를 가진 3개의 완전 연결 레이어로 구성되어 있다.

To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation.
학습을 더 빠르게 만들기 위해, 포화되지 않은 뉴런과 합성곱 연산의 매우 효율적인 GPU 구현을 사용했다.

To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.
완전 연결 레이어의 오버피팅(과적합)을 감소하기 위해 최근에 개발되고 매우 효과적으로 입증된 "dropout" 라고 불리는 정규화 메소드를 사용했다.

We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
또한 ILSVRC-2012 경쟁에서 해당 변형 모델을 출전했고, 26.2%를 달성한 2위 기록과 비교하여 15.3%라는 top-5 에러율로 우승을 달성했다.

* SOTA는 'State of the Art'의 약어로, 인공지능(AI) 및 기계 학습(ML) 분야에서 특정 작업에 대해 현재 사용 가능한 최고의 모델 또는 알고리듬을 의미

* considerably : (부) 상당히, 많이
* resolution : 결의안, 해상도
* consist of : (동) ~으로 구성되다
* followed by : 뒤이어, 잇달아
* saturate : (동) 포화시키다, 흠뻑 적시다
* efficient : 효율적인, 유능한
* employ : 고용하다, 쓰다
* prove : 입증하다
* prove to be : ~임을 입증되다
* enter : 들어가다, 출전하다, 시작하다
* variant : 변형
* compared to : ~와 비교하여

Introduction

Current approaches to object recognition make essential use of machine learning methods.
현재의 객체 인식의 접근 방식은 머신 러닝 방법을 필수적으로 활용한다.

To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting.
성능을 향상시키기 위해, 더 큰 데이터셋을 수집하고, 더 강력한 모델을 학습하고, 오버피팅(과적합)을 방지하기 위한 더 좋은 기법을 사용할 수 있다.

Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]).
최근까지 라벨이 붙은 이미지 데이터셋은 수만개의 이미지 순서로 비교적 작았다.

Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations.
간단한 인식 작업은 이 크기의 데이터셋으로도 상당히 좋게 해결되었고, 특히 레이블 보존 변환으로 증가된 경우라면 말이다.

For example, the current best error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4].
예를 들어, MNIST digit 인식 작업의 현재 최고 에러율(<0.3%) 인간 성능에 접근합니다.

But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets.
하지만 현실적인 세팅의 객체는 상당한 가변성을 전시하고, 그들을 인식하기 위한 학습은 더 큰 학습 셋을 사용하기 위해 필요하다.

And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images.
실제로, 작은 이미지 데이터 셋의 결점은 널리 인식되어 있지만, 최근에 수백만개 이미지로 레이블이 지정된 데이터 셋의 수집이 가능해졌다.

The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.
새로운 대규모 데이터 셋은 수십만개의 완전하게 분할된 이미지로 구성된 LabelMe와, 22,000개 이상의 카테고리에서 1,500만개 이상의 라벨이 지정된 고해상도 이미지로 구성된 ImageNet 을 포함한다.

To learn about thousands of objects from millions of images, we need a model with a large learning capacity.
수백만개의 이미지로부터 수천 개의 객체에 대해 학습하기 위해 우리는 큰 학습 능력을 가진 모델이 필요하다.

However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have.
그러나, 객체 인식 작업의 엄청난 복잡성은 ImageNet 만큼 큰 데이터 셋으로부터 전혀 명시할 수 없는 문제를 뜻하고, 우리의 모델은 우리가 가지지 않은 모든 데이터를 보충하기 위해 많은 사전 지식이 있어야 한다.

Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26].
합성 신경망(CNNs) 은 모델 클래스 중 하나에 속합니다. [16, 11, 13, 18, 15, 22, 26]

Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies).
그들의 능력은 깊이와 폭으로 변화하여 제어할 수 있고, 또한 이미지 특성(즉, 통계의 정상성과 픽셀 의존성의 지역성) 에 관하여 강력하고 대부분 정확한 추정을 할 수 있다.

Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.
따라서, 비슷한 크기의 레이어를 가진 표준 피드포워드 신경망을 비교했을 때, CNN은 훨씬 적은 연결과 매개변수가 있기에 훈련하기 쉽지만 이론적으로 최고 성능은 약간 더 나빠질 가능성이 높습니다.

Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images.
CNN의 매력적인 품질과 로컬 아키텍처의 상대적 효율성에도 불구하고, 고해상도 이미지에 대규모로 적용하기에는 아직도 엄두도 못낼 만큼 비싸다.

Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting.
다행히도 현재의 GPU는 2D 합성곱의 고도로 최적화된 구현과 쌍을 이루고 흥미롭게도 큰 CNN의 훈련을 가능하게 하도록 충분히 강력하고,
ImageNet과 같은 최근 데이터 셋에 심각한 과적합 없이 이러한 모델을 훈련하기 위한 충분한 레이블이 지정된 예제가 포함된다.

The specific contributions of this paper are as follows:
we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets.
이 논문의 특정한 기여는 다음과 같다.
우리는 ILSVRC-2010와 LSVRC-2012 대회에서 사용된 ImageSet의 부분 집합에서 가장 큰 합성곱 신경망 중 하나를 훈련시키고, 데이터 세트의 지금까지 보고된 최고의 결과를 달성했다.

We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly.
우리는 2D 합성곱과 합성 신경망 훈련에 내재된 모든 다른 작업의 고도로 최적화된 GPU 구현을 작성했고, 이를 우리는 공용적으로 이용 할 수 있게 했다.

Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section3.
우리의 네트워크는 다수의 새롭고 특이한 기능, 성능을 향상시키고 훈련 시간을 감소시키는 다수의 새롭고 특이한 특징을 포함하며, 이는 섹션3에서 자세하게 설명됩니다.

The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overfitting, which are described in Section 4.
우리의 네트워크 크기는 120만 개의 레이블이 지정된 훈련 예시를 가지고도 중요한 과적합이 심각한 문제였고, 우리는 과적합을 방지하기 위해 몇몇 효과적인 기술을 사용했고, 이는 섹션 4에서 자세하게 묘사됩니다.

Our final network contains five convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.
최종 네트워크는 5개의 합성곱층과 3개의 완전 연결층을 포함되고 해당 깊이는 중요할 것으로 보인다.
우리는 어떤 합성곱층(각 모델 매개변수의 1% 이내 포함)을 삭제하면 하위 성능에서 결과를 보였다는 것을 확인했다.

In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate.
결국 네트워크 크기는 주로 현재 GPU에서 이용 가능한 메모리 양과 우리가 허용할 수 있는 훈련 시간의 양에 의해 제한된다.

Our network takes between five and six days to train on two GTX 580 3GB GPUs.
우리의 네트워크는 2개의 GTX 580 3GB GPUs에서 훈련을 5~6일이 소요한다.

All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
우리의 모든 실험은 더 빠른 GPU와 더 큰 데이터 셋이 이용 가능하도록 되기 위해 기다리면 결과는 간단하게 향상될 수 있음을 시사한다.

* essential : (형) 필수적인, (명) 핵심 사항
* current approach : 현재 접근법
* method : 방법, 요소
* make use of : ~을 이용하다, ~을 활용하다
* relatively : (부) 비교적, 상대적으로
* augment : 늘리다, 증가시키다
* preserve : 지키다, 보존하다
* realistic : (형) 현실적인, 현실을 직시하는
* exhibit : 전시하다
* variability : 가변성, 변동성
* necessary : 필요한
* indeed : (부) 정말, 사실
* shortcoming : 결점, 단점
* widely : 널리, 폭넓게
* hundreds of thousands of : 방대한 수의, 다수의
* capacity : 용량, 능력
* immense(아이멘즈) : 어마어마한, 엄청난
* specify : 명시하다
* compensate for : 보상하다, 보충하다
* constitute : ~이 되는 것으로 여겨지다, 구성하다
* breadth : 폭
* assumption : 가정, 가설, 추정
* namely : 다시 말해, 즉
* stationarity : 정상성
* Thus : 이와 같이, 따라서
* similarly : 비슷하게
* theoretically : 이론적으로
* slightly : 약간, 조금, 자그만한
* despite : ~에도 불구하고
* efficiency : 효율성
* prohibitively : 엄두도 못낼 만큼
* facilitate : 가능하게 하다, 용이하게 하다
* interestingly : 흥미롭게도
* specific : 구체적인, 특정한
* contribution : 기여, 기부금, 성금
* as follows : 다음과 같이
* subset : 부분 집합, 하위 집합
* by far : 훨씬, 단연코
* inherent : 내재하는
* unusual : 특이한, 독특한
* a number of : 다수의
* significant : 의미심장한, 특별한 의미가 있는
* several : 각각의
* even with : ~을 가지고도, ~에도 불구하고
* no more than : 이내
* In the end : 마침내, 결국
* mainly : 주로, 대개
* tolerate : 용납하다
* experiment : 실험
* suggest : 제안하다, 시사하다

The Dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories.

The images were collected from the web and labeled by human labelers using Ama zon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held.

ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments.

Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable.

On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.

ImageNet consists of variable-resolution images, while our system requires a constant input dimen sionality.

Therefore, we down-sampled the images to a fixed resolution of 256 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 256 patch from the resulting image.

We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.

So we trained our network on the (centered) raw RGB values of the pixels.