멀티 암드 밴딧

멀티 암드 밴딧(Multi-armed bandit, K-armed bandit problem, N-armed bandit problem)은 확률론 및 기계 학습에서 고정된 제한된 리소스 세트를 경쟁(대안) 선택 간에 다음과 같은 방식으로 할당해야 하는 문제이다. 각 선택의 속성이 할당 당시 부분적으로만 알려지고 시간이 지남에 따라 또는 선택에 자원을 할당함으로써 더 잘 이해될 수 있을 때 예상 이득을 최대화한다. 이는 탐색-이용 트레이드오프 딜레마를 보여주는 고전적인 강화 학습 문제이다. 이 이름은 슬롯 머신이 줄지어 서 있는 도박꾼을 상상하는 데서 유래한다. 그는 어떤 기계를 플레이할지, 각 기계를 몇 번 플레이할지, 어떤 순서로 플레이할지, 현재 기계를 계속 사용할지 아니면 다른 기계를 사용해 볼지 여부를 결정해야 한다. 멀티 암드 밴딧 문제도 확률론적 스케줄링의 광범위한 범주에 속한다.

문제에서 각 기계는 선험적으로 알려지지 않은 해당 기계에 특정한 확률 분포에서 무작위 보상을 제공한다. 도박꾼의 목표는 일련의 레버 당김을 통해 얻은 보상의 합계를 최대화하는 것이다. 각 시도에서 도박꾼이 직면하는 중요한 트레이드오프는 가장 높은 기대 수익을 갖는 기계를 "이용"하는 것과 다른 기계의 예상 수익에 대해 더 많은 정보를 얻기 위한 "탐색" 사이이다. 탐색과 활용 사이의 균형은 기계 학습에서도 직면한다. 실제로 다중 무장 도적은 과학 재단이나 제약 회사와 같은 대규모 조직의 연구 프로젝트 관리와 같은 문제를 모델링하는 데 사용되었다. 문제의 초기 버전에서 도박꾼은 기계에 대한 초기 지식 없이 시작한다.

1952년 허버트 로빈스(Herbert Robbins)는 문제의 중요성을 깨닫고 "순차적 실험 설계의 일부 측면"에서 수렴적인 인구 선택 전략을 구축했다. 존 C. 기틴스(John C. Gittins)가 처음 발표한 정리인 기틴스 지수는 예상 할인 보상을 최대화하기 위한 최적의 정책을 제공한다.

같이 보기

외부 링크

MABWiser, open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits, open source implementation of bandit strategies in Python and Matlab.
Contextual, open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project, open source implementation of bandit strategies.
Banditlib, Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case.
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1. Part2.
Feynman's restaurant problem, a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing.
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits.
A Survey on Contextual Multi-armed Bandits, a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code.
Animated, interactive plots illustrating Epsilon-greedy, Thompson sampling, and Upper Confidence Bound exploration/exploitation balancing strategies.