기댓값 최대화 알고리즘

기댓값 최대화 알고리즘(expectation-maximization algorithm, 약자 EM 알고리즘)은 관측되지 않는 잠재변수에 의존하는 확률 모델에서 최대가능도(maximum likelihood)나 최대사후확률(maximum a posteriori, 약자 MAP)을 갖는 모수의 추정값을 찾는 반복적인 알고리즘이다. EM 알고리즘은 모수에 관한 추정값으로 로그가능도(log likelihood)의 기댓값을 계산하는 기댓값 (E) 단계와 이 기댓값을 최대화하는 모수 추정값들을 구하는 최대화 (M) 단계를 번갈아가면서 적용한다. 최대화 단계에서 계산한 변수값은 다음 기댓값 단계의 추정값으로 쓰인다.

역사

기댓값 최대화 알고리즘은 1977년 아서 뎀스터, 낸 라드와 도널드 루빈에 의해 제시되었다. 하지만 특수한 경우에서의 기댓값 최대화 알고리즘은 이미 몇몇 저자들에 의해 연구되고 있었고, 특히 롤프 선드버그는 페르 마틴-뢰프, 앤더스 마틴-뢰트와 함께 지수족에 한정된 기댓값 최대화 알고리즘에 관한 연구를 한 바 있다. 1977년 뎀스터-라드-루빈 논문은 이 방법을 일반화하여 지수족뿐만 아니라 다양한 확률분포에 한해서도 알고리즘이 수렴함을 밝혔다. 왕립통계협회저널에 게재된 이 논문은 그 혁신성으로 인해 롤프 선드버그를 비롯한 여럿에게 극찬을 받았으며, 기댓값 최대화 알고리즘을 통계분석의 중요한 도구로 자리매김하게 하였다.

이후 뎀스터-라드-루빈 논문의 알고리즘 수렴성 분석에 오류가 발견되었고, 1983년 C.F. 제프 우에 의해 옳은 증명이 제시되었다. 제프 우의 증명은 기댓값 최대화 알고리즘이 지수족뿐만 아니라 다른 확률분포에도 성립함을 보였다.

서론

기댓값 최대화 알고리즘은 통계 모델의 수식을 정확히 풀 수 없을 때 최대가능도를 구하는 데 사용된다. 이런 통계 모델은 대개 잠재변수와 관측할 수 없는 모수, 관측된 데이터로 구성되어 있고, 몇몇 관측값들이 누락되어 있거나 관측되지 않은 데이터가 존재한다고 가정할 수도 있다. 예를 들면, 혼합 모델에서 매 관측값이 그에 상응하는 잠재변수 혹은 관측되지 않은 값을 가지고 있다고 가정할 수 있다.

최대가능도는 관측되지 않은 모든 변수 각각에 대하여 로그가능도를 미분한 식을 풀어서 구할 수 있다. 하지만 잠재변수를 포함하는 통계 모델의 수식들은 매우 복잡하여 이와 같은 방법으로는 풀 수 없다. 각각의 변수에 관해 미분한 식은 서로 얽혀있어 한 수식의 해를 구하기 위해서는 다른 식의 해를 구해야 하는데, 이 수식의 해 또한 이전 식의 변수를 포함하기 때문이다. 따라서 이 경우 정확한 해를 계산하는 것은 불가능하다.

하지만 기댓값 최대화 알고리즘을 이용하면 이 문제의 해를 수치적으로 계산할 수 있다. 우선 서로 얽혀있는 두 식과 각각의 변수들을 가정해보자. 두 식은 얽혀있기 때문에 한 식의 변수는 다른 식의 변수일 수 있다. 첫번째 식의 변수값을 임의로 잡은 후 이 변수값으로 두번째 식의 변수값을 추정한다. 그렇게 구한 변수값으로 다시 첫번째 식의 변수값을 추정하고 이를 반복한다. 특정한 상황에서 충분히 많은 반복을 거치면 각각의 변수들이 일정한 값으로 수렴한다는 것을 보일 수 있다. 또한, 이 값에서 로그가능도의 미분은 0이 됨을 보일 수 있고, 이는 이 값이 극점이거나 안장점임을 의미한다. 일반적으로 로그가능도는 여러 개의 극점을 가질 수 있으므로 기댓값 최대화 알고리즘이 반드시 최댓값을 구할 수는 없다. 어떤 가능도는 특이점, 즉 무의미한 극점을 갖기도 한다. 예를 들어, 혼합 모델에서 한 성분의 평균 매개변수를 관측된 하나의 데이터로 놓고 분산 변수를 0으로 설정할 때 발생하는 특이점이 기댓값 최대화 알고리즘으로 구해지는 해가 될 수 있다.

방식

관측할 수 있는 확률변수 $\mathbf {X}$ 와 관측할 수 없는 확률변수 $\mathbf {Z}$ , 그리고 모수 ${\boldsymbol {\theta }}$ 가 있을 때, $(\mathbf {X} ,\mathbf {Z} )$ 에 대한 확률분포는 $L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )=p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})$ 로 주어져 있다. 이 때 최대가능도를 계산하고 싶은 경우, 최대화하려는 가능도 함수는 $L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} |{\boldsymbol {\theta }})=\sum _{\mathbf {Z} }p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})$ 로 정의할 수 있다.

이 수식의 계산비용은 잠재변수 $\mathbf {Z}$ 가 취할 수 있는 값의 수에 비례하고, $\mathbf {Z}$ 의 차원이 증가할수록 취할 수 있는 값의 수는 지수적으로 증가하기 때문에 이 수식을 계산하기는 매우 어렵다.

기댓값 최대화 알고리즘은 다음 두 단계를 반복하여 가능도 함수 $L({\boldsymbol {\theta }};\mathbf {X} )$ 의 최대가능도를 구한다.

기댓값 (E) 단계에서는 ${\boldsymbol {\theta }}^{(t)}$ 가 주어지고 새로운 ${\boldsymbol {\theta }}$ 를 사용할 때 가능도의 기댓값 $Q$ 를 정의한다. 이 때 기댓값을 취하는 확률분포는 $\mathbf {X}$ , ${\boldsymbol {\theta }}^{(t)}$ 가 주어졌을 때 $\mathbf {Z}$ 의 조건부 분포이다.

Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})=\operatorname {E} _{\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)}}\left[\log L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )\right]=\sum _{\mathbf {Z} }p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})\log L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )

최대화 (M) 단계에서는 $Q$ 를 최대화하는 새로운 모수 ${\boldsymbol {\theta }}^{(t+1)}$ 를 계산한다.

{\boldsymbol {\theta }}^{(t+1)}={\underset {\boldsymbol {\theta }}{\operatorname {arg\,max} }}\ Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})\,

기댓값 최대화 알고리즘은 다양한 모델에 적용 가능하지만, 적용되는 모델은 일반적으로 다음 성질을 만족한다.

관측값 $\mathbf {X}$ 는 이산적일 수도 연속적일 수도 있다.
누락값/잠재변수 $\mathbf {Z}$ 는 이산적이고, 각각의 관측값은 그에 대응하는 하나의 잠재변수를 가진다.
모수들은 연속적이고 두 종류로 나뉘는데, 이는 관측값들과 관련한 모수 혹은 잠재변수 각각의 값에 대응하는 모수이다.

모수 ${\boldsymbol {\theta }}$ 가 주어졌을 때, 잠재변수 $\mathbf {Z}$ 에 관한 가능도 함수를 최대화함으로써 $\mathbf {Z}$ 의 값을 어렵지 않게 추정할 수 있고, 반대로 잠재변수 $\mathbf {Z}$ 의 값이 주어졌을 때, 모수에 관한 가능도 함수를 최대화함으로써 모수 ${\boldsymbol {\theta }}$ 의 값을 추정할 수 있다. 모수와 잠재변수 중 하나의 값을 알면 다른 값도 쉽게 알 수 있다는 것이 기댓값 최대화 알고리즘의 핵심이다. 다음은 기댓값 최대화 알고리즘을 간략히 요약한 것이다.

모수 ${\boldsymbol {\theta }}$ 를 임의의 값으로 설정한다.
주어진 모수 값에 관한 잠재변수 $\mathbf {Z}$ 값을 추정한다.
2단계에서 얻은 잠재변수 값을 이용해 모수 ${\boldsymbol {\theta }}$ 값을 다시 추정한다.
모수 ${\boldsymbol {\theta }}$ 값과 잠재변수 $\mathbf {Z}$ 값이 수렴할 때까지 2, 3단계를 반복한다.

이 알고리즘은 비용함수의 극대점으로 단조 수렴한다.

기댓값 최대화 알고리즘은 극대점을 찾기 때문에, 구한 값이 실제 최대점은 아닐 수 있다. 이를 피하기 위해 여러 개의 임의 초기값에 대해 계산하는 기법이나 담금질 기법 등이 같이 사용될 수 있다.

성질

비록 EM을 반복하는 것이 관측된 데이터(한계) 기능도 함수를 증가시키지만 그 순열이 최대가능도로 수렴한다는 증거는 없다. 다항 분포에서는, EM 알고리즘이 초기값에 따라 관측된 데이터 가능도 함수의 극대값으로 수렴할 수도 있다. 경사 하강법(서로 다른 초기 추정량 ${\boldsymbol {\theta }}^{(t)}$ 에서 시작하는 방법)이나 담금질 기법 등 수 많은 체험적이거나 메타 발견적 학습 방법이 있다.

EM은 가능도가 지수족일 때 특히 유용하다: E 단계는 충분 통계량(Sufficient statistic)의 기대값들의 합이 되며, M 단계는 선형 함수를 최대화하는 것을 포함한다. 이런 경우에는, 각각의 단계에서 선드버그 공식을 이용하여 닫힌 해를 도출하는 것이 가능하다.^[1]^[2]^[3]^[4]^[5]^[6]^[7]^[8]^[9]

EM 방법은 뎀스터-라드-루빈 논문에서 베이즈 추론에서 최대 사후 확률을 계산하기 위해 변형되었다.

경사 하강법, 켤레기울기법, 또는 가우스-뉴턴 방법의 변형 등과 같은 최대가능도를 찾는 여러 방법들이 존재한다. EM과 달리, 이러한 방법들은 일반적으로 가능도 함수의 첫 번째와 두 번째 미분의 계산을 필요로 한다.

올바름 증명

기댓값 최대화 알고리즘은 $\log p(\mathbf {X} |{\boldsymbol {\theta }})$ 을 직접적으로 향상시키기 보다 $Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})$ 를 향상시키는 일이다. 여기에서 증명하는 것은 $Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})$ 를 향상시키는 ${\boldsymbol {\theta }}$ 를 찾으면 $\log p(\mathbf {X} |{\boldsymbol {\theta }})$ 가 향상된다는 것이다.^[10]

확률이 0이 아닌 $p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }})$ 를 만족하는 어떤 $\mathbf {Z}$ 에 대해서, 다음과 같이 이야기할 수 있다:

\log p(\mathbf {X} |{\boldsymbol {\theta }})=\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})-\log p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }})\,.

$p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})$ 를 $\mathbf {Z}$ 에 대해서 합한 값(혹은 적분한 값)을 양변에 곱할 수 있다. 좌변은 상수인 기댓값이므로:

{\begin{aligned}\log p(\mathbf {X} |{\boldsymbol {\theta }})&=\sum _{\mathbf {Z} }p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})-\sum _{\mathbf {Z} }p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})\log p(\mathbf {Z} |\mathbf {X} ,{\boldsymbol {\theta }})\\&=Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})\,,\end{aligned}}

여기에서, $H({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})$ 는 음수의 합(negated sum)으로 정의된다. 마지막 식은 ${\boldsymbol {\theta }}={\boldsymbol {\theta }}^{(t)}$ 인 경우를 포함해서 모든 ${\boldsymbol {\theta }}$ 에 대해서도 성립한다.

\log p(\mathbf {X} |{\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})\,,

이전의 식으로부터 이 마지막 식을 빼면,

\log p(\mathbf {X} |{\boldsymbol {\theta }})-\log p(\mathbf {X} |{\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})-H({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})\,,

이다. 하지만, 깁스 부등식(Gibbs' inequality)에 의해서 $H({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})\geq H({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})$ 이다. 그러므로, 우리는 다음 결론에 도달할 수 있다:

\log p(\mathbf {X} |{\boldsymbol {\theta }})-\log p(\mathbf {X} |{\boldsymbol {\theta }}^{(t)})\geq Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})\,.

그러므로, $Q({\boldsymbol {\theta }}^{(t)}|{\boldsymbol {\theta }}^{(t)})$ 에 비해서 $Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})$ 가 커지도록 ${\boldsymbol {\theta }}$ 을 선택하면, 최소한 그 이상으로 $\log p(\mathbf {X} |{\boldsymbol {\theta }}^{(t)})$ 에 비해서 $\log p(\mathbf {X} |{\boldsymbol {\theta }})$ 가 향상될 것이다.

대체 설명

상황에 따라서는, 기댓값 최대화 알고리즘을 다음의 두 최대화 단계로 보는 것이 편리하다.^[11]^[12] q는 미관측 데이터 z를 나타내는 임의의 확률 분포, p_Z|X(· |x;θ)는 관측 데이터 x가 주어졌을 때의 조건부 확률 분포, H를 엔트로피, 그리고 D_KL를 쿨백-라이블러 발산이라고 할 때, 다음 식을 고려한다.

F(q,\theta )=\operatorname {E} _{q}[\log L(\theta ;x,Z)]+H(q)=-D_{\mathrm {KL} }{\big (}q{\big \|}p_{Z|X}(\cdot |x;\theta ){\big )}+\log L(\theta ;x)

그러면 기댓값 최대화에서의 단계는 다음과 같이 나타낼 수 있다:

기댓값 단계: F를 최대화하는 q를 찾는다 :

q^{(t)}={\underset {q}{\operatorname {arg\,max} }}\ F(q,\theta ^{(t)})

최대화 단계: F를 최대화하는 θ를 찾는다:

\theta ^{(t+1)}={\underset {\theta }{\operatorname {arg\,max} }}\ F(q^{(t)},\theta )

적용

기댓값 최대화 알고리즘은 기계 학습과 컴퓨터 비전의 데이터 클러스터링에 자주 사용된다. 자연 언어 처리 분야에는 확률적 문맥 자유 문법(probabilistic context-free grammar)의 자율 유도를 위한 두 가지 중요한 알고리즘의 예로 바움-웰치 알고리즘(Baum-Welch algorithm)과 안밖 알고리즘(inside-outside algorithm)이 있다.

심리측정학에서 기댓값 최대화 알고리즘은 문항 반응 이론 모델의 아이템 매개변수와 잠재 능력의 추정을 위해 거의 없어서는 안된다.

기댓값 최대화 알고리즘은 누락된 데이터를 처리하고 미확인 변수를 관찰하는 능력이 있어 포트폴리오 리스크(portfolio risk)의 가격을 정하고 관리하는 유용한 도구가 되고 있다.

기댓값 최대화 알고리즘은 또한 의학 영상 복원, 특히 양전자 방출 단층촬영과 단일광자방출단층활영(single photon emission computed tomography)에 널리 쓰인다.

기댓값 최대화 알고리즘의 필터링과 평활화

칼만 필터는 일반적으로 온라인 상태 추정에 사용되고 최소 분산 평활화는 오프라인 또는 배치(batch) 상태 추정에 사용된다. 그러나 이러한 최소 분산 해법들을 위해서는 상태공간 모델의 매개변수 추정이 필요하다. 기댓값 최대화 알고리즘은 절리 상태(joint state)와 모수 추정 문제를 푸는데 사용될 수 있다.

기댓값 최대화 알고리즘의 필터링과 평활화는 아래의 두 단계를 반복함으로써 일어난다.

기댓값 단계(E step): 현재 모수 추정으로 설계된 칼만 필터 또는 최소 분산 평활화를 적용하여 갱신된 상태 추정을 얻는다.

최대화 단계(M step): 최대가능도 계산과 필터 또는 평활화 된 상태 추정을 사용해서 갱신된 매개변수 추정을 얻는다.

칼만 필터 또는 최소 분산 평활화가 하나의 입력을 받고 하나의 출력을 하는 시스템에서 잡음 섞인 측정을 한다고 가정하자. 최대가능도 계산으로부터 갱신된 측정 잡음 분산 추정을 얻을 수 있다.

{\hat {\sigma }}_{v}^{2}={\frac {1}{N}}\sum _{k=1}^{N}{(z_{k}-{\hat {x}}_{k})}^{2}

이 때 ${\hat {x}}_{k}$ 는 N 스칼라 측정인 ${z_{k}}$ 으로부터 필터 또는 평활화에 의해 계산된 스칼라 출력 추정이다. 비슷하게 1차 자기회귀 과정을 위해, 갱신된 과정 잡음 분산 추정이 다음과 같이 계산된다.

{\hat {\sigma }}_{w}^{2}={\frac {1}{N}}\sum _{k=1}^{N}{({\hat {x}}_{k+1}-{\hat {F}}{\hat {x}}_{k})}^{2}

이 때 ${\hat {x}}_{k}$ 와 ${\hat {x}}_{k+1}$ 가 필터 또는 평활화에 의해 계산된 스칼라 상태 추정들이다. 갱신된 모델 계수 추정은 다음을 통해서 얻어진다.

{\hat {F}}={\frac {\sum _{k=1}^{N}({\hat {x}}_{k+1}-{\hat {F}}{\hat {x}}_{k})}{\sum _{k=1}^{N}{\hat {x}}_{k}^{2}}}

.

위와 같은 모수 추정의 수렴에 대해서 잘 연구되어 있다.^[13]^[14]^[15]

변종

기댓값 최대화 알고리즘의 때때로 느린 수렴을 가속화 하기 위해서 몇 가지 방법이 제시되어 왔다. 켤레기울기법을 사용하는 것과 수정된 뉴턴의 방법 기법들이 그 예다.^[16] 추가적으로 기댓값 최대화 알고리즘은 속박 추정 기법들과 함께 쓰인다.

조건부 기댓값 최대화(Expectation conditional maximization, 약자 ECM)는 각각의 M 단계를 각각의 매개변수 $\theta _{i}$ 가 개별적으로 최대화 되고 조건부로 다른 매개변수들은 고정되는 조건부 최대화 (CM) 단계의 배열로 대체하였다.^[17]

이 아이디어는 후에 대체 설명에서 E 단계와 M 단계 모두에 대해서 손실 함수 F의 증가를 찾는 일반화된 기댓값 최대화(generalized expectation maximization, 약자 GEM) 알고리즘으로 확장된다.^[11]

기댓값 최대화 알고리즘을 MM (Majorize/Minimize 또는 Minorize/Maximize, 문맥에 따라서) 알고리즘^[18]의 하위 분류로 간주하는 것도 또한 가능하고 그러므로 더욱 일반적인 경우에 대해 개발된 어떤 기구를 사용하는 것이 가능하다.

α-EM 알고리즘

기댓값 최대화 알고리즘에 사용되는 Q-함수는 로그가능도를 기반으로 하고 있다. 그러므로, log-EM 알고리즘으로 간주된다. 로그가능도의 사용은 α-로그가능도 비율로 일반화 될 수 있다. 그러면, 관찰된 데이터의 α-로그가능도 비율은 그것의 Q-함수와 α-발산을 사용한 등식으로 정확하게 표현된다. 이 Q-함수를 얻는 것은 일반화 된 E 단계이다. 그것의 최대화는 일반화된 M 단계이다. 이 두 단계를 α-EM 알고리즘 ^[19] 이라고 부르며 log-EM 알고리즘이 이에 하위 분류로 포함된다. 따라서, 야스오 마츠야마(Yasuo Matsuyama)의 α-EM 알고리즘은 정확히 log-EM 알고리즘의 일반화이다. 기울기 또는 Hessian 행렬의 계산이 필요 없다. α-EM은 알맞은 α를 선택함으로써 log-EM 알고리즘 보다 더 빠르게 수렴한다. α-EM 알고리즘은 은닉 마르코프 모델 추정 알고리즘 α-HMM^[20]의 더 빠른 버전으로 이어졌다.

변분 베이즈 근사법과의 관계

기댓값 최대화 알고리즘은 부분적으로 베이즈가 아닌 최대가능도방법이다. 최종 결과는 (베이즈 방식의) 잠재변수와 θ (최대가능도 또는 사후 최빈값(posterior mode) 중 하나)의 점추정에 대한 확률분포로 나타난다. 잠재변수처럼 θ에 대한 확률분포로 나타나는 완전한 베이즈 방식의 버전을 생각해 볼 수 있다. 추론을 위한 베이즈 방식의 접근법은 간단하게 θ을 다른 잠재변수와 같이 취급한다. 이 패러다임에선 E 단계와 M 단계의 구분이 없어진다. 만약 위에서 말한 변분 베이즈 근사법과 같이 인수분해된 Q 근사값을 사용하면, 각각의 잠재변수(θ가 포함된)에 대해서 반복하고 한 번에 하나씩 최적화할 수 있다. 그러면 k가 잠재변수의 개수라고 할 때, 한 번 반복할 때마다 k 단계들이 있다. 그래픽 모델(graphical models)에서 앞의 방법은 각각의 변수의 새로운 Q를 마르코프 블랭킷(Markov blanket)에만 의존시킴으로써 쉽게 만들 수 있어서, 국소 메시지 전달(message passing)이 효율적인 추론으로 사용될 수 있다.

기하학적 해석

정보 기하학(information geometry)에서, E 단계와 M 단계는 이중 아핀 접속의 사영(projection)으로 해석되고, e-연결(e-connection)과 m-연결(m-connection)로 불린다. 쿨백-라이블러 발산 또한 이 용어들을 사용해서 해석될 수 있다.

예제 1: 정규분포 혼합 모델

두 개의 $d$ 차원 다변량 정규분포가 혼합되어 있는 경우, 보통 이 모델에서 $d$ 차원 값을 수집하는 것은 가능하지만 그 값이 두 분포 중 어느 분포에서 나온 것인지는 알 수 없는 경우가 많다. 이 문단에서는 이러한 경우에 대한 기댓값 최대화 알고리즘을 유도한다.

전체 $n$ 개의 점을 수집하였을 때, $i$ 번째 수집되는 값은 $(x_{i},z_{i})$ 로 나타낼 수 있다. 여기에서 $x_{i}$ 는 $d$ 차원 점을 가리키는 관측 가능한 변수, 그리고 $z_{i}$ 는 두 개의 정규분포 중 어느 분포에서 수집되었는지를 가리키는 관측 불가능한 변수이다.

$i$ 번째 값이 두 분포 중 어느 분포에서 수집되는지에 대한 확률은 $i$ 와 독립적이기 때문에, $P(z_{i})$ 는 $i$ 에 관계없는 매개변수로 표현할 수 있다. 이 값을 $P(z_{i}=1)=\tau _{1},P(z_{i}=2)=\tau _{2}$ 인 벡터 ${\boldsymbol {\tau }}=(\tau _{1},\tau _{2})$ 로 표기하도록 한다( $\tau _{1}+\tau _{2}=1$ ).

마지막으로, 두 다변량 정규분포를 각각 ${\mathcal {N}}_{d}({\boldsymbol {\mu }}_{1},\Sigma _{1})$ , ${\mathcal {N}}_{d}({\boldsymbol {\mu }}_{2},\Sigma _{2})$ 로 표기하기로 한다. 즉,

X_{i}|(Z_{i}=1)\sim {\mathcal {N}}_{d}({\boldsymbol {\mu }}_{1},\Sigma _{1})

이고

X_{i}|(Z_{i}=2)\sim {\mathcal {N}}_{d}({\boldsymbol {\mu }}_{2},\Sigma _{2})

이다.

그렇다면 이 모델에서 매개변수는 ${\boldsymbol {\theta }}=({\boldsymbol {\mu }}_{1},{\boldsymbol {\mu }}_{2},\Sigma _{1},\Sigma _{2},{\boldsymbol {\tau }})$ 가 된다.

이제 가능도 함수 $L({\boldsymbol {\theta }};\mathbf {x} ,\mathbf {z} )$ 를 다음과 같이 표현할 수 있다: 불완전한 데이터 가능도 함수는

L(\theta ;\mathbf {x} ,\mathbf {z} )=P(\mathbf {x} ,\mathbf {z} |\theta )=\prod _{i=1}^{n}\sum _{j=1}^{2}\tau _{j}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{j},\Sigma _{j})

이고,

완전한 데이터 가능도 함수는

L({\boldsymbol {\theta }};\mathbf {x} ,\mathbf {z} )=P(\mathbf {x} ,\mathbf {z} \vert {\boldsymbol {\theta }})=\prod _{i=1}^{n}\sum _{j=1}^{2}\mathbb {I} (z_{i}=j)\ \tau _{j}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{j},\Sigma _{j})

이다.

여기에서 $\mathbb {I}$ 는 지시 함수이고, $f$ 는 다변량 정규분포의 확률 밀도 함수이다. 여기에서 $f$ 를 풀어서 쓰면 다음과 같이 정리된다.

L({\boldsymbol {\theta }};\mathbf {x} ,\mathbf {z} )=\exp \left\{\sum _{i=1}^{n}\sum _{j=1}^{2}\mathbb {I} (z_{i}=j){\big [}\log \tau _{j}-{\tfrac {1}{2}}\log |\Sigma _{j}|-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})^{\top }\Sigma _{j}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})-{\tfrac {n}{2}}\log(2\pi ){\big ]}\right\}

기댓값 단계(E step)

모수 ${\boldsymbol {\theta }}^{(t)}$ 에 대하여 $z_{i}$ 에 대한 확률분포 $\operatorname {P} (Z_{i}=j|X_{i}=\mathbf {x} _{i};{\boldsymbol {\theta }}^{(t)})$ 는 다음과 같이 계산한다.

T_{j,i}^{(t)}:=\operatorname {P} (Z_{i}=j|X_{i}=\mathbf {x} _{i};{\boldsymbol {\theta }}^{(t)})={\frac {\tau _{j}^{(t)}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{j}^{(t)},\Sigma _{j}^{(t)})}{\tau _{1}^{(t)}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{1}^{(t)},\Sigma _{1}^{(t)})+\tau _{2}^{(t)}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{2}^{(t)},\Sigma _{2}^{(t)})}}

그러면 기댓값 단계의 $Q$ 함수는 다음과 같이 계산된다.

{\begin{aligned}Q({\boldsymbol {\theta }}|{\boldsymbol {\theta }}^{(t)})&=\operatorname {E} [\log L({\boldsymbol {\theta }};\mathbf {x} ,\mathbf {Z} )]\\&=\operatorname {E} [\log \prod _{i=1}^{n}L({\boldsymbol {\theta }};\mathbf {x} _{i},\mathbf {z} _{i})]\\&=\operatorname {E} [\sum _{i=1}^{n}\log L({\boldsymbol {\theta }};\mathbf {x} _{i},\mathbf {z} _{i})]\\&=\sum _{i=1}^{n}\operatorname {E} [\log L({\boldsymbol {\theta }};\mathbf {x} _{i},\mathbf {z} _{i})]\\&=\sum _{i=1}^{n}\sum _{j=1}^{2}T_{j,i}^{(t)}{\big [}\log \tau _{j}-{\tfrac {1}{2}}\log \Sigma _{j}-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})^{\top }\Sigma _{j}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})-{\tfrac {1}{2}}\log(2\pi ){\big ]}\end{aligned}}

최대화 단계(M step)

이제 최대화 단계를 계산한다. $Q$ 함수는 $\tau _{j}$ 에 대한 식과 $({\boldsymbol {\mu }}_{j},\Sigma _{j})$ 에 대한 식의 선형 결합이기 때문에, $\tau _{j}$ 와 $({\boldsymbol {\mu }}_{j},\Sigma _{j})$ 에 대해 독립적으로 최대화를 계산하는 것이 가능하다.

먼저 모수 ${\boldsymbol {\tau }}^{(t+1)}$ 는 다음과 같이 유도된다.

{\begin{aligned}{\boldsymbol {\tau }}^{(t+1)}&={\underset {\boldsymbol {\tau }}{\operatorname {arg\,max} }}\ Q(\theta |\theta ^{(t)})\\&={\underset {\boldsymbol {\tau }}{\operatorname {arg\,max} }}\ \left\{\left[\sum _{i=1}^{n}T_{1,i}^{(t)}\right]\log \tau _{1}+\left[\sum _{i=1}^{n}T_{2,i}^{(t)}\right]\log \tau _{2}\right\}\end{aligned}}

이 식은 이항분포에서의 최대가능도를 계산하는 경우와 동일하며, 따라서 다음과 같이 계산할 수 있다.

\tau _{j}^{(t+1)}={\frac {\sum _{i=1}^{n}T_{j,i}^{(t)}}{\sum _{i=1}^{n}(T_{1,i}^{(t)}+T_{2,i}^{(t)})}}={\frac {1}{n}}\sum _{i=1}^{n}T_{j,i}^{(t)}

다음으로 $({\boldsymbol {\mu }}_{1}^{(t+1)},\Sigma _{1}^{(t+1)})$ 에 대한 식은 다음과 같다.

{\begin{aligned}({\boldsymbol {\mu }}_{1}^{(t+1)},\Sigma _{1}^{(t+1)})&={\underset {{\boldsymbol {\mu }}_{1},\Sigma _{1}}{\operatorname {arg\,max} }}\ Q(\theta |\theta ^{(t)})\\&={\underset {{\boldsymbol {\mu }}_{1},\Sigma _{1}}{\operatorname {arg\,max} }}\ \sum _{i=1}^{n}T_{1,i}^{(t)}\left\{-{\tfrac {1}{2}}\log |\Sigma _{1}|-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{1})^{\top }\Sigma _{1}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{1})\right\}\end{aligned}}

이 식은 다변량 정규분포에서의 최대가능도를 계산하는 식과 유사하며, 다음과 같이 계산할 수 있다.

{\boldsymbol {\mu }}_{1}^{(t+1)}={\frac {\sum _{i=1}^{n}T_{1,i}^{(t)}\mathbf {x} _{i}}{\sum _{i=1}^{n}T_{1,i}^{(t)}}}

\Sigma _{1}^{(t+1)}={\frac {\sum _{i=1}^{n}T_{1,i}^{(t)}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{1}^{(t+1)})(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{1}^{(t+1)})^{\top }}{\sum _{i=1}^{n}T_{1,i}^{(t)}}}

$({\boldsymbol {\mu }}_{2}^{(t+1)},\Sigma _{2}^{(t+1)})$ 도 같은 방식으로 구할 수 있다.

알고리즘의 종료

반복과정은 각 과정에서 가능도 사이의 증가폭이 사전에 정의된 기준점 아래로 내려오면 종료한다.

즉, 사전에 정의된 기준점 $\epsilon$ 에 대해서, $\log L(\theta ^{t};\mathbf {x} ,\mathbf {Z} )\leq \log L(\theta ^{(t-1)};\mathbf {x} ,\mathbf {Z} )+\epsilon$ 이면 종료한다.

일반화

앞서 소개한 알고리즘은 두 개 이상의 다변량 정규 분포를 혼합한 경우에도 적용이 가능하다.

예제 2: 베르누이 혼합 모델

동일한 크기의 이미지 $N$ 개가 주어졌을 때, 이를 $\mathbf {x_{n}} =(x_{n,1},...,x_{n,D})^{T},n\in \{1,...,N\}$ 라 하자. 각 이미지는 $D$ 개의 픽셀로 이루어져 있고 각 픽셀은 $0$ 혹은 $1$ 의 값을 가진다. 즉, $x_{n,i}\in \{0,1\}$ 이다. 또한 각각의 픽셀이 서로에 대해 조건부 독립적이라 가정하면, 성분 $k$ 의 모든 이미지에 관한 픽셀 $i$ 의 확률분포는 모수 $0\leq \mu _{k,i}\leq 1$ 를 갖는 베르누이 분포로 모델링 할 수 있다.

혼합 모델이 $K$ 개의 군집(cluster)을 가진다고 가정할 때, 어떤 이미지가 군집 $i$ 에 속해 있을 확률을 $\mathbf {\pi _{i}} ,i\in \{1,...,K\},\sum _{i=1}^{K}\pi _{i}=1$ 이라 하자. $\mathbf {\pi _{i}}$ 는 혼합 계수(mixing coefficient)라 불린다.

이 모델은 어떤 이미지가 특정 군집에 속할 확률변수와 그 군집에서의 픽셀 분포 변수를 모수로 갖는다. 즉, $\theta =(\mathbf {\mu } ,\mathbf {\pi } )$ 이고,

$\mathbf {\mu } :=(\mathbf {\mu _{1}} \;\mathbf {\mu _{2}} \;...\;\mathbf {\mu _{K}} )=\left({\begin{array}{cccc}\mu _{1,1}&\mu _{2,1}&...&\mu _{K,1}\\\mu _{1,2}&\mu _{2,2}&...&\mu _{K,2}\\\vdots &\vdots &\ddots &\vdots \\\mu _{1,D}&\mu _{2,D}&...&\mu _{K,D}\\\end{array}}\right),\mathbf {\pi } :=(\pi _{1},...,\pi _{K})^{T}$ 이다.

하나의 트레이닝 이미지(training image) $\mathbf {x}$ 에 관한 가능도 함수는 다음과 같다.

${\begin{aligned}\,{\text{Pr}}(\mathbf {x} |\theta )=\,{\text{Pr}}(\mathbf {x} |\mathbf {\mu } ,\mathbf {\pi } )=\sum _{k=1}^{K}\pi _{k}\,{\text{Pr}}(\mathbf {x} |\mathbf {\mu _{k}} )\end{aligned}}$

이 때 $\mathbf {x}$ 가 군집 $k$ 에 의해 생성되었을 확률은 다음과 같다.

${\begin{aligned}\,{\text{Pr}}(\mathbf {x} |\mathbf {\mu _{k}} )=\prod _{i=1}^{D}\mu _{k,i}^{x_{i}}(1-\mu _{k,i})^{1-x_{i}}.\end{aligned}}$

$K$ -차원 잠재변수 $\mathbf {z_{i}} ,i\in \{1,...,N\}$ 을 정의함으로써 주어진 이미지가 어떤 군집에 속할 확률을 모델링 할 수 있다. $\mathbf {z_{i}}$ 가 $1$ -of- $K$ 표기법으로 나타내어졌다고 하자. 즉,

$\mathbf {z_{i}} :=(z_{i,1},...,z_{i,K})^{T}$ 에 대하여, $z_{i,j}\in \{0,1\},i\in \{1,...,N\},j\in \{1,...,K\}$ 이고, $\sum _{j=1}^{K}z_{i,j}=1$ 가 성립한다.

$\mathbf {z_{i}}$ 의 분포가 $\,{\text{Pr}}(z_{i,j}=1)=\pi _{j}$ 로 주어져 있다고 가정하자. 그러면 다음이 성립한다.

${\begin{aligned}\,{\text{Pr}}(\mathbf {z_{n}} |\mathbf {\pi } )=\prod _{i=1}^{K}\pi _{i}^{z_{n,i}}.\end{aligned}}$

또한, $\,{\text{Pr}}(\mathbf {x_{n}} |z_{n,k}=1)=\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {\mu _{k}} )$ 라 하면,

${\begin{aligned}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {z_{n}} ,\mathbf {\mu } ,\mathbf {\pi } )=\prod _{k=1}^{K}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {\mu _{k}} )^{z_{n,k}}.\end{aligned}}$ 이다.

잠재변수 $\mathbf {z_{i}}$ 들을 포함하는 집합 $\mathbf {Z} :=\{\mathbf {z_{1}} ,...,\mathbf {z_{N}} \}$ 를 정의하고 위 수식을 사용하면,

${\begin{aligned}\,{\text{Pr}}(\mathbf {Z} |\mathbf {\pi } )=\prod _{n=1}^{N}\,{\text{Pr}}(\mathbf {z_{n}} |\mathbf {\pi } )=\prod _{n=1}^{N}\prod _{k=1}^{K}\pi _{k}^{z_{n,k}}.\end{aligned}}$ 이 성립하고,

트레이닝 이미지의 집합 $\mathbf {X} :=\{\mathbf {x_{1}} ,...,\mathbf {x_{N}} \}$ 를 정의하면 $\mathbf {Z}$ 에 관한 $\mathbf {X}$ 의 조건부 확률분포를 다음과 같이 유도할 수 있다.

${\begin{aligned}\,{\text{Pr}}(\mathbf {X} |\mathbf {Z} ,\mathbf {\mu } ,\mathbf {\pi } )=\prod _{n=1}^{N}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {z_{n}} ,\mathbf {\mu } ,\mathbf {\pi } )=\prod _{n=1}^{N}\prod _{k=1}^{K}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {\mu _{k}} )^{z_{n,k}}=\prod _{n=1}^{N}\prod _{k=1}^{K}\left(\prod _{i=1}^{D}\mu _{k,i}^{x_{n,i}}(1-\mu _{k,i})^{1-x_{n,i}}\right)^{z_{n,k}}.\end{aligned}}$

이제 위 수식들과 확률의 곱셈법칙을 이용하면 완전한 데이터 가능도(complete data likelihood)를 구할 수 있고,

${\begin{aligned}\,{\text{Pr}}(\mathbf {X} ,\mathbf {Z} |\mathbf {\mu } ,\mathbf {\pi } )=\,{\text{Pr}}(\mathbf {X} |\mathbf {Z} ,\mathbf {\mu } ,\mathbf {\pi } )\,{\text{Pr}}(\mathbf {Z} |\mathbf {\mu } ,\mathbf {\pi } )=\prod _{n=1}^{N}\prod _{k=1}^{K}\left(\pi _{k}\prod _{i=1}^{D}\mu _{k,i}^{x_{n,i}}(1-\mu _{k,i})^{1-x_{n,i}}\right)^{z_{n,k}}.\end{aligned}}$

이에 로그함수를 취하면 완전한 데이터 로그가능도(complete data log-likelihood) ${\mathcal {L}}(\theta )$ 가 유도된다.

${\begin{aligned}{\mathcal {L}}(\theta )=\ln \,{\text{Pr}}(\mathbf {X} ,\mathbf {Z} |\mathbf {\mu } ,\mathbf {\pi } )=\sum _{n=1}^{N}\sum _{k=1}^{K}z_{n,k}\left(\ln \pi _{k}+\sum _{i=1}^{D}x_{n,i}\ln \mu _{k,i}+(1-x_{n,i})\ln(1-\mu _{k,i})\right).\end{aligned}}$

기댓값 단계(E step)

기댓값 단계에서 잠재변수의 값을 업데이트 하기 위해 $\mathbf {Z}$ 의 확률분포를 $\,{\text{Pr}}(\mathbf {Z} |\mathbf {X} ,\mathbf {\theta } )$ 이라 하자. 이 분포는 정확히 풀 수 없으므로 값을 추정해야 하는데, 이를 위해 현재 $z_{n,k}$ 의 값을 이 변수의 기댓값으로 대체한다.

${\begin{aligned}z_{n,k}^{\text{new}}\leftarrow \mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\mathbf {\mu } ,\mathbf {\pi } }[z_{n,k}]=\sum _{z_{n,k}}\,{\text{Pr}}(z_{n,k}|\mathbf {x_{n}} ,\mathbf {\mu } ,\mathbf {\pi } )\,z_{n,k}={\frac {\pi _{k}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {\mu _{k}} )}{\sum _{m=1}^{K}\pi _{m}\,{\text{Pr}}(\mathbf {x_{n}} |\mathbf {\mu _{m}} )}}={\frac {\pi _{k}\prod _{i=1}^{D}\mu _{k,i}^{x_{n,i}}(1-\mu _{k,i})^{1-x_{n,i}}}{\sum _{m=1}^{K}\pi _{m}\prod _{i=1}^{D}\mu _{m,i}^{x_{n,i}}(1-\mu _{m,i})^{1-x_{n,i}}}}.\end{aligned}}$

이와 같이 $z_{n,k}$ 의 값을 업데이트 하면 잠재변수 $\mathbf {z_{n}} ^{\text{new}}$ 가 더 이상 $1$ -of- $K$ 식으로 나타내어지지 않는다는 사실을 알 수 있다. 이는 이미지 $\mathbf {x_{n}}$ 이 하나의 군집이 아니라 여러 개의 군집에 부분적으로 속한다는 것을 의미한다.

최대화 단계(M step)

이 모델의 모수들(혼합 계수 및 군집별 픽셀 분포 변수)을 최대화하기 위해 다음 식을 이용한다.

$\mathbf {\theta } ^{\text{new}}\leftarrow {\underset {\mathbf {\theta } }{\text{arg max }}}\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\theta ^{\text{old}}}\left[{\mathcal {L}}(\theta )\right]$

${\begin{aligned}\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\theta ^{\text{old}}}\left[{\mathcal {L}}(\theta )\right]&=\sum _{n=1}^{N}\sum _{k=1}^{K}\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\mathbf {\mu } ^{\text{old}},\mathbf {\pi } ^{\text{old}}}\left[z_{n,k}\right]\left(\ln \pi _{k}+\sum _{i=1}^{D}x_{n,i}\ln \mu _{k,i}+(1-x_{n,i})\ln(1-\mu _{k,i})\right).\end{aligned}}$

위 식을 $\mathbf {\mu _{k}}$ 에 관해 최대화하기 위해 미분하여 $0$ 이 되는 값을 찾는다.

${\begin{aligned}{\frac {\partial }{\partial \mu _{m,j}}}\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\theta ^{\text{old}}}\left[{\mathcal {L}}(\theta )\right]&=\sum _{n=1}^{N}\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\mathbf {\mu } ^{\text{old}},\mathbf {\pi } ^{\text{old}}}\left[z_{n,m}\right]\left({\frac {x_{n,j}}{\mu _{m,j}}}-{\frac {1-x_{n,j}}{1-\mu _{m,j}}}\right)\\&=\sum _{n=1}^{N}z_{n,m}^{\text{new}}{\frac {x_{n,j}-\mu _{m,j}}{\mu _{m,j}(1-\mu _{m,j})}}=0\Leftrightarrow \\\mu _{m,j}&={\frac {1}{N_{m}}}\sum _{n=1}^{N}x_{n,j}z_{n,m}^{\text{new}},\end{aligned}}$

이 때, $N_{m}=\sum _{n=1}^{N}z_{n,m}^{\text{new}}$ 는 군집 $m$ 에 실질적으로 속한 이미지의 개수이다.

군집 $m$ 에서의 픽셀 분포의 매개변수 $\mathbf {\mu _{m}}$ 는 다음과 같이 나타낼 수 있다.

$\mathbf {\mu _{m}} =\mathbf {{\bar {x}}_{m}} ,\mathbf {{\bar {x}}_{m}} ={\frac {1}{N_{m}}}\sum _{n=1}^{N}z_{n,m}^{\text{new}}\mathbf {x_{n}}$

$\mathbf {{\bar {x}}_{m}}$ 는 군집 $m$ 에 (실질적으로) 속한 이미지들의 가중 평균을 의미한다.

$\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\theta ^{\text{old}}}\left[{\mathcal {L}}(\theta )\right]$ 를 혼합 계수 $\mathbf {\pi }$ 와 조건 $\sum _{k=1}^{K}\pi _{k}=1$ 에 관해 최대화하기 위해 라그랑주 승수법(Lagrange Multiplier)을 이용한다.

$\Lambda (\theta ,\lambda ):=\mathbb {E} _{\mathbf {Z} |\mathbf {X} ,\theta ^{\text{old}}}\left[{\mathcal {L}}(\theta )\right]+\lambda \left(\sum _{k=1}^{K}\pi _{k}-1\right)$

최댓값은 다음과 같이 구할 수 있다.

${\begin{aligned}{\frac {\partial }{\partial \pi _{m}}}\Lambda (\theta ,\lambda )={\frac {1}{\pi _{m}}}\sum _{n=1}^{N}z_{n,m}^{\text{new}}+\lambda =0\Leftrightarrow \\\pi _{m}=-{\frac {N_{m}}{\lambda }},\end{aligned}}$

${\begin{aligned}{\frac {\partial }{\partial \lambda }}\Lambda (\theta ,\lambda )=\sum _{k=1}^{K}\pi _{k}-1=0\Leftrightarrow \\\sum _{k=1}^{K}\pi _{k}=1.\end{aligned}}$

위 두 결과를 종합하면 $\lambda =-\sum _{k=1}^{K}N_{k}=-N$ 이고,

$\pi _{m}=-{\frac {N_{m}}{\lambda }}={\frac {N_{m}}{N}}$ 이다.

요약

베르누이 혼합 모델의 기댓값(E) 및 최대화(M) 단계의 수식들은 다음과 같다.

기댓값 단계(E step):
$z_{n,k}\leftarrow {\frac {\pi _{k}\prod _{i=1}^{D}\mu _{k,i}^{x_{n,i}}(1-\mu _{k,i})^{1-x_{n,i}}}{\sum _{m=1}^{K}\pi _{m}\prod _{i=1}^{D}\mu _{m,i}^{x_{n,i}}(1-\mu _{m,i})^{1-x_{n,i}}}}$

최대화 단계(M step):
$\mathbf {\mu _{m}} \leftarrow \mathbf {{\bar {x}}_{m}} ,$ $\pi _{m}\leftarrow {\frac {N_{m}}{N}},$

이 때, $\mathbf {{\bar {x}}_{m}} ={\frac {1}{N_{m}}}\sum _{n=1}^{N}z_{n,m}\mathbf {x_{n}}$ 이고, $N_{m}=\sum _{n=1}^{N}z_{n,m}$ 이다.

EM의 대체 알고리즘

EM은 극값으로만 수렴하고 일반적으로 수렴율에 대한 범위가 없다. 고차원에서 잘 작동하지 않고 극값의 수가 다수가 존재할 가능성이 있다. 따라서 특히 고차원 상에서 다른 학습 방법을 사용할 필요가 있다. 이 때 모멘트-기반 접근 방식이나 스펙트럼 기법이 더 좋은 효과를 보장한다. 모멘트-기반 접근은^[21] 확률 모형의 매개변수를 학습하기 위한 것으로 최근 들어 사용이 많아지기 시작했는데, 이는 EM이 극값만을 보장해주는 것에 반해 모멘트-기반 접근은 최대, 최솟값을 보장하기 때문이다. 혼합 모형, 은닉 마르코프 모델^[22], 공동체 모델 등으로부터 학습을 보장하는 알고리즘들이 유도될 수 있다.^[23] 이 스펙트럼 방법들은 잘못된 극값을 찾아내지 않고, 모수의 참값을 특정 제약 조건 하에서 일관적으로 도출시킨다.

같이 보기

주성분 분석

참조

EM 알고리즘은 MM 알고리즘^[24]의 특수한 형태로 볼 수 있다.

각주

↑ Sundberg, Rolf (1974). “Maximum likelihood theory for incomplete data from an exponential family”. 《Scandinavian Journal of Statistics》 1 (2): 49–58. JSTOR 4615553. MR 381110.
↑ Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.
↑ Sundberg, Rolf (1976). “An iterative method for solution of the likelihood equations for incomplete data from exponential families”. 《Communications in Statistics – Simulation and Computation》 5 (1): 55–64. doi:10.1080/03610917608812007. MR 443190.
↑ G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.
↑ Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")
↑ Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf).
↑ Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic year 1969-1970), with the assistance of Rolf Sundberg. Stockholm University. ("Sundberg formula")
↑ Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, Arthur P. Dempster|A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, George A. Barnard|G. A. Barnard, O. Barndorff-Nielsen, J. D. Kalbfleisch and Rasch model|G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.
↑ Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.
↑ Little, Roderick J.A.; Rubin, Donald B. (1987). 《Statistical Analysis with Missing Data》. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. 134–136쪽. ISBN 0-471-80254-9.
↑ ^가 ^나 Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan, 편집. “A view of the EM algorithm that justifies incremental, sparse, and other variants” (PDF). 《Learning in Graphical Models》 (Cambridge, MA: MIT Press): 355–368. ISBN 0-262-60032-3. 2009년 3월 22일에 확인함.
↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). 〈8.5 The EM algorithm〉. 《The Elements of Statistical Learning》. New York: Springer. 236–243쪽. ISBN 0-387-95284-5.
↑ Einicke, G.A.; Malos, J.T.; Reid, D.C.; Hainsworth, D.W. (January 2009). “Riccati Equation and EM Algorithm Convergence for Inertial Navigation Alignment”. 《IEEE Trans. Signal Processing》 57 (1): 370–375. doi:10.1109/TSP.2008.2007090.
↑ Einicke, G.A.; Falco, G.; Malos, J.T. (May 2010). “EM Algorithm State Matrix Estimation for Navigation”. 《IEEE Signal Processing Letters》 17 (5): 437–440. Bibcode:2010ISPL...17..437E. doi:10.1109/LSP.2010.2043151.
↑ Einicke, G.A.; Falco, G.; Dunn, M.T.; Reid, D.C. (May 2012). “Iterative Smoother-Based Variance Estimation”. 《IEEE Signal Processing Letters》 19 (5): 275–278. Bibcode:2012ISPL...19..275E. doi:10.1109/LSP.2012.2190278.
↑ Jamshidian, Mortaza; Jennrich, Robert I. (1997). “Acceleration of the EM Algorithm by using Quasi-Newton Methods”. 《Journal of the Royal Statistical Society, Series B》 59 (2): 569–587. doi:10.1111/1467-9868.00083. MR 1452026.
↑ Meng, Xiao-Li; Rubin, Donald B. (1993). “Maximum likelihood estimation via the ECM algorithm: A general framework”. 《Biometrika》 80 (2): 267–278. doi:10.1093/biomet/80.2.267. MR 1243503.
↑ Hunter DR and Lange K (2004), A Tutorial on MM Algorithms Archived 2011년 7월 20일 - 웨이백 머신, The American Statistician, 58: 30-37
↑ Matsuyama, Yasuo (2003). “The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures”. 《IEEE Transactions on Information Theory》 49 (3): 692–706. doi:10.1109/TIT.2002.808105.
↑ Matsuyama, Yasuo (2011). “Hidden Markov model estimation based on alpha-EM algorithm: Discrete and continuous alpha-HMMs”. 《International Joint Conference on Neural Networks》: 808–816.
↑ Anandkumar, Animashree; Ge, Rong; Hsu, Daniel; Kakade, Sham M; Telgarsky, Matus (2014). “Tensor decompositions for learning latent variable models”. 《The Journal of Machine Learning Research》 15 (1): 2773–2832.
↑ Anandkumar, Animashree; Hsu, Daniel; Kakade, Sham M (2012). “A method of moments for mixture models and hidden Markov models”. 《arXiv preprint arXiv:1203.0683》.
↑ Anandkumar, Animashree; Ge, Rong; Hsu, Daniel; Kakade, Sham M (2014). “A tensor approach to learning mixed membership community models”. 《The Journal of Machine Learning Research》 15 (1): 2239–2312.
↑ Lange, Kenneth. “The MM Algorithm” (PDF).

추가 자료

Robert Hogg, Joseph McKean and Allen Craig. Introduction to Mathematical Statistics. pp. 359–364. Upper Saddle River, NJ: Pearson Prentice Hall, 2005.
The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay includes simple examples of the EM algorithm such as clustering using the soft k-means algorithm, and emphasizes the variational view of the EM algorithm, as described in Chapter 33.7 of version 7.2 (fourth edition).
Dellaert, Frank. “The Expectation Maximization Algorithm”. CiteSeerX: 10.1.1.9.9735, gives an easier explanation of EM algorithm in terms of lowerbound maximization.
Bishop, Christopher M. (2006). 《Pattern Recognition and Machine Learning》. Springer. ISBN 0-387-31073-8.
M. R. Gupta and Y. Chen (2010). 《Theory and Use of the EM Algorithm》. doi:10.1561/2000000034. A well-written short book on EM, including detailed derivation of EM for GMMs, HMMs, and Dirichlet.
Bilmes, Jeff. “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models”. CiteSeerX: 10.1.1.28.613, includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
Variational Algorithms for Approximate Bayesian Inference, by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs (chapters).
The Expectation Maximization Algorithm: A short tutorial, A self-contained derivation of the EM Algorithm by Sean Borman.
The EM Algorithm, by Xiaojin Zhu.
EM algorithm and variants: an informal tutorial by Alexis Roche. A concise and very clear description of EM and many interesting variants.
Einicke, G.A. (2012). 《Smoothing, Filtering and Prediction: Estimating the Past, Present and Future》. Rijeka, Croatia: Intech. ISBN 978-953-307-752-9.

[Sundberg1974-1] Sundberg, Rolf (1974). “Maximum likelihood theory for incomplete data from an exponential family”. 《Scandinavian Journal of Statistics》 1 (2): 49–58. JSTOR 4615553. MR 381110.

[Sundberg1971-2] Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.

[Sundberg1976-3] Sundberg, Rolf (1976). “An iterative method for solution of the likelihood equations for incomplete data from exponential families”. 《Communications in Statistics – Simulation and Computation》 5 (1): 55–64. doi:10.1080/03610917608812007. MR 443190.

[4] G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.

[Martin-Löf1963-5] Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")

[Martin-Löf1966-6] Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf).

[Martin-Löf1970-7] Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic year 1969-1970), with the assistance of Rolf Sundberg. Stockholm University. ("Sundberg formula")

[Martin-Löf1974a-8] Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, Arthur P. Dempster|A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, George A. Barnard|G. A. Barnard, O. Barndorff-Nielsen, J. D. Kalbfleisch and Rasch model|G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.

[Martin-Löf1974b-9] Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.

[Little1987-10] Little, Roderick J.A.; Rubin, Donald B. (1987). 《Statistical Analysis with Missing Data》. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. 134–136쪽. ISBN 0-471-80254-9.

[neal1999-11] 가 ^나 Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan, 편집. “A view of the EM algorithm that justifies incremental, sparse, and other variants” (PDF). 《Learning in Graphical Models》 (Cambridge, MA: MIT Press): 355–368. ISBN 0-262-60032-3. 2009년 3월 22일에 확인함.

[hastie2001-12] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). 〈8.5 The EM algorithm〉. 《The Elements of Statistical Learning》. New York: Springer. 236–243쪽. ISBN 0-387-95284-5.

[13] Einicke, G.A.; Malos, J.T.; Reid, D.C.; Hainsworth, D.W. (January 2009). “Riccati Equation and EM Algorithm Convergence for Inertial Navigation Alignment”. 《IEEE Trans. Signal Processing》 57 (1): 370–375. doi:10.1109/TSP.2008.2007090.

[14] Einicke, G.A.; Falco, G.; Malos, J.T. (May 2010). “EM Algorithm State Matrix Estimation for Navigation”. 《IEEE Signal Processing Letters》 17 (5): 437–440. Bibcode:2010ISPL...17..437E. doi:10.1109/LSP.2010.2043151.

[15] Einicke, G.A.; Falco, G.; Dunn, M.T.; Reid, D.C. (May 2012). “Iterative Smoother-Based Variance Estimation”. 《IEEE Signal Processing Letters》 19 (5): 275–278. Bibcode:2012ISPL...19..275E. doi:10.1109/LSP.2012.2190278.

[16] Jamshidian, Mortaza; Jennrich, Robert I. (1997). “Acceleration of the EM Algorithm by using Quasi-Newton Methods”. 《Journal of the Royal Statistical Society, Series B》 59 (2): 569–587. doi:10.1111/1467-9868.00083. MR 1452026.

[17] Meng, Xiao-Li; Rubin, Donald B. (1993). “Maximum likelihood estimation via the ECM algorithm: A general framework”. 《Biometrika》 80 (2): 267–278. doi:10.1093/biomet/80.2.267. MR 1243503.

[18] Hunter DR and Lange K (2004), A Tutorial on MM Algorithms Archived 2011년 7월 20일 - 웨이백 머신, The American Statistician, 58: 30-37

[19] Matsuyama, Yasuo (2003). “The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures”. 《IEEE Transactions on Information Theory》 49 (3): 692–706. doi:10.1109/TIT.2002.808105.

[20] Matsuyama, Yasuo (2011). “Hidden Markov model estimation based on alpha-EM algorithm: Discrete and continuous alpha-HMMs”. 《International Joint Conference on Neural Networks》: 808–816.

[anandkumar2014tensor-21] Anandkumar, Animashree; Ge, Rong; Hsu, Daniel; Kakade, Sham M; Telgarsky, Matus (2014). “Tensor decompositions for learning latent variable models”. 《The Journal of Machine Learning Research》 15 (1): 2773–2832.

[anandkumar2012method-22] Anandkumar, Animashree; Hsu, Daniel; Kakade, Sham M (2012). “A method of moments for mixture models and hidden Markov models”. 《arXiv preprint arXiv:1203.0683》.

[anandkumar2014tensor_community-23] Anandkumar, Animashree; Ge, Rong; Hsu, Daniel; Kakade, Sham M (2014). “A tensor approach to learning mixed membership community models”. 《The Journal of Machine Learning Research》 15 (1): 2239–2312.

[24] Lange, Kenneth. “The MM Algorithm” (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]