표제어 추출

표제어 추출(Lemmatization)은 언어학에서 단어의 변형 형태를 그룹화하여 단어의 기본형 또는 사전 형태로 식별되는 단일 항목으로 분석할 수 있도록 하는 프로세스이다.^[1]

전산언어학에서 표제어 추출은 단어의 의도된 의미를 기반으로 단어의 기본정리를 결정하는 알고리즘 프로세스이다. 어간 추출과 달리 표제어 추출은 문장에서 의도된 품사와 단어의 의미를 올바르게 식별하는 것뿐 아니라 해당 문장을 둘러싼 더 큰 맥락(예: 인접 문장 또는 전체 문서) 내에서도 이를 정확하게 식별하는 데 달려 있다. 결과적으로 효율적인 표제어 알고리즘을 개발하는 것은 공개된 연구 영역이다.^[2]^[3]^[4]

알고리즘[편집]

표제어 추출을 수행하는 간단한 방법은 간단한 사전 조회를 이용하는 것이다. 이는 간단한 활용형에 잘 작동하지만, 긴 합성어를 사용하는 언어와 같은 다른 경우에는 규칙 기반 시스템이 필요하다. 이러한 규칙은 직접 작성하거나 주석이 달린 말뭉치에서 자동으로 학습할 수 있다.

각주[편집]

↑ Collins English Dictionary, entry for "lemmatize"
↑ “WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages”.
↑ Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). 《Joint Lemmatization and Morphological Tagging with LEMMING》 (PDF). 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. 2268–2274쪽. doi:10.18653/v1/D15-1272.
↑ Bergmanis, Toms; Goldwater, Sharon. “Context Sensitive Neural Lemmatization with Lematus” (PDF).

[1] Collins English Dictionary, entry for "lemmatize"

[Semantic_Annotation_Research-2] “WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages”.

[Muller,_University_of_Munich-3] Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). 《Joint Lemmatization and Morphological Tagging with LEMMING》 (PDF). 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. 2268–2274쪽. doi:10.18653/v1/D15-1272.

[4] Bergmanis, Toms; Goldwater, Sharon. “Context Sensitive Neural Lemmatization with Lematus” (PDF).

[1]

[2]

[3]

[4]