어간 추출

어간 추출(語幹抽出, 영어: stemming)은 형태론 및 정보 검색 분야에서 어형이 변형된 단어로부터 접사 등을 제거하고 그 단어의 어간을 분리해 내는 것을 의미한다. 여기서 어간은 반드시 어근과 같아야 할 필요는 없으며, 어근과 차이가 있더라도 관련이 있는 단어들이 일정하게 동일한 어간으로 맵핑되게 하는 것이 어간 추출의 목적이다. 1960년대부터 컴퓨터 과학 분야에서 다양한 어간 추출 관련 알고리즘들이 연구되어 왔다. 많은 웹 검색 엔진들은 동일한 어간을 가진 단어들을 동의어로 취급하는 방식으로 질의어 확장을 하여 검색 결과의 품질을 높인다.

어간 추출 프로그램은 흔히 스테밍 알고리즘(stemming algorithm) 또는 스테머(stemmer)라 불린다.

예시[편집]

다음은 영어 단어에 대한 스테머의 동작 예시이다. 문자열 “cats”(“catlike”, “catty” 등도 마찬가지)의 어간으로는 “cat”이 추출된다. “stemmer”, “stemming”, “stemmed”의 어간은 “stem”이다. “fishing”, “fished”, “fisher”는 “fish”가 된다. “argue”, “argued”, “arguing”, “argus”의 어간은 “argu”이다.(추출된 어간이 어근이나 단어의 원형과 일치하지 않는 경우) 그러나 “argument”, “arguments”에서는 “argument”가 추출된다.

역사[편집]

최초의 스테머는 1968년 줄리 베스 로빈스(Julie Beth Lovins)에 의해 작성되었다.^[1] 이 논문은 매우 이른 시기에 나와 이후 이 분야의 연구에 큰 영향을 끼쳤다는 점에서 의의를 지닌다.

이후 또 다른 스테머가 마틴 포터에 의해 작성되어 1980년 7월 프로그램(Program) 저널에 실렸다. 이 스테머는 매우 널리 사용되었으며 영어를 위한 알고리즘의 사실상의 표준(de facto standard)이 되었다. 포터 박사는 스테밍 및 정보 검색에서의 공로를 인정 받아 2000년 토니 켄트 스트릭스 상(Tony Kent Strix award)을 수상하였다.

스테밍 알고리즘은 많은 사람들에 의해 구현되어 무료로 배포되었다. 그러나, 많은 버전들이 부분적인 결함들을 내포하고 있어, 알고리즘의 성능을 제대로 발휘하지 못하였다. 이러한 문제를 해결하기 위해 마틴 포터는 2000년 경 자신의 알고리즘을 직접 구현하여 무료로 배포 하였다. 그는 이후 수 년에 걸쳐 구현을 확장하여 스테밍 알고리즘 개발을 위한 프레임워크인 스노볼(Snowball)을 내놓았다. 또한 향상된 버전의 영어 스테머와 다른 몇 개의 언어들을 위한 스테머도 만들었다.

같이 보기[편집]

어근
어간
형태론
어휘소
어형 변화
파생 - 어간 추출은 역파생의 일종
자연 언어 처리 - 어간 추출은 일반적으로 자연 언어 처리의 일종
텍스트 마이닝 - 스테밍 알고리즘은 상용 자연 언어 처리 소프트웨어의 핵심
전산 언어학

각주[편집]

↑ Lovins, Julie Beth (1968). “Development of a Stemming Algorithm”. 《Mechanical Translation and Computational Linguistics》 11: 22–31.

읽어 보기[편집]

Dawson, J. L. (1974); Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33–46
Frakes, W. B. (1984); Term Conflation for Information Retrieval, Cambridge University Press
Frakes, W. B. & Fox, C. J. (2003); Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26–30
Frakes, W. B. (1992); Stemming algorithms, Information retrieval: data structures and algorithms, Upper Saddle River, NJ: Prentice-Hall, Inc.
Hafer, M. A. & Weiss, S. F. (1974); Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371–386
Harman, D. (1991); How Effective is Suffixing?, Journal of the American Society for Information Science 42 (1), 7–15
Hull, D. A. (1996); Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70–84
Hull, D. A. & Grefenstette, G. (1996); A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report
Kraaij, W. & Pohlmann, R. (1996); Viewing Stemming as Recall Enhancement, in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22, pp. 40–48
Krovetz, R. (1993); Viewing Morphology as an Inference Process, in Proceedings of ACM-SIGIR93, pp. 191–203
Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981); An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177–183
Lovins, J. (1971); Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28–40
Lovins, J. B. (1968); Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics, 11, 22—31
Jenkins, Marie-Claire; and Smith, Dan (2005); Conservative Stemming for Search and Indexing
Paice, C. D. (1990); Another Stemmer Archived 2011년 7월 22일 - 웨이백 머신, SIGIR Forum, 24: 56–61
Paice, C. D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632–649
Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data, Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390
Porter, Martin F. (1980); An Algorithm for Suffix Stripping, Program, 14(3): 130–137
Savoy, J. (1993); Stemming of French Words Based on Grammatical Categories Journal of the American Society for Information Science, 44(1), 1–9
Ulmschneider, John E.; & Doszkocs, Tamas (1983); A Practical Stemming Algorithm for Online Search Assistance^{[깨진 링크(과거 내용 찾기)]}, Online Review, 7(4), 301–318
Xu, J.; & Croft, W. B. (1998); Corpus-Based Stemming Using Coocurrence of Word Variants, ACM Transactions on Information Systems, 16(1), 61–81

외부 링크[편집]

(영어) Apache OpenNLP Porter 및 Snowball 스테머 제공
(영어) SMILE Stemmer - 무료 온라인 서비스, Porter 및 Paice/Husk' Lancaster 스테머 포함(Java API)
(영어) Themis - 오픈 소스 IR 프레임워크, Porter 스테머 구현 포함(PostgreSQL, Java API)
(영어) Snowball - 다양한 언어를 위한 무료 스테밍 알고리즘 및 소스코드, 5개 로맨스어를 위한 스테머 포함
(영어) Snowball on C# - Snowball 스테머의 C# 포팅 (14개 언어)
(영어) Language wrappers - Snowball API의 파이썬 바인딩
(영어) Ruby-Stemmer - Snowball API를 위한 루비 익스텐션
(영어) PECL - Snowball API를 위한 PHP 익스텐션
(영어) Oleander Porter's algorithm - BSD용으로 릴리즈 된 C++ 스테밍 라이브러리
(영어) Unofficial home page of the Lovins stemming algorithm - 두 개 언어의 소스 코드 포함
(영어) Official home page of the Porter stemming algorithm - 여러 언어의 소스 코드 포함
(영어) Official home page of the Lancaster stemming algorithm Archived 2011년 7월 22일 - 웨이백 머신 - 영국 랭캐스터 대학교
(영어) Official home page of the UEA-Lite Stemmer - 영국 이스트 앵글리아 대학교
(영어) Overview of stemming algorithms Archived 2011년 7월 2일 - 웨이백 머신
(영어) PTStemmer - 포르투갈어를 위한 Java/Python/.Net 스테밍 툴킷
(영어) jsSnowball^{[깨진 링크(과거 내용 찾기)]} - 여러 가지 언어를 위한 Snowball 스테밍 알고리즘의 오픈 소스 자바스크립트 구현
(영어) Snowball Stemmer - 자바 구현
(영어) hindi_stemmer - 힌두어를 위한 오픈 소스 스테머
(영어) czech_stemmer - 체코어를 위한 오픈 소스 스테머
(영어) Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers

이 문서에는 GFDL 라이선스로 배포된 자유 온라인 컴퓨팅 사전(FOLDOC)의 내용을 기초로 작성된 내용이 포함되어 있습니다.

[1] Lovins, Julie Beth (1968). “Development of a Stemming Algorithm”. 《Mechanical Translation and Computational Linguistics》 11: 22–31.

[1]