잠재 의미 분석

잠재 의미 분석(LSA, Latent Semantic Analysis)은 자연어 처리, 특히 분포 의미론에서 문서 및 용어와 관련된 개념 집합을 생성하여 문서 집합과 해당 문서에 포함된 용어 간의 관계를 분석하는 기술이다. LSA는 의미가 유사한 단어가 유사한 텍스트 부분에 나타날 것이라고 가정한다(분포 가설). 문서당 단어 수를 포함하는 행렬(행은 고유한 단어를 나타내고 열은 각 문서를 나타냄)은 큰 텍스트 조각으로 구성되며 특잇값 분해(SVD)라는 수학적 기법을 사용하여 유사성 구조를 유지하면서 열 중에서 행 수를 줄인다. 그런 다음 두 열 간의 코사인 유사성을 기준으로 문서를 비교한다. 1에 가까운 값은 매우 유사한 문서를 나타내고, 0에 가까운 값은 매우 다른 문서를 나타낸다.

잠재 의미 구조를 이용한 정보 검색 기술은 스콧 디어웨스터, 수잔 뒤메스, 조지 퍼나스, 리처드 하쉬먼, 토마스 랜다우어, 카렌 로크바움, 린 스트리터에 의해 1988년에 특허를 받았다(미국 특허 4,839,853, 현재 만료됨). 정보 검색에 적용되는 맥락에서 LSI(Latent Semantic Indexing)라고도 한다.

같이 보기[편집]

외부 링크[편집]

Articles on LSA[편집]

Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.

Talks and demonstrations[편집]

LSA Overview, talk by Prof. Thomas Hofmann describing LSA, its applications in Information Retrieval, and its connections to probabilistic latent semantic analysis.
Complete LSA sample code in C# for Windows. The demo code includes enumeration of text files, filtering stop words, stemming, making a document-term matrix and SVD.

Implementations[편집]

Due to its cross-domain applications in Information Retrieval, Natural Language Processing (NLP), Cognitive Science and Computational Linguistics, LSA has been implemented to support many different kinds of applications.

Sense Clusters, an Information Retrieval-oriented perl implementation of LSA
S-Space Package, a Computational Linguistics and Cognitive Science-oriented Java implementation of LSA
Semantic Vectors applies Random Projection, LSA, and Reflective Random Indexing to Lucene term-document matrices
Infomap Project, an NLP-oriented C implementation of LSA (superseded by semanticvectors project)
Text to Matrix Generator, A MATLAB Toolbox for generating term-document matrices from text collections, with support for LSA
Gensim contains a Python implementation of LSA for matrices larger than RAM.