부산대학교

작성일: 2020.10.27

수정일: 2020.10.27

작성자: 최용석

조회수: 248

정민지(2017). Creation and clustering of proximity data for text data analysis(한국통계학회 논문 포스트)

Abstract

Document-term frequency matrix is a type of data used in text mining. This matrix is often based on various
documents provided by the x-objects to be analyzed. When analyzing x-objects using this matrix, researchers
generally select only terms that are common in documents belonging to one x-object as keywords. Keywords
are used to analyze the x-object. However, this method misses the unique information of the individual
document as well as causes a problem of removing potential keywords that occur frequently in a specic
document. In this study, we dene data that can overcome this problem as proximity data. We introduce
twelve methods that generate proximity data and cluster the x-objects through two clustering methods of
multidimensional scaling and k-means cluster analysis. Finally, we choose the best method to be optimized
for clustering the x-object.

Keywords: text mining, proximity data, TF-IDF, multidimensional scaling, cluster analysis

첨부파일