MCF Tree-Based Clustering Method for Very Large Mixed-Type Data Set
- 주제(키워드) 도움말 Clustering methods , Memory management , Histograms , Buildings , Estimation , Licenses , Computational efficiency , Clustering , CF plus tree , MCF tree , numeric data , categorical data , mixed-type data , very large data sets
- 발행기관 IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
- 발행년도 2021
- 총서유형 Journal
- 본문언어 영어
초록/요약 도움말
Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.
more