free papers,research papers,free term paper samples

Grid-based clustering method

Abstract: The existing clustering algorithm that clusters of arbitrary shape for handling outliers and the result is not satisfactory, analysis of the existing grid-based clustering algorithm. Using the grid method of data analysis method space is divided by ( super) rectangular grid composed of grid cells, and then clustering the grid unit. Finally, concluding remarks and proposed grid-based clustering for further Research.

Keywords: data mining; grid; clustering

1 Introduction
Data mining is from a large database or data warehouse extraction of implicit, unknown and there is value in the Information or model. It is a useful database Research in the areas of application, integration of databases, machine learning, statistics other fields of theory and technology [1].

Cluster analysis is a widely studied data mining, one of the subjects, the data from the data to find similarities between, and so to classify the data to discover useful Information hidden in the data or knowledge. Has already made Many data clustering algorithms, the more famous are CLARANS [2], BIRCH [3], DBSCAN [4] and CLIQUE [5] and so on. But for high-dimensional, high-performance large-scale cluster analysis of the database is still an open study is open to question.

Spatial data processing grid method is commonly used in the method of discrete spatial data. Grid-based incremental clustering algorithm is easy to implement and carry out the high-dimensional data is widely used in clustering algorithms. Researchers have proposed a lot of grid-based clustering algorithms, including STING [6], which uses the storage unit in the grid statistics; WaveCluster [7] which use a wavelet transform method to cluster data objects; CLIQUE in high-dimensional data space in the grid and density based clustering methods.

In this paper, the existing grid-based clustering algorithm is studied, from the grid, said the method by grid cell, to the statistical Information grid, the grid cell nearest neighbor search, clustering the network than the specified threshold Unit cell analysis of the various steps, the final grid-based method for clustering of Research are put forward.


2 Definition and division of the grid
The basic concept of the grid, set A1, A2, ..., Ar is a data set O = {O1, O2, ..., On} r in the attributes of data objects bounded domain, then W = A1 �� A2 �� ... �� Ar is an r-dimensional space, the A1, A2, ..., Ar as the dimension of W (attributes, fields), then for data points that contains the r n-dimensional space in the data set O = {O1, O2 , ..., On}, where Oi = {Oi1, Oi2, ..., Oir} (i = 1, 2, ..., n), Oi j-th component of Oij �� Aj. the W M decile for each dimension, W that is divided into a grid cell.

The first step in grid-based clustering algorithm is to divide the grid structure, according to the search subspace different strategy, mainly based on cell division by the end of method to online algorithms and mesh-based method of top-down algorithm.

2.1 divided by the bottom-up approach
Mesh from the bottom-up method of division according to user input parameters (number of segments per dimension ki, 1 �� i �� d), the data space is divided into equal-sized uniform grid cell, assuming that fall within the same grid cell all data points belong to the same cluster, each grid cell within the data saved into the statistics, such as the number of data points, data points and. contains a certain number of data points in the grid cell is known as high-density network grid unit.

WaveCluster CLIQUE is used with the grid by the end of division method to the representation of algorithms. WaveCluster treatment of low-dimensional spatial data, its performance beyond the BIRCH, CLARANS, and other good clustering algorithm DBSCAN [15]. CLIQUE considered a high-dimensional subspace clustering, but it is high time complexity, require the user to specify the global density threshold. algorithm MAFIA [8] of CLIQUE is improved, in order to reduce the clustering algorithm to deal with the number of grid cells, MAFIA grid will be evenly divided Each dimension in the distribution density of the data merge similar adjacent sections, which are divided into a uniform grid. The more uniform distribution in the data grid, the regional division of a large size, uneven distribution of the data by region of grain is small, this uneven division of the grid method can improve the quality of clustering, many algorithms to be used by the follow-up.

Adopted by the bottom-up approach has the advantage of meshing, it can scan through the data again, the data compressed into a grid data structure and data structure based on the grid and found that clusters of arbitrary shape. In addition, if grid cell size is smaller (ie smaller), then get the cluster of high accuracy, but the computational complexity is large. In addition, the grid from the bottom-up approach has not suited to deal with high dimensional data problems. in high dimensional space, the distribution of data is very sparse, the role of the grid method loses its compression, and belong to the same cluster of high-density grid cells may not be connected, which makes clustering algorithm can not find a reasonable number of cluster.

2.2 by top-down approach
Top-down approach taken by the grid partition strategy (divide and conquer principle), a recursive partition of data space, decreasing the scale of the problem. First the original data space is divided into several large areas. obtained for each region, division of the implementation process is repeated until each region contains a cluster belong to the same data points, then the area is the final grid cell. grid method based on top-down clustering algorithm directly to the high density grid cell identified as a cluster, or grid will be connected to the unit identified as high-density clusters.

OptiGrid [9] and CLTree [10] are two typical mesh-based method of top-down clustering algorithm. Which, OptiGrid is the density of data distribution with the spatial Information to select the optimal partition. Through a density function to determine the cutting plane, the data space can be divided into regular or irregular units such as the traditional division of space than you can use this to solve the problem of high dimensional clustering. And CLTree after use by the best Information gain to select division.

By top-down approach does not require the user to specify the main advantage of division of parameters, but the distribution of the data to divide the space, so this division is more reasonable. Data space dimension on the top-down approach of the grid small, can quickly focus on large high dimensional data cluster separated. This class method computational complexity and data set size and dimension of the linear relationship were tested for treatment of high dimensional data. As division is based on the data distribution , while the noise is usually considered uniformly distributed in the whole space, so the division of top-down method is not sensitive to noise. However, because this method are much larger than the size of the grid unit to the grid by the end of the grid method unit volume, the method produces more accurate than the description of the cluster up by the end of the grid description of the cluster obtained lower accuracy. and in the division of top-down process, with a cluster can be divided into different regions , end up in the same area may contain different clusters, thus further reducing the accuracy of the algorithm. Another drawback of such methods by the classification process is that it requires multiple scans of the data set.

Divided by the bottom-up approach is the only data set to conduct a linear scan, and a description of the cluster with higher precision. Therefore, the two methods for different problems. The former is suitable for handling high dimensional data sets, which can effectively access to large cost of processing large data sets and dynamic data.


3 Grid-based clustering
Grid-based clustering algorithm basic process is, first of all the data space is divided into grid cell W, the data object set O is mapped to the grid cell, and calculate the density of each unit. Based on user input density threshold MinPts each grid cell to determine whether the high-density units, from the neighboring cell groups to form dense clusters [11], as shown in Table 1.


Algorithm 1 Step 1 in the above detailed description has the following steps 2-4 described the content of the concrete.

Density of 3.1 grid unit
Cluster is a region that is greater than the density of points in the adjacent area. In the mesh data structure, because each grid cell has the same size, the grid cell in which the density of data points fall unit is the number of dots. This gives the density of the dense grid cell is set at a time t the density of a grid cell density, the definition of density = number of data points within the unit / data space the total number of data points, the density threshold set for the user to enter the density threshold, when the density>, the grid cell is - a dense grid cell.

Relative to the dense grid cells, most of the grid cell that contains very little data or empty, this type of grid cell is called sparse grid cell. A large number of sparse grid cell there will be a great reduce the rate of clustering need to cluster the grid prior to the unit dealing with the sparse, sparse-density threshold is defined, when the density>, the grid cell is - a sparse unit. For the sparse grid cell approach generally method of using compressed or delete method, if necessary to retain the sparse grid cell for further processing, you can use compression methods; if the basis of existing data clustering directly, you can delete the sparse grid cell, theoretical analysis and proved to delete the sparse grid cell does not affect the quality of clustering [12].

Links to Research Papers Download http://www.hi138.com 3.2 by the formation of dense clusters of grid cells
Grid-based clustering algorithm, according to the above analysis, the cells adjacent to the formation of dense clusters is relatively straightforward, which is grid-based approach one of the advantages. But needs to first define the meaning of adjacent cells. Let n-dimensional Spaces in the presence of any two cells U1 and U2, when the two cells in the - one-dimensional intersection or on a public face with when they are called adjacent grid cell.

In the two-dimensional space, the more often used the 4-connection neighbors adjacent to the definition and the definition of 8-connection (Figure 1) ,4-connection is more suitable for use in the clustering algorithm. Because when looking for a grid cell neighbors, in the 4-connection definition, a grid cell only 2d neighbors, and in the 8-connection definition, there are 3d-1 neighbors, when the data dimension d is large, this number is very large. Use 4-connection not only involved in a significant decrease in the number of computing units, and units to increase the relationship with the dimension of the exponential growth into linear growth, so the algorithm can further reduce the time required to run with low computational complexity [13 ]. aloof, only in very exceptional circumstances, the use of 4-connection will be the definition of clustering results obtained using the 8-connection with the definition of the clustering results to differ [14], this is because, when the 4-connection of grid cell is a high-density grid cell, the four diagonal grid cell whether or not a high-density grid cell, can be the right clustering; only when the cell of a grid with the diagonal phase 2 adjacent grid cell while the cell itself is empty and high-density grid cell, you can not correct clustering, the division of the grid, usually require much less than the grid cell size of the cluster size, it can be that this situation appears likely to be small.



4 Conclusion and Outlook
Grid-based clustering method has the advantage is its processing speed, because the speed has nothing to do with the number of data objects, but only depends on the data space in number of units of each dimension, that any shape, any size of the cluster The calculated results has nothing to do with the data input in order to calculate the amount of time has nothing to do with the data, but does not require the same means as the k pre-specified number of such clusters. However, grid-based method of clustering algorithm input parameters greater impact on the clustering results and more difficult to set these parameters. When there is noise in the data, if without special treatment, the quality will be poor clustering algorithm. Moreover, algorithms for the poor scalability of data dimensions.

Grid-based clustering method there are still some urgent problems, mainly the following points: (1) When the clusters have different densities, the overall density parameter can not effectively find such clusters, need to develop with variable density parameters of the algorithm. (2) clustering for different types of data problems, such as for high-dimensional data, the data grid will rapidly increase, the need for effective technology to discover neighboring units. (3) When the scale of the data set and data has geographical distribution, the need to develop efficient parallel algorithms to improve processing speed. (4) the optimization of existing grid algorithm, to improve different aspects of the grid algorithm. such as the development of sparse mesh compression algorithm, the density merging algorithm similar to the grid.

In this paper, grid-based clustering method has been analyzed and summarized Research, including the definition and division of the grid method to determine the density of the grid cell by grid cell form a cluster adjacent to the clustering process; Finally, advantages and limitations of the grid clustering method to summarize the existing research and analysis, based on a follow-up need to focus on resolving the problem.


References
[1] CHENM S, HAN Jiawei, YUP S. Datamining: an overviewfrom a database perspective [J]. IEEE Trans on Knwledge and Data Eng.1996, 8 (6) :866-883.

[2] NG RT, HAN J. Efficient and effective clustering methods for spatial data mining [C]. Proc of the 20th VLDB Conference.Chile, Santia.1994 :144-155.

[3] ZHANG T, RAMAKRISHNAN R, LIVNY M. An efficient data clustering method for very large databases [C]. Proc of ACM SIGMOD International Conference on Management of Data. New York: ACM Press ,1996:103-114.

[4] ESTER M, KRIEGEL HP, SANDER JA density-based algorithm for discovering clusters in large spatial databases with noise [C]. Proc of the 2nd International Conference on Knowledge Discovering in Databases and Data Mining.Oregon ,1996:122-128 .

[5] AGRAWAL R, GEHRKE J, GUNOPOLOS D. Automatic subspace clustering of high dimensional data for data mining applications [C]. Proc of ACM SIGMOD International Conference on Management of Data.New York: ACM Press ,1998:94-105.

[6] Wang W, Yang J, Muntz R. STING: A Statistical Information Grid Approach to Spatial Data Mining [C]. In: Proceedings of the 23rd VLDB Conference.Athens, Greece ,1997.186-195.

[7] Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases [C]. In: Proceedings of the 24th VLDB Conference.New York, USA ,1998.428-439.

[8] Goil S, Nagesh H, Choudhary A. MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets [C]. Technical Report No.CPDC-TR-9906-010, Center for Parallel and Distributed Computing, 1999.

[9] Hinneburg A, Keim D A. Optimal Grid-Clustring: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering [C]. In: Proceedings of the 25th VLDB Conference.1999.506-517.

[10] Liu B, Xia Y, Yu P S. Clustering Through Decision Tree Construction [C]. In: Proceedings of the Ninth International Conference on Information and Knowledge Management.2000.20-29.

[11] Pang-Ning Tan, Michael Steinbach.Introduction to Data Mining [J] .2005,372-373.

[12] Chen Y, Tu L.Density-Based Clustering for Real-Time Stream Data [J]. ACMKDD'07, August 12-15,2007, San Jose, California, USA.133-142.

[13] Cao its Yu Lan, Zhi-Hui. Grid-based clustering algorithm for mining outliers [J]. Computer Engineering .2006 (6).

[14] Sun Yufen. Grid-based method of clustering algorithm [J]. Huazhong University of .2006.

[15] Han J, Kamber M. Data Mining: Concepts and Techniques [J]. Morgan Kaufmann Publishers, 2001.

Links to Research Papers Download http://www.hi138.com

Newest Research Papers

  • Newest
  • Computer Theory Papers

MOST POPULAR Computer Theory Papers

  • 24Hours
  • 7Days
  • 30Days