Recursive Hierarchical Clustering

This script performs recursive hierarchical clustering on clickstream data, and generate clusters of user behaviors. The system does three things: (1) group similar users into clusters; (2) identify the natural hierarchy within user clusters; (3) select key features to help to interpret the semantic meaning of each cluster. [Code Download]


Dependency

Python library numpy, scipy is required.

Usage

The name of the main script is recursiveHierarchicalCustering.py. Two ways of executing the script: command line interface or python import.

Command Line Interface

$> python recursiveHierarchicalClustering.py inputFile outputPath/ sizeThreshold

Arguments

inputFile: input file that contains the clickstream information for users to be clustered. Each line represents one user.

user_id \t A(1)G(10)

where A and G are action patterns and 1 and 10 represent how many times that the action pattern appears in this user's clickstream. In our case, action patterns refer to k-grams or subsequences of the clickstream. user_id grows from 1 to the total number of users.

outputPath: The directory to store all the temporary files and the final result.

sizeThreshold (optional): Defines the minimum size of the clusters. 0.05 means that clusters containing less than 5% of the total users will not be further split.

Python Interface

import recursiveHierarchicalClustering as rhc
data = rhc.getSidNgramMap(inputPath)
matrixPath = '%smatrix.dat' % inputPath
treeData = rhc.runDiana(outputPath, data, matrixPath)

treeData is the resulting cluster tree. Same as output/result.json if ran through CLI.

Result

outputPath/matrix.dat: A distance matrix for the root level (a temporary file for quick computation). If the file is available, the script will read in the matrix instead of calculating it again. The file format is a N*N distance matrix scaled to integer in the range of (0-100).

outputPath/result.json: A file for the clustering result, in the form of ['t', sub-cluster list, cluster info] or ['l', user list, cluster info].

Configuration

This script is designed to be distributed on multiple machines with shared file system. However, it is also be configured to run locally on a single machine. The configuration file is server.json in the following format:

{
    "threadNum": 5,
    "minPerSlice": 1000,
    "server":
        ["server1.example.com", "server2.example.com"]
}