Clickstream Clustering for User Behavior Analysis

SAND Lab @ University of Chicago



Online services are increasingly dependent on user participation. Whether it is online social networks or crowdsourcing services, understanding user behavior is important yet challenging. In this project, we build an unsupervised system to capture dominating user behaviors from clickstream data (traces of users’ click events), and visualize the detected behaviors in an intuitive manner. Our system identifies "clusters" of similar users by partitioning a similarity graph (nodes are users; edges are weighted by clickstream similarity). The partitioning process leverages iterative feature pruning to capture the natural hierarchy within user clusters and produce intuitive features for visualizing and understanding captured user behaviors.

This demo presents the clustering result on a large-scale clickstream traces from an anonymous social network, Whisper. Our system effectively identifies previously unknown behaviors, e.g., dormant users, hostile chatters. In addition, we have successfully applied clickstream-based behavior model to detect new attacks in real-world online social networks including Renren and LinkedIn.



The project source code is available for download. This zip file contains a set of scripts that perform recursive hierarchical clustering on clickstream data, and generate clusters of user behaviors.

For details about input/output format, and system configurations, please refer to the documentation. The algorithm itself is detailed in our paper.

A quick example is shown as follows.

$> python input.txt output/

  • input.txt: input file that contains information about user clickstreams. Each line represents one user, her clickstream patterns:
    user_id \t A(1)G(10)
    where A and G are action patterns, and 1 and 10 represent how many times the respective pattern appears in the user's clickstream.
  • output: the directory for temporary files and the final clustering result files.
    output/result.json will be the output file for the clustering results.

Contact Us

We are a research team from the Department of Computer Science in Univ of Chicago. If you have any questions, please don't hesitate to contact us.