Welcome to DeBaCl

DeBaCl is a Python library for estimation of density level set trees and nonparametric DEnsity-BAsed CLustering. Level set trees are based on the statistically-principled definition of clusters as modes of a probability density function. They are particularly useful for analyzing structure in complex datasets that exhibit multi-scale clustering behavior. DeBaCl is intended to promote the practical use of level set trees through improvements in computational efficiency, flexible algorithms, and an emphasis on modularity and user customizability.

DeBaCl is available under the 3-clause BSD license. Some useful links:

Installation

The most straightforward way to install DeBaCl is to download it from the PyPI server. In a bash terminal (linux and Mac):

$ pip install debacl

The current plan is to post a new version of DeBaCl on the PyPI server roughly every two months. In between “official” updates, the latest code can be installed by downloading the DeBaCl repo directly from GitHub. In a bash terminal, the following commands download DeBaCl to a newly created “DeBaCl” folder in your working directory, then add the new folder to the python path:

$ git clone https://github.com/CoAxLab/DeBaCl/
$ export PYTHONPATH='DeBaCl'

DeBaCl depends on the Python packages numpy, networkx, and prettytable, and recommends the packages matplotlib (for plotting level set trees), scipy, and scikit-learn (for utilities that compute nearest neighbors). These packages can all be installed with either “conda” or “pip”:

$ pip install numpy networkx prettytable
$ pip install matplotlib scipy scikit-learn

Quickstart

DeBaCl runs in any standard Python 2.7 interpreter, such as IPython:

$ pip install ipython
$ ipython

The first step is to simulate some data. Here we use the scikit-learn datasets module to draw 100 observations from the “two moons” distribution. The data is stored in a 100 x 2 numpy array.

>>> from sklearn.datasets import make_moons
>>> X = make_moons(n_samples=100, noise=0.1, random_state=19)[0]

Next we import DeBaCl and construct the level set tree for our simulated dataset.

>>> import debacl as dcl
>>> tree = dcl.construct_tree(X, k=10, prune_threshold=10)

The construct_tree method takes a tabular numpy array as input and returns a LevelSetTree object. The parameter k indicates how many points to consider neighbors for each point when constructing a similarity graph; higher values of k lead to a more connected graph and a “smoother” level set tree. The prune_threshold parameter is the minimum size of branches in the output tree; if any branches are smaller than this number after the level set tree is created, they are merged with nearby branches.

Printing the tree lets us see the key statistics for the branches. Each row of the print output shows the starting and ending density and mass levels of the corresponding branch, the number of points that belong to the branch, and the parent and children branches (if any).

>>> print(tree)
+----+-------------+-----------+------------+----------+------+--------+----------+
| id | start_level | end_level | start_mass | end_mass | size | parent | children |
+----+-------------+-----------+------------+----------+------+--------+----------+
| 0  |    0.000    |   0.196   |   0.000    |  0.220   | 100  |  None  |  [1, 2]  |
| 1  |    0.196    |   0.396   |   0.220    |  0.940   |  37  |   0    |    []    |
| 2  |    0.196    |   0.488   |   0.220    |  1.000   |  41  |   0    |    []    |
+----+-------------+-----------+------------+----------+------+--------+----------+

For complex level set trees, the console output does not convey good intuition about the shape of the tree, and plotting the tree is a better option. Each vertical line segment represents a branch of the level set tree; the bottom endpoint is at the density level where the branch is born and the top endpoint is at the density level where the branch either splits or vanishes.

>>> fig = tree.plot(form='density')[0]
>>> fig.show()
_images/readme_tree.png

Finally, use the get_clusters method to retrieve cluster labels from the level set tree.

>>> labels = tree.get_clusters()

By default, each leaf node of the tree (i.e. a branch without any children) becomes a cluster. Clusters are returned in the form of a numpy array with two columns; the first column is the row index of the data point in the original dataset, and the second is an integer cluster label.

Level Set Tree constructors

construct_tree Construct a level set tree from tabular data.
construct_tree_from_graph Construct a level set tree from a similarity graph and a density estimate.
load_tree Load a saved tree from file.

Level Set Tree methods

LevelSetTree Level Set Tree attributes and methods.
LevelSetTree.branch_partition Partition the input data.
LevelSetTree.get_clusters Retrieve cluster labels from the level set tree.
LevelSetTree.get_leaf_nodes Return the indices of leaf nodes in the level set tree.
LevelSetTree.plot Plot the level set tree as a dendrogram and return coordinates and colors of the branches.
LevelSetTree.prune Prune the tree by recursively merging small leaf nodes into larger sibling nodes.
LevelSetTree.save Save a level set tree object to file.

Utilities

define_density_level_grid Create an evenly spaced grid of density levels.
define_density_mass_grid Create a grid of density levels with a uniform number of points between each level.
epsilon_graph Construct an epsilon-neighborhood graph, represented by an adjacency list.
knn_density Compute the kNN density estimate for a set of points.
knn_graph Compute the symmetric k-nearest neighbor graph for a set of points.
reindex_cluster_labels Re-index integer cluster labels to be consecutive non-negative integers.