# chromcocluster 

**chromcocluster** is a Python package for co-clustering  chromatin accessibility data across multiple cell types in a manner that reflects a cell type lineage tree.  

**chromcocluster** takes as input a tree describing the lineage structure of a collection of cell types and an accessibility matrix describing genomic accessibility across the cell types and provides as output a clustering of the loci and a clustering of the cell types.  Together the locus and cell type clusters decompose the accessibility matrix into a grid of submatrices, a co-clustering, representing subsets of loci with similar accessibility across subsets of cell types.  

Locus clustering is achieved through the Louvain algorithm.   Cell type clustering is achieved through an algorithm that selects as clusters coherent components of the cell type lineage tree, thereby associating accessibility patterns with lineage structure. Details regarding the algorithms implemented in **chromcocluster** can be found in

* *George, Strawn, and Leviyang.  Tree Based Co-Clustering Identifies Chromatin Accessibility Pattens Associated with
the Lineage Structure of Hematopoiesis, Bioarxiv.*.  

**chromcocluster** was written by Thomas George and Sivan Leviyang.  If you use **chromcluster**, please cite the George et al. reference above.  See the github page, `https://github.com/SLeviyang/chromcocluster` for example input files and for source files. 

**Please do not hesitate to post any questions or comments to https://github.com/SLeviyang/chromcocluster/issues or email Sivan.Leviyang@georgetown.edu.** 


# Obtaining and Using chromcocluster

**chromcocluster** is available on PyPl for Python3 (>=v3.6).  Installation is easiest through a call to pip.

```python
python3 -m pip install chromcocluster
```

After installation, import **chromcocluster** modules into python as described below.

The following packages and external tools are required by **chromcocluster**:

1. Python packages: os, sys, pandas, numpy, random, igraph, matplotlib, seaborn, sknetwork.clustering, scipy.sparse, scipy.stats, multiprocessing, time
2. The bedtools suite of utilities must be installed.  Download bedtools from `https://bedtools.readthedocs.io/en/latest/`.  Bedtools is written by the Quinlan laboratory at the University of Utah.



# Input Files

**chromcocluster** co-clustering requires two csv files as input:

1. **A csv file containing the accessibility matrix**.  The accessibility matrix must be binary (i.e. all entries 0 or 1) with rows corresponding to genomic loci and columns to cell types.  A 0 and 1 represent an inaccessible and accessible locus, respectively.  The csv file must contain a header line giving the cell type (i.e. column) names.
2. **A csv file containing the edges in the tree representing the lineage structure of the cell types**.   The csv file should contain exactly two columns, with each row describing an edge.  Edges are specified by providing the start and end cell type name.  The edge list csv file should contain a header line providing column names, but the column names can be chosen by the user.

#### Creating the Accessibility Matrix CSV File

The user can use any workflow to create the accessibility matrix.  However, for the sake of convenience, **chromcocluster** includes a module for constructing an accessibility matrix from a collection of bed files.   The bed files must be generated by the user, with one bed file for each cell type.  Each bed file provides the accessible, genomic loci for the given cell 
type. 

(*A typical workflow to generate the bed files starts with ATACseq generated fastq files, uses an aligner such as bowtie to create bam files, and then uses a peak caller such as MACS2 to call peaks and output a bed file for each cell type.  The idr tool (`https://github.com/nboley/idr`) is useful in merging multiple bed files for a single cell type into a single bed file. See the ENCODE consortium ATACseq workflow for a particular example.*)

The bed files must not have a header line and must contain the following information for each locus as columns:

* `chr`: chromosome (e.g. chr10)
* `chrStart` : start position of the peak
* `chrEnd` : end position of the peak
* `summit` : location of summit of peak relative to the chrStart value using 0-indexing.
* `q`: quality score, assumed in -log10 form, so that a larger value being better.

The `accessibility_matrix` module of **chromcocluster** can be used to generate the accessibility matrix through the following code.

```python
import chromcocluster.accessibility_matrix as am
a = am.accessibility_matrix("peak_files", [0,1,2,9,8])
a.create_master_peak_list("master.bed", window_radius=250)
a.create_accessibility_matrix("master.bed", "matrix.csv", bedtools_path="/opt/local/bin/bedtools")
```

* The first line after the module import constructs an `accessibility_matrix` object.  In this case, bed files are in the `"peak_files"` directory and the 0, 1, 2, 9, and 8th column of each bed file give chr, chrStart, chrEnd, summit, and q values.  All files in the directory with a .bed suffix are identified and associated with a cell type.  The cell type names are the bed file names with the .bed suffix dropped.  For example, if a bed file is `DC.bed` then the corresponding cell type name will be `DC`.

* The second line calls the `create_master_peak_list` method, which creates a master list of non-intersecting windows containing all locus summits across all the bed files.  All windows are centered at a peak summit and are of size `2*window_radius+1`.  In this example, every locus summit is enclosed in a 501 base pair window and the method constructs a master list of non-overlapping windows.  The non-overlapping windows are then written to a bed file, in this case `master.bed`.  

* The third line constructs the accessibility matrix.  The method `create_accessibility_matrix` takes as arguments the path to the master list bed file (in this case `master.bed`), the file name to which the matrix should be written in csv format (in this case `matrix.csv`), and the path to the bedtools executable (see above for the bedtools download link).    The output accessibility matrix, in this case `matrix.csv`, has a row for each non-intersecting window in the master peak list.  A particular cell type has a 1 in the row corresponding to a window if one of the cell type's accessible loci intersects with the window.

For convenience, the `accessibility_matrix` object contains the following fields:

* `peaks_list` : a list containing each bed file as a panda data.frame.  The data.frame contains only the fields chr, chrStart, chrEnd, summit, and q.
* `m` : the accessibility matrix stored as a numpy array. 
* `master_peaks` : the non-intersecting windows stored as a pandas data.frame with columns chr, chrStart, chrEnd, and q and rows ordered to match the rows of `m`.
* `cell_type_names` : a list containing the cell type names, ordered to match the columns of `m` and the elements of `peaks_list`.
     
#### Creating the Edge List CSV File

The edge list csv file must be provided by the user.  As noted above, cell type names should be used to specify edges.  The graph specified by the edge list must be a tree, i.e. all nodes have a single parent except for a root node with no parent.

### Example Input Files

Example input files based on data collected by Yoshida et al. are available for download at `https://github.com/SLeviyang/chromcocluster`.

* Yoshida et al.  The cis-Regulatory Atlas of the Mouse Immune System.  Cell. 2019.


The bed files formed from the ATACseq dataset of Yoshida et al are provided in the `peak_files` folder.  An edge list corresponding to the tree in Figure 1A of Yoshida et al is provided in the `tree_files` folder.  The results presented in George et al. use these files as input to **chromcocluster**.

# Co-Clustering 
The co-clustering algorithm is implemented in the **cocluster** module.

```python
import chromcocluster.cocluster as cclust
cc = cclust.cocluster("edge_list.csv", "matrix.csv")
cc.locus_cluster(FDR=0.001, min_accessible_cell_types=3, max_accessible_cell_types=None, min_cluster_size=30, outfile="locus_clusters.csv")
cc.cell_type_cluster(k=8, nCPU=1, ntrails=10, outfile="cell_type_clusters.csv")
```

* The first line after the import constructs a `cocluster` object using the edge list csv and accessibility matrix csv file paths.

* The second line performs locus (i.e. row) clustering.  The `FDR` argument modulates the degree to which two loci (i.e. rows) are included in the same cluster.  A small FDR will lead to many, small clusters while a large FDR will lead to few, large clusters.  The default is `0.001`.  Rows with less than `min_accessible_cell_types` 1's in the accessibility matrix are thrown out, allowing the user to focus on loci that are accessible in some minimum number of cell types.  Rows with more than `max_accessible_cell_types` are all grouped in one cluster, allowing the user to group together loci that are accessible across a large range of cell types. Passing `None` to this parameter will result in the algorithm setting the value to the total number of cell types minus `'min_accessible_cell_types`.  `min_cluster_size` specifies the minimum number of rows that must be in a cluster.  The default is `30`.  The locus clustering information can be saved to file through the `outfile` argument.  If `outfile=None` then the locus clustering is not saved.

* The third line performs the cell type (i.e. column) clustering.  `k` specifies the number of clusters, `nCPU` allows for parallelization, and `ntrials` specifies how many optimization trails are to be run.  The more trails run, the more likely the algorithm is to find the optimal clustering, but with the price of increasing computation time.  The default is `10` trials.  The cell type clustering information can be saved to file through the `outfile` argument.  If `outfile=None` then the cell type clustering is not saved.


The `cocluster` object has the following fields which are used to access the clustering results.

* `locus_clusters` : a pandas data.frame with 2 columns: row and cluster.  The row column values corresponds to a row (i.e. locus) in the accessibility matrix and the cluster column gives the cluster to which the row belong.  Clusters are numbered starting with the 0 cluster.  
* `cell_type_clusters` : a numpy array providing the cluster number of each cell type.  Cell types are ordered according to the column order in the accessibility matrix csv file, or equivalently the `cell_type_names` field in the `accessibility_matrix` object.

The `cocluster` object also has the following fields, which provide more information regarding the clustering and are useful in cluster analysis.

* `m_list` : a list of numpy matrices partitioning the accessibility matrix according to the row clustering.
* `locus_edges` : the edges of the graph passed into the Louvain algorithm.  These edges are constructed based on the `FDR` argument passed to the `locus_cluster` method.  The loci (i.e. rows) of the accessibility matrix form the nodes connected by these edges.  

If both the locus and cell type clustering have been saved to file, then the information can be loaded into a `cocluster` object to avoid rerunning the clustering,

```python
cc2 = cclust.cocluster("edge_list.csv", "matrix.csv")
cc2.load_clustering(locus_clusters_file="locus_clusters.csv", cell_type_clusters_file="cell_type_clusters.csv")
```
The `cc2` object will then have the clustering information created through the `cc` object above.


**Several important attributes of the locus (i.e. row) clustering must be kept in mind:**

1. **The intersection of the set of cell names found in the edge list csv file and the accessible matrix csv file (in the header line) is used to co-cluster.  All cell type columns of the accessibility matrix not in the intersection are removed prior to co-clustering.  Similarly, all nodes in the tree that are not in the intersection are removed prior to co-clustering.  Co-clustering will not proceed if the subset of nodes does not form a tree.**
2. **The `locus_clusters` field will typically not contain every row (i.e. locus) of the accessibility matrix.  If a row is an outlier in its accessibility pattern relative to all other rows, as modulated by the `FDR`, it will not be included in any cluster.  Further, the `min_accessible_cell_types` parameter setting may throw out many rows (i.e. loci) with low accessibility.**
3. **All rows/loci that are accessible in more cell types than specified by `max_accessible_cell_types` are grouped in a single cluster, cluster 0**.



# Visualization and Statistics

The `cocluster` module contains visualization and statistical methods.  After creating a `cocluster` object and performing locus and cell type clustering, as described above, the following methods can be called:

```python
cc.plot_tree()
cc.heatmap(cocluster_approximation=False, collapse_locus_clusters=False, outfile=None)
```

* The `plot_tree` method generates a plot of the cell type tree with the cell types colored according to the cell type clustering.   Nodes are labeled according to cell type index and cluster.  For example, a label 23-5 means that the node is the 23 cell type and is in cluster 5.
* The `heatmap` method plots clustered accessibility matrix rows and columns in order of increasing cluster assignment, thereby allowing for visualization of the locus and cell type coclustering.  If `cocluster_approximation` is set to True, then all entries within a cocluster are replaced by the cocluster average.  The coclustering approximation is analagous to the mediod of each cluster in kmeans.  If `collapse_locus_clusters` is True, then the rows of each locus cluster are replaced by a single row with values given by the column means.  The heatmap can be saved to a file by providing a file path in the `outfile` argument.

The fraction of total and cell type associated variation captured by the coclustering can be computed through the method `R2`.

```python
cc.R2()
```
The `R2` method will return a data.frame with columns `locus_cluster, R2_total, R2_cell_type`.  The `R2_total` and `R2_cell_type` columns give the fraction of total and cell type associated variance captured by the coclustering for the given locus cluster.  The fraction of the variation is an R-squared value computed through an ANOVA approach in which the accessibility matrix restricted to clustered rows is compared against a matrix in which all entries within each co-cluster are replaced by the mean value of entries in the co-cluster.  


