Metadata-Version: 2.1
Name: stanford-stk
Version: 0.0.1
Summary: Sparse Toolkit
Home-page: https://github.com/stanford-futuredata/stk
Author: Trevor Gale
Author-email: tgale@stanford.edu
License: UNKNOWN
Description: # :abacus: Sparse Toolkit
        
        A light-weight PyTorch library for block-sparse matrices and block-sparse matrix multiplication.
        
        STK is built around a core sparse matrix class ([stk.Matrix](stk/matrix.py)), which uses a hybrid [blocked-CSR-COO](https://arxiv.org/abs/2211.15841) sparse matrix encoding to enable efficient matrix products with sparse inputs and outputs in transposed or non-transposed order. The library supports the following operations with `block_size=128`:
        
        ```
        op: transpose or non-transpose
        
        [Sparse Matrix Multiplication]
        stk.ops.dsd: dense = op(sparse) x op(dense)
        stk.ops.dds: dense = op(dense) x op(sparse)
        stk.ops.sdd: sparse = op(dense) x op(dense)
        
        stk.ops.ssd: sparse = op(sparse) x op(dense) [Experimental]
        stk.ops.sds: sparse = op(dense) x op(sparse) [Experimental]
        stk.ops.dss: dense = op(sparse) x op(sparse) [Experimental]
        
        [Sparse Matrix Conversion]
        stk.ops.to_sparse: torch.Tensor => stk.Matrix
        stk.ops.to_dense: stk.Matrix => torch.Tensor
        
        [Sparse Matrix Generation]
        stk.random.dense_mask: Create a random, block-sparse dense matrix.
        stk.random.mask: Create a random, block-sparse sparse matrix.
        ```
        
        STK is designed for applications where the sparse matrices change rapidly. This is complementary to libraries like [triton-blocksparse](https://github.com/ptillet/torch-blocksparse), which assume that sparse matrix topologies do not change between invocations.
        
        # :rocket: Performance
        
        ![STK Performance](media/block_sparse_matmul_benchmarks.png)
        
        Block-sparse matrix multiplication with STK is able to match the performance of cuBLAS on a range of problems. On these benchmarks from [MegaBlocks](https://github.com/stanford-futuredata/megablocks) dMoE models, STK realizes **98.6%** of cuBLAS throughput with `128x128` blocks on average.
        
        ```
        Hardware: A100-SXM4-80GB
        Software: CUDA 11.5, CUTLASS 2.5
        ```
        
        # :building_construction: Installation
        
        We recommend using NGC's [`nvcr.io/nvidia/pytorch:21.12-py3`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) PyTorch container and using the pre-built `.whl` under [dist/](https://github.com/tgale96/stk/tree/main/dist):
        
        ```bash
        export REPO=https://github.com/tgale96/stk/blob/main
        pip install ${REPO}/dist/21.12-py3/stk-0.0.1-cp38-cp38-linux_x86_64.whl?raw=true
        ```
        
        For other environments, please follow the steps outlined in the [Dockerfile](https://github.com/tgale96/stk/blob/main/Dockerfile).
        
        # :writing_hand: Citation
        
        ```
        @article{megablocks-arxiv,
          author    = {Trevor Gale and Deepak Narayanan and Cliff Young and Matei Zaharia},
          title     = {MegaBlocks: Efficient Sparse Training with Mixture-of-Experts},
          journal   = {CoRR},
          volume    = {abs/2211.15841},
          year      = {2022},
        }
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: Unix
Description-Content-Type: text/markdown
