Metadata-Version: 2.1
Name: antiberty
Version: 0.1.0
Description-Content-Type: text/markdown
License-File: LICENSE.MD

# AntiBERTy
Official repository for AntiBERTy, an antibody-specific transformer language model pre-trained on 558M natural antibody sequences, as described in [Deciphering antibody affinity maturation with language models and weakly supervised learning](https://arxiv.org/abs/2112.07782).


## Setup
To use AntiBERTy, install via pip:
```bash
pip install antiberty
```

Alternatively, you can clone this repository and install the package locally:
```bash
$ git clone git@github.com:jeffreyruffolo/AntiBERTy.git 
$ pip install AntiBERTy
```

## Usage

### Embeddings

To use AntiBERTy to generate sequence embeddings, use the `embed` function. The output is a list of embedding tensors, where each tensor is the embedding for the corresponding sequence. Each embedding has dimension `[(Length + 2) x 512]`.

```python
from antiberty import AntiBERTyRunner

antiberty = AntiBERTyRunner()

sequences = [
    "EVQLVQSGPEVKKPGTSVKVSCKASGFTFMSSAVQWVRQARGQRLEWIGWIVIGSGNTNYAQKFQERVTITRDMSTSTAYMELSSLRSEDTAVYYCAAPYCSSISCNDGFDIWGQGTMVTVS",
    "DVVMTQTPFSLPVSLGDQASISCRSSQSLVHSNGNTYLHWYLQKPGQSPKLLIYKVSNRFSGVPDRFSGSGSGTDFTLKISRVEAEDLGVYFCSQSTHVPYTFGGGTKLEIK",
]
embeddings = antiberty.embed(sequences)
```

To access the attention matrices, pass the `return_attention` flag to the `embed` function. The output is a list of attention matrices, where each matrix is the attention matrix for the corresponding sequence. Each attention matrix has dimension `[Layer x Heads x (Length + 2) x (Length + 2)]`.

```python
from antiberty import AntiBERTyRunner

antiberty = AntiBERTyRunner()

sequences = [
    "EVQLVQSGPEVKKPGTSVKVSCKASGFTFMSSAVQWVRQARGQRLEWIGWIVIGSGNTNYAQKFQERVTITRDMSTSTAYMELSSLRSEDTAVYYCAAPYCSSISCNDGFDIWGQGTMVTVS",
    "DVVMTQTPFSLPVSLGDQASISCRSSQSLVHSNGNTYLHWYLQKPGQSPKLLIYKVSNRFSGVPDRFSGSGSGTDFTLKISRVEAEDLGVYFCSQSTHVPYTFGGGTKLEIK",
]
embeddings, attentions = antiberty.embed(sequences, return_attention=True)
```

The `embed` function can also be used with masked sequences. Masked residues should be indicated with underscores.

### Classification
To use AntiBERTy to predict the species and chain type of sequences, use the `classify` function. The output is two lists of classifications for each sequences.

```python
from antiberty import AntiBERTyRunner

antiberty = AntiBERTyRunner()

sequences = [
    "EVQLVQSGPEVKKPGTSVKVSCKASGFTFMSSAVQWVRQARGQRLEWIGWIVIGSGNTNYAQKFQERVTITRDMSTSTAYMELSSLRSEDTAVYYCAAPYCSSISCNDGFDIWGQGTMVTVS",
    "DVVMTQTPFSLPVSLGDQASISCRSSQSLVHSNGNTYLHWYLQKPGQSPKLLIYKVSNRFSGVPDRFSGSGSGTDFTLKISRVEAEDLGVYFCSQSTHVPYTFGGGTKLEIK",
]
species_preds, chain_preds = antiberty.classify(sequences)
```

The `classify` function can also be used with masked sequences. Masked residues should be indicated with underscores.

### Mask prediction
To use AntiBERTy to predict the identity of masked residues, use the `fill_masks` function. Masked residues should be indicated with underscores. The output is a list of filled sequences, corresponding to the input masked sequences.

```python
from antiberty import AntiBERTyRunner

antiberty = AntiBERTyRunner()

sequences = [
    "____VQSGPEVKKPGTSVKVSCKASGFTFMSSAVQWVRQARGQRLEWIGWIVIGSGN_NYAQKFQERVTITRDM__STAYMELSSLRSEDTAVYYCAAPYCSSISCNDGFD____GTMVTVS",
    "DVVMTQTPFSLPV__GDQASISCRSSQSLVHSNGNTY_HWYLQKPGQSPKLLIYKVSNRFSGVPDRFSG_GSGTDFTLKISRVEAEDLGVYFCSQSTHVPYTFGG__KLEIK",
]
filled_sequences = antiberty.fill_masks(sequences)
```

## Citing this work

```bibtex
@article{ruffolo2021deciphering,
    title = {Deciphering antibody affinity maturation with language models and weakly supervised learning},
    author = {Ruffolo, Jeffrey A and Gray, Jeffrey J and Sulam, Jeremias},
    journal = {arXiv preprint arXiv:2112.07782},
    year= {2021}
}
```
