Metadata-Version: 2.1
Name: clip-text-decoder
Version: 1.0.0
Summary: Generate text captions for images from their CLIP embeddings.
Home-page: https://github.com/fkodom/clip-text-decoder
Author: Frank Odom
Author-email: frank.odom.iii@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: test
License-File: LICENSE

# clip-text-decoder

Generate text captions for images from their CLIP embeddings.  Includes PyTorch model code and example training script.


## Example Predictions

Example captions were computed with the pretrained model mentioned below.

"A man riding a wave on top of a surfboard."

![A surfer riding a wave](http://farm6.staticflickr.com/5028/5654757697_bcdd8088da_z.jpg)

"A baseball player swinging a bat on top of a field."

![Baseball player](http://farm4.staticflickr.com/3202/2697603492_fbb44f6d2d_z.jpg)

"A dog running across a field with a frisbee."

![Dog with frisbee](http://farm3.staticflickr.com/2544/3715539092_f070a36b22_z.jpg)


## Installation

Install for easier access to the following objects/classes:
* `clip_text_decoder.datasets.ClipCocoCaptionsDataset`
* `clip_text_decoder.models.ClipDecoder`
* `clip_text_decoder.models.ClipDecoderInferenceModel`
* `clip_text_decoder.tokenizer.Tokenizer`

The `train.py` script will not be available in the installed package, since it's located in the root directory.  To train new models, either clone this repository or recreate `train.py` locally.

Using `pip`:
```bash
pip install clip-text-decoder
```

From source:
```bash
git clone https://github.com/fkodom/clip-text-decoder.git
cd clip-text-decoder
pip install .
```

**NOTE:** You'll also need to install `openai/CLIP` to encode images with CLIP.  This is also required by `ClipCocoCaptionsDataset` to build the captions dataset the first time (cached for subsequent calls).

```bash
pip install "clip @ git+https://github.com/openai/CLIP.git"
```

For technical reasons, the CLIP dependency can't be included in the PyPI package, since it's not an officially published package.


## Training

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13MJsNlff1Ew5_rJHWtpkYamVg30oyRTO?usp=sharing)

Launch your own training session using the provided script (`train.py`):
```bash
python train.py --max-epochs 5
```

Training CLI arguments, along with their default values:
```bash
--max-epochs 5  # (int)
--num-layers 6  # (int)
--dim-feedforward 256  # (int)
--precision 16  # (16 or 32)
--seed 0  # (int)
```


## Inference

The training script will produce a `model.zip` archive, containing the `Tokenizer` and trained model parameters.  To perform inference with it:
```python
import clip
from PIL import Image
import torch

from clip_text_decoder.model import ClipDecoderInferenceModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model = ClipDecoderInferenceModel.load("path/to/model.zip").to(device)
clip_model, clip_preprocessor = clip.load("ViT-B/32", device=device, jit=False)

# Create a blank dummy image
dummy_image = Image.new("RGB", (224, 224))
preprocessed = clip_preprocessor(dummy_image).to(device)
# Add a batch dimension using '.unsqueeze(0)'
encoded = clip_model.encode_image(preprocessed.unsqueeze(0))
text = model(encoded)

print(text)
# Probably some nonsense, because we used a dummy image.
```


## Pretrained Models

A pretrained CLIP decoder is hosted in my Google Drive, and can easily be downloaded by:

```python
from clip_text_decoder.model import ClipDecoderInferenceModel

model = ClipDecoderInferenceModel.download_pretrained()
```

To cache the pretrained model locally, so that it's not re-downloaded each time:
```python
model = ClipDecoderInferenceModel.download_pretrained("/path/to/model.zip")
```


## Shortcomings

* Only works well with COCO-style images.  If you go outside the distribution of COCO objects, you'll get nonsense text captions.
* Relatively short training time.  Even within the COCO domain, you'll occasionally see incorrect captions.  Quite a few captions will have bad grammar, repetitive descriptors, etc.


