Metadata-Version: 2.1
Name: yt2doc
Version: 0.2.2
Summary: Transcribe any YouTube video into a structural Markdown document
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: emoji>=2.13.0
Requires-Dist: faster-whisper>=1.0.3
Requires-Dist: ffmpeg-python>=0.2.0
Requires-Dist: instructor>=1.5.1
Requires-Dist: openai>=1.51.0
Requires-Dist: pathvalidate>=3.2.1
Requires-Dist: pydantic>=2.9.1
Requires-Dist: torch>=2.4.1
Requires-Dist: tqdm>=4.66.5
Requires-Dist: typer-slim>=0.12.5
Requires-Dist: wtpsplit>=2.0.8
Requires-Dist: yt-dlp>=2024.10.07

# yt2doc

![Header Image](header-image.png)

yt2doc transcribes videos & audios online into readable Markdown documents.

Supported video/audio sources:
* YouTube
* Apple Podcast
* Twitter

yt2doc is meant to work fully locally, without invoking any external API. The OpenAI SDK dependency is required solely to interact with a local LLM server such as [Ollama](https://github.com/ollama/ollama).

Check out some [examples](./examples/) generated by yt2doc.

## Why

There have been many existing projects that transcribe YouTube videos with Whisper and its variants, but most of them aimed to generate subtitles, while I had not found one that priortises readability. Whisper does not generate line break in its transcription, so transcribing a 20 mins long video without any post processing would give you a huge piece of text, without any line break or topic segmentation. This project aims to transcribe videos with that post processing. 

## Installation

### Prerequisites

[ffmepg](https://www.ffmpeg.org/) is required to run yt2doc.

If you are running MacOS:

```
brew install ffmpeg
```

If you are on Debian/Ubuntu:
```
sudo apt install ffmpeg
```

### Install yt2doc

Install with [pipx](https://github.com/pypa/pipx):

```
pipx install yt2doc
```

Or install with [uv](https://github.com/astral-sh/uv):
```
uv tool install yt2doc
```

### Upgrade

If you have already installed yt2doc but would like to upgrade to a later version:

```
pipx upgrade yt2doc
```

or with `uv`:

```
uv tool upgrade yt2doc
```

## Usage

Get helping information:

```
yt2doc --help
```

### Transcribe Video from Youtube or Twitter

To transcribe a video (on YouTube or Twitter) into a document:

```
yt2doc --video <video-url>
```

To save your transcription:

```
yt2doc --video <video-url> -o some_dir/transcription.md
```

### Transcribe a YouTube playlist

To transcribe all videos from a YouTube playlist:

```
yt2doc --playlist <playlist-url> -o some_dir
```

### Chapter unchaptered videos

(LLM server e.g. [Ollama](https://github.com/ollama/ollama) required) If the video is not chaptered, you can chapter it and add headings to each chapter:

```
yt2doc --video <video-url> --segment-unchaptered --llm-model <model-name>
```

Among smaller size models, `gemma2:9b`, `llama3.1:8b`, and `qwen 2.5:7b` work reasonably well.

By default, yt2doc talks to Ollama at `http://localhost:11434/v1` to segment the text by topic. You can run yt2doc to interact with Ollama at a different address or port, a different (OpenAI-compatible) LLM server (e.g. [vLLM](https://github.com/vllm-project/vllm), [mistral.rs](https://github.com/EricLBuehler/mistral.rs)), or even OpenAI itself, by

```
yt2doc --video <video-url> --segment-unchaptered --llm-server <llm-server-url> --llm-api-key <llm-server-api-key> --llm-model <model-name>
```

### Transcribe Apple Podcast

To transcribe a podcast episode on Apple Podcast:

```
yt2doc --audio <apple-podcast-episode-url> --segment-unchaptered --llm-model <model-name>
```

### Whisper configuration

By default, yt2doc uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) as transcription backend. You can run yt2doc with different faster-whisper configs (model size, device, compute type etc):

```
yt2doc --video <video-url> --whisper-model <model-name> --whisper-device <cpu|cuda|auto> --whisper-compute-type <compute_type>
```

For the meaning and choices of `--whisper-model`, `--whisper-device` and `--whisper-compute-type`, please refer to this [comment](https://github.com/SYSTRAN/faster-whisper/blob/v1.0.3/faster_whisper/transcribe.py#L101-L127) of faster-whisper.


If you are running yt2doc on Apple Silicon, [whisper.cpp](https://github.com/ggerganov/whisper.cpp) gives much faster performance as it supports the Apple GPU. (A hacky) Support for whisper.cpp has been implemented:

```
yt2doc --video --whisper-backend whisper_cpp --whisper-cpp-executable <path-to-whisper-cpp-executable>  --whisper-cpp-model <path-to-whisper-cpp-model>
```

See https://github.com/shun-liang/yt2doc/issues/15 for more info on whisper.cpp integration.


### Text segmentation configuration

yt2doc uses [Segment Any Text (SaT)](https://github.com/segment-any-text/wtpsplit) to segment the transcript into sentences and paragraphs. You can change the SaT model:
```
yt2doc --video <video-url> --sat-model <sat-model>
```

List of available SaT models [here](https://github.com/segment-any-text/wtpsplit?tab=readme-ov-file#available-models).

## TODOs
* Tests and evaluation
* Better support for non-English languages
