Metadata-Version: 2.1
Name: malayalam_asr_benchmarking
Version: 0.0.2
Summary: A study to benchmark whisper based ASRs in Malayalam
Home-page: https://github.com/kurianbenoy/malayalam_asr_benchmarking
Author: kurianbenoy
Author-email: kurian.bkk@gmail.com
License: MIT License
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

malayalam_asr_benchmarking
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The work is still in progress. I have now done some benchmarking for
[Common Voice 11 Malayalam
dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ml/train).
The benchmarking results has been [uploaded to hugging face as a
dataset](https://huggingface.co/datasets/kurianbenoy/malayalam_common_voice_benchmarking).
At the moment I am working on benchmarking [Malayalam Speech
Corpus](https://msc.smc.org.in/) dataset as well. The benchmarking
results once completed will be uploaded to huggingface datasets in the
same manner.

## Install

``` sh
pip install malayalam_asr_benchmarking
```

Or locally

``` sh
pip install -e .
```

## Setting up your development environment

I am developing this project with nbdev. Please take some time reading
up on nbdev … how it works,
[directives](https://nbdev.fast.ai/explanations/directives.html), etc…
by checking out [the
walk-thrus](https://nbdev.fast.ai/tutorials/tutorial.html) and
[tutorials](https://nbdev.fast.ai/tutorials/) on the [nbdev
website](https://nbdev.fast.ai/)

### Step 1: Install Quarto:

`nbdev_install_quarto`

[Other options are mentioned in getting started to
quarto](https://quarto.org/docs/get-started/)

## Step 2: Install hooks

`nbdev_install_hooks`

## Step 3: Install our library

`pip install -e '.[dev]'`

## How to use

``` python
from malayalam_asr_benchmarking.commonvoice import evaluate_whisper_model_common_voice

evaluate_whisper_model_common_voice("parambharat/whisper-tiny-ml")
```

    Found cached dataset common_voice_11_0 (/home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0)
    Loading cached processed dataset at /home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0/cache-374585c2877047e3.arrow
    Loading cached processed dataset at /home/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/ml/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0/cache-22670505c562e0d4.arrow
    /opt/conda/lib/python3.8/site-packages/transformers/generation_utils.py:1359: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 448 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
      warnings.warn(

    Total time taken: 133.23447608947754
    The WER of model: 38.31
    The CER of model: 21.93
    The model size is: 37.76M
