# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['vectory', 'vectory.db', 'vectory.es', 'vectory.visualization']

package_data = \
{'': ['*']}

install_requires = \
['Jinja2==3.0.1',
 'Pillow>=9.2.0,<10.0.0',
 'bokeh<=2.2.3',
 'coolname>=1.1.0,<2.0.0',
 'elasticsearch-dsl>=7.0.0,<8.0.0',
 'elasticsearch==7.16.3',
 'matplotlib>=3.3',
 'numpy>=1.14.5',
 'pandas>=1.3.5,<2.0.0',
 'peewee>=3.14.10,<4.0.0',
 'plotly>=5.9.0,<6.0.0',
 'psutil>=5.9.1,<6.0.0',
 'pynndescent>=0.5.6,<0.6.0',
 'python-on-whales>=0.52.0,<0.53.0',
 'streamlit-bokeh-events>=0.1.2,<0.2.0',
 'streamlit>=1.8.1,<1.12.0',
 'tabulate>=0.8.10,<0.9.0',
 'tqdm>=4.61.1,<5.0.0',
 'typer>=0.4.0,<0.5.0',
 'umap-learn>=0.5.3,<0.6.0']

entry_points = \
{'console_scripts': ['vectory = vectory.cli:app']}

setup_kwargs = {
    'name': 'vectory',
    'version': '0.1.2',
    'description': 'Streamline the benchmark and experimentation process of your models that rely on generating embeddings',
    'long_description': '<p align="center">\n  <img src="https://pento.ai/images/vectory-banner.png" alt="Vectory">\n</p>\n\n<p align="center">\n    <b> An embedding evaluation toolkit </b>\n</p>\n\n<p align="center">\n    <a href="https://pypi.org/project/vectory" target="_blank">\n        <img src="https://img.shields.io/pypi/v/vectory?color=%2334D058&label=pypi%20package" alt="Package version">\n    </a>\n    <a href="https://pypi.org/project/vectory" target="_blank">\n        <img src="https://img.shields.io/pypi/pyversions/vectory.svg?color=%2334D058" alt="Supported Python versions">\n    </a>\n</p>\n\n<p align="center">\n  <img src="assets/overview.gif" alt="animated" />\n</p>\n\n<!-- ![overview](assets/overview.gif) -->\n\nVectory provides a collection of tools to **track and compare embedding versions**.\n\nBeing able to visualize and register each experiment is a crucial part of developing successful models. Vectory is a tool designed by and for machine learning engineers to handle embedding experiments with little overhead.\n\n### Key features:\n\n- **Embedding linage**. Keep track of what data and models were used to generate embeddings.\n- **Compare performance**. Compare metrics between different vector spaces.\n- **Ease of use**. Easy usage through the CLI, Python and GUI interfaces.\n- **Extensibility**. It was built with extensibility in mind.\n- **Persistence**. Simple local state persistence using SQLite.\n\n# Table of Contents\n\n1. [Installation](#installation)\n2. [Demo](#demo)\n3. [Usage](#usage)\n4. [Troubleshooting](troubleshooting.md)\n5. [License](#license)\n\n# Installation\n\nAll you need for Vectory to run is to install the package and Elasticsearch. You can install the package using pip:\n\n```console\npip install vectory\n```\n\n## Set up Elasticsearch\n\nWhat is Elasticsearch? It\'s a free high performance search engine, which is used for any kind of data.\n\nVectory uses Elasticsearch to load embeddings and then search for them.\n\nTo start the engine you will need to install Docker and start its daemon.\nAfter that, just run:\n\n```console\nvectory elastic up --detach\n```\n\nand you can turn it off with:\n\n```console\nvectory elastic down\n```\n\n# Demo\n\n<p align="center">\n  <img src="assets/intro.gif" alt="animated" />\n</p>\n\nAfter installing vectory with the GUI dependencies, you can play with the demo cases to get a feel of the toolkit.\n\n- Tiny-imagenet computer vision dataset embeddings made from pretrained models ResNet50 and ConvNext-tiny.\n- Imdb nlp dataset embeddings made from pretrained models BERT and RoBERTa.\n\nIn order to download the data and set up the demo, run the following command:\n\n```console\nvectory demo\n```\n\nYou can specify the demo dataset with the `--dataset-name` argument.\n\nRun the Streamlit viualization app:\n\n```console\nvectory run\n```\n\n<p align="center">\n  <img src="assets/zoom.gif" alt="animated" />\n</p>\n\n# Usage\n\nThe key concepts needed to use Vectory are **datasets**, **experiments** and **embedding spaces**.\n\nA **dataset** is just a collection of data. You could have evaluation or training datasets. Evaluation datasets are required for Vectory to run, whereas training datasets are optional, desired for tracking purposes.\n\nDatasets are defined with a csv file. The csv file should have a header row, followed by a row for each data point. The columns may contain any information about the data point, but it is recommended that the first column is an identifier for the data point. The next columns could be labels, features, or any other information.\n\nAn **experiment** is a machine learning model which has been trained with a particular dataset. You could create different experiments by varying the model and the dataset. As well as the training datasets, the experiments are optional and desired for tracking purposes.\n\nTogether, they form an **embedding space**, which is just a 2-dimensional array with all the generated vectors (or features or embeddings) for a particular dataset using a particular experiment. They can be either `.npz` files or `.npy` files, we\'ll refer to them as `.npz` for simplicity. It must follow the same order as the evaluation dataset csv file.\n\n<details markdown="1">\n<summary> <b> Example </b> </summary>\n\nYou could have an experiment, such as a ResNet model trained with the dataset Data1. Let’s call the generated embedding space ES1. But either you split your data or you get new data once in a while (or both), so this experiment will not only be used in a static dataset. You might want to use this experiment on Data2 then, generating a particular embedding space called ES2.\n\nVectory helps you to organize and analyze the obtained embeddings for each dataset and experiment.\n\n</details>\n\n---\n\n## Command Line Interface\n\n### Create\n\nCreate datasets, experiments and embedding spaces:\n\n```console\nvectory add --dataset [path_to_csv] --embeddings [path_to_npz]\n```\n\nThis is the most simple way to add them. In case you want to track your tests, you can specify the names of the elements, the dimension of the embedding space and the parameters of the model. You can see all the options with the `--help` flag.\n\n### Load\n\nEmbedding spaces are mapped to Elasticsearch **indices**. To load the embeddings to Elasticsearch when creating the embedding space with the previous command, add `--load ` after designating the dataset, the embedding space and the parameters. This option for the `add` command only works for the default loading options. If you want to load the embeddings with different options, you can use the `load` command.\n\nLoad independentely an embedding space to Elasticsearch:\n\n```console\nvectory embeddings load [index_name] [embedding_space_name]\n```\n\nYou can specify the model name, the similarity function, the number of threads, the chunk size and the hyperparameters for the kNN search. You can see all the options with the `--help` flag.\n\n### Search\n\nGet all your datasets, experiments, embedding spaces and indices:\n\n```console\nvectory ls\n```\n\nList all the indices:\n\n```console\nvectory embeddings list-indices\n```\n\n### Delete\n\nDelete datasets:\n\n```console\nvectory dataset delete [dataset_name]\n```\n\nExperiments:\n\n```console\nvectory experiment delete [experiment_name]\n```\n\nEmbedding Spaces:\n\n```console\nvectory embeddings delete [embedding_space_name]\n```\n\nYou can delete elements associated to these objects and their respective indices adding `--recursive`.\n\nIndices:\n\n```console\nvectory embeddings delete-index [index_name]\n```\n\nAll indices:\n\n```console\nvectory embeddings delete-all-indices\n```\n\n### Comparing embedding spaces\n\nWith Vectory you can measure how similar two embedding spaces are. The similarity between two embedding spaces is the mean of the local neighbourhood similarity of every point, which is the IoU of the 10 nearest neighbours.\n\nBasically, in order to compare 2 embedding spaces Vectory computes the 10 nearest neighbours for every data point for both embedding spaces, get the IoU for each group of 10 nearest neighbours obtained and shows the distribution of the IoU values. Also, we compute the mean of the IoU values in order to provide a single value to compare the two embedding spaces.\n\nMore info about comparing embedding spaces [here](http://vis.csail.mit.edu/pubs/embedding-comparator/).\n\nCompare two embedding spaces using:\n\n```console\nvectory compare [embedding_space_1_name] [embedding_space_2_name] --precompute\n```\n\nYou can specify the metric to use for kNN search in each of the embedding spaces, calculate similarity histogram and allow precoumpute.\n\n## Python API\n\n### Create\n\nCreate datasets, experiments and an embedding space from them.\n\n```python\nfrom vectory.datasets import Dataset\nfrom vectory.experiments import Experiment\nfrom vectory.spaces import EmbeddingSpace\n\ndataset = Dataset.get_or_create(csv_path=CSV_PATH, name=DATASET_NAME)\n\ntrain_dataset = Dataset.get_or_create(csv_path=TRAIN_CSV_PATH, name=TRAIN_DATASET_NAME)\n\nexperiment = Experiment.get_or_create(\n    train_dataset=TRAIN_DATASET_NAME,\n    model=MODEL_NAME,\n    name=EXPERIMENT_NAME,\n)\n\nembedding_space = EmbeddingSpace.get_or_create(\n    npz_path=NPZ_PATH,\n    dims=EMBEDDINGS_DIMENSIONS,\n    experiment=EXPERIMENT_NAME,\n    dataset=DATASET_NAME,\n    name=EMBEDDING_SPACE_NAME,\n)\n```\n\nThe train dataset is optional, but it is recommended to track the training process.\n\nLoad an index on elastic search for an embedding space:\n\n```python\nfrom vectory.indices import load_index\n\nload_index(\n    index_name=INDEX_NAME,\n    embedding_space_name=EMBEDDING_SPACE_NAME,\n)\n```\n\nThe `dataset`, `experiment` and `embedding_space` objects have the `.model.name` attribute, so both the variable and the attribute can be used for specifying the name.\n\nAdditionally, you can specify the desired mapping to load the index with. This determies whether `cosine` or `euclidean` similarity will be used for the kNN search, as well as the model for the kNN search. Using an `exact` model instead of the `lsh` option will make the search slower, but more accurate. The `lsh` model and the `cosine` similarity are the default options. To see all the available mappings, check the possible options from `vectory.es.api.Mapping`.\n\n### Search\n\nGet all your datasets, experiments, embedding spaces and indices:\n\n```python\nfrom vectory.db.models import (\n    DatasetModel,\n    ElasticSearchIndexModel,\n    EmbeddingSpaceModel,\n    ExperimentModel,\n    Query,\n)\n\ndatasets = Query(DatasetModel).get()\nexperiments = Query(ExperimentModel).get()\nspaces = Query(EmbeddingSpaceModel).get()\nindices = Query(ElasticSearchIndexModel).get()\n```\n\nYou can also get a specific dataset, expeiment, space or index by specifying an attribute:\n\n```python\ndataset = Query(DatasetModel).get(name=DATASET_NAME)[0]\n```\n\n### Delete\n\nDelete old datasets and its indices if wanted:\n\n```python\nfrom vectory.db.models import  DatasetModel, Query\n\ndataset = Query(DatasetModel).get(name=DATASET_NAME)[0]\ndataset.delete_instance(recursive=True)\n```\n\nKeep in mind that if the `recursive` option is set to `True`, the experiments, spaces and indices associated with the dataset will be deleted as well.\n\nThe same can be done for experiments, embedding spaces and indices by using the `delete_instance` method on the correct object.\n\n### Compare\n\nWith Vectory you can measure how similar two embedding spaces are. The similarity between two embedding spaces is the mean of the local neighbourhood similarity of every point, which is the IoU of the 10 nearest neighbours. More info about comparing embedding spaces [here](http://vis.csail.mit.edu/pubs/embedding-comparator/).\n\nCompare two embedding spaces:\n\n```python\nfrom vectory.spaces import compare_embedding_spaces\n\nsimilarity, _, fig, _ = compare_embedding_spaces(\n    embedding_space_a=EMBEDDING_SPACE_NAME_1,\n    embedding_space_b=EMBEDDING_SPACE_NAME_2,\n    metric_a=METRIC_A,\n    metric_b=METRIC_B,\n    allow_precompute_knn=True,\n)\n```\n\nThe `metric_a` and `metric_b` parameters are either `euclidean` or `cosine`. The `allow_precompute_knn` parameter is set to `True` to allow precomputing the bulk operations for the similarity computation.\n\nThe `spaces_similarity` variable contains the similarity between the two embedding spaces. The `id_similarity_dict` variable contains the similarity scores for every point in the embedding spaces.\n\nAn additional argument can be passed to the `compare_embedding_spaces` function, which is `histogram`. If set to `True`, the function will show a histogram of the similarity scores, otherwise, an empty figure is returned. The `fig` and `ax` variables are the figure and axis of the histogram.\n\n### Reduce dimensionality\n\nReduce the dimensionality to 2D of an embedding space:\n\n```python\nfrom vectory.visualization.utils import calculate_points, get_index\n\n# Get the embedding space data\nembeddings, rows, index = get_index(\n    EMBEDDING_SPACE_NAME, model=MODEL, similarity=SIMILARITY_METHOD\n)\n\n# Reduce the dimensionality\ndf = calculate_points(DIMENSIONAL_REDUCTION_MODEL, embeddings, rows)\n```\n\nThe `calculate_points` function reduces the dimensionality of the embeddings using the `DIMENSIONAL_REDUCTION_MODEL` model. It can be either `UMAP`, `PCA` or `PCA + UMAP`. It returns a DataFrame with the reduced dimensionality points and the data contained in the dataset\'s csv.\n\n### Get similar indices\n\nGet the most similar indices for a given embedding:\n\n```python\nfrom vectory.indices import match_query\n\n# Get the most similar indices for a sample embedding\nsimilarity_results, _ = match_query(indices_name=[INDEX_NAME], query_id=EMBEDDING_INDEX)\n```\n\nThe `match_query` function returns the most similar indices for a given embedding and the index of the embedding. The `indices_name` parameter is a list of indices names, and the `query_id` parameter is the id of the embedding to search for. From these results, you can get the most similar indices and their scores. The `similarity_results` variable contains a dictionary with the indices names as keys and a list of tuples with the most similar indices and their scores as values.\n\n## Visualization\n\nOnce you have loaded your datasets, experiments and empedding spaces, you can analyze the results either by visualizing them on our Streamlit app or by following the Python API documentation and getting the indices.\n\n### Streamlit\n\nVisualize your embedding spaces on a local Streamlit app with:\n\n```console\nvectory run\n```\n\nThe GUI dependencies are required to view the Streamlit app.\n\n# License\n\nThis project is licensed under the terms of the MIT license.\n',
    'author': 'Pento',
    'author_email': 'hello@pento.ai',
    'maintainer': 'None',
    'maintainer_email': 'None',
    'url': 'None',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'entry_points': entry_points,
    'python_requires': '>=3.7.1,<=3.9.7',
}


setup(**setup_kwargs)
