Metadata-Version: 2.4
Name: crisp-t
Version: 0.8.0
Summary: Qualitative Research support tools in Python!
Project-URL: Homepage, https://dermatologist.github.io/crisp-t/
Project-URL: Repository, https://github.com/dermatologist/crisp-t
Project-URL: Documentation, https://dermatologist.github.io/crisp-t/
Author-email: Bell Eapen <github_public@gulfdoctor.net>
License-File: LICENSE
Keywords: python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <4.0,>=3.10
Requires-Dist: chromadb
Requires-Dist: click
Requires-Dist: gensim
Requires-Dist: matplotlib
Requires-Dist: mcp
Requires-Dist: mlxtend
Requires-Dist: pandas
Requires-Dist: pip
Requires-Dist: pydantic
Requires-Dist: pyldavis
Requires-Dist: pypdf
Requires-Dist: requests
Requires-Dist: seaborn
Requires-Dist: spacy
Requires-Dist: tabulate
Requires-Dist: textacy
Requires-Dist: tqdm
Requires-Dist: vadersentiment
Requires-Dist: wordcloud
Provides-Extra: ml
Requires-Dist: imbalanced-learn; extra == 'ml'
Requires-Dist: scikit-learn; extra == 'ml'
Requires-Dist: torch; extra == 'ml'
Provides-Extra: xg
Requires-Dist: xgboost; extra == 'xg'
Description-Content-Type: text/markdown

# 🔍 CRISP-T (Sense-making from Text and Numbers!)

[![Release](https://img.shields.io/github/v/release/dermatologist/crisp-t)](https://img.shields.io/github/v/release/dermatologist/crisp-t)
[![Build status](https://img.shields.io/github/actions/workflow/status/dermatologist/crisp-t/pytest.yml?branch=develop)](https://github.com/dermatologist/crisp-t/actions/workflows/pytest.yml?query=branch%3Adevelop)
[![codecov](https://codecov.io/gh/dermatologist/crisp-t/branch/develop/graph/badge.svg)](https://codecov.io/gh/dermatologist/crisp-t)
[![Commit activity](https://img.shields.io/github/commit-activity/m/dermatologist/crisp-t)](https://img.shields.io/github/commit-activity/m/dermatologist/crisp-t)
[![License](https://img.shields.io/github/license/dermatologist/crisp-t)](https://img.shields.io/github/license/dermatologist/crisp-t)
[![Downloads](https://img.shields.io/pypi/dm/crisp-t)](https://pypi.org/project/crisp-t)
[![Documentation](https://badgen.net/badge/icon/documentation?icon=libraries&label)](https://dermatologist.github.io/crisp-t/)
<!-- gh-dependents-info-used-by-start -->
[![Generated by github-dependents-info](https://img.shields.io/static/v1?label=Used%20by&message=2&color=informational&logo=slickpic)](https://github.com/dermatologist/crisp-t/blob/develop/docs/github-dependents-info.md)<!-- gh-dependents-info-used-by-end -->

**TL;DR** 🚀 *CRISP-T is a qualitative research method and a toolkit to perform textual (e.g. topic modelling) and numeric (e.g. decision trees) analysis of mixed datasets for computational triangulation and sense-making using large language models.* 👉 [See Demo](/notes/DEMO.md).

<p align="center">
  <img src="https://github.com/dermatologist/crisp-t/blob/develop/notes/crisp-logo.jpg" />
</p>


  ✅ CRISP is written in Python, but **you don’t need to know Python** to use it!

  ✅ CRISP is not a data science tool; it’s a **sense-making** tool!

  ✅ CRISP does not replace your analysis; it just **augments** it!

  ✅ CRISP employs an **interpretivist approach**, and the same lens is required to comprehend its results!

  ✅ CRISP does not need LLMs but can augment them with **tools**!

  ✅ CRISP is designed to **simplify your life as a qualitative researcher!**

  💯 CRISP is open-source! licensed under the GPL-3.0 License.


**Qualitative research** focuses on collecting and analyzing textual data—such as interview transcripts, open-ended survey responses, and field notes—to explore complex phenomena and human experiences. Researchers may also incorporate quantitative or external sources (e.g., demographics, census data, social media) to provide context and triangulate findings. Characterized by an inductive approach, qualitative research emphasizes generating theories from data rather than testing hypotheses. While qualitative and quantitative data are often used together, there is **no standard method for combining them.**

**CRISP-T is a method and toolset** to integrate **textual data** (as a list of documents) and **numeric data** (as Pandas DataFrame) into structured classes that retain **metadata** from various analytical processes, such as **topic modeling** and **decision trees**. Researchers, with or without **GenAI assistance**, can define relationships between textual and numerical datasets based on their chosen **theoretical lens**.  An optional final analytical phase ensures that proposed relationships actually hold true. Further, if the numeric and textual datasets share same id, or if the textual metadata contains keywords that match numeric column names; both datasets are filtered simultaneously, ensuring alignment and facilitating triangulation. 👉 [See Demo](/notes/DEMO.md).

CRISP-T implements **semantic search** using **ChromaDB** to find relevant documents or document chunks based on similarity to a query or reference documents. This is useful for literature reviews to find documents likely to fit inclusion criteria within your corpus/search results. It can also be used for coding/annotating documents by finding relevant chunks within a specific document.

An **MCP server** exposes all functionality as tools, resources, and prompts, enabling integration with AI agent platforms such as Claude desktop, VSCODE and other MCP-compatible clients. CRISP-T cannot directly code the documents, but it provides semantic chunk search that **may be used in association with other tools to acheive automated coding**. For example, VSCODE provides built in tools for editing text and markdown files, which can be used to code documents based on semantic search.

## Installation

```bash
pip install crisp-t
```

Include machine learning features for numeric data analysis (Recommended):
```bash
pip install crisp-t[ml]
```

Include XGBoost for gradient boosting features (Optional):
```bash
pip install crisp-t[xg]
```
* Mac users need to install libomp: `brew install libomp` for XGBoost to work. (Needed only if you want to use XGBoost)

## Command Line Scripts

CRISP-T now provides four main command-line scripts:

- `crisp` — Main CLI for qualitative triangulation and analysis (see below)
- `crispviz` — Visualization CLI for corpus data (word frequencies, topic charts, wordclouds, etc.)
- `crispt` — Corpus manipulation CLI (create, edit, query, and manage corpus objects)
- `crisp-mcp` — Starts the MCP server for AI integration (see MCP section below)

All scripts are installed as entry points and can be run directly from the command line after installation.

### crisp (Analytical CLI)

```bash
crisp [OPTIONS]
```

#### Input/Output Options

- `--source, -s PATH|URL`: Read source data from a directory (reads .txt, .pdf and a single .csv) or from a URL
- `--sources PATH|URL`: Provide multiple sources; can be used multiple times
- `--inp, -i PATH`: Load an existing corpus from a folder containing `corpus.json` (and optional `corpus_df.csv`)
- `--out, -o PATH`: When saving the corpus, provide a folder path; the CLI writes `corpus.json` (and `corpus_df.csv` if available) into that folder. When saving analysis results (topics, sentiment, etc.), this acts as a base path: files are written with suffixes, e.g., `results_topics.json`.
- `--unstructured, -t TEXT`: Text CSV column(s) to analyze/compare (can be used multiple times). This is useful when you have free-form text data in a DataFrame. If this is provided, those columns are used as documents.
- `--ignore TEXT`: Comma-separated words to ignore during ingestion (applies to `--source/--sources`)

#### Analysis Options

- `--codedict`: Generate qualitative coding dictionary
- `--topics`: Generate topic model using LDA
- `--assign`: Assign documents to topics
- `--cat`: List categories of entire corpus or individual documents
- `--summary`: Generate extractive text summary
- `--sentiment`: Generate sentiment scores using VADER
- `--sentence`: Generate sentence-level scores when applicable
- `--nlp`: Generate all NLP reports (combines above text analyses)
- `--nnet`, `--cls`, `--knn`, `--kmeans`, `--cart`, `--pca`, `--regression`, `--lstm`, `--ml`: Machine learning and clustering options (requires `crisp-t[ml]`)
  - `--regression`: Perform linear or logistic regression (automatically detects binary outcomes for logistic regression)
  - `--lstm`: Train LSTM model on text data to predict outcome variable (requires binary outcome and 'id' column for alignment)
- `--visualize`: Generate visualizations (word clouds, topic charts, etc.)
- `--num, -n INTEGER`: Number parameter (clusters, topics, epochs, etc.) - default: 3
- `--rec, -r INTEGER`: Record parameter (top N results, recommendations) - default: 3
- `--filters, -f TEXT`: Filters to apply as `key=value` (can be used multiple times); keeps only documents where `document.metadata[key] == value`. Invalid formats raise an error.
- `--verbose, -v`: Print verbose messages for debugging

#### Data Sources

- `--source, -s PATH|URL`: Read source data from a directory (reads .txt and .pdf) or from a URL
- `--sources PATH|URL`: Provide multiple sources; can be used multiple times

#### Display Options

- `--print, -p TEXT`: Print corpus information; options: [all|documents|dataframe|metadata|stats]
	documents: Lists the first 5 documents with IDs and text snippets
	dataframe: Displays the DataFrame head (if available)
	metadata: Shows corpus metadata
	stats: Provides descriptive statistics from the DataFrame (if available)

### crispviz (Visualization CLI)

```bash
crispviz [OPTIONS]
```

- `--inp, --source, --sources`: Input corpus or sources
- `--out`: Output directory for PNG images
- Visualization flags: `--freq`, `--by-topic`, `--wordcloud`, `--ldavis`, `--top-terms`, `--corr-heatmap`
- Optional params: `--bins`, `--top-n`, `--columns`, `--topics-num`

**Visualization Options:**
- `--freq`: Export word frequency distribution
- `--by-topic`: Export distribution by dominant topic (requires LDA)
- `--wordcloud`: Export topic wordcloud (requires LDA)
- `--ldavis`: Export interactive LDA visualization as HTML (requires LDA and pyLDAvis)
- `--top-terms`: Export top terms bar chart
- `--corr-heatmap`: Export correlation heatmap from CSV numeric columns
- `--topics-num N`: Number of topics for LDA (default: 8, based on Mettler et al., 2025)

### crispt (Corpus Manipulation CLI)

```bash
crispt [OPTIONS]
```

- `--id`, `--name`, `--description`: Corpus metadata
- `--doc`: Add document as `id|name|text` or `id|text` (repeatable)
- `--remove-doc`: Remove document by ID (repeatable)
- `--meta`: Add/update corpus metadata as `key=value` (repeatable)
- `--add-rel`: Add relationship as `first|second|relation` (repeatable)
- `--clear-rel`: Clear all relationships
- `--out`: Save corpus to folder/file as `corpus.json`
- `--inp`: Load corpus from folder/file containing `corpus.json`
- Query options:
	- `--df-cols`: Print DataFrame column names
	- `--df-row-count`: Print DataFrame row count
	- `--df-row INDEX`: Print DataFrame row by index
	- `--doc-ids`: Print all document IDs
	- `--doc-id ID`: Print document by ID
	- `--relationships`: Print all relationships
	- `--relationships-for-keyword KEYWORD`: Print relationships involving a keyword
- Semantic search (requires `chromadb`):
	- `--semantic QUERY`: Perform semantic search with query string
	- `--similar-docs DOC_IDS`: Find documents similar to comma-separated list of document IDs (useful for literature reviews)
	- `--num N`: Number of results to return (default: 5). Used for `--semantic` and `--similar-docs`
	- `--semantic-chunks QUERY`: Perform semantic search on document chunks. Returns matching chunks for a specific document (use with `--doc-id` and `--rec` for similarity threshold between 0 and 10 with a default of 8.5)
	- `--rec THRESHOLD`: Threshold for semantic operations. For `--semantic-chunks`, use 0-10 (default: 8.5). For `--similar-docs`, use 0-1 (default: 0.7). Only results with similarity above this value are returned
	- `--metadata-df`: Export collection metadata as DataFrame+
	- `--metadata-keys KEYS`: Comma-separated metadata keys to include+

	- + *The above two options can be used  to export or add metadata from NLP to the DataFrame. For example, you can extract sentiment scores or topic assignments as additional columns for numerical analysis. This is useful if dataframe and documents are aligned as in a survey response.*

### [Example Usage](/notes/DEMO.md)

When saving the corpus via `--out`, the CLI writes `corpus.json` (and `corpus_df.csv` if present) into the specified folder. If you pass a file path, only its parent directory is used for writing `corpus.json`.

## MCP Server

CRISP-T provides a Model Context Protocol (MCP) server that exposes all functionality as tools, resources, and prompts. This enables integration with AI assistants and other MCP-compatible clients.

### Using the MCP Server

<p align="center">
  <img src="https://github.com/dermatologist/crisp-t/blob/develop/notes/crisp.gif" />
</p>

### Configuring MCP Clients
#### Claude Desktop

Add to your Claude Desktop configuration file:

**MacOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`

```json
{
  "mcpServers": {
    "crisp-t": {
      "command": "<python-path>crisp-mcp"
    }
  }
}
```

#### Using with Other MCP Clients

The server can be used with any MCP-compatible client. Configure your client to run the `crisp-mcp` command via stdio.

### Available Tools

The MCP server provides tools for:

**Corpus Management**
- `load_corpus` - Load corpus from folder or source
- `save_corpus` - Save corpus to folder
- `add_document` - Add new document
- `remove_document` - Remove document by ID
- `get_document` - Get document details
- `list_documents` - List all document IDs
- `add_relationship` - Link text keywords with numeric columns
- `get_relationships` - Get all relationships
- `get_relationships_for_keyword` - Query relationships by keyword

**NLP/Text Analysis**
- `assign_topics` - Assign documents to topics (creates keyword labels)
- `extract_categories` - Extract common concepts
- `generate_summary` - Generate extractive summary
- `sentiment_analysis` - VADER sentiment analysis

**Semantic Search** (requires `chromadb`)
- `semantic_search` - Find documents similar to a query using semantic similarity
- `find_similar_documents` - Find documents similar to a set of reference documents (useful for literature reviews and qualitative research)
- `semantic_chunk_search` - Find relevant chunks within a specific document (useful for coding/annotating documents)
- `export_metadata_df` - Export ChromaDB metadata as DataFrame

**DataFrame/CSV Operations**
- `get_df_columns` - Get DataFrame column names
- `get_df_row_count` - Get number of rows
- `get_df_row` - Get specific row by index

**Machine Learning** (requires `crisp-t[ml]`)
- `kmeans_clustering` - K-Means clustering
- `decision_tree_classification` - Decision tree with feature importance
- `svm_classification` - SVM classification
- `neural_network_classification` - Neural network classification
- `regression_analysis` - Linear/logistic regression with coefficients
- `pca_analysis` - Principal Component Analysis
- `association_rules` - Apriori association rules
- `knn_search` - K-nearest neighbors search
- `lstm_text_classification` - LSTM model for text-based outcome prediction

### Resources

The server exposes corpus documents as resources:
- `corpus://document/{id}` - Access document text by ID

### Prompts

- `analysis_workflow` - Complete step-by-step analysis guide based on INSTRUCTIONS.md
- `triangulation_guide` - Guide for triangulating qualitative and quantitative findings

### [Example MCP commands](/notes/DEMO.md)

## Role of CRISP-T in research and practice

The workflow enables AI assistants to help conduct comprehensive analyses by combining text analytics, machine learning, and triangulation of qualitative-quantitative findings.

For example, in market research, a company collects:
- **Textual** feedback from customer support interactions.
- **Numerical** data on customer retention and sales performance.
Using this framework, business analysts can investigate how recurring concerns in feedback correspond to measurable business outcomes.

## Framework Documentation

For detailed information about available functions, metadata handling, and theoretical frameworks, see the [comprehensive user instructions](/notes/INSTRUCTION.md). For semantic search examples and best practices, see the [Semantic Search Guide](/notes/SEMANTIC_SEARCH.md). Documentation (WIP) is also available [here](https://dermatologist.github.io/crisp-t/).

### Data model

[![crisp-t](https://github.com/dermatologist/crisp-t/blob/develop/notes/arch.drawio.svg)](https://github.com/dermatologist/crisp-t/blob/develop/notes/arch.drawio.svg)

## Citation

* Released on 10/11/2025 for presentation at [ICIS 2025](https://icis2025.aisconferences.org/) conference.
* Paper coming soon. Cite this repository in the meantime:

## Give us a star ⭐️
If you find this project useful, give us a star. It helps others discover the project.

## Contact

* [Bell Eapen](https://nuchange.ca) ([UIS](https://www.uis.edu/directory/bell-punneliparambil-eapen)) |  [Contact](https://nuchange.ca/contact) | [![Twitter Follow](https://img.shields.io/twitter/follow/beapen?style=social)](https://twitter.com/beapen)
