# Flagged CSV

Convert XLSX files to CSV while preserving visual formatting information as inline flags.

Traditional XLSX → CSV conversion will cause the loss of format such as background cell colors or cell merge:
![ShowExcelConversionOriginalCsv_ManimCE_v0 19 0](https://github.com/user-attachments/assets/9241d467-ab47-4005-8438-3bad05864a6b)

The new Flagged CSV will attach important flags to the original cell value, keeping the colors/cell-merge with {} flags
![ShowExcelConversionFlaggedCsv_ManimCE_v0 19 0](https://github.com/user-attachments/assets/2dee974a-d480-4f71-9404-10b121df388f)

## Why Flagged CSV is Critical for AI

Traditional AI models cannot process XLSX files directly. When users convert XLSX to CSV for AI processing, critical visual information is lost - colors that indicate categories, merged cells that show date ranges, and formatting that conveys meaning. This forces users to manually add context that was already present visually.

### Real-World Example: AI's Limitation with Standard CSV

Consider this financial data with colored categories and merged cells for date ranges:

**Original Excel file** has:
- Different colors for each expense category
- Merged cells showing the time span for each expense
- My color1 and My color4 share the same green color (#84E291)

**After standard CSV conversion**, an AI assistant fails to answer basic questions:

> **User**: "Which two items have the same color?"
> 
> **AI**: "The provided CSV file does not contain any information about the colors of the cells..."

> **User**: "What is the time range for each color's spend?"
> 
> **AI**: "I cannot determine time ranges from the CSV data..."

### The Flagged CSV Solution

With Flagged CSV format, the same AI can now answer correctly:

```csv
Name,Color,Value,JUL{#0E2841},AUG{#0E2841},SEP{#0E2841},OCT{#0E2841},NOV{#0E2841},DEC{#0E2841}
My color1,{#84E291},30,$500{#84E291}{MG:881261},{MG:881261},,,,
My color2,{#E49EDD},32,$600{#E49EDD}{MG:316508},{MG:316508},{MG:316508},,,
My color3,{#F6C6AC},34,$700{#F6C6AC}{MG:353471},{MG:353471},{MG:353471},{MG:353471},,
My color4,{#84E291},36,$800{#84E291}{MG:860393},{MG:860393},{MG:860393},{MG:860393},{MG:860393},{MG:860393}
```

Now the AI can answer:

> **Q: Which two items have the same color?**
> 
> **A: My color1 and My color4 have the same color (#84E291).**

> **Q: What is the time range and spend for each item?**
> 
> **A:** 
> - **My color1**: Time range JUL-AUG with a spend of $500
> - **My color2**: Time range JUL-SEP with a spend of $600  
> - **My color3**: Time range JUL-OCT with a spend of $700
> - **My color4**: Time range JUL-DEC with a spend of $800

This enables AI to understand:
- **Color patterns** - Which items belong to the same category
- **Time ranges** - Merged cells indicate duration
- **Relationships** - Visual cues that humans naturally understand

## Overview

Flagged CSV is a Python library and command-line tool that converts Excel (XLSX) files to CSV format while preserving important visual information that would normally be lost in conversion:

- **Cell background colors** - Preserved as `{#RRGGBB}` or `{bc:#RRGGBB}` flags
- **Cell foreground colors** - Preserved as `{fc:#RRGGBB}` flags
- **Merged cells** - Marked with `{MG:XXXXXX}` flags where XXXXXX is a unique identifier
- **Cell formatting** - Currency symbols, number formats, dates preserved as displayed in Excel
- **Cell locations** - Original Excel coordinates preserved as `{l:CellRef}` flags

## Installation

### Via uv (Recommended)

```bash
# Clone the repository
git clone https://github.com/yourusername/flagged-csv.git
cd flagged-csv

# Sync dependencies with uv
uv sync

# Install in development mode
uv pip install -e .
```

### Via pip

```bash
pip install flagged-csv
```

## Quick Start

### Command Line Usage

```bash
# Basic conversion
flagged-csv input.xlsx -t Sheet1 > output.csv

# Include all colors (foreground and background) with merge information
flagged-csv input.xlsx -t Sheet1 --include-colors --signal-merge -o output.csv

# Include background colors only, preserve formatting, ignore white backgrounds
flagged-csv input.xlsx -t Sheet1 --preserve-formats --include-bg-colors --ignore-colors "#FFFFFF"

# Include foreground colors only (font colors)
flagged-csv input.xlsx -t Sheet1 --include-fg-colors -o output.csv

# Include cell locations and keep empty rows for structure preservation
flagged-csv input.xlsx -t Sheet1 --add-location --keep-empty-lines -o output.csv

# Process with size limits
flagged-csv input.xlsx -t Sheet1 --max-rows 1000 --max-columns 200 -o output.csv
```

### Python Library Usage

```bash
# Run Python scripts with uv
uv run python your_script.py
```

```python
# your_script.py
from flagged_csv import XlsxConverter

# Create converter instance
converter = XlsxConverter()

# Convert with all formatting options
csv_content = converter.convert_to_csv(
    'data.xlsx',
    tab_name='Sheet1',
    include_colors=True,
    signal_merge=True,
    preserve_formats=True,
    ignore_colors='#FFFFFF'
)

# Save to file
with open('output.csv', 'w') as f:
    f.write(csv_content)
```

## Flag Format Specification

### Color Flags

#### Background Color
- Format: `{#RRGGBB}` (backward-compatible) or `{bc:#RRGGBB}` (explicit)
- Example: `Sales{#FF0000}` or `Sales{bc:#FF0000}` - "Sales" with red background

#### Foreground Color
- Format: `{fc:#RRGGBB}`
- Example: `Text{fc:#0000FF}` - "Text" with blue font color

#### Combined Colors
- Multiple flags can be combined: `100{#FFFF00}{fc:#FF0000}{MG:123456}{l:B5}`
- This represents: Yellow background, red text, part of merge group 123456, from cell B5

### Merge Flags  
- Format: `{MG:XXXXXX}` where XXXXXX is a 6-digit identifier
- All cells in a merged range share the same ID
- The first cell contains the actual value
- Subsequent cells contain only the merge flag

### Location Flags
- Format: `{l:CellRef}` where CellRef is the Excel cell coordinate
- Example: `Value{l:A5}` - "Value" from cell A5
- Useful for preserving cell position information

### Example Output

Given an Excel file with:
- Cell A1: "Total Sales" with blue background (#0000FF) and white text (#FFFFFF)
- Cells B1-D1: Merged cell containing "$1,000" with green background (#00FF00)
- Cell A2: "Profit" with red text (#FF0000)

The CSV output with `--include-colors` would be:
```csv
Total Sales{#0000FF}{fc:#FFFFFF},$1000{#00FF00}{MG:384756},{MG:384756},{MG:384756}
Profit{fc:#FF0000},,
```

## Configuration Options

### CLI Options

- `-t, --tab-name`: Sheet name to convert (required)
- `-o, --output`: Output file path (default: stdout)
- `--format`: Output format: csv, html, or markdown (default: csv)
- `--include-colors`: Include both foreground and background colors
- `--include-bg-colors`: Include background colors only as {#RRGGBB} flags
- `--include-fg-colors`: Include foreground colors only as {fc:#RRGGBB} flags
- `--signal-merge`: Include merged cell information as {MG:XXXXXX} flags
- `--preserve-formats`: Preserve number/date formatting as displayed in Excel
- `--ignore-colors`: Comma-separated hex colors to ignore (e.g., "#FFFFFF,#000000")
- `--add-location`: Add cell coordinates {l:A5} to non-empty cells
- `--keep-empty-lines`: Preserve empty rows to maintain original row positions
- `--max-rows`: Maximum number of rows to process (default: 300)
- `--max-columns`: Maximum number of columns to process (default: 100)
- `--no-header`: Exclude DataFrame column headers (A, B, C...) from output
- `--keep-na`: Keep NA values instead of converting to empty strings

### Python API Options

```python
from flagged_csv import XlsxConverter, XlsxConverterConfig

# Create converter with custom configuration
config = XlsxConverterConfig(
    keep_default_na=False,    # Convert NA to empty strings
    index=False,              # Don't include row index
    header=False,             # Don't include DataFrame column headers (default)
    keep_empty_lines=False,   # Remove empty rows (default)
    add_location=False        # Don't add cell coordinates (default)
)

converter = XlsxConverter(config)

# Convert with additional options
csv_content = converter.convert_to_csv(
    'data.xlsx',
    tab_name='Sheet1',
    include_colors=True,      # Include both fg and bg colors
    # OR use specific color options:
    # include_bg_colors=True,  # Background colors only
    # include_fg_colors=True,  # Foreground colors only
    signal_merge=True,
    preserve_formats=True,
    add_location=True,        # Add {l:A5} cell coordinates
    keep_empty_lines=True,    # Keep empty rows
    max_rows=500,            # Process up to 500 rows
    max_columns=50           # Process up to 50 columns
)
```

## Advanced Usage

### Processing Multiple Sheets

Save this as `process_sheets.py`:

```python
from flagged_csv import XlsxConverter
import pandas as pd

converter = XlsxConverter()

# Process all sheets in a workbook
xl_file = pd.ExcelFile('multi_sheet.xlsx')
for sheet_name in xl_file.sheet_names:
    csv_content = converter.convert_to_csv(
        'multi_sheet.xlsx',
        tab_name=sheet_name,
        include_colors=True,
        signal_merge=True
    )
    
    with open(f'{sheet_name}.csv', 'w') as f:
        f.write(csv_content)
    print(f'Converted {sheet_name} -> {sheet_name}.csv')
```

Run with:
```bash
uv run python process_sheets.py
```

### Parsing Flagged CSV

Save this as `parse_flagged.py`:

```python
import re
import pandas as pd

def parse_flagged_csv(file_path):
    """Parse a flagged CSV file and extract values and formatting."""
    df = pd.read_csv(file_path, header=None)
    
    # Regular expressions for parsing flags
    color_pattern = r'{#([0-9A-Fa-f]{6})}'
    merge_pattern = r'{MG:(\d{6})}'
    location_pattern = r'{l:([A-Z]+\d+)}'
    
    # Extract clean values and formatting info
    for row_idx in range(len(df)):
        for col_idx in range(len(df.columns)):
            cell = str(df.iloc[row_idx, col_idx])
            
            # Extract color
            color_match = re.search(color_pattern, cell)
            if color_match:
                color = color_match.group(1)
                print(f"Cell ({row_idx},{col_idx}) has color #{color}")
            
            # Extract merge ID
            merge_match = re.search(merge_pattern, cell)
            if merge_match:
                merge_id = merge_match.group(1)
                print(f"Cell ({row_idx},{col_idx}) is part of merge group {merge_id}")
            
            # Extract location
            location_match = re.search(location_pattern, cell)
            if location_match:
                location = location_match.group(1)
                print(f"Cell ({row_idx},{col_idx}) originally from {location}")
            
            # Get clean value (remove all flags)
            clean_value = re.sub(r'{[^}]+}', '', cell)
            df.iloc[row_idx, col_idx] = clean_value
    
    return df

# Example usage
if __name__ == "__main__":
    df = parse_flagged_csv('output.csv')
    print("\nCleaned data:")
    print(df)
```

Run with:
```bash
uv run python parse_flagged.py
```

### Working with Merged Cells

```python
def reconstruct_merged_cells(df):
    """Reconstruct merged cell ranges from flagged CSV."""
    merge_groups = {}
    
    for row_idx in range(len(df)):
        for col_idx in range(len(df.columns)):
            cell = str(df.iloc[row_idx, col_idx])
            
            # Find merge ID
            match = re.search(r'{MG:(\d{6})}', cell)
            if match:
                merge_id = match.group(1)
                if merge_id not in merge_groups:
                    merge_groups[merge_id] = []
                merge_groups[merge_id].append((row_idx, col_idx))
    
    # merge_groups now contains all cells belonging to each merge
    for merge_id, cells in merge_groups.items():
        print(f"Merge {merge_id}: {cells}")
```

## Output Formats

### CSV (Default)
Standard CSV format with flags appended to cell values.

### HTML
```python
html_output = converter.convert_to_csv(
    'data.xlsx',
    tab_name='Sheet1',
    output_format='html',
    include_colors=True
)
```

### Markdown
```python
markdown_output = converter.convert_to_csv(
    'data.xlsx', 
    tab_name='Sheet1',
    output_format='markdown',
    include_colors=True
)
```

## Error Handling

The library handles various error cases gracefully:

```python
try:
    csv_content = converter.convert_to_csv('data.xlsx', tab_name='InvalidSheet')
except ValueError as e:
    print(f"Sheet not found: {e}")
except FileNotFoundError as e:
    print(f"File not found: {e}")
```

## Performance Considerations

- The library uses multiple fallback engines (calamine, openpyxl, xlrd) for maximum compatibility
- Large files are processed efficiently with streaming where possible
- Color extraction uses caching to avoid repeated theme color lookups
- Default limits of 300 rows and 100 columns can be adjusted using `--max-rows` and `--max-columns`
- Empty rows are automatically removed by default (use `--keep-empty-lines` to preserve them)
- Trailing empty content is always trimmed to reduce file size

## Testing

Run the test suite using uv:

```bash
# Run all tests
uv run pytest tests/test_converter.py

# Run tests with verbose output
uv run pytest tests/test_converter.py -v

# Run a specific test
uv run pytest tests/test_converter.py::TestXlsxConverter::test_color_extraction -v
```

## Development

```bash
# Set up development environment
uv sync

# Run the example script
uv run python example.py

# Run the CLI tool in development
uv run flagged-csv --help
```

## Requirements

- Python 3.11+
- pandas >= 2.0.0, < 2.2.0
- numpy < 2.0 (for compatibility)
- pydantic >= 2.0.0 (for configuration models)
- openpyxl >= 3.1.0
- python-calamine >= 0.2.0 (for robust Excel reading)
- xlrd >= 2.0.0 (for older Excel format support)
- click >= 8.0.0 (for CLI)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests to ensure everything works (`uv run pytest tests/`)
4. Commit your changes (`git commit -m 'Add amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## AI Integration Guide

When using Flagged CSV with AI systems (ChatGPT, Claude, etc.), you have two options:

### Option 1: Use the provided prompt file
Simply include the contents of `flagged-csv.prompt.md` in your conversation with the AI. This file contains the complete specification for the Flagged CSV format.

### Option 2: Use this condensed prompt
```
The CSV data uses Flagged CSV format where:
- {#RRGGBB} indicates cell background color (e.g., {#FF0000} is red)
- {MG:XXXXXX} indicates merged cells (same ID = same merged group)
- {l:CellRef} indicates the original Excel cell location (e.g., {l:A5} is cell A5)
- The value in the merged cell should be applied to the entire range as a whole.
- Multiple flags can appear together: value{#color}{MG:id}{l:cell}
```

This simple context enables AI to answer questions about:
- Color-based categorization
- Time ranges from merged cells
- Visual patterns and relationships
- Conditional formatting meanings

## Acknowledgments

This library is inspired by the need to preserve Excel's visual information during data processing pipelines, particularly for financial and business reporting applications where cell colors and merged cells convey important meaning. It bridges the gap between human-readable Excel files and AI-processable data formats.

The code for this project was primarily written using [Claude Code](https://claude.ai/code), demonstrating how AI can help create tools that make data more accessible to AI systems.
