Metadata-Version: 2.1
Name: duplicate-images
Version: 0.5.0
Summary: Finds equal or similar images in a directory containing (many) image files
Home-page: https://github.com/lene/DuplicateImages
Author: Lene Preuss
Author-email: lene.preuss@gmail.com
Requires-Python: >=3.8.5,<4.0.0
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Multimedia :: Graphics
Classifier: Topic :: Utilities
Requires-Dist: Wand (>=0.6.5,<0.7.0)
Requires-Dist: coloredlogs (>=15.0,<16.0)
Requires-Dist: imagehash (>=4.2.0,<5.0.0)
Requires-Dist: pillow (>=8.1.0,<9.0.0)
Requires-Dist: tqdm (>=4.56.0,<5.0.0)
Project-URL: Repository, https://github.com/lene/DuplicateImages.git
Description-Content-Type: text/markdown

# Finding Duplicate Images

Finds equal or similar images in a directory containing (many) image files.

## Usage
```shell
$ pip install duplicate_images
$ find-dups -h
```
to print the help screen. Or just
```shell
$ find-dups $IMAGE_ROOT 
```
for a test run.

### Image comparison algorithms

Use the `--algorithm` option to select how equal images are found.

`ahash`, `colorhash`, `dhash`, `phash`, `whash`: five different image hashing algorithms. See 
https://pypi.org/project/ImageHash for an introduction on image hashing and 
https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory details which
image hashing algorithm performs best in which situation. For a start I recommend using `phash`, 
and only evaluating the other algorithms if `phash` does not perform satisfactorily in your use 
case.

### Actions for matching image pairs

Use the `--on-equal` option to select what to do to pairs of equal images.
- `delete-first`: deletes the first of the two files
- `delete-second`: deletes the second of the two files
- `delete-bigger` or `d>`: deletes the file with the bigger size
- `delete-smaller` or `d<`: deletes the file with the smaller size
- `eog`: launches the `eog` image viewer to compare the two files
- `xv`: launches the `xv` image viewer to compare the two files
- `print`: prints the two files
- `quote`: prints the two files with quotes around each 
- `none`: does nothing.
The default action is `print`.
  
### Parallel execution

Use the `--parallel` option to utilize all free cores on your system. 

### Progress and verbosity control
- `--progress` prints a progress bar each for the process of reading the images and the process of 
  finding duplicates among the scanned image
- `--debug` prints debugging output

## Development notes

Needs Python3 and Pillow imaging library to run, additionally Wand for the test suite.

Uses Poetry for dependency management.

### Installation

From source:
```shell
$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install
```

### Running

```shell
$ poetry run find-dups $PICTURE_DIR
```
or
```shell
$ poetry run find-dups -h
```
for a list of all possible options.

### Test suite

Running it all:
```shell
$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests
```
or simply 
```shell
$ .git_hooks/pre-push
```
Setting the test suite to be run before every push:
```shell
$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .
```

### Publishing

```shell
$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
  poetry publish --username $PYPI_USER --password $PYPI_PASSWORD
```
(obviously assuming that username and password are the same on PyPI and TestPyPI)
### Profiling

#### CPU time
To show the top functions by time spent, including called functions:
```shell
$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
```
or, to show the top functions by time spent in the function alone:
```shell
$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
```

#### Memory usage
```shell
$ poetry run fil-profile run ./duplicate_images/duplicate.py \
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1
```
This will open a browser window showing the functions using the most memory (see 
https://pypi.org/project/filprofiler for more details).
