Metadata-Version: 2.1
Name: fastq-filter
Version: 0.3.0
Summary: A fast FASTQ filter progam.
Home-page: https://github.com/lumc/fastq-filter
Author: Leiden University Medical Center
Author-email: r.h.p.vorderman@lumc.nl
License: MIT
Keywords: FASTQ filter
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.7
Description-Content-Type: text/x-rst
License-File: LICENSE

.. image:: https://img.shields.io/pypi/v/fastq-filter.svg
  :target: https://pypi.org/project/isal/
  :alt:

.. image:: https://img.shields.io/conda/v/bioconda/fastq-filter.svg
  :target: https://bioconda.github.io/recipes/fastq-filter/README.html
  :alt:

.. image:: https://img.shields.io/pypi/pyversions/fastq-filter.svg
  :target: https://pypi.org/project/isal/
  :alt:

.. image:: https://img.shields.io/pypi/l/fastq-filter.svg
  :target: https://github.com/LUMC/fastq-filter/blob/main/LICENSE
  :alt:

.. image:: https://codecov.io/gh/LUMC/fastq-filter/branch/main/graph/badge.svg?token=E85BEYDQ45
  :target: https://codecov.io/gh/LUMC/fastq-filter
  :alt:

=============
fastq-filter
=============

A fast FASTQ filter program.

Fastq-filter correctly takes into account that quality scores are log scores
when calculating the mean. It also provides an option to filter on average
error rate directly.

FASTQ Q=30 stands for an average error rate of 0.001, Q=20 for 0.01 and Q=10
for 0.1. This is not very intuitive. Q=20 has 10 times more errors than Q=30
though the numbers (20 and 30) do little to convey this difference. Using
0.01 and 0.001 correctly conveys that these error rates are an order of
magnitude apart. This also means that the phred scores cannot be naively
averaged. Q=10 and Q=30 do not average Q=20. The actual average error rate
is (0.001 + 0.1) / 2 = 0.0505. Roughly 1 in 20. Q=20 means 0.01: 1 in 100.
By naively averaging the quality is overestimated by a factor of 5! This
means any tool that averages naively is unusable in practice.

Unfortunately many tools do this. fastq-filter was written to provide a very
fast filtering solution so the correct filtering can be applied at a very low
cost.

Installation
============

+ With pip: ``pip install fastq-filter``
+ For the latest development version: ``pip install git+https://github.com/LUMC/fastq-filter``
+ With conda ``conda install -c conda-forge -c bioconda fastq-filter``


Usage
=====

Single fastq files can be filtered with::

    fastq-filter -e 0.001 -o output.fastq input.fastq

Multiple fastq files can be filtered with::

    fastq-filter -e 0.001 -o r1_filtered.fastq.gz -o r2_filtered.fastq.gz r1.fastq.gz r2.fastq.gz

Fastq-filter ensures the output is in sync. It is not limited to two inputs
so also ``R1.fq``, ``R2.fq`` and ``R3.fq`` can be filtered together.

In the following section 'pair' is used to note when 2 or more FASTQ records are
evaluated. When multiple FASTQ files are given the filters behave as follows:

+ average error rate: The average of the combined phred scores is used.
+ median quality: The median of the combined phred scores is used.
+ Minimum length: at least one of the records of the pair must meet the minimum length.
+ Maximum length: None of the records in the pair must exceed the maximum length.

The rationale for the length filters is that R1 and R2 both sequence the same
molecule and the canonical length is the longest of both.

.. code-block::

    usage: fastq-filter [-h] [-o OUTPUT] [-l MIN_LENGTH] [-L MAX_LENGTH]
                        [-e AVERAGE_ERROR_RATE] [-q MEAN_QUALITY]
                        [-Q MEDIAN_QUALITY] [-c COMPRESSION_LEVEL] [--verbose]
                        [--quiet]
                        input [input ...]

    Filter FASTQ files on various metrics.

    positional arguments:
      input                 Input FASTQ files. Compression format automatically
                            detected. Use - for stdin.

    optional arguments:
      -h, --help            show this help message and exit
      -o OUTPUT, --output OUTPUT
                            Output FASTQ files. Compression format automatically
                            determined by file extension. Flag can be used
                            multiple times. An output must be given for each
                            input. Default: stdout.
      -l MIN_LENGTH, --min-length MIN_LENGTH
                            The minimum length for a read.
      -L MAX_LENGTH, --max-length MAX_LENGTH
                            The maximum length for a read.
      -e AVERAGE_ERROR_RATE, --average-error-rate AVERAGE_ERROR_RATE
                            The minimum average per base error rate.
      -q MEAN_QUALITY, --mean-quality MEAN_QUALITY
                            Average quality. Same as the '--average-error-rate'
                            option but specified with a phred score. I.e '-q 30'
                            is equivalent to '-e 0.001'.
      -Q MEDIAN_QUALITY, --median-quality MEDIAN_QUALITY
                            The minimum median phred score.
      -c COMPRESSION_LEVEL, --compression-level COMPRESSION_LEVEL
                            Compression level for the output files. Relevant when
                            output files have a .gz extension. Default: 2
      --verbose             Report stats on individual filters.
      --quiet               Turn of logging output.

Optimizations
=============

fastq-filter has used the following optimizations to be fast:

- Multiple filters can applied simultaneously to minimize IO.
- fastq-filter can be used in pipes to minimize IO
- The python filter function is used. Which is a a shorthand for python code
  that would otherwise need to be interpreted.
- The mean and median quality algorithms are implemented in C with bindings to
  Python.
- The mean quality algorithm uses a lookup table since there are only 93
  possible phred scores encoded in FASTQ. That saves a lot of power
  calculations to calculate the probabilities.
- The median quality algorithm implements a counting sort, which is really
  fast but not applicable for generic data. Since FASTQ qualities are uniquely
  suited for a counting sort, median calculation can be performed very quickly.
- `dnaio <https://github.com/marcelm/dnaio>`_ is used as FASTQ parser.  This
  parses the FASTQ files with a parser written in Cython.
- `xopen <https://github.com/pycompression/xopen>`_ is used to read and write
  files. This allows for support of gzip compressed files which are opened
  using `python-isal <https://github.com/pycompression/python-isal>`_ which
  reads gzip files 2 times faster and writes gzip files 5 times faster than
  the python ``gzip`` module implementation.
