Metadata-Version: 2.1
Name: twitterhistory
Version: 0.3.6
Summary: Download posts and user metadata from the microblogging service Twitter
Home-page: https://gitlab.com/christoph.fink/twitterhistory/
Author: Christoph Fink
Author-email: christoph.fink@helsinki.fi
License: GPLv3
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE

# Download a history of posts and user metadata from the microblogging service Twitter

***Twitterhistory*** **is in early BETA status. Using it in production might be an absolutely bad idea (but it also may just work). If you encounter any issues, [please report them](https://gitlab.com/christoph.fink/twitterhistory/-/issues) and/or submit a merge request with a fix.**

This is a Python module to download a complete history of posts and user metadata from the microblogging service Twitter using its API’s as of 2021 latest version 2. Data are saved to an SQLAlchemy/GeoAlchemy2-compatible database (currently only PostgreSQL/PostGIS is fully supported, see also the [documention of GeoAlchemy2](https://geoalchemy-2.readthedocs.io/en/latest/)).

![screen shot](https://gitlab.com/christoph.fink/twitterhistory/-/blob/master/extra/images/screenshot.png)

The script will download all Twitter status messages up until the current time, and keep track of already downloaded time periods in a cache file (default location `~/.cache/twitterhistory.yml`). When started the next time, it will attempt to fill gaps in the downloaded data and catch up until the then current time. 

To use *twitterhistory*, your API key (see further down) needs to be associated to an account with [academic research access](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you).

If you use *twitterhistory* for academic research, please cite it in your publication: <br />
Fink, C. (2021): *twitterhistory: a Python tool to download historical Twitter data*. [doi:10.5281/zenodo.4471195](https://doi.org/10.5281/zenodo.4471195)

### Dependencies

The script is written in Python 3 and depends on the Python modules [blessed](https://blessed.readthedocs.io/), [GeoAlchemy2](https://geoalchemy-2.readthedocs.io/), [psycopg2](https://www.psycopg.org/), [PyYaml](https://pyyaml.org/), [Requests](https://2.python-requests.org/en/master/) and [SQLAlchemy](https://sqlalchemy.org/).

### Installation

```shell
pip install twitterhistory
```

### Configuration

Copy the example configuration file [twitterhistory.yml.example](https://gitlab.com/christoph.fink/twitterhistory/-/raw/master/twitterhistory.yml.example) to a suitable location, depending on your operating system: 

- on Linux systems:
    - system-wide configuration: `/etc/twitterhistory.yml`
    - per-user configuration: 
        - `~/.config/twitterhistory.yml` OR
        - `${XDG_CONFIG_HOME}/twitterhistory.yml`
- on MacOS systems:
    - per-user configuration:
        - `${XDG_CONFIG_HOME}/twitterhistory.yml`
- on Microsoft Windows systems:
    - per-user configuration:
        `%APPDATA%\twitterhistory.yml`
- in a custom file path location specified on the command line (see further down)

Adapt the configuration:

- Configure a database connection string (`connection_string`), pointing to an existing database (with the PostGIS extension enabled).
- Configure an API [OAuth 2.0 Bearer token](https://developer.twitter.com/en/docs/authentication/oauth-2-0) with access to the Twitter API v2 `twitter_oauth2_bearer_token`).
- Configure one or more search terms for the query (`search_terms`).

If you have a cache file from a previous installation in which already downloaded time periods are saved, copy it to `${XDG_CACHE_HOME}/twitterhistory.yml` or `%LOCALAPPDATA%/twitterhistory.yml` on Linux or MacOS, and Microsoft Windows, respectively.

The cache file is currently also the best way to limit the temporal range of the data collection (by default, *twitterhistory* downloads the entire history of Tweets that correspond to the search terms). Run *twitterhistory* at least briefly for it to create an initial cache file. In this file, it marks the time spans for which it successfully downloaded data, per `search_term`. Add one or more `!TimeSpan` objects that cover all periods between March 2006 and the current date except the temporal range you want to download - *twitterhistory* will then try to fill this gap, only.

### Usage

#### Command line executable

```shell
python -m twitterhistory
```

```shell
python -m twitterhistory --config /path/to/custom/config-file.yml
```

#### Python

Import the `twitterhistory` module. Instantiate a `TwitterHistoryDownloader`, and call its `download()` method.

```python
import twitterhistory

downloader = twitterhistory.TwitterHistoryDownloader()
downloader.download()
```

### Data privacy

By default, *twitterhistory* pseudonymises downloaded metadata, i.e., it replaces (direct) identifiers with randomised identifiers (generated using hashes, i.e., one-way ‘encryption’). This serves as one step of a responsible data processing workflow. However, other (meta-)data might nevertheless qualify as indirect identifiers, as they, combined or on their own, might allow re-identification of a person. If you want to use data downloaded using *twitterhistory* in a GDPR-compliant fashion, you have to follow up the data collection stage with data minimisation and further pseudonymisation or anonymisation efforts.

*twitterhistory* can keep original identifiers (i.e., skip pseudonymisation). To instruct it to do so, instantiate a `TwitterHistoryDownloader` with the parameter `pseudonymise_identifiers=False` or set the according parameter in the configuration file. Ensure that you fulfil all legal and organisational requirements to handle personal information before you decide to collect non-pseudonyismed data.

```python
import twitterhistory

downloader = twitterhistory.TwitterHistoryDownloader(
    pseudonymise_identifiers = False  # get legal advice and ethics approval before doing this
)
downloader.download()
```


