Metadata-Version: 2.1
Name: urlcounter
Version: 0.0.3
Summary: A set of functions that tally URLs within an event-based corpus. It assumes that you have data divided into a range of event-based periods with community-detected modules/hubs. It also assumes that you have unspooled and cleaned your URL data. See Deen Freelon's unspooler module for help: https://github.com/dfreelon/unspooler.
Home-page: https://github.com/lingeringcode/urlcounter/
Author: Chris A. Lindgren
Author-email: chris.a.lindgren@gmail.com
License: UNKNOWN
Download-URL: https://github.com/lingeringcode/urlcounter/
Description: # urlcounter
        
        By Chris Lindgren <chris.a.lindgren@gmail.com>
        Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.
        
        ## Overview
        
        ```urlcounter``` is a set of functions that tallies full and domain URLs for periodic, event-defined social-media posting data. It assumes you seek an answer to the following questions about link-sharing:
        
        1. What are the top x full URLs and domain URLs from each group during each period?
        2. What are the top x full URLs and domain URLs from each group-module (detected community) in each period?
        
        To use the module, import and follow the below example for guidance:
        
        ```python
        import urlcounter as urlc
        
        dict_url_counts = urlc.top_urls(
            df=cdf, #DataFrame of full corpus
            periods=(1,10), #Tuple providing range of numbered periods
            hubs=(1,10), #Tuple providing range of numbered hubs
            period_dates=period_dates, #Dict of Lists with dates per period
            list_of_regex=[htg_btw,htg_fbt,htg_anti], #List of regex patterns defined for each group
            hl=hub_lists, #Dict with keyed lists of hub usernames per period
            columns=['cleaned_urls', 'retweets_count', 'hashtags', 'username', 'mentions'], #Provide a List of column names to use for search and counting
            url_sample_size=50, #Desired sample size limit, e.g., Top 50
            verbose=True #Boolean. True prints out status messages, False prints nothing
        )
        ```
        
        ### Example outputs
        
        It returns a ```Dict``` keyed by user-defined group names, period ranges, and module ranges:
        
        ```python
        # Overall period-based URL summary data with keyed group name, 'fbt'
        ## '1' = Period 1
        ## 'fbt' = Keyed group name + 'urls_per_period' and 'domains_per_period' = Summary total data
        output['1']['fbt_urls_per_period']
        output['1']['fbt_domains_per_period']
        
        # Overall community hub-based URL summary per period data with keyed group name, 'fbt'
        ## '1' = Period 1
        ## 'fbt' = Keyed group name 'fbt'
        ## '1' = Community hub/module 1
        ## 'hub_sample_size', 'hub_tweet_sample_size','hub_url_counts','hub_domain_counts' = Summary total data
        output['1']['fbt']['1']['hub_sample_size']
        output['1']['fbt']['1']['hub_tweet_sample_size']
        output['1']['fbt']['1']['hub_url_counts']
        output['1']['fbt']['1']['hub_domain_counts']
        ```
        
        ```python
        {'1': #Start period 1
          'fbt_domains_per_period': [ #start period 1 totals for group keyed as 'fbt'
            ('twitter.com', 3003), ('instagram.com', 1001), ('facebook.com', 202)
          ],
          'fbt_urls_per_period': [
            ('https://twitter.com/user/status/example', 202),
            ('https://www.instagram.com/p/example/', 202),
            ...
          ]}, #end period 1 totals for group keyed as 'fbt'
          {'fbt': { #start period 1, module/hub 1
            '1': {
              'hub_domain_counts': [
                ('example.com', 178),
                ('example2.go.lc', 14),
                ('example3.com', 10),...
              ],
              'hub_sample_size': 103,
              'hub_tweet_sample_size': 486,
              'hub_url_counts': [
                ('https://example.com/politics/story-title-1/',120),
                ('https://example.com/politics/story-title-2/',58),
                ...
              ]
              }
            }
          }, #end period 1, module/hub 1
          ...
        }, #end period
        ...
        ```
        
        ### top_urls()
        
        Tallies up URLs in corpus.
            
        Arguments:
        
        - ```df```= DataFrame. Corpus to query from.
        - ```columns```= a List of 5 column names (String) to reference in DF corpus. **!IMP**: *The order matters*:
          1. Column with URLs (String) that includes a list of URLs included in post/content: 
            - Example: ['https://time.com','https://and-time-again.com']. The List can also be a String, '[]' since the function converts literals.
          2. Column with number of times a post was shared (Integer), such as Retweets on Twitter.
          3. Column with group data (String), such as hashtags from tweets.
          4. Column with usernames (String), such as tweet usernames
          5. Column with target content data (String), such as tweets with targeted users from module, or stringified list of targeted people like tweet mentions.
        - ```url_sample_size```= Integer. Desired sample limit.
        - ```periods```= Tuple. Contains 2 Integers, which define the range of periods, e.g., (1,10)
        - ```hubs```= Tuple. Contains 2 Integers, which define the range of module/hubs, e.g., (1,10)
        - ```period_dates```= Dict of Lists with dates per period: pd['1'] => ['2018-01-01','2018-01-01',...]
        - ```list_of_regex```= List. Contains:
            1. list of regex patterns with group identifiers, such as hashtags
            2. String. Key identifier for group.
        - ```hl```= Dict. Contains lists of community-detected usernames
        - ```verbose```= Boolean. True prints out status messages (recommended), False prints nothing
        
        Returns:
        
        - Dict. See documentation for output details for data access.
        
        ### url_counter()
        
        Helper function for ```top_urls()```. It transforms an incoming list of Strings into a regex string to facilitate a search.
            
        Arguments:
        
        - ```df```: DataFrame. Array of Strings to write as a regex String.
        - ```columns```: A List of 4 column names to use from corpus, but only uses the first two in this function:
            1. Name of URL column that includes a list of URLs included in post/content.
            2. Integer. Number of times a post was shared, such as Retweets on Twitter.
        
        Returns:
        
        - A ```List``` that includes:
          - ```sorted_totals```: List of Tuples that contain 2 items:
              - String full URL
              - Integer. Total number of URL instances (including RTs).
          - ```sorted_domain_totals```:
              - String domain URL
              - Integer. Total number of URL instances (including RTs).
        
        ### regex_lister()
        
        Helper function for ```top_urls()```, but also can be used to create the group regex search parameters on its own. It transforms an incoming list of Strings into a regex string to facilitate a search.
            
        Arguments:
        
        - ```the_list```: List. Array of Strings to write as a regex String.
        - ```key```: String. Denotes the group name
        
        Returns:
        
        - ```keyed```: Tuple with;
            - ```'key'``` (String) that denotes the group name
            - ```'listicle'``` (regex String) that will be used for a search
        
        ```urlcounter``` functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).
        
        **Warning**: ```urlcounter``` performs no custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.
        
        ## System requirements
        
        * pandas
        
        ## Installation
        ```pip install urlcounter```
        
        ## Distribution update terminal commands
        
        <pre>
        # Create new distribution of code for archiving
        sudo python3 setup.py sdist bdist_wheel
        
        # Distribute to Python Package Index
        python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*
        </pre>
Keywords: data processing,summary analysis,URL,circulation studies
Platform: UNKNOWN
Description-Content-Type: text/markdown
