Metadata-Version: 1.1
Name: pipulate
Version: 0.1.98
Summary: Make Data Dashboards with Google Sheets and not much else.
Home-page: https://github.com/miklevin/pipulate
Author: Mike Levin SEO, 360i & Commodore alum, Creator of HitTail now working in NYC for J2 Ziff-Davis Mashable IGN.
Author-email: miklevin@gmail.com
License: MIT
Download-URL: https://github.com/miklevin/pipulate/archive/0.1.97.tar.gz
Description: ===============
        pipulate v0.1.9 - Automate Google Sheets for SEO
        ===============
        
        :Author: `Mike Levin, HitTail Inventor & SEO in New York City <http://mikelev.in>`_
        
        - Are you already drinking the Google Koolaid with Analytics, GSC and GDocs?
        - Are you still paying for Google Sheet automation with expensive plug-ins?
        - Do you feel there's some sort of missing "glue" to hold it all together?
        - Would you like that glue to be career-advantageous future-proofing coolness?
        
        If you've answered yes to one or more of these questions, then Pipulate is for
        you— or in a broader sense, Python and the SQL-alternative, Pandas. Go install
        the Python 3.6 version of Anaconda from https://www.anaconda.com/download/ and
        then run Jupyter Notebook. It'll pop up in a tab in your browser. Then get used
        to the following magical incantations::
        
            import pipulate as gs  # Install from console with: pip install pipulate
            import pandas as pd
            import numpy as np
        
        Then hold onto your hat, because you've just jumped onto the bandwagon of a
        lifetime. Here is my `love letter to Python <http://mikelev.in/2011/01/python-programming-language-advantages/>`_.
        
        .. image:: pipulate-logo.svg
        
        There's a reason spreadsheets remain as popular as they do in the face of more
        capable databases and reporting systems. Spreadsheets are designed for humans—
        you don't need a developer for every little thing. You can use ledgers in many
        places where a database would be overkill. However, as anyone who's tried to
        build durable long-term automations around spreadsheets using VBA, AppScript
        (or any of the other obvious choices), it ain't so easy. I fix that here.
        
        First, I plan to open the mysterious world of Python-driven data-automation to
        newbs simply by having you import a library that auto-prompts you for Google's
        OAuth web-login, getting you right over the first show-stopping hurdle. After
        that, I advocate a simple convention of selecting ranges out of a spreadsheet,
        modifying them, and pushing the results back in.  And that's mostly done with
        Pandas; the one data manipulation tool every SEO should right now be mastering.
        
        Congratulations! You're about to learn to program Python (easily) for SEO::
        
            #  _____ _     _       _       ____  _             _       _       
            # |_   _| |__ (_)___  (_)___  |  _ \(_)_ __  _   _| | __ _| |_ ___ 
            #   | | | '_ \| / __| | / __| | |_) | | '_ \| | | | |/ _` | __/ _ \
            #   | | | | | | \__ \ | \__ \ |  __/| | |_) | |_| | | (_| | ||  __/
            #   |_| |_| |_|_|___/ |_|___/ |_|   |_| .__/ \__,_|_|\__,_|\__\___|
            # There are many frameworks like it,  |_| but this one is mine. --MikeL
        
        .. contents::
            :backlinks: none
        
        .. sectnum::
        
        ########################################
        Background & Philosophy
        ########################################
        ::
        
            You gotta pay the Python toll
            If you want out of the vendor hole.
        
        Congratulations to Conductor for the acquisition by WeWork. This should bring a
        normally inaccessible tier of SEO-product into the reach of exactly the sorts
        of people who can use it. For everyone else cobbling it together, I'm here.
        
        .. highlight:: python
        
        ****************************************
        A better way
        ****************************************
        
        I've done ad hoc and scheduled data-churning work for a long time, where
        investigation becomes automated system becomes generalized tool. I've invented
        my own systems for this sort of stuff that have gone obsolete in the face of a
        rapidly changing world so often that it hurts. No really, it hurts. Having now
        experienced almost-framework packages like Flask that merely help you connect
        the dots begging to be connected (Requests & Workzeug in their case / GData &
        GSpread in mine), I'm like Yeah! Pipulate should be like that— almost not
        there. Focus on what's important: Pandas! Think light-touch framework mentality
        as pretty much the opposite of Django, which is what most of my framework
        creations of the past more-or-less resembled. This is I think a better way.
        
        ****************************************
        Of Stacked Decks & Killer Card Combinations
        ****************************************
        
        So now, instead of inventing my own systems, I leverage the strength of the
        cards I'm dealt. This is familiar to anyone who place the Magic The Gathering
        card battle-game where certain card-combinations can't be played because they
        are too powerful and throw of the balance (and fairness) of the game.
        Thankfully for us SEOs, in real-life thankfully such killer combinations are
        not so easy to ban— such as that of Google Sheets, Jupyter Notebook and
        Pandas. It's like resurrecting elves from your graveyard for unlimited
        fireballs the gift that keeps on giving.
        
        ****************************************
        It's Pandas all the way down
        ****************************************
        
        An Ode to Pandas::
        
            I'm a #Python #pandas person'
            'Cause #SEO will worsen
            With each day the #AI's
            Waking uuuup!
            Solutions?
            I've a' plenty!
            But none are
            Muscle-friendly;
            Because
            from #vim
            I must reach
            Juuuupyter!
        
        I used to be an amateur (very amateur) magician. You know, it's all the same
        feeling for me; magic, art and coding... 3 sides of an eternal golden braid...
        and a bit of performance art, the constant hunt for a good paying gig, and
        getting down an bunch of cool new moves that will serve you for a few years
        until they stop working. Well, this pandas trick has got quite a few good years
        ahead of it as it replaces SQL as the ad hoc data langua franca du jour— and
        that ain't no fad. Legit bandwagon here that's perfect for SEO's, so ya had
        better get a jumpin. If you weren't a data-tech before, you are one now.
        
        ****************************************
        From SEO to Data Tech to Datamaster
        ****************************************
        
        Once you can manipulate Pandas DataFrames, you have super-powers and are no
        longer an SEO. Your bullshit will smell more like statistically convincing
        correlations, though you will always attach the standard disclaimer, that does
        not mean causation. Therefore, our field remains Kung Fufu. Anyone who doesn't
        want to be called out on bad-science better be able to walk the walk. Show
        process. KNOW process. Invent process. Make process your own secret sauce in
        such a way where you can EVEN STILL jump-on and benefit from some pretty
        grown-up and destined to succeed bandwagons.
        
        ****************************************
        That's where Pipulate comes in
        ****************************************
        
        Here's the basic API-invocation for Pipulate::
        
            cl, df = gs.pipulate(tab, rows, cols)     # pulls range from GSheet
        
            # Do stuff to df object using pandas,
            # the new "must know" SQL-alternative.
            # Just keep the row x col "shape" intact, then...
        
            gs.populate(tab, cl, df)                  # pushes range to GSheet
        
        That's it. That's Pipulate— it's just the rectangular data-range pull & push
        stuff (nothing in-between). So, `go get Anaconda 3.6 <https://www.anaconda.com/download/>`_ 
        and help me reposition SEO somewhere more than the gut-feeling Intuiters of
        yesteryear and Data Scientists of tomorrow— the **Datamaster** with a
        Python/Pandas industry-standard skill-set has vast room for creative expression
        and career-path flexibility (hedging your bet on SEO). 
        
        ****************************************
        there's nothing to see here. please move along.
        ****************************************
        
        If you're reading this, you're among the first who knows what's going on with
        me here with my latest version of Pipulate— which amounts to yielding to the
        staggeringly cool process put into motion by Fernando Perez (invented iPython
        that's become Jupyter Notebook) and Continuum Analytics who put together the
        Anaconda FOSS equivalent to Mathematica, MATLAB, SAS or whathaveyou. It's data
        analysis stuff so scientists can solve their crisis of reproducibility a few
        years back that really shook 'em up and gave "the other side" a lot of ammo. 
        
        If you want to fight data with data, you better have some seriously good Kung
        Fu. I'm the type of SEO who survives just fine at places like 360i with clients
        like Apple and JCPenney (I helped pull them out of the fire), and these are my
        tools now in their modern form. I prefer to work in-house where I can slip in
        work like this and not have to bill it evenly against each account I work on
        (yuck!). I can get into my grove, walking the walk as I talk the talk...
        in-house SEO-style... yeah. Good stuff.
        
        ****************************************
        Be a Lightweight Pandas-speaking API-Monkey in the Middle
        ****************************************
        
        And so, here is my little gift to you, world of SEO. It's not DeepCrawl or
        Conductor or ahrefs. No, but it can connect to any of them. It could draw
        exactly the dashboard you want from out of all 3 systems (if you've got 'em)
        and give you just the dashboard you need under simple GMail security context,
        and not some per-seat license login nonsense. Innovate new reports in Jupyuter
        Notebook. Refine them. Copypasta them over to Linux scheduling where you fit it
        togehter like Tetris, squeezing every dollars worth out of your EC2 instance.
        
        This project is just the medicine that a highly stressed SEO field needs. I got
        your cure right here, and it's really just the minimum baseline technical
        capabilities you should have today so that you may interact...  with...
        anything.
        
        ****************************************
        All The Bands Whose Wagon's You're Hitching Your Star Too Are Belong To Us
        ****************************************
        
        Track your SEO rankings, Social Media views and counts, whatever!  Start in the
        shallow end with Jupyter Notebook and "graduate" your work to generic scheduled
        Linux jobs that you can run almost anywhere. If you're a successful YouTuber
        worried about the gravy-train running dry, diversify your skills with something
        you'll probably love and can definitely self-learn. This is your chance to jump
        with me onto the following surprisingly fun bandwagons:
        
        - **Python** - programming for humans
        - **Jupyter Notebook** - programming for even more humans
        - **Pandas** - data manipulation for humans
        - **GSheets** - data manipulation for even more humans
        
        ...but mostly Python. Go read my love letter to Python even though I have to
        update it for 2018: http://mikelev.in/2011/01/python-programming-language-advantages/
        
        ****************************************
        You Would Not, Could Not Learn to Code?
        ****************************************
        
        I'm talking to YOU, Animation Gang! Sure, it's fun to animate today, but you
        will always be at the mercy of people who can express automation concepts as
        naturally as you speak English. Coding is not what you think. You already do
        it. You express your creative dynamic right-brained selves in an overwhelmingly
        technical and meticulous left-brained medium. I'm not sure if you understand
        what's giving you super-powers on YouTube, but when that foolish dancing gets
        tired and your energy runs out, come back here, sit down and learn to code
        Python. You'll green eggs and ham.
        
        - Automate Gimp
        - Automate Inkscape
        - Automate Blender
        - Automate ImageMagik
        - Automate Robots
        
        ****************************************
        The Million-Follower YouTube Challenge... Accepted
        ****************************************
        
        I could go on, but I think I make my point. My daughter challenged me to a
        million followers on YouTube by VidCon. I'm like, haha, not likely. No one's
        really interested in your dad's opinion that you're either either doing it
        old'skool or you're fresh meat. Your dad can reach a million followers by
        VidCon 2019 if I really focus and work on it. First step is to just catch
        Jaiden's or James' or Becca's or Tony's attention and mention my channel in one
        of their videos— a 7 year old YouTuber's super-nerd Dad in New York trying to
        teach kids Python before their brains are poisoned with JavaScript. Oh... he
        was the first to unbox a Raspberry Pi on YouTube and got a million-view video
        out of it... then dropped the ball, because kid. Now kid loves YouTube and is
        challenges me from a million views to a million followers.
        
        Challenge accepted, Adi. https://www.youtube.com/mikelevin
        
        ****************************************
        Did I Mention You're Either Old's Kool or Fresh Meat?
        ****************************************
        
        And while it's definitely **not** required, I'll also teach you the timeless
        badass tools of tech: **Linux**, **vim** and **git**— or at least the minimum
        you need to know about each to project that "I'm Tony Stark" feeling.
        
        Pipulate is about repositioning careers as SEO continues to change. Some tools
        are like shiny new pennies (Jupyter Notebook & pandas), while others have
        ascended to be the very fabric of our modern infotech-world— the true Samurai
        Kung Fu vorpal light saber weapons of tech where developing muscle-memory makes
        you dangerous... to your competitors.
        
        ########################################
        Installing Pipulate
        ########################################
        
        Step #1: Install https://www.anaconda.com/download/ start Jupyter Notebook.
        Make a new Python 3 notebook and type::
        
            !pip install pipulate --upgrade
        
        This command executes as an operating system command (because of the
        exclamation mark) and not Python code. It may take awhile, but you will
        eventually get the output of the pip program vomit out a bunch of messages
        about installing this-and-that dependency. Most Pipulate requirements are
        already met by Anaconda, but if you're in some other Python environment, you
        can install all the rest of the requirements for Pipulate with pip install
        pandas (a VERY BIG install).
        
        ########################################
        For the impatient
        ########################################
        
        The quickest way to pipulate is as follows, but I suggest you read through this
        documentation where I cover creating Pipulate functions. The pandas concept of
        df.apply() is very important to grok. If you don't grok the word grok, go read
        Stranger in a Strange Land and return::
        
            import pipulate as gs
            import pandas as pd
            tab = gs.name('Your Sheet Name').sheet1
            cl, df = gs.pipulate(tab, rows=(1, 20), cols=('a', 'b'))
            df['B'] = 'foo'
            gs.populate(tab, cl, df)
        
        Or the slightly longer-form, but probably easy for maintenance::
        
            import pipulate as gs
            import pandas as pd
            key = 'gobbledygookdockeyhere'
            name = 'Sheet1'
            rows = (1, 20)
            cols = ('a', 'b')
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
            cl, df = gs.pipulate(tab, rows, cols)
            df['B'] = 'foo'
            gs.populate(tab, cl, df)
        
        ****************************************
        Things even the impatient must know
        ****************************************
        
        - You must be on Python >= 3.x.
        - You must be using a Google Spreadsheet in online mode.
        - You must exactly match 'Your Sheet Name' with your actual sheet name (or
          switch to keys).
        - You must be able to recognize row and column indexes when you see them.
        - When you import pipulate as gs, Google will (1-time) throw up a giant blue
          link that you must click and login with the same Google account you used to
          make the Google Sheet with.
        - The meaning of the word grok.
        
        ****************************************
        Things about OAuth2 you should eventually know
        ****************************************
        
        Upon first-run, Pipulate asks you once for access and then not again until
        things goes wrong and you have to login again. You should be aware that a file
        named ouath.dat is dropped in your working directory (where you Jupyter
        Notebook .ipynb files save) which allows full access to your Google stuff.
        There is a refresh token there that grants new rapidly-expiring access tokens,
        but which itself doesn't expire. It is used to frequently re-log you in
        invisibly in the background. If you're interested in seeing these tokens, you
        can open oauth.dat in a text editor. It really helps to start to demystify
        OAuth2. Occasionally, Google WILL make even the refresh token have to be
        recreated with a new Web login, so just be aware of that especially if you
        build real automations (non-Jupyter Notebook) around Pipulate.
        
        ########################################
        A tour through Pipulate
        ########################################
        
        After you've installed Anaconda, run Jupyter Notebook by either looking for an
        icon named Jupyter Notebook in your Start Menu (Windows) after an Anaconda
        install, or select Anaconda-Navigator from your Applications folder (Mac) and
        Launch Jupyter Notebook from there. There's a few ways to get Jupyter Notebook
        running, but so long as something pops up in your web browser where you can
        choose New / Notebook: Python 3, then you found it. After you have a new
        Notebook, if you haven't done step #1 already, then execute::
        
            !pip install pipulate --upgrade
        
        Once pipulate is installed, you don't have to do that again, so after it's done
        (it can take awhile), you can delete the command and it's ugly output and get
        onto the real action. There are alternative ways to do pip installs in
        "Anaconda Prompts" (with more control outside the browser) but details vary on
        Macs vs. Windows vs. Linux, so I prefer to tell people how to do pip installs
        from within Anaconda. The --upgrade parameter ensures you always have the
        latest because I will be updating it often. Step #2, execute::
        
            import pipulate as gs
        
        This will cause an enormous Google Web login-link to appear in your Jupyter
        Notebook that you must click, which will open another tab in your browser
        presenting a Google login prompt. Choose which Google account you want to use
        to access Sheets. It must have permission to the sheet you're manipulating. It
        also asks for various other Google Service permissions while it has the chance,
        in case you plan on using Pipulate to track your YouTube view-counts and such,
        which you should totally do.
        
        ****************************************
        Pipulate naming conventions
        ****************************************
        
        In case you're wondering why I recommend the convention of importing pipulate
        as gs, it's because my other Github module GoodSheet got fully wrapped in here,
        and I like reminding everyone Pipulate is in fact GoodSheet. I also got very
        fond of how gs.pipulate() looks, and I think it helps that gs also stands for
        Google Sheet. It also avoids the verbosity of pipulate.pipulate() or
        abbreviation-confusion of pi.pipulate() or pip.pipulate() or any of the other
        choices not nearly as beautiful as gs.pipulate().
        
        For those familiar with the Flask web microframework, it might help to think of
        Pipulate as something lightly sprinkled in to connect GSpread and Pandas, and
        not really trying to do all that much itself except a few API innovations to
        help. The act of pipulating is just picking up an Excel-style rectangular
        spreadsheet range as both a GSpread cell_list and a Pandas DataFrame, altering
        the df completely with Pandas, and then using the symmetrical act of POPULATING
        to push the changes back into Google Sheet.
        
        ****************************************
        Of Pandas & dependencies
        ****************************************
        
        Pipulate is designed to let you do all your challenging data-manipulation work
        in Pandas. Pandas is not part of Python "core", but then neither is Google
        Sheets or GSpread, so don't complain. You're drinking deep of both the Google
        and Python Koolaid with Pipulate. You could do a lot worse. Any disenfranchised
        SQL-users out there, Python Pandas is where you should be going. Not to put too
        fine a point on it, but SQL has let you down. You need a more universal
        lightweight "general case" data manipulation tool, and Pandas is it whether you
        realize it yet or not. It's not like Oracle's going to buy Python too. So just
        go ahead and import Pandas::
        
            import pandas as pd
        
        ****************************************
        Configuring a job
        ****************************************
        
        In that same Jupyter Notebook that you imported pipulate and pandas into, you
        can now set the values that will allow you to connect to our spreadsheet (file)
        and worksheet (tab), along with the cell range defined as a set of row and
        column indexes, using row-numbers and column-letters that display in
        spreadsheet user interfaces::
        
            key = '[Your GSheet key]'
            name = 'Sheet1'
            rows = (1, 20)
            cols = ('a', 'b')
        
        It's good to switch from using GSheet file-names to their unique "keys" for the
        sake of avoiding future confusion about which document you're actually working
        on. It's far too easy to have 2 files with the same name. Be sure to use the
        long string of characters copied out of a Google Sheet URL for the key. That's
        the long string of alphanumeric gobbledygook not broken up by slashes. The
        name is always "Sheet1" on a freshly-made sheet. If you rename it or want
        to manipulate a different tab, be sure to make it match this. The rows and cols
        tuple defines the rectangular region you will want to manipulate.
        
        Okay, let's generate some text to manipulate with Pipulate. Enter and execute::
        
            import this
        
        ...and you will now have 20 nice new lines about the Zen of Python to
        copy/paste from Jupyter Notebook to a newly-made Google Sheet you can use for
        the below exercise. In other words, create a new Google Sheet and paste the 20
        Zen of Python lines into cells A1:A20. You are now ready to pipulate.
        
        ****************************************
        Connecting to sheet
        ****************************************
        
        Open the connection to the Google Sheet (as if it were a database) and copy a
        rectangular range in both the GSpread cell_list format and as a Pandas
        DataFrame. This is setting the stage to pipulate, by creating two identical
        shapes, but of different types (one from GSpread and the other from Pandas)::
        
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
            cl, df = gs.pipulate(tab, rows, cols)
        
        Even though the cl is a cell_list from GSpread, it is also very similar to
        Python's core datatype called list. Jupyter Notebook lets you inspect the
        contexts of cl or df simply by running them on their own line. Type this and
        hit Enter::
        
            cl
        
        As you can see, GSpread cell_lists are just what one might call a
        one-dimensional array in other languages, which is the same as a normal Python
        list datatype. However, a few extra attributes have been layered onto each
        cell, such as cl[0]._row to see what row a cell belongs to and cl[0]._col for
        its column. In this way, GSpread avoids more complex shapes like a list of
        lists or a list of tuples, but it does make manipulating it directly as if a
        spreadsheet a challenge, which is pretty frustrating because that's the entire
        reason you use a library like GSpread.
        
        Have no fear; Pandas to the rescue! It's not the cl we're going to manipulate.
        It's the df, which is a Pandas DataFrame and has a lot of powerful
        database-like tricks built-in. All we have to do is NOT TOUCH the cl until such
        time as we push our changes back to the spreadsheet. You can also inspect the
        df with Jupyter Notebook::
        
            df
        
        ****************************************
        And now a word about Jupyter Notebook and REPL enviornments
        ****************************************
        
        You can inspect objects like cl and df this way because you are in a REPL
        (read, eval, print, loop) for Python code execution where the contents of a cl
        or df is just sort of "hanging around" frozen in memory MID-EXECUTION for your
        casual perusal. This is both a small miracle, and makes Jupyter Notebook the
        ideal place for for scientists and marketers to "feel their way around" data
        before building resilient automations.
        
        I'm also helping you jump on the same bandwagon that's helping scientists solve
        the crisis of reproducibility that hit their field a few years back when they
        realized that 70% of published scientific research was unreproducible. While
        much credit goes to Jupyter Notebook, it's really Anaconda that gets it all
        installed and erases that pesky multi-platform issues that usually become very
        major stumbling blocks—even for scientists.
        
        ****************************************
        Your first pipulation
        ****************************************
        
        Say you wanted to just plug the value "foo" into column B::
        
            df['B'] = 'foo'
        
        You can now "push" your changed dataframe object back into the still
        compatibly-shaped cell_list object, but peek at it first "in memory" by just
        typing df all by itself::
        
            df
        
        Make the changes that you see in memory push back out to the spreadsheet. Watch
        the browser as you populate to see the changes occur!::
        
            gs.populate(tab, cl, df)
        
        Congratulations. You've just pipulated.
        
        Plugging data dynamically into Google Sheets is nothing new. Pipulate just
        simplifies it. To do something slightly more interesting, you can simply copy
        the contents of column A to B::
        
            df['B'] = df['A']
            gs.populate(tab, cl, df)
        
        ****************************************
        Working with numbers
        ****************************************
        
        Say there were numbers in column A and you wanted column be to be that number
        times 2. Notice I have to convert column A to integers even if they look like
        numbers in the spreadsheet, because GSpread converts all numbers to strings::
        
            df['B'] = df['A'].astype(int) * 2
        
        This example will throw an error if you try it on the Zen of Python data, you
        would get ValueError: invalid literal for int() with base 10: 'The Zen of
        Python, by Tim Peters'. But you can put numbers in column A and execute this to
        see a simple *2 operation and acquaint yourself with how automate-able things
        start to become when you replace tedious manual Excel processes with
        automation.
        
        ****************************************
        Appending strings
        ****************************************
        
        If you wanted to append foo to column A and put the result in column B (like
        above, but appending strings to an already already string-type column).::
        
            df['B'] = df['A'] + 'foo'
        
        
        ****************************************
        Alternatives to Pipulate
        ****************************************
        
        Embedded application languages like Microsoft's VBA or Google's AppScript can
        achieve similar results, but if I need to explain to you why these are not as
        good as using Python on the back-end, you're in the wrong place. The same goes
        for the ever-increasing selection of paid-for Excel and GDocs plug-ins and
        other proprietary vendor products which probably don't quite do what you need.
        
        Pipulate is mostly about Python and Pandas. You could replace gs.pipulate() and
        gs.populate() with pd.read_csv() and pd.to_csv() and take Google Sheets out of
        the equation entirely, or use Excel instead of GSheets by swapping PyExcel for
        GSpread. There's also a new library pygsheets that does a few more things like
        named ranges and limited formatting I may switch to. Doesn't matter, because
        Pipulate is a lightweight wrapper to provide a lightweight spreadsheet
        manipulation API.
        
        My thinking is that if you have to learn and master one tool for this sort of
        data manipulation, it might as well be Python/Pandas. Shove all the complexity
        onto pandas... it's going to be around for awhile... and not cost you anything.
        
        ****************************************
        Cheerleading for Anaconda, Jupyter Notebook and Pandas
        ****************************************
        
        The above example with .astype() also shows that even if you know Python,
        there's some new learning to do here for things like casting datatypes, which
        is actually different from pure Python. Pandas sits on NumPy which is a popular
        C-optimized Python library that provides N-dimensional arrays for the same kind
        of work that IBM dinosaurs still do in Fortran for science and stuff. Pandas is
        a FRAMEWORK on top of NumPy for such work, but which turns out to be perfectly
        designed for what I used to use Pipulate for when it was a Flask-based Web app.
        
        ****************************************
        Applying a Python function to every row of a Google Sheet
        ****************************************
        
        Now say you wanted to apply a function to every line of the DataFrame to do
        something like retrieve a title tag from a web address, and you had a function
        that looked like::
        
            def status_code(url):
                import requests
                rv = 'failed'
                try:
                    rv = requests.get(url).status_code
                except:
                    pass
                return rv
        
        Now you can get the status code of every URL in column A with::
        
            df['B'] = df['A'].apply(status_code)
        
        This is where the "framework" known as Pandas steps in with its own
        conventions. Pandas knows to take the function named in the apply method and
        for every row of the dataframe, plug the value found in column A into the
        function called status_code and plug the resulting value into column B. Look
        carefully at what's going on here, because it's about to get a lot more
        complicated.
        
        ****************************************
        Giving Python function entire row of Google Sheet as input
        ****************************************
        
        While the above example is powerful, it's not nearly as powerful as feeding TWO
        arguments to the function using values from out of each row of the dataframe.
        To do that, we simply call the .apply() method of the ENTIRE DATAFRAME and not
        just a row::
        
            df['B'] = df.apply(func, axis=1)
        
        There's a few things to note here. First, we HAVE TO include the axis=1
        argument or else each COLUMN will be fed to the function by default as it
        iterates through the dataframe. When you use the df.apply() method, you can
        step through the entire dataframe row-by-row or column-by-column, and we simply
        have to include axis=1 to PRESERVE the row-by-row behavior of calling the apply
        method directly from a row (when it's implicit). Now, we can select a 3-column
        range::
        
            key = '[Your GSheet key]'
            name = 'Sheet1'
            rows = (1, 4)
            cols = ('a', 'c') # <--Switched "b" to "c"
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
            cl, df = gs.pipulate(tab, rows, cols)
        
        Now we plan on putting a URL in column A and some text that we're going to look
        for on the page in column B. Instead of just returning a response code, we will
        return how many times the text was found in the retrieved HTML of the page. So,
        we will desire to apply this command::
        
            df['C'] = df.apply(count_times, axis=1)
        
        ****************************************
        CAVEAT! Row-aware Pipulate functions grab values from row by index
        ****************************************
        
        However now the count_times function has more responsibility than the
        status_code function. Specifically, it needs to know to get the URL from column
        A and the keyword from column B, so we rewrite status_code as follows::
        
            def count_times(row):
                import requests
                url = row[0]
                keyword = row[1]
                rv = None
                try:
                    ro = requests.get(url)
                except:
                    pass
                rv = '--'
                if ro and ro.status_code == 200:
                    rv = ro.text.count(keyword)
                return rv
        
        With the above example, you put the URL you want to examine in column A and the
        text whose occurrences you want to count on the page in column B. The results
        appear in column C. This is where it starts getting more complex, and there are
        ALWAYS costs to complexity. Mapping has to go somewhere, and I currently choose
        to put it INSIDE Pipulate functions, which is not necessarily the best
        long-term decision, but complex as it may be, you're going to be able to follow
        everything that's going on right there in front of you without maintaining
        some awful set of per-project externalized mapping tables... ugh! You'll suffer
        through that sort of thing soon enough. For here, for now; MAGIC NUMBERS!
        
        ****************************************
        FYI: Python execution-context choices
        ****************************************
        
        Remember that the Python code is running under your control so you are not
        limited as you would be using Google's own built-in Apps Script (Google's
        answer to VBA) for the same purposes. Your Python code is running on your local
        machine (via Jupyter Notebook) and can easily be moved to the cloud or on cheap
        hardware like Raspberry Pi's. Truth be told, Jupyter Notebook is optional.
        
        All your data manipulation or "creative work" is taking place in Pandas
        DataGrids which you are "painting" onto in memory. Aside from copying the
        initial range out of a spreadsheet and then pasting the identically-shaped but
        altered rectangular spreadsheet range back in, this entire system is just
        becoming adept at Pandas using GSheets instead of CSVs.
        
        ****************************************
        Giving your function "extra" argument data per row (a tale of 2 splats)
        ****************************************
        
        When stepping row-by-row through a Python Pandas DataFrame, it is often
        desirable to insert "meta" attributes that can be used in the function WITHOUT
        putting those numbers wastefully on every row of the spreadsheet you're
        manipulating. Say the data we wanted to add is a date and it was the same dates
        for every row.
        
        ===== === ========== ==========
        one   com 2018-10-01 2018-10-31
        two   net 2018-10-01 2018-10-31
        three org 2018-10-01 2018-10-31
        ===== === ========== ==========
        
        Since the date would be the same all the way down, using a whole column in a
        Google Sheet for it would be a waste. In fact, GSheets has some limit to how
        many cells you can have, so an extra column with nothing but repeated data is
        very "expensive" quota-wise and slows your sheet down. Instead, only keep the
        unique data per-row in the sheet. The Pandas API (and Python API in a broader
        sense) provides for passing in both fixed-position arguments and labeled
        arguments by sort of "side-loading" them in as follows::
        
            df['C'] = df.apply(func, axis=1, start='2018-01-01', end='2018-01-31')
        
        APIs are weird. They work different ways in different languages, and this is
        how Python works. It's weird, but wonderful. There are subtle rules you have to
        get down here that just comes with experience. It's called learning to think
        Pythonically, If you're in Jupyter Notebook, take a moment to run this::
        
            import this
        
        --------------------
        Splat type #2: \**kwargs by label (easier to explain first)
        --------------------
        
        The argument named (\*\*kwargs) accepts as a parameter EITHER a Python
        dictionary object (called a dict, which looks a lot like JSON) or it will
        accept the more common command-line convention of name=value, name2=value2...
        as if being typed-in a terminal. I had a lot of difficulty grokking this, but
        it's one of the reasons Python is used to create user-loved "API-wrappers" to
        every non-Python API out there. Look at how you're going to have to ACCESS
        those values from inside a function::
        
            df['C'] = df.apply(func, axis=1, start='2018-01-01', end='2018-01-31')
        
            def func(row, **kwargs):
                number = row[0]
                tld = row[1]
                kwarg1 = kwargs['start']
                kwarg2 = kwargs['end']
                # Do stuff here
                return stuff
        
        --------------------
        Splat type #1: \*args by label
        --------------------
        
        That was an example where you have multiple labeled arguments like start and
        end dates. But if it's being side-loaded in a similar fashion similar to the
        row, then you use the other type of splat that only uses a single asterisk in
        the function argument definition::
        
            df['c'] = df.apply(func, axis=1, args=('two', 'peas'))
        
            def func(row, *args):
                number = row[0]
                tld = row[1]
                arg1 = args[0]
                arg2 = args[1]
                # do stuff here
                return stuff
        
        --------------------
        Splat types 1 & 2 together
        --------------------
        
        And then as you would imagine, you can mix positional \*splatting with labeled
        \**splatting. You just have to use positional first and labeled second (or
        last, actually), because if you think about it, that's how it must be::
        
            df['c'] = df.apply(func, axis=1, args=('two', 'peas'),
                               start='2018-01-01', end='2018-01-31')
        
            def func(row, *args):
                number = row[0]
                tld = row[1]
                kwarg1 = kwargs['start']
                kwarg2 = kwargs['end']
                arg1 = args[0]
                arg2 = args[1]
                # do stuff here
                return stuff
        
        --------------------
        One bourbon one scotch one beer
        --------------------
        
        Just to put a fine point on it, because it's really that important, the very
        common way to define a pipulate function and its arguments is::
        
            def func(row, *args, **kwargs):
        
        ...which gets invoked stand-alone like this::
        
            func(one_row, one_tuple, one_dict)
        
        ...or via Pandas like this::
        
            df.apply(func, axis=1, one_tuple, one_dict)
        
        ...or possibly like this::
        
            df.apply(func, axis=1, ('two', 'peas'), foo='bar', spam='eggs', ping='pong')
        
        ...is the same as saying:
        
        1. Define a function named "func".
        2. Require something in position 1.
        3. Optionally expect a tuple next.
        4. Optionally expect a dictionary or sequence of labeled values as the last thing(s).
        
        --------------------
        Automatic unpacking of tuples and dicts in \*args and \*\*kwargs
        --------------------
        
        If passing all these lists and name/value pairs starts to get ugly, remember
        Python actually likes to unpack for tuples and dicts for you as you splat. So
        this ugly form of the above API-call::
        
            df['C'] = df.apply(func, axis=1, args=('two', 'peas'),
                               start='2018-01-01', end='2018-01-31')
        
        ...can be re-written in Python as::
        
            pod = ('two', 'peas')
            dates = {'start' : '2018-01-01', 'end': '2018-01-31'}
            df['C'] = df.apply(func, axis=1, pod, dates)
        
        So the common pattern for a Pipulate function which you plan to apply to every
        row of a Pandas DataFrame using the .apply() method is::
        
            my_val = func(a_list, a_tuple, a_dict)
        
        --------------------
        Putting it all together
        --------------------
        
        So say you were starting out with this data, but you needed to use start and
        end dates with it, along with 2 more pieces of standard information per row.
        
        ===== ===
        one   com
        two   net
        three org
        ===== ===
        
        The Pipulate function to could look like::
        
            def func(row, *pod, **dates):
                postion = row[0]
                tld = row[1]
                pea1 = pod[0]
                pea2 = pod[1]
                start = dates['start']
                end = dates['end']
        
        ...and calling it from Pandas, again, like this::
        
            df['C'] = df.apply(func, axis=1,
                                od=('two', 'peas'),
                               dates={'start' : '2018-01-01',
                                       'end': '2018-01-31'
                                     }
                               )
        
        Aren't you glad Python doesn't HAVE TO look like JavaScript?
        
        --------------------
        Testing functions without DataFrame.apply()
        --------------------
        
        If you don't really want to connect to Google Sheets and you just want to test
        your Pipulate function with dummy data to simulate the DataFrame.apply() call,
        you can use the function directly like this::
        
            my_val = func(['three', 'org'],
                          ('two', 'peas'),
                          start='2018-01-01',
                          end='2018-01-31')
        
        But when the time comes to use it with Panda's DataFrame.apply(), it would look
        like this. Just a reminder, the word "func" is actually the name of the
        function that you've defined (with def) and axis=1 is what makes ROWS get fed
        in on each step through the DataFrame::
        
            df['C'] = df.apply(func, axis=1,
                               pod=('two', 'peas'),
                               start='2018-01-01',
                               end='2018-01-31')
        
        Whether you label the tuple or not in the call is optional, but if you do, it
        has to match the definition. Otherwise, its position is enough.
        
        --------------------
        How parts like to snap together
        --------------------
        
        Some pretty cool concepts of bundling and unbundling of attributes between
        Python objects and more common command-line API style is going on here. You
        don't have to use the Python objects as the argument parameters. You can break
        out and unbundle them yourself. If we only have one date parameter for example,
        we could feed it in an unlabeled fixed position::
        
            pod = ('two', 'peas')
            dates = {'start' : '2018-01-01', 'end': '2018-01-31'}
        
        ...which leads to the simplest form to look at::
        
            df['C'] = df.apply(func, axis=1, pod, dates)
        
        And there you have it. That's pretty much the basic use of Pipulate for
        completely open-ended semi-automated Python Kung Fu in Google Sheets. If you're
        anything like me, you're feeling chills running down your back at the
        possibilities. If jumping onto the SCIENCE bandwagon that's occurring (to fix
        their "crisis of accountability") isn't also the future of SEO, then I don't
        know what is. All Pipulate does is let you get it in and out of GSheets easily,
        so you can focus on the hard parts. Let the crazy ad hoc SEO investigations of
        your dreams begin!
        
        ########################################
        Working with very large lists
        ########################################
        
        Google Sheet is not always the best place to process very large lists, but the
        alternative is often worse, so the trick is to just decide by what size chunks
        you should process at a time. This concept is sometimes called step-by-stride.
        To use step-by-stride with Pipulate we take a basic example and simply add a
        "stride" variable and edit out the last 2 lines that set and push the values::
        
            import pandas as pd
            import pipulate as gs
            stride = 100
            key = '[Your GSheet key]'
            name = 'Sheet1'
            rows = (1, 10000)
            cols = ('a', 'b')
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
            cl, df = gs.pipulate(tab, rows, cols)
            #df['B'] = 'foo'
            #gs.populate(tab, cl, df)
        
        ****************************************
        Step by stride
        ****************************************
        
        In the above example, we only added a "stride" variable and edited out the last
        2 lines that updates the sheet. Say the sheet were 10,000 rows long. Updating A
        LOT of data with one of these AJAX-y data-calls is never a good idea. The
        bigger the attempted update of a GSheet in one-pass, the more mysterious things
        are going on while you wait, and the likelihood of an entire update failing
        because of a single row failing goes up. The solution is to travel 10,000 rows
        by 100-row strides (or smaller) and we wanted it to take 1000 steps. We replace
        the last 2 lines with the following step-by-stride code::
        
            steps = rows[1] - rows[0] + 1
            for i in range(steps):
                row = i % stride
                if not row:
                    r1 = rows[0] + i
                    r2 = r1 + stride - 1
                    rtup = (r1, r2)
                    print('Cells %s to %s:' % rtup)
                    cl, df = gs.pipulate(tab, rtup, cols)
                    df['B'] = 'foo'
                    gs.populate(tab, cl, df)
        
        And that's pretty much it. All together, the code to process 10,000 rows by
        100-row long strides directly in Google Sheets for accomplishing almost
        anything you can write in a function to replace 'foo' with one of the fancier
        pandas API calls described above::
        
            import pandas as pd
            import pipulate as gs
            stride = 100
            key = '[Your GSheet key]'
            name = 'Sheet1'
            rows = (1, 10000)
            cols = ('a', 'b')
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
            cl, df = gs.pipulate(tab, rows, cols)
            steps = rows[1] - rows[0] + 1
            for i in range(steps):
                row = i % stride
                if not row:
                    r1 = rows[0] + i
                    r2 = r1 + stride - 1
                    rtup = (r1, r2)
                    print('Cells %s to %s:' % rtup)
                    cl, df = gs.pipulate(tab, rtup, cols)
                    df['B'] = 'foo'
                    gs.populate(tab, cl, df)
        
        ########################################
        SQL Data Sources
        ########################################
        
        It's easiest to pipulate when you only have to apply one quick function to
        every line of a list because it takes advantage of the Pandas framework
        conventions; how the .apply() method works in particular. HOWEVER, if your
        per-row query is a slow and expensive SQL query INSIDE a pipulate function like
        this (the WRONG way)::
        
            def hits(row, **kwargs):
                import psycopg2
                import apis
                url = row[1]
                start = kwargs['start']
                end = kwargs['end']
                a = apis.constr
                atup = tuple(a[x] for x in a.keys())
                user, password, host, port, dbname = atup
                constr = "user='%s' password='%s' host='%s' port='%s' dbname='%s'" % atup
                conn = psycopg2.connect(constr)
                sql = """SELECT
                    url,
                    sum(hits) as hits
                FROM
                    table_name
                WHERE
                    url = '%s'
                    AND date >= '%s'
                    AND date <= '%s'
                GROUP BY
                    url
                """ % (url, start, end)
                df = pd.read_sql(sql, con=conn)
                return df['hits'].iloc[0]
        
        ****************************************
        Connecting to SQL from Pandas
        ****************************************
        
        We now want to move the SQL query OUTSIDE the function intended to be called
        from .apply(). Instead, you get all the records in one go and plop them onto
        your drive as a CSV file and hit THAT later in the function from .apply().
        Getting psycopg2 installed is usually easiest through Anaconda's conda repo
        system (not covered here). First we connect to SQL::
        
            a = apis.constr
            atup = tuple(a[x] for x in a.keys())
            user, password, host, port, dbname = atup
            constr = "user='%s' password='%s' host='%s' port='%s' dbname='%s'" % atup
            conn = psycopg2.connect(constr)
        
        ****************************************
        Building very long SQL WHERE's
        ****************************************
        
        Next, we're going to need to build a string fragment for use in the SQL query
        that calls out every single URL that we want to get data back on. One of the
        worst parts about SQL is "in list" manipulations. The only way to be sure is a
        pattern like this::
        
            WHERE
                url = 'example1'
                OR url = 'example2'
                OR url = 'example3'
                OR url = 'example4'
        
        ...and so on for as many URLs as you have to check. They're probably in your
        Google sheet already, so let's grab them into a list in a way that creates
        almost the exact above pattern (yay, Python!)::
        
            urls = df['A'].tolist()
            urls = "url = '%s'" % "' OR url = '".join(urls)
        
        The 2 lines above convert a Pandas DataFrame into a standard Python list and
        then into a fragment of a SQL statement. When people talk about being
        expressive AND brief in Python, this is what they mean. Being able to read and
        write statements like those above is a pure joy. You can look at the urls value
        in Jupyter Notebook to confirm it's good (if a bit wordy) valid SQL that will
        slip right into a query. Now, we unify the SQL fragment above with the rest of
        the SQL statement using the endlessly beautiful possibilities of the Python
        API::
        
            def sql_stmt(urls, start, end):
                return """SELECT
                    url,
                    sum(hits) as hits
                FROM
                    table_name
                WHERE
                    %s
                    AND date >= '%s'
                    AND date <= '%s'
                GROUP BY
                    url
                """ % (urls, start, end)
        
        ****************************************
        Using Pandas CSVs as SQL temp table
        ****************************************
        
        You can now use the above function that really only returns the not-executed
        multi-line text string which is used to populate a Pandas DataFrame and cache
        the results locally just in case you come back during a separate Jupyter
        Notebook session, you won't have to re-execute the query (unless you want the
        freshet data)::
        
            df_sql = pd.read_sql(sql_stmt(urls, start='2018-01-01', end='2018-01-31'), con=conn)
            df_sql.to_csv('df_sql.csv') #In case you need it later
            df_sql = pd.read_csv('df_sql.csv', index_col=0) #Optional / already in memory
        
        ****************************************
        Correlating URLs in Google Sheet and Pandas Dataframe
        ****************************************
        
        We will now use this data source which now contains the "result" list of URLs
        with the accompanying the number of hits each got in that time-window to create
        your own Pipulate data source (or service). The GROUP BY in the query and
        sum(hits) is aggregating all the hit counters into one entry per URL. The
        correlation here is similar to an Excel VLookup. We make a pipualte function
        for the DataFrame.apply() method to use THIS local data::
        
            def hits(row, **kwargs):
                url = row[1]
                df_obj = kwargs['df_obj']
                retval = 'Not found'
                try:
                    retval = df_obj.loc[df_obj['url'] == url]
                    retval = retval['hits'].iloc[0]
                except:
                    pass
                return retval
        
        ****************************************
        Using local Pandas DataFrame as "external" datasource
        ****************************************
        
        Now instead of hitting the remote, slow, expensive SQL database every time, we
        execute the SQL once at the beginning and can use the local data to pipulate::
        
            key = '[Your GSheet key]'
            name = 'Sheet1'
            rows = (1, 1000)
            cols = ('a', 'b')
            sheet = gs.key(key)
            tab = sheet.worksheet(name)
        
            cl, df = gs.pipulate(tab, rows, cols)
            df['B'] = df.apply(hits, axis=1, df_obj=df_sql)
            gs.populate(tab, cl, df)
        
        Or if it's over a huge list or is error-prone and will need rows entirely
        skipped because of bad data or whatever, we can step by stride by replacing the
        above 3 lines with::
        
            stride = 10
            steps = rows[1] - rows[0] + 1
            for i in range(steps):
                row = i % stride
                if not row:
                    r1 = rows[0] + i
                    r2 = r1 + stride - 1
                    rtup = (r1, r2)
                    print('Cells %s to %s:' % rtup)
                    cl, df = gs.pipulate(tab, rtup, cols)
                    try:
                        df['B'] = df.apply(hits, axis=1, df_obj=df_sql)
                        gs.populate(tab, cl, df)
                    except:
                        pass
        
        ########################################
        Developing Pipualte functions
        ########################################
        
        Because Pipulate functions are really just Python functions (generally being
        called through the Pandas DataFrame.apply() method), you can develop Pipulate
        functions just as you would any other Python function.
        
        The only unusual concern is how when you feed an entire "row" of a dataframe to
        a Python function, it takes the form of an arbitrary variable name (usually
        row) containing a numerically indexed list of values (the values from the row,
        of course). This only means that a wee bit of "mapping" need be done inside the
        function. So say you needed to apply an arbirary function to column C using the
        data from both columns A and B in this form::
        
            df['C'] = df.apply(arbitrary_function, axis=1)
        
        ...then you would need to write the arbitrary function like this::
        
            def arbitrary_function(row):
                value_from_A = row[0]
                value_from_B = row[1]
                # Do something here to
                # populate return_value.
                return return_value
        
        ...so when you're developing functions, the idea is to simulate a Pandas
        DataFrame row in default Python list syntax to feed into the function for
        testing... which is this easy::
        
            simulated_row = ['foo', 'bar']
        
        So in Jupyter Notebook actually feeding the simulated row to the arbitrary
        function for actually running and testing OUTSIDE the Pipulate framework looks
        like this::
        
            arbitrary_function(simulated_row)
        
        ...so developing functions for Pipulate is easy-peasy. Just design your
        functions to always just take in the first argument as a list whose values have
        meaning because of their fixed positions — which naturally represent the cell
        values from rows you'll be pulling in from a spreadsheet.
        
        By the way, namedtuples are the superior way of doing this when not bound by a
        pre-existing framework, but whatever. Pandas is worth it.
        
        ########################################
        Scraping Google Search Engine Result Pages (SERPs)
        ########################################
        
        Well, you knew it was coming. Let's scrape some SERPs. It's sooo easy. But I
        suggest you get yourself an anonymous proxy server or twenty. Put them in a
        file named proxies.txt, 1-per-line. If ports are used, include them after the
        IP like this::
        
            152.190.44.178:8080
            53.117.213.95
            250.227.39.116:8000
            20.15.5.222
        
        Now load the file called get_search_results.ipynb. If you cloned the github
        repo and are working in Jupyuter Notebook, you can work directly in your cloned
        pipulate folder. I would suggest making a copy of files such as
        get_search_results.ipynb and keep the originals around as a sort of template.
        
        Anonymous web proxies go bad fast, so before you start a session, you should do
        a one-time refreshing of your proxy servers. Do that by running this block of
        code with update_proxies set to True. It will create a file in your repo folder
        called goodproxies.txt::
        
            update_proxies = True
            if update_proxies:
                import pipulate.update_proxies as up
                    up.Main()
        
        After you have a good new list of proxies
        
        ########################################
        Scheduling
        ########################################
        
        Everything so far has been in Jupyter Notebook, and that's great for ad hoc
        work, but when it comes to "promoting" a good report to daily use, you need
        scheduling. And that's never pleasant, because you need a machine running
        somewhere with as much reliability as you can get paying as little as possible.
        That's just sort of a life lesson there. No matter how powerful you feel in
        Jupyter Notebook, you're not all that if you can't automated. The answer?
        
        ****************************************
        Get yourself a Linux in the Cloud
        ****************************************
        
        Cloud... EC2 or whatever. Pick your poison. Whatever it is, being server-like
        (as it should be), you're going to need to get into it... and for that you're
        likely to receive a key from Amazon or your devops Dept. Figure out how to
        login to that machine. It should be TOTALLY YOURS. This is your EC2 instance.
        There are others like it, but this one is yours. Learn how to get in and out of
        it fast, from almost anywhere. You can do this on a Raspberry Pi too if you
        don't even want Amazon and a key in the picture.
        
        ****************************************
        Logging into your server
        ****************************************
        
        Once you figure out the ssh command to log in to your server, and do it
        manually a few times. This follows the model of putting the key file in a
        usually hidden directory on your system called .ssh which is usually in your
        home directory::
        
            ssh -i ~/.ssh/id_rsa_yourname ubuntu@55.25.123.156
        
        ****************************************
        Making it a pleasure to log into your server
        ****************************************
        
        Once this works for you, create a text file and name it something like go.sh
        and put it in your sbin. What's an sbin? It's a place you put little text-files
        that work a lot like commands, but you write them. They're really useful. This
        is your first Linux lesson from the Pipulate project. Linux (and Unix) won; get
        used to it. It'll be your next stop after Jupyter Notebook. Scheduling
        something you set-up in Jupyter Notebook is your natural "bridge" project. So
        by this point, you struggled through that ssh command; congratulations.
        Everything else is easy. Find your sbin by looking at your path::
        
            echo $PATH
        
        Find your sbin in that gobbledygookdthen, then put something that looks like
        this text (your info) in a file called go.sh (or whatever) there. Do the chmod
        +x trick to make it executable, and then whenever you need to reach your
        server, just type go. It's really nice to open a shell and to be in your
        scheduling-environment just like that. We want to do everything immediately
        reasonable to make the text-based Linux shell environment as totally cool as
        Jupyter Notebook is::
        
            #!/bin/bash
        
            ssh -i ~/.ssh/id_rsa_yourname ubuntu@55.25.123.156
        
        ****************************************
        A scheduler with a better API is a better scheduler
        ****************************************
        
        We are not using crontab as our next step to achieve scheduling as some
        googling about how to do this on a stock Linux server may indicate. We DON'T
        like APIs where you have to drive nails through your head here at Pipulate. No,
        we side with RedHat and others on the matter of default Linux system service
        management and encourage you to use systemd. It's not the principles of the
        least moving parts but rather the principle of not having to learn advanced
        BASH script that's at play here. Thankfully, crontab's replacement systemd is
        considered a highly supported mainstream alternative.
        https://en.wikipedia.org/wiki/Systemd
        
        ****************************************
        /etc/systemd/service/servicename.service
        ****************************************
        
        You need a file in /etc/systemd/system which is the name of your service dot
        service, like mysched.service. To create it, you may have to sudo vim or
        whatever command because its a protected system location. The contents of your
        file to kick-off Pipulate (or any other) Python scheduling job like this::
        
            [Unit]
        
            Description=Run Python script to handle scheduling
        
            [Service]
            Type=forking
            Restart=always
            RestartSec=5
            User=ubuntu
            Group=ubuntu
            WorkingDirectory=/home/ubuntu/mysched/
            ExecStart=/usr/bin/screen -dmS mysched /home/ubuntu/py35/bin/python /home/ubuntu/mysched/mysched.py
            StandardOutput=syslog
            StandardError=syslog
        
            [Install]
            WantedBy=multi-user.target
        
        You you've just dropped this file in location, but now it needs to be enabled.
        This is a one-time thing (unless you want it off for debugging or whatever)::
        
            sudo systemctl enable zdsched.service
        
        Once you start playing around with the invisible background system services
        (named daemons in Linux), the temptation is to keep rebooting your server to
        make sure your changes "took" (similar to Apache/IIS webserver issues).
        Whenever you're unsure and want to avoid a reboot, you can type::
        
            sudo systemctl daemon-reload
        
        If you want to just restart YOUR scheduling service and not all daemons, you
        can optionally do::
        
            sudo systemctl restart mysched.service
        
        Who wants to type a longer command when you can type a shorter command? Since
        we're in a location where we're typically cd' into, we don't need to do that
        sbin trick we did on your local machine. In fact, I included r.sh in the repo,
        so just cd into the repo directory and make sure the service names I'm using
        match the ones you're using, and type::
        
            r[Enter]
        
        ...and it should reboot the service keeping mysched.py running. For your
        curiosity, this is what it's doing::
        
            #!/bin/bash
            # This belongs in your sbin
        
            sudo systemctl daemon-reload
            sudo systemctl restart mysched.service
        
        ****************************************
        Shoving the work around to where it belongs
        ****************************************
        
        This r.sh file comes into play again later, because in order to ensure the
        health of your scheduling server, we're going to give it a "clean slate" every
        morning by rebooting it, and we're going to schedule the running of this BASH
        script FROM PYTHON to do it. This is an example of doing each thing in the
        place where it best belongs. Reboot from a bash script, respwan from systemd,
        and actually SCHEDULE from within a single master Python script.
        
        After such a reboot (and on any boot, really), we hand all scheduling
        responsibility immediately over to Python (even though systemd could do more)
        because as much better as it is over crontab, Python APIs are better still. We
        actually are only using systemd as a pedantic task respawner. Think of it as
        someone watching for your python-script to exit that can 100% reliably re-start
        it. That's systemd in our scenario. After mysched.py is running, control is
        immediately handed over to the 3rd party "Schedule" package from PyPI/Github
        because it's API is better than the default sched module built-into Python.
        Such things on my mind are:
        
        - Period vs. Exact scheduling (every-x minutes vs every-day at y-o'clock)
        - Concurrency when I need it and crystal clarity when I don't
        - Minimal new "framework" language. If it feels like Django, turn and run.
        - Optional ability to "lock" long-running jobs. General collision handling.
        - Calls for little-enough code that if I make a mistake, I can easily recode.
        - Crystal clear clarity of what's going on, no matter where I may be.
        
        For now, "pip install schedule" seems to do the job.
        
        ****************************************
        Scheduling issues are more complex than you think.
        ****************************************
        
        When restarting a scheduling-script, you need to know that when it springs back
        to life it may be in the middle or even towards the end of the daily time-cycle
        you're probably used to thinking in, so "today's" reports may never get a
        chance to run. You need to accommodate for this. You also need to be very
        realistic about how many reports you're going to be able to run on a given
        server on a given day. It can be like playing a giant game of Tetris, so it
        would be nice to have concurrency.
        
        Concurrency, you say? Are you suuuuure? There might be order-dependencies and
        race-conditions in your script runs that you haven't thought about. I find that
        it's always a good idea to avoid concurrency and to keep it simple (much good
        karma) if the situation doesn't really call for concurrency. As hardware and
        hosting gets cheaper, you can always slam out more EC2 instances and put less
        work per server. Everything scales if you just size your work to fit one unit
        of generic Linux server.
        
        Staying conservative with your estimates and modest with your promises is
        always a good idea, specially given how flaky all those APIs you're pulling
        from could be, you ought to size out the job, then half it. Maybe even quarter
        it. You won't be sorry. All that extra capacity in the server could be used for
        temp tables or other unexpected resource hogs you'll run into that you don't
        see today.
        
        ****************************************
        In the beginning...
        ****************************************
        
        The idea here with Pipulate is to make a very generic and almost organic (with
        a heart-beat) place to start plugging your scheduled extractions from Jupyter
        Notebook into, with the least muss and fuss... but also, the most power.
        Pipulate only exists to make GSheets easier; a 3rd party package from
        Github/PyPI which itself only exists to make gdata easier; a cryptic but
        GOOGLE-PROVIDED API to Google Sheets. GData is most definitely NOT made for
        humans. GSpread is made for those slightly more human. And for those entirely
        human, there's Pipulate. In fact, there's something other than GSpread that I
        just discovered which may have either tremendous impact or no impact at all
        here at Pipulate. I'll let you know, but go take a look in either case. Real
        kindred spirits over at https://github.com/nithinmurali/pygsheets
        
        Oh yeah, so in the beginning::
        
            #Do whatever virtualenv stuff you do here
            pip install schedule
            pip install logzero
            pip install pyfiglet
            pip install colorama
            cd ~/
            mkdir mysched
            cd mysched
            vim mysched.py
        
        ****************************************
        And then there was ASCII Art
        ****************************************
        
        You can create your mysched.py however you like. I use vim, and it's spiritual
        and life-changing. It also solves how to be really productive on pretty much
        any machine you sit down at when doing tasks like this. Anyway, I just added
        that file to the github repo, but for ease-of-use, I'll show the development of
        our scheduling script here. First::
        
            from pyfiglet import figlet_format
            from colorama import Fore
            from logzero import logger, setup_logger
        
            font = 'standard'
            subfont = 'cybermedium'
            green = Fore.GREEN
            white = Fore.WHITE
            blue = Fore.BLUE
        
            ascii_art1 = figlet_format('Congratulations!', font=font)
            ascii_art2 = figlet_format('Welcome to Wonderland.!', font=subfont)
            print('%s%s%s' % (green, ascii_art1, white))
            print('%s%s%s' % (blue, ascii_art2, white))
        
            logger = setup_logger(logfile='mysched.log', maxBytes=1000000, backupCount=3)
            logger.info('This is some logger info.')
        
        This should give you a good starting point for scheduling... none! By stripping
        out everything that actually does scheduling, you can see how flashy
        color-coded ASCII art can color your day, your view, and your perception of
        time, rightness, and generally keep you on-track. Don't down-play the titles.
        It should only ever be visible when you're restarting the script a lot for
        testing, or at around 1:00 AM, or whenever your daily reboot occurs.
        
        Keep in mind that later-on, we're going to "seize" the command-line output
        stream (your view of the log-file) from anywhere you have a terminal program
        and ssh program. That could very well be (and often is in my case) your mobile
        phone. It's easier than you think; you don't even have to look at the actual
        log files; it just sort of streams down the screen like the Matrix. That's the
        effect I was going for (thank you LogZero).
        
        You can ALSO see the log-file output that is also being written into
        mysched.log which you can look at if say the script stopped running and the
        real-time output went away::
        
            [I 180222 19:36:56 mysched:20] This is some logger info.
            [I 180222 19:47:07 mysched:20] This is some logger info.
        
        ****************************************
        The first scheduled event
        ****************************************
        
        Going from this "blank" scheduling file to the next step really highlights a
        lot of the default power of the scheduling module.::
        
            import schedule as sched
            from pyfiglet import figlet_format
            from colorama import Fore
            from logzero import logger, setup_logger
            from datetime import date, datetime, timedelta
            import time
        
            UTCRebootTime = '06:00' # Generally, 1-AM for me
            beat_count = 0
            font = 'standard'
            subfont = 'cybermedium'
            green = Fore.GREEN
            white = Fore.WHITE
            blue = Fore.BLUE
        
            ascii_art1 = figlet_format('Congratulations!', font=font)
            ascii_art2 = figlet_format('Welcome to Wonderland.!', font=subfont)
            print('%s%s%s' % (green, ascii_art1, white))
            print('%s%s%s' % (blue, ascii_art2, white))
        
            the_time = str(datetime.now().time())[:5]
            logger = setup_logger(logfile='mysched.log', maxBytes=1000000, backupCount=3)
            logger.info("We're not in Jupyter Notebook anymore. The time is %s." % the_time)
        
        
            def main():
                sched.every(10).minutes.do(heartbeat)
                next_min = minute_adder(1).strftime('%H:%M')
                logger.info("When the clock strikes %s, down the rabbit hole with you!" % next_min)
                sched.every().day.at(next_min).do(the_queue)
                sched.every().day.at(UTCRebootTime).do(reboot)
                while True:
                    sched.run_pending()
                    time.sleep(1)
        
        
            def heartbeat():
                global beat_count
                beat_count += 1
                logger.info("Heartbeat %s at %s" % (beat_count, datetime.now()))
        
        
            def the_queue():
                logger.info("This is a scheduled event. Jump! Down the rabbit hole...")
        
        
            def reboot():
                logger.info("Rebooting system.")
                import subprocess
                p = subprocess.Popen(['sh', 'r.sh'], cwd='/home/ubuntu/pipulate/')
        
        
            def minute_adder(minutes):
                the_time = datetime.now().time()
                today = date.today()
                beat = timedelta(minutes=minutes)
                return_value = datetime.combine(today, the_time) + beat
                return return_value
        
        
            if __name__ == '__main__':
                main()
        
        You can, and I encourage you to run this directly with the standard Python
        command-line way of running it. If you haven't been doing it already, cd into
        that directory and run::
        
            python mysched.py
        
        ****************************************
        GNU screen, your portal to your scheduled world.
        ****************************************
        
        Webmasters are dead. Long live the Datamaster! This is kind of like the LAMP
        stack, but for scheduling in the modern world with (what I consider) the least
        moving parts pushing around the "responsibilities". SQL-ish stuff goes to
        Pandas, logic stuff to Python, Task-respawning to Linux, Data-UI to Google
        Sheets... but what about WATCHING your scripts being run? What about getting
        that at-one Zen feeling with all those invisible plates you have spinning?
        
        If you put the service file in location at /etc/systemd/service/[mysched.py]
        and reboot already, then this script is running in the background right now.
        Wanna see it? It may be up to a bunch of those 10-minute heartbeats already.
        Type::
        
            screen -d -r mysched
        
        That's the gnu screen program. It's a lot like tmux, but if you don't know what
        that is either, then it's a terminal server just like remote desktop software
        like RDP or VNC, but instead of being for Graphical Desktops like Windows or
        Mac, it's just for those type-in command-line terminals. It's a lot to type and
        remember, so drop this into your /usr/local/sbin, or maybe your ~/ home folder
        if sbin gives you trouble. I call it "ing.py" so that all together to see
        what's going on (starting from a terminal on Mac, Windows, whatever), I type::
        
            go[Enter]
            do[Enter]
        
        ****************************************
        Get used to using Ctrl+A (let go) then d to detach
        ****************************************
        
        And then if I want to just immediately exit out, I type::
        
            Ctrl+A, D [Enter]
        
        If you want to activate the "do.sh" just make this file by that name, chmod +x
        it and drop it in your sbin or home::
        
            #!/bin/bash
            # Put this in your sbin or ~/ to be useful.
        
            screen -d -r mysched
        
        This means that a secret invisible command-line task starts whenever the
        machine reboots that you can "connect to" with that command. The -d parameter
        means force it to detach from whatever other device it's showing on (the
        "seizing" the display I mentioned earlier) and the -r parameter means
        reconnect. Together, you can pretty much pull up your scheduling output
        anywhere anytime.
        
        But once you do, you are inside a terminal window session created by gnu screen
        and NOT by the original login-session you had. The -d parameter means force it
        to detach from whatever other device it's showing on (the "seizing" the
        display I mentioned earlier) and the -r parameter means reconnect. This gives
        you quite a bit of power to just scroll up and down the log-output (without
        having to load a single file) using GNU screen's buffer-scroll::
        
            do[Enter] (to seize screen)
            Ctrl+A [Esc] (to switch to "Copy mode" with a scroll-back history)
        
        Once you're in Copy Mode, you can use Page Up & Page Down. You can also use
        Ctrl+B for back and Ctrl+F for forward. When you're done, hit the [Esc] key
        again. When you want to release the screen session, it's still Ctrl+A, D
        [Enter] to detach.
        
        And finally, it can feel a little "out of control" to have a script running
        insistently in the background with no way to stop.
        
        The Unix/Linux-style type-in "terminal" interface that ships with Macs and can
        be installed with Windows using CygWin or their new Windows 10 BASH shell is
        your new portal into Wonderland. Jupyter Notebook gave you a taste of the power
        of Python, but you're not really realizing it until you're running reports
        during all that other delicious time when you're NOT sitting in front of a
        browser hitting a button and waiting for something to finish on your local
        machine.
        
        ****************************************
        Making timestamps and caching pervasive
        ****************************************
        
        You can't store everything locally, so don't try. You will run out of space,
        and there's nothing worse than having to do file-maintenance on Cloud hardware
        that's supposed to be sparing you from that nonsense. But neither can you write
        anything that's going to fill your hard drive up forever with past data.
        Hardware is hardware and resources are actually finite— or rather, they're as
        finite as you're willing to pay for. So if we want to store the data long-term,
        it's got to be off-server, probably using a service such as Amazon S3. Using a
        "data bucket" NoSQL hash-table (call it what you will) is a good idea in
        situations like this because "deconstructing" everything into rows & columns
        for SQL-like RDBMS storage isn't worth it, and although field-stuffing into a
        Text or XML field in an RDBMS would work, it feels a lot like shoving a round
        peg into a square hole— why do it if a round hole is sitting right there?  This
        is more a place-holder for me to incorporate probably a decorator-based cache
        system that is back-ended by Amazon S3. That will solve a lot of ongoing server
        maintenance issues.
        
        ########################################
        First Scheduled Script
        ########################################
        
        Okay, so the above sets out the framework for scheduling. We have:
        
        - A daily reboot
        - An every-10-minutes heartbeat
        - Something beginning 1-minute after script runs
        
        ****************************************
        A way to schedule external Python scripts
        ****************************************
        
        So the idea now is to build-out from that 3rd point. We just just start putting
        references to different external Python filename.py's there, and they'll just
        run. But there's one more trick. I'm adding this function to mysched.py along
        with importing the importlib library to do the trick::
        
            def do_main(name):
                mod = importlib.import_module(name)
                mod.main()
        
        This way, if you follow the Python if __name__ == '__main__' convention, you
        can use this to invoke the method (function) named "main" with it's standard
        (non-parameterized) call by just putting it in the same folder as mysched.py
        and referring to the file by name from within mysched.py like this::
        
            do_main(filename)
        
        ...and that's all you need to do to schedule something that uses that
        convention. Conversely, if the code in the external file is of the directly
        copy-pasted from Jupyter Notebook variety which is likely to NOT use a function
        called main (or even functions at all), then you can use the alternative
        version that just does it::
        
            def do_it(name):
                mod = importlib.import_module(name)
        
        ...invoked with the very straight forward::
        
            do_it(filename)
        
        ****************************************
        Making the file
        ****************************************
        
        You will be using .py files and not the .ipynb files of Jupyter Notebook for
        scheduling. There are various ways to go about it, but I suggest just
        copy-pasting your separate text-blocks from JN over to your text-editor or
        whatever, and just re-build your script up from parts over server-side. The
        reason for this is that it makes you think through your work again. The way you
        work in Jupyter Notebook is going to be very different from the way you work on
        a Linux scheduling system. Your considerations are about 100x more complex, and
        so now is the time to start thinking about them. So make track.py in the same
        repo directory::
        
            vim hello_world.py
        
        Don't worry. I already put it in the repo for you. It is basically just a
        template for the ASCII art that I like to do. I think I've gone overboard with
        colors. I usually only use green for OK with other colors sprinkled in
        sparingly to capture my attention (warnings & stuff). Thanking you for taking
        the red pill gets an exception. But now that we've proven scheduling an
        external script, it's time to add a SECOND external script and-get serious
        about SEO::
        
            vim track.py
        
        ****************************************
        There's no place like ~/pipulate
        ****************************************
        
        The hardest part for me in trying to grab the reins and gaining control of all
        the required parts of datamastering, is always staying centered. You always
        have to know where things are relative to either root "/" or home "~/". Even
        that is a Unix geek joke about symbolic links. Anyhoo, working directories
        usually tend to end up relative to ~/ 'cause you don't edit much in root. And
        that folder-name is likely going to be whatever your git repo. And if you're
        cloning from me with::
        
            git clone git@github.com:miklevin/pipulate
        
        ...then you have a directory probably something like this::
        
            ~/pipulate
        
        ...which you're always going to want to be in, especially if the machine you're
        working on is a cloud instance set up somewhere specifically for you
        specifically for this purpose. Then, you can get rid of a lot of typing by
        creating a file like this (given you're on Ubuntu / adjust to OS) as your
        .bash_profile. These are invisible configuration files (because of the dot at
        the beginning of their filename). .bash_profile is executed whenever your
        username (in this case, the default EC2 Ubuntu's "ubuntu" user::
        
            source /home/ubuntu/py35/bin/activate /home/ubuntu/py35
            PATH="$HOME/:$PATH"
            clear
            python ~/hi.py
            cd /home/ubuntu/pipulate
        
        This answers such pressing questions as:
        
        - How do I always make sure I'm in the exact same Python virtualenv as the one
          I use for scheduling?
        - How can I add little bash helper scripts to my repo and have them in my path
          so I can easily use them?
        - How can I give myself a snazzy ASCII art login messing using Python and the
          Figlet library?
        - How can I avoid typing cd pipulate every time I log into my scheduling
          server?
        
        So now, whenever from your host machine you type::
        
            go
        
        ...you wil be logged automatically onto your cloud server and greeted with a
        warm welcome that will impress your friends. What does hi.py look like you
        ask?::
        
            from pyfiglet import figlet_format
            from colorama import Fore
        
            def out(print_me, color=Fore.GREEN, font='standard'):
                ascii_art = figlet_format(print_me, font=font)
                print('%s%s%s' % (color, ascii_art, Fore.WHITE))
        
            out('Welcome.')
            out("Get pipulating!", font='cybermedium', color=Fore.WHITE)
        
        ########################################
        Getting over the hump
        ########################################
        
        When getting started with any new investigation into data residing somewhere
        that you're supposed to fetch, transform, and extract some insights from begins
        with "where does the data come from?" and "Where do I put it?" questions. In
        Google Sheets, you don't really have the option of warehousing all the raw data
        pulls forever, unless you want your spreadsheets to get slower and slower, then
        go corrupt. Instead, you have to either use services like Amazon S3, a large
        local hard drive, or simply not warehouse as much data as you'd like.
        
        ****************************************
        The Google Spreadsheet 100,000 cell limit
        ****************************************
        
        Fortunately, you can do a lot in just the limited data-space of a single
        spreadsheet, which supports (at last check) about 100,000 cells in any
        Spreadsheet document before reaching a quota (and performance) limit. The 100K
        cells can be broken up any way over rows & columns.
        
        ****************************************
        Trim the fat
        ****************************************
        
        Because every spreadsheet gives you an automatic 1000 rows and 26 columns, it's
        easy to make the system use up 26,000 of those cells right away. This doesn't
        really happen in practice, but there is ambiguity so as a matter of style and
        preference, I like to delete all the rows and columns I'm not using. As in the
        field of SEO, we just don't want to send mixed signals to Google.
        
        To be continued...
        
        ########################################
        To git clone or pip install?
        ########################################
        
        If you "git clone", then you will get all the standard .ipynb example files in
        a nice accessible place in ~/ (home) which you'll be able to reach with Jupyter
        Notebook, but you can't make that your own home directory for a github repo
        because you can't (easily) mix repos in the same folder. Yet, you're going to
        want to immediately turn this location into your own repo. The answer is to::
        
            git clone git@github.com:miklevin/pipulate
        
        ...in one location in order to have access to the full repo and maybe
        participate in Pipulate development. In another location, you make the folder
        destined to become your own repo after you get my files, then delete out the
        invisible .git folder inside and do a new "git init". So get started making the
        myschedule folder::
        
            git clone git@github.com:miklevin/pipulate mysched
            cd mysched
            rm -rf .git
            rm -rf pipulate
            git init
        
        From here, it's up to you how much you want to purge out from the original repo
        files before you make you start adding files to your own git repo, but the
        heart of the thing, the file that the Linux systemd daemon will point to in
        your /etc/systemd/system/mysched.service file. If you pip installed Pipulate on
        your system, none of the other files are really necessary. You may also wish to
        do this clone-and-purge process on your local machine for the sake of keeping
        your Jupyter Notebook work in a git repo as well. Remember the cloud server
        you're working on is where you "promote" ad hoc Jupyter Notebook work to live
        in a more persistent scheduled form, so you want to have very similar
        environments on both. It's like forking, but none of this is really important
        enough to fork. It's just your own scheduling instance of Pipulate.
        
        ########################################
        I prefer to pip install... plus nbviewer.
        ########################################
        
        If you're a purist about it and want to build it up from scratch for a fuller
        understanding (than clone-and-purge can provide), then just make a new folder
        directly from Jupyter Notebook and copy any raw source files you may need here
        from github, or go through nbviewer for even easier saving of Jupyter
        Notebook-ready .ipynb files. Just paste any templates you're interested in
        from::
        
            https://github.com/miklevin/Pipulate
        
        into::
        
            https://nbviewer.jupyter.org/
        
        ...and save the resulting .ipynb file to where your local copy of Jupyter
        Notebook can reach it.
        
        ########################################
        Reminders & To-Do's
        ########################################
        
        Reminder to self: add logic to system to always address columns by Excel-style
        letter-index::
        
            for i, col in enumerate(cols):
                letter = gs.cc(i+1)
                eval(tab)[col] = letter
                eval('%s2' % tab)[col] = '%s.%s' % (tab,letter)
        
        Reminder to self: Pipulate Wisdom.
        
        Much of the complexity is in the data-transform. Separate as much of the "raw
        data" work as you can from the derivative output-data formats. Those can be
        re-generated different ways with better and better insights revealed and
        interactivity. But you may never be able to re-get that raw data. So focus!
        
        ########################################
        Experimential: The Zen of Pipulate
        ########################################
        
        - Everything you need has already been done many, many times.
        - However, this does not spare you from SOME of the work.
        - Processes resist automation until sufficient TLC is applied.
        - And even then, it's all going to fall apart for 10 unexpected reasons.
        - Your job then is to build with that sort of resilience in mind
        - Even if it means scrapping-and-rebuilding quickly.
        
        ****************************************
        Ideas for Tenants of the Pipulate System
        ****************************************
        
        - Got a dataframe that was tough to get? Maybe it should be a .csv file.
        - Got a .csv file on your drive blocking you? Was it generated today?
        - It's better to pip install pipulate than git clone pipulate.
        - Directories you create to pipulate out of shouldn't all be repos.
        - Little secret-sauce pipulations belong in private Github gists.
        
Keywords: seo,google sheets,data,datamaster,data master,google,database,sheets,pandas,gspread,google data,google spreadsheets,google sheet,google spreadsheet,spreadsheets,spreadsheet,databases,datascience,data science
Platform: UNKNOWN
