Thoughts on axial encodings vs bloom embeddings

# Cat pros

-smaller emb matrices

# Cat cons

-different items will share large blocks of emb vectors. Maybe this works if we have some pre-defined knowledge of what words should be indexed close together. Sorting by frequency does this somewhat, but maybe would work better to first cluster by w2vec, sort so similar words are close together. Maybe this is why this is good for positional encodings: adjacent time steps share some (but not all) of embs

# Add pros

-words won't necessarily end up sharing embs w/ other words even if most of their rows are the same - the 1 unshared row could learn to be very different

# Add cons

-larger emb matrices  
-each row serves multiple purposes - maybe "pulled in multiple directions" by diff words

---
NLP augmentation pipeline and transform thoughts:

-huggingface pipelines can accept either a single string or an iterable container
    -> therefore, my paraphrasePipeline should do the same
-should transforms work on strings, list-likes, or either?
    Use cases:
        1. generate csv - in this case, we'll need to process a large number of items. This supports accepting list-likes. However, this will all happen under the hood so we could easily make __call__ process a single string and use the pipeline in a way that handles more.
        2. Torch dataset on-the-fly transform - process single str
        3. general function - unsure.
    Either way, we probably should have a function (whether it's __call__ or something internal) that processes a single str at a time. The only reason not to do this would be if I realize something about the way we batch strings makes this undesirable.

---
Problem:

1. Sometime during tonight's FillMask edits, __call__ now only returns n items for each round of masking. Before, this number grew quickly: when inputting 1 string, the first round would produce (by default) 5, the second would product 25, the third 125,... (5**n). I'm not sure the new version is bad - I think it may be faster - but I'm confused why this has changed.

11/14/20
--------
zanderbush/Paraphrase: 
    size: 486 MB
    arch: AutoModelWithLMHead, AutoTokenizer
    notes: Tried generating text but it either output something completely unrelated or "{input} -> {input}". Must be doing something wrong but there's no documentation. There's also a v2 and v3 version but I it looks like they'd have similar issues.

mrm8488/t5-small-finetuned-quora-for-paraphrasing
    size: 419 MB
    arch: AutoModelForSeq2SeqLM or AutoModelWithLMHead, AutoTokenizer
    notes: Getting either "{input} {input}" or "?? ?? ??". Tried AutoModelWithLMHead as well. Later found I need to prepend with 'paraphrase: ' but when I did that the output was identical to the input. UPDATE: got this working (need to pass in one string at a time) but it seems a bit question-biased, probably due to its training on the Quora sentence pair corpus.

aakashD/t5_paraphrase
    size: 850 MB
    arch: AutoModelForSeq2SeqLM (seems like it should be T5ForConditionalGeneration though?), AutoTokenizer
    notes: Trouble loading tokenizer. Thought I tried this before?

ceshine/t5-paraphrase-paws-msrp-opinosis
    size: 850 MB
    arch: AutoModelForSeq2SeqLM, AutoTokenizer
    notes: Seems to work better if we preprend with 'paraphrase: '. No docs so not sure if this is required. Output seems quite repetitive (with the input and/or with other outputs) but maybe that's due to using generate_kwargs from a different example.

Vamsi/T5_Paraphrase_Paws
    size: 850 MB
    arch: AutoModelForSeq2SeqLM, AutoTokenizer
    notes: I think this may be same as prev model but this has docs. Realized this also uses "paraphrase: " prefix but there's a "</s>" suffix, which I probably needed for earlier models too. Tried using recommended kwargs and it's still very repetitive.

Sidenote: I realized identical size prob just means they used the same architecture. I.e. they took a standard huggingface model but then tuned it on their own data. So these could still be huge differences in quality between different models.

Conclusions: All of these seem noticeably worse than the pegasus model. That's bigger/slower but on GPU it's not too bad and it's the only one that seems to consistently avoid duplicates or outputting the input. I think let's just stick with this for now. If I really want to use a different model I still can by passing them in as a pipeline. Maybe better models will emerge.

It kind of seems like pipeline ignores kwargs (temperature seemingly has no effect) but I'm not sure. num_return_sequences does work so maybe the impact of others is just low.


11/24/20
--------
Summary of some interesting math-y huggingface models I played around with in colab. The samples I tried did not work that well though, except for a few basic examples with the calculus differentiation model.

    https://huggingface.co/mrm8488/t5-base-finetuned-math-linear-algebra-1d (like calculus model, append ' </s>'. Enter inputs like 'solve 44 = 2*x + 6'. Didn't seem to work very well though)
    https://huggingface.co/mrm8488/t5-base-finetuned-math-linear-algebra-2d
    https://huggingface.co/mrm8488/t5-base-finetuned-math-qa-test
    https://huggingface.co/mrm8488/t5-base-finetuned-math-seq-next-term
    https://huggingface.co/mrm8488/t5-base-finetuned-math-calculus-differentiate (append ' </s>' to raw strings like 'differentiate 3*x**2 - 4*x +1'. Works pretty well for simple examples but doesn't seem capable of things like sin/cos/log/exp. Also struggled with some more complex examples of the standard case.)

    https://huggingface.co/mrm8488/t5-base-finetuned-qasc (combine 2 facts into a question)
    https://twitter.com/mrm8488/status/1295680004039344135 (twitter vid showing how to use one of the math models)

5/6/21
------
X -ideas: embeddings from_w2vec() factory, embeddings __getitem__ multipledispatch for slice input (allow emb[:5])
