Metadata-Version: 2.1
Name: instruct_goose
Version: 0.0.1
Summary: Implementation of Reinforcement Learning from Human Feedback (RLHF)
Home-page: https://github.com/xrsrke/instructGOOSE
Author: xrsrke
Author-email: xariusdrake@hotmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: dev
License-File: LICENSE

InstructGoose
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

Paper: InstructGPT - [Training language models to follow instructions
with human feedback](https://arxiv.org/abs/2203.02155)

### Questions

- In the context of RLHF, how to calculate the $L_t^{V F}(\theta)$,
  - Like it’s a function of the PPO agent uses to predict how much
    reward it gets if generates the sequence?
- ~~Does the RL model and the SFT model use the same tokenizer? Yes~~
- ~~I don’t know how to returns the logit of the generation model~~

## Install

``` sh
pip install instruct_goose
```

### Resources

I used these resources to implement this

- Copied the
  [`load_yaml`](https://xrsrke.github.io/instructGOOSE/utils.html#load_yaml)
  function from https://github.com/Dahoas/reward-modeling
- Learned how to build a dataset to train reward model:
  https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX–VmlldzozMzAwODM2
- Learned how to add value head in PPO agent:
  https://github.com/lvwerra/trl
