It sure seems like there are a lot of text-generation chatbots out there, but it's hard to find a python package or model that is easy to tune around a simple text file of message data. This repo is a simple attempt to help solve that problem.
ai-msgbot
covers the practical use case of building a chatbot that sounds like you (or some dataset/persona you choose) by training a text-generation model to generate conversation in a consistent structure. This structure is then leveraged to deploy a chatbot that is a "free-form" model that consistently replies like a human.
There are three primary components to this project:
- parsing a dataset of conversation-like data
- training a text-generation model. This repo is designed around using the Google Colab environment for model training.
- Deploy the model to a chatbot interface for users to interact with, either locally or on a cloud service.
It relies on the aitextgen
and python-telegram-bot
libraries. Examples of how to train larger models with DeepSpeed are in the notebooks/colab-huggingface-API
directory.
python ai_single_response.py -p "greetings sir! what is up?"
... generating...
finished!
('hello, i am interested in the history of vegetarianism. i do not like the '
'idea of eating meat out of respect for sentient life')
Some of the trained models can be interacted with through the HuggingFace spaces and model inference APIs on the ETHZ Analytics Organization page on huggingface.co.
Table of Contents
- Quick outline of repo
- Quickstart
- Repo Overview and Usage
- WIP: Tasks & Ideas
- Extras, Asides, and Examples
- Citations
- training and message EDA notebooks in
notebooks/
- python scripts for parsing message data into a standard format for training GPT are in
parsing-messages/
- example data (from the Daily Dialogues dataset) is in
conversation-data/
- Usage of default models is available via the
dowload_models.py
script
This response is from a bot on Telegram, finetuned on the author's messages
The model card can be found here.
NOTE: to build all the requirements, you may need Microsoft C++ Build Tools, found here
- clone the repo
- cd into the repo directory:
cd ai-msgbot/
- Install the requirements:
pip install -r requirements.txt
- if using conda:
conda env create --file environment.yml
- NOTE: if there are any errors with the conda install, it may ask for an environment name which is
msgbot
- if using conda:
- download the models:
python download_models.py
(if you have a GPT-2 model, save the model to the working directory, and you can skip this step) - run the bot:
python ai_single_response.py -p "hey, what's up?"
or enter a "chatroom" withpython conv_w_ai.py -p "hey, what's up?"
- Note: for either of the above, the
-h
parameter can be passed to see the options (or look in the script file)
- Note: for either of the above, the
Put together in a shell block:
git clone https://github.com/pszemraj/ai-msgbot.git
cd ai-msgbot/
pip install -r requirements.txt
python download_models.py
python ai_single_response.py -p "hey, what's up?"
- the first step in understanding what is going on here is to understand what is happening ultimately is teaching GPT-2 to recognize a "script" of messages and respond as such.
- this is done with the
aitextgen
library, and it's recommended to read through some of their docs and take a look at the Training your model section before returning here.
- this is done with the
- essentially, to generate a novel chatbot from just text (without going through too much trouble as required in other libraries.. can you easily abstract your friend's WhatsApp messages into a "persona"?)
An example of what a "script" is:
speaker a:
hi, becky, what's up?
speaker b:
not much, except that my mother-in-law is driving me up the wall.
speaker a:
what's the problem?
speaker b:
she loves to nit-pick and criticizes everything that i do. i can never do anything right when she's around.
..._Continued_...
more to come, but check out parsing-messages/parse_whatsapp_output.py
for a script that will parse messages exported with the standard whatsapp chat export feature. consolidate all the WhatsApp message export folders into a root directory, and pass the root directory to this
TODO: more words
The next step is to leverage the text-generative model to reply to messages. This is done by "behind the scenes" parsing/presenting the query with either a real or artificial speaker name and having the response be from target_name
and, in the case of GPT-Peter, it is me.
Depending on computing resources and so forth, it is possible to keep track of the conversation in a helper script/loop and then feed in the prior conversation and then the prompt, so the model can use the context as part of the generation sequence, with of course the attention mechanism ultimately focusing on the last text past to it (the actual prompt)
Then, deploying this pipeline to an endpoint where a user can send in a message, and the model will respond with a response. This repo has several options; see the deploy-as-bot/
directory, which has an associated README.md file.
- an example dataset (Daily Dialogues) parsed into the script format can be found locally in the
conversation-data
directory.- When learning, it is probably best to use a conversational dataset such as Daily Dialogues as the last dataset to finetune the GPT2 model. Still, before that, the model can "learn" various pieces of information from something like a natural questions-focused dataset.
- many more datasets are available online at PapersWithCode and GoogleResearch. Seems that Google Research also has a tool for searching for datasets online.
- Note that training is done in google colab itself. try opening
notebooks/colab-notebooks/GPT_general_conv_textgen_training_GPU.ipynb
in Google Colab (see the HTML button at the top of that notebook or click this link to a shared git gist) - Essentially, a script needs to be parsed and loaded into the notebook as a standard .txt file with formatting as outlined above. Then, the text-generation model will load and train using aitextgen's wrapper around the PyTorch lightning trainer. Essentially, the text is fed into the model, and it self-evaluates for a "test" as to whether a text message chain (somewhere later in the doc) was correctly predicted or not.
TODO: more words
aitextgen
is largely designed around leveraging Colab's free-GPU capabilities to train models. Training a text generation model and most transformer models, is resource intensive. If new to the Google Colab environment, check out the below to understand more of what it is and how it works.
- Google's FAQ
- Medium Article on Colab + Large Datasets
- Google's Demo Notebook on I/O
- A better Colab Experience
- Command line scripts:
python ai_single_response.py -p "hey, what's up?"
python conv_w_ai.py -p "hey, what's up?"
- You can pass the argument
--model <NAME OF LOCAL MODEL DIR>
to change the model. - Example:
python conv_w_ai.py -p "hey, what's up?" --model "GPT2_trivNatQAdailydia_774M_175Ksteps"
- Some demos are available on the ETHZ Analytics Group's huggingface.co page (no code required!):
- Gradio - locally hosted runtime with public URL.
- See:
deploy-as-bot/gradio_chatbot.py
- The UI and interface will look similar to the demos above, but run locally & are more customizable.
- See:
- Telegram bot - Runs locally, and anyone can message the model from the Telegram messenger app(s).
- See:
deploy-as-bot/telegram_bot.py
- An example chatbot by one of the authors is usually online and can be found here
- See:
One of this project's primary goals is to train a chatbot/QA bot that can respond to the user "unaided" where it does not need hardcoding to handle questions/edge cases. That said, sometimes the model will generate a bunch of strings together. Applying "general" spell correction helps make the model responses as understandable as possible without interfering with the response/semantics.
- Implemented methods:
- symspell (via the pysymspell library) NOTE: while this is fast and works, it sometimes corrects out common text abbreviations to random other short words that are hard to understand, i.e., tues and idk and so forth
- gramformer (via transformers
pipeline()
object). a pretrained NN that corrects grammar and (to be tested) hopefully does not have the issue described above. Links: model page, the models github
- Grammar Synthesis (WIP) - Some promising results come from training a text2text generation model that, through "pseudo-diffusion," is trained to denoise heavily corrupted text while learning to not change the semantics of the text. A checkpoint and more details can be found here and a notebook here.
-
finish out
conv_w_ai.py
that is capable of being fed a whole conversation (or at least, the last several messages) to prime response and "remember" things. -
better text generation
-
add-in option of generating multiple responses to user prompts, automatically applying sentence scoring to them, and returning the one with the highest mean sentence score.
-
constrained textgen
- explore constrained textgen
- add constrained textgen to repo [x] assess generalization of hyperparameters for "text-message-esque" bots
-
add write-up with hyperparameter optimization results/learnings
-
switch repo API from
aitextgen
totransformers pipeline
object -
Explore model size about "human-ness."
The following responses were received for general conversational questions with the GPT2_trivNatQAdailydia_774M_175Ksteps
model. This is an example of what is capable (and much more!!) in terms of learning to interact with another person, especially in a different language:
python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "where is the grocery store?"
... generating...
finished!
"it's on the highway over there."
took 38.9 seconds to generate.
Python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "what should I bring to the party?"
... generating...
finished!
'you need to just go to the station to pick up a bottle.'
took 45.9 seconds to generate.
C:\Users\peter\PycharmProjects\gpt2_chatbot>python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "do you like your job?"
... generating...
finished!
'no, not very much.'
took 50.1 seconds to generate.
These are probably worth checking out if you find you like NLP/transformer-style language modeling:
TODO: add citations for datasets and main packages used.