MLWizards is a Python library designed to simplify the process of creating chatbots and fine-tuning models using OpenAI. It also includes features for extracting and processing text from PDFs.
- ChatBot: Easily create and interact with chatbots using OpenAI's GPT models.
- Model Fine-Tuning: Fine-tune OpenAI models with your own data.
- PDF Text Extraction: Extract and preprocess text from PDF files using OCR.
1. Install the required Python packages:
pip install pytesseract pdf2image pillow requests-html openai beautifulsoup4
2. Install Poppler:
- Download
Poppler
from here. - Extract the contents to a directory, e.g.,
C:\poppler
. - Add
C:\poppler\bin
to your system's PATH.
- Use
Homebrew
to installPoppler
:brew install poppler
- Use the package manager to install
Poppler
:sudo apt-get install poppler-utils
- Install Tesseract OCR:
- Follow the instructions here to install Tesseract OCR on your operating system.
Create and interact with a chatbot using OpenAI's GPT models.
from mlwizards import ChatBot
api_key = "your_openai_api_key"
bot = ChatBot(api_key=api_key)
while True:
user_input = input("You: ")
if user_input.lower() in ['exit', 'quit']:
break
response = bot.generate(user_input)
bot.add_to_history('user', user_input)
bot.add_to_history('assistant', response)
print("Bot:", response)
Fine-tune OpenAI models with your own data.
from mlwizards import fine_tune_model, upload_file
api_key = "your_openai_api_key"
training_file_path = "path/to/your/training_data.jsonl"
# Upload the training file
upload_response = upload_file(api_key, training_file_path)
training_file_id = upload_response['id']
# Fine-tune the model
fine_tune_response = fine_tune_model(api_key, training_file_id)
print(f"Fine-tuning job started. Response: {fine_tune_response}")
Extract and preprocess text from PDF files using OCR.
from mlwizards import extract_text_from_pdf
pdf_path = "path/to/your/file.pdf"
tesseract_cmd = "/usr/local/bin/tesseract" # Update this path to your tesseract executable if needed
extracted_text = extract_text_from_pdf(pdf_path, tesseract_cmd)
print(extracted_text)
__init__(self, api_key, model='gpt-3.5-turbo', organization=None, project=None)
: Initialize the ChatBot with API key and model.generate(self, prompt)
: Generate a response from the chatbot.add_to_history(self, role, content)
: Add a message to the conversation history.reset_conversation(self)
: Reset the conversation history.search_internet(self, query, num_results=5)
: Search the internet and generate a response based on search results.
fine_tune_model(api_key, training_file_id, model='gpt-3.5-turbo', organization=None, project=None, suffix=None, hyperparameters=None)
: Fine-tune a model with the given training file.upload_file(api_key, file_path, purpose='fine-tune', organization=None, project=None)
: Upload a file to OpenAI.
extract_text_from_pdf(pdf_path, tesseract_cmd=None)
: Extract text from a PDF file using OCR.preprocess_text(text)
: Preprocess extracted text by removing unnecessary whitespace and newlines.
We welcome contributions to MLWizards! Please submit a pull request or open an issue to discuss any changes you would like to make.
This project is licensed under the MIT License.
This README provides a comprehensive guide to installing, using, and contributing to the `MLWizards` library. It includes instructions for installing dependencies, examples of how to use the main features, and descriptions of the key functions.