[go: up one dir, main page]

Skip to content

Convert Text with context about your dataframe to code Pandas by py

License

Notifications You must be signed in to change notification settings

zeyadusf/Text2Pandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Text to Pandas

Convert Text with context about your dataframe to code Pandas by py

Kaggle | Kaggle | Huggingface | Huggingface | Huggingface

Table of Content:

  1. Problem Definition
  2. About Data
  3. About Model
  4. Inference Model
  5. Future work
  6. Related Repositories


Problem Definition

The Text-to-Pandas task involves translating natural language text instructions into code that performs operations using the Pandas library, a powerful data manipulation and analysis tool in Python. The Pandas library allows users to handle data structures like DataFrames, perform filtering, aggregation.

In this task, given a natural language command, the goal is to generate the correct Pandas code to achieve the desired result on a dataset, such as filtering rows, computing aggregates, or selecting specific columns.

About Data

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

  • Rahima411/text-to-pandas :
    • The data is divided into Train with 57.5k and Test with 19.2k.

    • The data has two columns as you can see in the example:

      • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
      • "Pandas Query": Pandas code
              Input                                        |                     Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object))          |   result = management['head.age'].unique()
Table Name: management (head_id (object),                  |
 temporary_acting (object))                                | 
What are the distinct ages of the heads who are acting?    |
  • hiltch/pandas-create-context:
    • It contains 17k rows with three columns:
      • question : text .
      • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
      • answer : Pandas code.
           question                     |                        context                         |              answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?	| df = pd.DataFrame(columns=['_number_of_total_votes'])  |  df['_number_of_total_votes'].min()		

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:

  • Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
  • Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code.
  • You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
def extract_table_creation(text:str)->(str,str):
    """
    Extracts DataFrame creation statements and questions from the given text.
    
    Args:
        text (str): The input text containing table definitions and questions.
        
    Returns:
        tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
        # Extract column names
        columns = re.findall(column_pattern, columns_str)
        column_names = [col[0] for col in columns]
        
        # Format DataFrame creation statement
        df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
        df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '\n'.join(df_creations)
    
    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into 72.8K train and 18.6K test. We analyzed this dataset and you can see it all through the notebook, but we found some problems in the dataset as well, such as

  • Answer : df['Id'].count() has been repeated, but this is possible, so we do not need to dispense with these rows.
  • Context : We see that it contains 147 rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
  • Question : It is clear that the question has an invalid sample such as ? etc.
  • context_length confirms that there are rows that contain no text and it is also clear that there are rows that contain short text that is probably invalid.
  • question_length is the same as context_length there are rows that contain no text and it is also clear that there are rows that contain short text that is probably invalid.

To address the problem, we need to filter out rows based on multiple criteria:
1. Remove rows with an invalid question, such as those containing ?.
2. Remove rows where context or question lengths are less than 10, which indicates too short or invalid text.
3. Retain the rows where the answer column may have repeated values, as this is expected behavior.
4. Drop duplicate rows

You can find the data after structuring, filtering and cleaning through:

Load Data

  • Dataset library
  from datasets import load_dataset
ds = load_dataset("zeyadusf/text2pandas")
  • Pandas library
import pandas as pd

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/zeyadusf/text2pandas/" + splits["train"])
  • Croissant library
from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/zeyadusf/text2pandas/croissant")
records = ds.records("default")
  • Polars library
import polars as pl

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pl.read_parquet('hf://datasets/zeyadusf/text2pandas/' + splits['train'])
  • number of characters image

  • WordCloud image

About Model

I fine tuned T5, T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. Using Transformers library and trained on 5 epochs and learning rate was 3e-5 and scheduler type was cosine. You can see the rest of the hyperparameters in the notebook.
As for the results on test dataset:

  1. Prediction Loss: 0.0463
  • This is the average loss during the prediction phase of your model on the test set. A lower loss indicates that the model is predicting outputs that are closer to the expected values. In this case, a loss of 0.0463 suggests that the model is making fairly accurate predictions, as a low loss generally signals better performance.
  1. Prediction ROUGE-1: 0.8396
  • ROUGE-1 measures the overlap of unigrams (single words) between the predicted text and the reference text (in this case, the generated Pandas code and the ground truth). A score of 0.8396 (or ~84%) indicates that there is a high level of overlap between the predicted and true sequences, meaning that the model is capturing the general structure well.
  1. Prediction ROUGE-2: 0.8200
  • ROUGE-2 evaluates bigram (two-word) overlap between the predicted and reference texts. A score of 0.82 (~82%) suggests that the model is also doing well at capturing the relationships between words, which is important for generating coherent and syntactically correct code.
  1. Prediction ROUGE-L: 0.8396
  • ROUGE-L measures the longest common subsequence (LCS) between the predicted and reference sequences, focusing on the sequence order. A high ROUGE-L score (~84%) means the model is generating sequences that align well with the true code in terms of overall structure and ordering of operations. This is crucial when generating code, as the order of operations affects the logic.
  1. Prediction BLEU: 0.4729
  • BLEU evaluates how many n-grams (in this case, code snippets) in the predicted output match those in the reference output. A BLEU score of 0.4729 (or ~47%) is a moderate result for a text-to-code task. BLEU can be more challenging to optimize for code generation since it requires exact matches at a token level, including symbols, syntax, and even whitespace.

In general, this is a promising result, showing that the model is performing well on the task, with room for improvement on exact token matching (reflected by the BLEU score).

Inference Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

tokenizer = AutoTokenizer.from_pretrained("zeyadusf/text2pandas-T5")
model = AutoModelForSeq2SeqLM.from_pretrained("zeyadusf/text2pandas-T5")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def generate_pandas(question, context, model, tokenizer, max_length=512, num_beams=4, early_stopping=True):
    """
    Generates text based on the provided question and context using a pre-trained model and tokenizer.

    Args:
        question (str): The question part of the input.
        context (str): The context (e.g., DataFrame description) related to the question.
        model (torch.nn.Module): The pre-trained language model (e.g., T5).
        tokenizer (PreTrainedTokenizer): The tokenizer corresponding to the model.
        max_length (int): Maximum length of the generated text.
        num_beams (int): The number of beams for beam search.
        early_stopping (bool): Whether to stop the beam search when enough hypotheses have reached the end.

    Returns:
        str: The generated text decoded by the tokenizer.
    """
    # Prepare the input text by combining the question and context
    input_text = f"<question> {question} <context> {context}"

    # Tokenize the input text, convert to tensor, and truncate if needed
    inputs = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=max_length)

    # Move inputs and model to the appropriate device
    inputs = inputs.to(device)
    model = model.to(device)

    # Generate predictions without calculating gradients
    with torch.no_grad():
        outputs = model.generate(inputs, max_length=max_length, num_beams=num_beams, early_stopping=early_stopping)

    # Decode the generated tokens into text, skipping special tokens
    predicted_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return predicted_text

# Example usage
question = "what is the total amount of players for the rockets in 1998 only?"
context = "df = pd.DataFrame(columns=['player', 'years_for_rockets'])"

# Generate and print the predicted text
predicted_text = generate_pandas(question, context, model, tokenizer)
print(predicted_text)

output

df[df['years_for_rockets'] == '1998']['player'].count()

The model link is at the top of the page.

Future work

  • Improve T5 results.
  • Finetune another model such as Llama2

Related Repositories

LLMs from Scratch FineTuning Large Language Models Topics in NLP and LLMs


πŸ“ž Contact :

About

Convert Text with context about your dataframe to code Pandas by py

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published