8000 Refactor notebook to load dataset from Hugging Face Hub and update in… · huggingface/cookbook@1545b1f · GitHub
[go: up one dir, main page]

Skip to content

Commit 1545b1f

Browse files
committed
Refactor notebook to load dataset from Hugging Face Hub and update index.md to include new tag generator notebook
1 parent d2f3f89 commit 1545b1f

File tree

2 files changed

+24
-21
lines changed

2 files changed

+24
-21
lines changed

notebooks/en/finetune_t5_for_search_tag_generation.ipynb

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@
9191
"outputs": [],
9292
"source": [
9393
"from google.colab import userdata\n",
94+
"import os\n",
9495
"os.environ['HUGGINGFACE_TOKEN'] = userdata.get('HUGGINGFACE_TOKEN')"
9596
]
9697
},
@@ -225,39 +226,33 @@
225226
"\n",
226227
"We split this dataset into training and validation sets using a 90/10 ratio.\n",
227228
"\n",
228-
"🔁 _Note_: When this notebook was initially run, the dataset was loaded locally from a file. However, the same dataset is now also available on the Hugging Face Hub here: [zamal/github-meta-data](https://huggingface.co/datasets/zamal/github-meta-data). Feel free to load it directly using `load_dataset(\"zamal/github-meta-data\")` in your workflow.\n"
229+
"🔁 _Note_: When this notebook was initially run, the dataset was loaded locally from a file. However, the same dataset is now also available on the Hugging Face Hub here: [zamal/github-meta-data](https://huggingface.co/datasets/zamal/github-meta-data). Feel free to load it directly using `load_dataset(\"zamal/github-meta-data\")` in your workflow as shown below.\n"
229230
]
230231
},
231232
{
232233
"cell_type": "code",
233-
"execution_count": null,
234-
"metadata": {
235-
"id": "NtMlaBADani6"
236-
},
234+
"execution_count": 4,
235+
"metadata": {},
237236
"outputs": [],
238237
"source": [
239-
"from datasets import DatasetDict\n",
238+
"from datasets import load_dataset, DatasetDict\n",
240239
"\n",
241-
"# Load and split local JSONL\n",
242-
"import json\n",
243-
"from datasets import Dataset\n",
240+
"# Load existing dataset with only a \"train\" split\n",
241+
"dataset = load_dataset(\"zamal/github-meta-data\") # returns DatasetDict\n",
244242
"\n",
245-
"with open(\"/content/t5_formatted_dataset.jsonl\", \"r\", encoding=\"utf-8\") as f:\n",
246-
" data = [json.loads(line) for line in f]\n",
243+
"# Split the train set into train and validation\n",
244+
"split = dataset[\"train\"].train_test_split(test_size=0.1, seed=42)\n",
247245
"\n",
248-
"dataset = Dataset.from_list(data)\n",
249-
"\n",
250-
"# Train/validation split\n",
251-
"splits = dataset.train_test_split(test_size=0.1, seed=42)\n",
246+
"# Wrap into a new DatasetDict\n",
252247
"dataset_dict = DatasetDict({\n",
253-
" \"train\": splits[\"train\"],\n",
254-
" \"validation\": splits[\"test\"]\n",
248+
" \"train\": split[\"train\"],\n",
249+
" \"validation\": split[\"test\"]\n",
255250
"})\n"
256251
]
257252
},
258253
{
259254
"cell_type": "code",
260-
"execution_count": null,
255+
"execution_count": 5,
261256
"metadata": {
262257
"colab": {
263258
"base_uri": "https://localhost:8080/"
@@ -1763,7 +1758,16 @@
17631758
"name": "python3"
17641759
},
17651760
"language_info": {
1766-
"name": "python"
1761+
"codemirror_mode": {
1762+
"name": "ipython",
1763+
"version": 3
1764+
},
1765+
"file_extension": ".py",
1766+
"mimetype": "text/x-python",
1767+
"name": "python",
1768+
"nbconvert_exporter": "python",
1769+
"pygments_lexer": "ipython3",
1770+
"version": "3.11.5"
17671771
},
17681772
"widgets": {
17691773
"application/vnd.jupyter.widget-state+json": {

notebooks/en/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,11 @@ applications and solving various machine learning tasks using open-source tools
77

88
Check out the recently added notebooks:
99

10+
- [Fine-tuning T5 for Automatic GitHub Tag Generation with PEFT](finetune_t5_for_search_tag_generation)
1011
- [Documentation Chatbot with Meta Synthetic Data Kit](fine_tune_chatbot_docs_synthetic)
1112
- [HuatuoGPT-o1 Medical RAG and Reasoning](medical_rag_and_Reasoning)
1213
- [Fine-tuning Granite Vision 3.1 2B with TRL](fine_tuning_granite_vision_sft_trl)
1314
- [Post training an LLM for reasoning with GRPO in TRL](fine_tuning_llm_grpo_trl)
14-
- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library)
15-
- [Fine-tuning T5 for Automatic GitHub Tag Generation with PEFT](finetune_t5_for_search_tag_generation)
1615

1716

1817
You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).

0 commit comments

Comments
 (0)
0