Introduction to
Hugging Face
W O R K I N G W I T H H U G G I N G FA C E
Jacob H. Marquez
Lead Data Engineer
What is Hugging Face?
Collaboration platform
Open-source machine learning
Text, vision, and audio tasks
Models, datasets, frameworks
Reduce barriers to entry
1 https://huggingface.co/
WORKING WITH HUGGING FACE
In this course
Navigate and use the Hugging Face Hub
Explore models and datasets
Build pipelines for text, image, and audio data
Fine-tuning, generation, embeddings, and semantic search
WORKING WITH HUGGING FACE
Large Language Models
LLMs
Understand and generate human-like text
Massive amounts of data
Learn patterns in sequences
1 https://en.wikipedia.org/wiki/Large_language_model
WORKING WITH HUGGING FACE
Large Language Models
LLMs
Understand and generate human-like text
Massive amounts of data
Learn patterns in sequences
Transformer architecture
1 https://towardsdatascience.com/transformers-89034557de14
WORKING WITH HUGGING FACE
Large Language Models
LLMs
Understand and generate human-like text
Massive amounts of data
Learn patterns in sequences
Transformer architecture
Popular options are GPT and Llama
WORKING WITH HUGGING FACE
Benefits of Hugging Face
Faster experimentation
WORKING WITH HUGGING FACE
Benefits of Hugging Face
Faster experimentation
Supports every step of the process
WORKING WITH HUGGING FACE
Benefits of Hugging Face
Faster experimentation
Supports every step of the process
Smoother adoption
WORKING WITH HUGGING FACE
Deciding when to use
Use Hugging Face Use another solution
Quick way to use ML tasks Slow computer
Don't have deep ML expertise Highly customized architectures
Testing several models Domain specific needs not yet met
Dataset needed Not leveraging advanced ML techniques
WORKING WITH HUGGING FACE
Installing Hugging Face
Hugging Face
pip install transformers datasets
ML Framework
pip install torch torchvision torchaudio
1 https://pytorch.org/
WORKING WITH HUGGING FACE
Let's practice!
W O R K I N G W I T H H U G G I N G FA C E
Transformers and
the Hub
W O R K I N G W I T H H U G G I N G FA C E
Jacob H. Marquez
Lead Data Engineer
Transformers - the Hugging Face package
1 https://github.com/huggingface/transformers
WORKING WITH HUGGING FACE
Transformers - the model architecture
Neural network models
Learn context and understanding
Core components:
Encoder
Decoder
Self-attention mechanism
Transform input to numerical
representations
Helps model understand context of the
input
1 https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power
WORKING WITH HUGGING FACE
Uses cases of transformers
Use cases for text, image, and vision
Classification for all three
Automatic speech recognition
Text summarization
Object detection for autonomous driving
WORKING WITH HUGGING FACE
A key benefit of transformers
Enables Hugging Face models to perform well on new tasks with little data
1 https://www.topbots.com/transfer-learning-in-nlp/#transfer-learning
WORKING WITH HUGGING FACE
The Hub
1 https://huggingface.co/
WORKING WITH HUGGING FACE
Navigating the Hub
1 https://huggingface.co/
WORKING WITH HUGGING FACE
Searching for models
1 https://huggingface.co/models
WORKING WITH HUGGING FACE
Searching for models
1 https://huggingface.co/models
WORKING WITH HUGGING FACE
Searching for models
1 https://huggingface.co/models
WORKING WITH HUGGING FACE
Searching for models
1 https://huggingface.co/models
WORKING WITH HUGGING FACE
Model cards
1 https://huggingface.co/openai/whisper-large-v3
WORKING WITH HUGGING FACE
Using huggingface_hub
pip install huggingface_hub
from huggingface_hub import HfApi
api = HfApi()
list(api.list_models())
[ModelInfo: {
{'_id': '622fea36174feb5439c2e4be',
'author': 'cardiffnlp',
...}]
1 https://github.com/huggingface/huggingface_hub
WORKING WITH HUGGING FACE
Using huggingface_hub
models = api.list_models( task searches for specified task
filter=ModelFilter(
sort will order the list
task="text-classification"),
sort="downloads", direction provides the direction of the
direction=-1, sorted order
limit=5
-1 for descending
)
) all other numbers for ascending
modelList = list(models) limit will limit the number of models
returned
print(modelList[0])
Model Name: albert/albert-base-v1, Tags: [...]
1 https://github.com/huggingface/huggingface_hub
WORKING WITH HUGGING FACE
Saving a model locally
# Import AutoModel
from transformers import AutoModel
modelId = "distilbert-base-uncased-finetuned-sst-2-english"
# Download model using the modelId
model = AutoModel.from_pretrained(modelId)
# Save the model to a local directory
model.save_pretrained(save_directory=f"models/{modelId}")
Be mindful of storage!
1 https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel
WORKING WITH HUGGING FACE
Let's practice!
W O R K I N G W I T H H U G G I N G FA C E
Working with
datasets
W O R K I N G W I T H H U G G I N G FA C E
Jacob H. Marquez
Lead Data Engineer
Datasets in Hugging Face
1 https://huggingface.co/datasets
WORKING WITH HUGGING FACE
Searching for datasets
1 https://huggingface.co/datasets
WORKING WITH HUGGING FACE
Dataset cards
1 https://huggingface.co/datasets/imdb
WORKING WITH HUGGING FACE
Dataset cards
Description
Dataset structure
An example
Field metadata
Training and testing splits
1 https://huggingface.co/datasets/imdb
WORKING WITH HUGGING FACE
Dataset cards
1 https://huggingface.co/datasets/imdb
WORKING WITH HUGGING FACE
Dataset cards
1 https://huggingface.co/datasets/imdb
WORKING WITH HUGGING FACE
datasets package
pip install datasets
Access
Download
Mutate
Use
Share
1 https://huggingface.co/docs/datasets/index
WORKING WITH HUGGING FACE
Inspecting a dataset
from datasets import load_dataset_builder
data_builder = load_dataset_builder("imdb")
print(data_builder.info.description)
Large Movie Review Dataset. This is a dataset for sentiment classification...
print(data_builder.info.features)
{'text': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}
1 https://huggingface.co/docs/datasets/load_hub
WORKING WITH HUGGING FACE
Downloading a dataset
from datasets import load_dataset
data = load_dataset("imdb")
Split parameter
data = load_dataset("imdb", split="train")
Configuration parameter
data = load_dataset("wikipedia", "20231101.en")
1 https://huggingface.co/docs/datasets/v2.15.0/loading
WORKING WITH HUGGING FACE
Use in datasets
WORKING WITH HUGGING FACE
Use in datasets
WORKING WITH HUGGING FACE
Apache Arrow dataset formats
1 https://arrow.apache.org/overview/
WORKING WITH HUGGING FACE
Mutating a dataset
imdb = load_dataset("imdb", split="train")
# Filter imdb
filtered = imdb.filter(lambda row: row['label']==0)
{'text': 'I rented I AM CURIOUS-YELLOW...''}
1 https://huggingface.co/docs/datasets/process#select-and-filter
WORKING WITH HUGGING FACE
Mutating a dataset
# Slicing
sliced = filtered.select(range(2))
print(sliced)
Dataset({features: ['id', 'url', 'title', 'text'], num_rows: 2})
print(sliced[0]['text'])
1 https://huggingface.co/docs/datasets/process#select-and-filter
WORKING WITH HUGGING FACE
Benefits of datasets
Accessible and shareable
Relevant to common ML tasks
Efficient processing on large data
Faster querying
Convenient complimentary datasets package
WORKING WITH HUGGING FACE
Let's practice!
W O R K I N G W I T H H U G G I N G FA C E