FACE CLUSTERING
Context
In the era of vast digital content generation and storage, the need for organizing and
managing visual data, especially human faces, has grown substantially. From social media and
photo management systems to surveillance and security applications, face data is omnipresent.
However, manually categorizing and labeling large volumes of facial images is not only time-
consuming but often impractical. This challenge has paved the way for automated techniques
like face clustering—a subfield of computer vision and machine learning focused on grouping
facial images based on identity without prior labels.
Face clustering operates in the broader ecosystem of face analysis, alongside face detection,
recognition, and verification. While face recognition answers the question of “who is this?”,
clustering attempts to find inherent groupings or identities within a dataset without labeled
guidance. This makes it particularly valuable for unsupervised scenarios or as a preprocessing
step for downstream supervised learning tasks.
Introduction
This document presents a technical overview of a face clustering system built using the
DeepFace library. Face clustering refers to the process of organizing a collection of face images
into groups (clusters), where each group ideally represents a unique individual. Unlike
supervised face recognition, clustering is performed without labeled data, making it particularly
suitable for exploratory analysis or organization of large, unlabeled datasets.
The face clustering pipeline in this project is implemented using DeepFace and involves the
following core steps:
Face Detection and Alignment – Automatically identifying and preprocessing faces in
images.
Face Embedding Generation – Extracting compact feature representations of faces
using pre-trained deep neural networks like VGG-Face, Facenet, or ArcFace.
Distance Calculation – Measuring similarity between face embeddings.
Clustering Algorithm – Applying unsupervised algorithms (e.g., DBSCAN,
Agglomerative Clustering) to group faces by identity.
By leveraging DeepFace’s robust and flexible API, this system abstracts away much of the low-
level complexity typically involved in face analysis pipelines, allowing for rapid development
and deployment of face clustering solutions.
Problem Statement
The objective of this project is to develop an unsupervised face clustering system
that can automatically group face images of the same individual from a given image dataset,
without any prior labeling. This system is particularly valuable in scenarios where manual
annotation is infeasible, such as organizing large photo libraries, clustering surveillance footage,
or performing identity grouping in media archives.
While commercial APIs like Amazon Rekognition offer powerful face analysis features—such as
face detection, comparison, and indexing—these services often come with constraints:
Vendor Lock-In: Proprietary systems limit flexibility in customization and model selection.
Cost Scaling: Per-image pricing becomes expensive for large-scale datasets.
Lack of Control: Limited access to underlying models and hyperparameters.
Data Privacy: Cloud-based services may raise concerns about user data handling.
This project proposes an open-source, locally deployable alternative using the DeepFace library,
which integrates state-of-the-art face embedding models (e.g., VGG-Face, Facenet, ArcFace).
The goal is to deliver a customizable and privacy-preserving clustering pipeline that offers
comparable accuracy to commercial offerings like AWS Rekognition, but with full transparency
and cost efficiency.
Key challenges addressed:
How to generate meaningful and consistent face embeddings.
How to measure similarity between embeddings effectively.
How to choose an optimal clustering strategy (e.g., DBSCAN vs. hierarchical).
How to benchmark clustering performance without ground-truth labels.
System Architecture:
1. Start
o Launch the application.
2. Input Image
o Accepts a directory of images in .jpg, .jpeg, or .png format.
3. Face Detection (via Mediapipe)
o Detects face(s) in each image.
o If no valid face is found, image is moved to the others folder.
4. Face Alignment (via OpenCV)
o Aligns detected faces to a canonical orientation to improve embedding quality.
5. Embedding Extraction (via FaceNet)
o Converts the aligned face into a 128-dimensional vector (embedding)
representing facial features.
6. Similarity Check
o The embedding is compared against existing cluster embeddings using cosine
similarity.
7. Threshold Comparison
o If similarity with an existing cluster is greater than or equal to the threshold
(e.g. 0.37):
Image is grouped into that existing cluster folder.
o If not:
A new cluster folder is created for that face.
Component Technology Used
Face Detection Mediapipe
Face Alignment OpenCV
Embeddings FaceNet
Similarity Metric Cosine Distance
Programming Lang Python 3.8+
I/O JSON + File System
Technical Stack
SNAPSHOTS
Input:
Result
Code
Docker
Docker is a platform used to develop, ship, and run applications inside containers. A
container is a lightweight, portable, and self-sufficient environment that includes
everything an application needs to run—such as code, runtime, libraries, and system tools
—ensuring consistency across different systems. Docker uses images (which are built from
Dockerfiles) to create these containers, allowing developers to build once and run
anywhere without worrying about environment differences. It's widely used for its ability
to isolate applications, simplify deployment, and ensure smooth scaling across
development, testing, and production environments.
Rancher Desktop (For Windows)
Rancher Desktop is an open-source application that provides a local Kubernetes and
container management environment on your desktop. It’s mainly used by developers to run
and test Kubernetes clusters and Docker-style container workflows directly on their
machines without needing a cloud setup.
Rancher Desktop includes everything needed to build, run, and manage containers using
tools like containerd, dockerd (Moby), and Kubernetes. It provides a graphical interface to
manage these components and supports switching between container runtimes and
Kubernetes versions easily. Developers can build images with nerdctl (a Docker-
compatible CLI) or Docker CLI (if Moby is enabled) and test them in a Kubernetes
environment—all locally. It’s especially useful for developers working with microservices
and cloud-native applications, allowing a simple way to simulate production-like
environments on their local systems.
Installation
Fig.Installation of the Rancher Desktop
WSL2(Windows Subsystem for Linux)
This is the subsystems for Linux which should be Downloaded to work with Rancher
Desktop so for the installation for the subsystems of Linux
After the Installation Run the rancher Desktop then the dialog box appears like
Create a Dockerfile
The above dockerfile is a basic dockerfile example which shows a basic HelloWorld for the
docker file
How to Write a DockerFile
To get a detailed description of Writing a Docker File follow the link below:
https://docs.docker.com/get-started/docker-concepts/building-images/writing-a-dockerfile/
Basics Terminologies Docker
1.Container:
Containers are isolated processes for each of your app's components. Each component - the
frontend React app, the Python API engine, and the database - runs in its own isolated
environment, completely isolated from everything else on your machine.Here's what makes
them awesome. Containers are:
Self-contained. Each container has everything it needs to function with no reliance
on any pre-installed dependencies on the host machine.
Isolated. Since containers are run in isolation, they have minimal influence on the
host and other containers, increasing the security of your applications.
Independent. Each container is independently managed. Deleting one container
won't affect any others.
Portable. Containers can run anywhere! The container that runs on your
development machine will work the same way in a data center or anywhere in the
cloud!
For more information Refer this: https://docs.docker.com/get-started/docker-concepts/the-
basics/what-is-a-container/
2.Image:
Seeing a container is an isolated process, where does it get its files and configuration? How
do you share those environments?
That's where container images come in. A container image is a standardized package that
includes all of the files, binaries, libraries, and configurations to run a container.
For a PostgreSQL image, that image will package the database binaries, config files, and
other dependencies. For a Python web app, it'll include the Python runtime, your app code,
and all of its dependencies.
There are two important principles of images:
Images are immutable. Once an image is created, it can't be modified. You can only
make a new image or add changes on top of it.
Container images are composed of layers. Each layer represents a set of file system
changes that add, remove, or modify files.
These two principles let you to extend or add to existing images. For example, if you are
building a Python app, you can start from the Python image and add additional layers to
install your app's dependencies and add your code. This lets you focus on your app, rather
than Python itself.
For more information Refer this: https://docs.docker.com/get-started/docker-concepts/the-
basics/what-is-an-image/
Image Layers:
Each layer in an image contains a set of filesystem changes - additions, deletions, or
modifications. Let’s look at a theoretical image:
The first layer adds basic commands and a package manager, such as apt.
The second layer installs a Python runtime and pip for dependency management.
The third layer copies in an application’s specific requirements.txt file.
The fourth layer installs that application’s specific dependencies.
The fifth layer copies in the actual source code of the application.
Stacking the Layers
Layering is made possible by content-addressable storage and union filesystems. While this
will get technical, here’s how it works:
After each layer is downloaded, it is extracted into its own directory on the host
filesystem.
When you run a container from an image, a union filesystem is created where layers
are stacked on top of each other, creating a new and unified view.
When the container starts, its root directory is set to the location of this unified
directory, using chroot.
Building an Image
Building images - the process of building an image based on a Dockerfile
Tagging images - the process of giving an image a name, which also determines
where the image can be distributed
Publishing images - the process to distribute or share the newly created image using
a container registry
Most often, images are built using a Dockerfile. The most basic docker build command might
look like the following:
docker build .
The final. in the command provides the path or URL to the build context. At this location, the
builder will find the Dockerfile and other referenced files.
Tagging Image
Tagging images is the method to provide an image with a memorable name. However, there is a
structure to the name of an image. A full image name has the following structure:
[HOST[: PORT_NUMBER]/]PATH[:TAG]
HOST: The optional registry hostname where the image is located. If no host is specified,
Docker's public registry at docker.io is used by default.
PORT_NUMBER: The registry port number if a hostname is provided
PATH: The path of the image, consisting of slash-separated components. For Docker
Hub, the format follows [NAMESPACE/] REPOSITORY, where namespace is either a
user's or organization's name. If no namespace is specified, library is used, which is the
namespace for Docker Official Images.
TAG: A custom, human-readable identifier that's typically used to identify different
versions or variants of an image. If no tag is specified, latest is used by default.
To tag an image during a build, add the -t or --tag flag:
docker build -t my-username/my-image .
If you've already built an image, you can add another tag to the image by using the docker
image tag command:
Publishing Image
Once you have an image built and tagged, you're ready to push it to a registry. To do so, use the
docker push command:
docker push my-username/my-image
Within a few seconds, all of the layers for your image will be pushed to the registry.
Running Image
Once you have an image has built and tagged, you're ready to run . To do so, use the docker Run
command:
docker run <image name>
-it – This is a combination of two flags:
-i = interactive mode (keeps the standard input open)
-t = allocate a pseudo-terminal (makes it look like a real terminal)
docker run -it <image_name>
Within a few seconds, all of the layers for your image will be pushed to the registry.
DOCKER COMPOSE
Docker Compose is a tool that helps you define and share multi-container applications. With
Compose, you can create a YAML file to define the services and with a single command, you
can spin everything up or tear it all down.
each container starts from the image definition each time it starts. While containers can
create, update, and delete files, those changes are lost when you remove the container, and
Docker isolates all changes to that container.
DOCKER MOUNTS
Mounts are how you attach storage to containers
Docker supports three main types of mounts:
Volumes
Bind Mounts
Tmpfs (Temporary file System)
Volumes
Volumes provide the ability to connect specific filesystem paths of the container back to
the host machine. If you mount a directory in the container, changes in that directory are also
seen on the host machine. If you mount that same directory across container restarts, you'd see
the same files.
There are two main types of volumes
Named Volume
Anonymous Volume
Named Volume
A named volume in Docker is a persistent storage mechanism that allows data to be
stored independently of a container's lifecycle. Unlike anonymous volumes, named volumes
have specific names, making them easier to reference and manage. They are created and
managed by Docker and are typically stored in Docker's default storage location. Named
volumes are ideal for sharing data between containers or preserving data across container restarts
and rebuilds.
docker volume create my-data
docker run -v my-data:/app/data my-image
Anonymous Volume
An anonymous volume in Docker is a volume created without a specific name, usually by just
specifying the mount point inside the container (e.g., -v /app/data). Docker automatically assigns
a random name to it. These volumes are useful for temporary data storage, but because they
don’t have a defined name, they can be harder to manage or reference later. They still persist
outside the container’s filesystem, but since they’re not easily identifiable, they are more suitable
for short-term storage that doesn’t need to be shared between containers.
docker run -v /app/data my-image
Bind Mount
A bind mount is another type of mount, which lets you share a directory from the host's
filesystem into the container. When working on an application, you can use a bind mount to
mount source code into the container. The container sees the changes you make to the code
immediately, as soon as you save a file. This means that you can run processes in the container
that watch for filesystem changes and respond to them.
docker run -it --mount "type=bind,src=%cd%,target=/src" ubuntu
bash
Dockerization Of Face Clustering
Steps to follow to Dockize Video to Gif Generation
1.Write the Docker file which should be present in the Root Directory of the Project:
2.Build a Docker Image:
3.Mounting the directory to the docker image
As mentioned above in this document we have two types of mounts
i. Bind mounts
ii. Volume mounts
I.Bind Mounts:
For manual mounting refer the section of bind mount in the same document
Another way of mounting is by writing compose.yml file in your root directory
In face clustering the bind mount compose.yml file looks something like the below code
version: "3.9"
services:
faceclustering:
image: faceclustering # Make sure you've built this image already
volumes:
- ./input_images:/app/input_images
- ./clustered_images:/app/clustered_images
- ./.deepface/weights:/root/.deepface/weights
restart: "no"
Running the compose file
Command to run the docker-compose.yml
Docker compose run service_name
In the above .yml file faceclustering is the service_name
II.Volume mounts
As we discussed about volume mounts in this document
In face clustering the volume mount compose.yml file looks something like the below code
version: "3.9"
services:
faceclustering:
image: faceclustering # Make sure you've built this image
already
container_name: faceclustering_container
volumes:
- input_images_vol:/app/input_images
- clustered_images_vol:/app/clustered_images
- deepface_weights_vol:/root/.deepface/weights
restart: "no"
volumes:
input_images_vol:
clustered_images_vol:
deepface_weights_vol:
Running the compose file
Command to run the docker-compose.yml
Docker compose run service_name
In the above .yml file faceclustering is the service_name
Note one extra step we do here is the volumes we give in the docker compose file name