[go: up one dir, main page]

0% found this document useful (0 votes)
192 views48 pages

Beginner's Guide to LLMs

Uploaded by

amioyiitd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views48 pages

Beginner's Guide to LLMs

Uploaded by

amioyiitd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Master Generative AI: Step-by-step guide to become a Download


GenAI expert Roadmap

New
Free Learning GenAI Agentic
Courses Paths Pinnacle AI
Program Pioneer
Program A
" Interview Prep Career GenAI Prompt Engg ChatGPT LLM Langchain RAG !
AI Agents

Home Large Language Models

Build Large Language Models from Scratch

Build Large
Language Models
from Scratch
Aravind
Pai
16
Last
15 min read
Updated
: 11 Nov,
2024

Be it X or Linkedin, I encounter
numerous posts about Large
Language Models(LLMs) for beginners

each day. Perhaps I wondered why


there’s such an incredible amount of
READING research and development dedicated to RECOMMEN
LIST DED
these intriguing models. From ChatGPT ARTICLES

Google
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 1 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Google
Introduction to Gemini, Falcon, and countless Gemma, the
to Open-Source
Generative
others, their names swirl around,
LLM
AI
leaving me eager to uncover their true Powerhouse
nature. These burning questions have How to Build a
Introduction lingered in my mind, fueling my Multilingual
to Chatbot using
Generative curiosity. This insatiable curiosity has
AI Large...
applications ignited a fire within me, propelling me to
What are
dive headfirst into the realm of LLMs. Large
No-code
Language
Generative Join me on an exhilarating journey as Models(LLMs)
AI app ?
development we will discuss the current state of the
Beyond
art in LLMs for begineers. Together,
Words:
Code- we’ll unravel the secrets behind their Unleashing
focused the Power of
Generative development, comprehend their
AI App Large Lan...
Development extraordinary capabilities, and shed
From GPT-3
light on how they have revolutionized
to Future
the world of language processing. Generations
of Language
Mo...
In this article, you will learn how to build
an LLM from scratch through a A Survey of
Large
beginner-friendly tutorial. We’ll cover Language
how LLMs are trained and share tips Models
(LLMs)
from Analytics Vidhya. By the end, you’ll
LLM Chatbot
have the skills to create a large
Architecture
language model. AI: Building
Smarter C...

12 Free And
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 2 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Paid LLMs for


Your Daily
Tasks

30+ LLM
Interview
Questions and
Answers

In this article, you will gain 7 Essential


Steps to
understanding on how to train a large Master Large
language model (LLM) from scratch, Language
Models
including essential techniques for

building an LLM model effectively.

Learning Objectives

Learn about LLMs and their current


state of the art.

Understand different LLMs available


and approaches to training these

LLMs from scratch

Explore best practices to train and

evaluate LLMs

This article was published as a part


of the Data Science Blogathon.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 3 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Table of contents

1. A Brief History of Large Language

Models

2. What are Large Language Models?

3. Why Large Language Models?

4. Different Kinds of LLMs

5. What are the Challenges of

Training LLMs for beginner’s?

6. Understanding the Scaling Laws

7. How to build LLM model from

scratch?

A Brief History of
Large Language
Models
The history of Large Language Models
goes back to the 1960s. In 1967, a

professor at MIT built the first ever NLP


program Eliza to understand natural
language. It uses pattern matching and
substitution techniques to understand

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 4 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and interact with humans. Later, in


1970, the MIT team built another NLP

program called SHRDLU to understand


and interact with humans.

In 1988, RNN architecture was


introduced to capture the sequential
information present in the text data. But
RNNs could work well with only shorter

sentences but not with long sentences.


Hence, LSTM was proposed in 1997.
During this period, huge developments
emerged in LSTM-based applications.
Later on, research began in attention
mechanisms as well.

Two Major Concerns With


LSTM

LSTM solved the problem of long


sentences to some extent but it could
not really excel while working with really
long sentences. Training LSTM models

cannot be parallelized. Due to this, the


training of these models took longer
time.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 5 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

In 2017, there was a breakthrough in


the research of NLP through the paper
Attention Is All You Need. This paper
revolutionized the entire NLP
landscape. The researchers introduced
the new architecture known as
Transformers to overcome the
challenges with LSTMs. Transformers
essentially were the first LLM developed
containing a huge no. of parameters.

Transformers emerged as state-of-the-


art models for LLMs. Even today, the

development of LLM remains influenced


by transformers.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 6 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Over the next five years, there was


significant research focused on building

better LLMs for begineers compared to


transformers. The size of LLM

exponentially increased over time. The


experiments proved that increasing the

size of LLMs and datasets improved the


knowledge of LLMs. Hence, GPT

variants like GPT-2, GPT-3, GPT 3.5,

GPT-4 were introduced with an increase


in the size of parameters and training

datasets.

In 2022, there was another


breakthrough in NLP, ChatGPT.

ChatGPT is a dialogue-optimized LLM

that is capable of answering anything


you want it to. In a couple of months,

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 7 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Google introduced Gemini as a

competitor to ChatGPT.

In the last 1 year, there have been


hundreds of Large Language Models

developed. You can get the list of open-


source LLMs along with the ranking on

the Hugging Face Open LLM

leaderboard. The state-of-the-art LLM


to date is Falcon 40B Instruct.

What are Large


Language Models?

Simply put this way, Large Language

Models are deep learning models


trained on huge datasets to understand

human languages. Its core objective is


to learn and understand human

languages precisely. Large Language

Models enable the machines to interpret


languages just like the way we, as

humans, interpret them.

Large Language Models learn the


patterns and relationships between the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 8 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

words in the language. For example, it


understands the syntactic and semantic

structure of the language like grammar,


order of the words, and meaning of the

words and phrases. It gains the


capability to grasp the whole language

itself.

But how exactly is language models

different from Large Language Models?

Language models and Large Language

models learn and understand the


human language but the primary

difference is the development of these


models.

Generally develop language models as

statistical models using HMMs or

probabilistic-based methods, while they


design Large Language Models as deep

learning models with billions of


parameters trained on a very large

dataset.

Why Large Language


Models?
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 9 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

The answer to this question is simple.

LLMs for begineers are task-agnostic


models. Literally, these models have the

capability to solve any task. For


example, ChatGPT is a classical

example of this. Every time you ask

ChatGPT something, it amazes you.

Here is the steps about the Large


Language Models:

1. Prompt the Model: Simply provide

a well-crafted prompt; no fine-tuning


needed.

2. Utilize Versatility: Use one model


for various tasks and applications.

3. Receive Instant Solutions: Get

quick and effective responses to


your problems.

4. Understand Foundation Models:

Recognize LLMs as foundational


tools in Natural Language

Processing (NLP).

5. Explore Features: Leverage their

ability to generate human-like text

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 10 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and insights easily.

Different Kinds of
LLMs

LLMs can be broadly classified into 2

types depending on their task:

Continuing the text

Dialogue optimized

Continuing the Text

These LLMs are trained to predict the

next sequence of words in the input


text. Their task at hand is to continue

the text.

For example, given the text “How are

you”, these LLMs might complete the


sentence with “How are you doing? or

“How are you? I am fine.

The list ofLLMs for begineers falling

under this category are Transformers,


BERT, XLNet, GPT, and its variants like

GPT-2, GPT-3, GPT-4, etc.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 11 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Now, the problem with these LLMs is

that its very good at completing the text

rather than answering. Sometimes, we


expect the answer rather than
completion.

As discussed above, given How are

you? as an input, LLM tries to complete

the text with doing? or I am fine. The


response can be either of them:

completion or an answer. This is exactly

why the dialogue-optimized LLMs were

introduced.

Dialogue Optimized

These LLMs respond back with an

answer rather than completing it. Given

the input “How are you?”, these LLMs

might respond back with an answer “I

am doing fine.” rather than completing


the sentence.

The list of dialogue-optimized LLMs is

InstructGPT, ChatGPT, Gemini, Falcon-

40B-instruct, etc.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 12 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Now, we will see the challenges


involved in training LLMs from scratch.

What are the


Challenges of Training
LLMs for beginner’s?

Training LLMs from scratch are really

challenging because of 2 main factors:


Infrastructure and Cost.

Infrastructure

LLMs for begineers are trained on a

massive text corpus ranging at least in

the size of 1000 GBs. The models used

to train on these datasets are very large


containing billions of parameters. In

order to train such large models on the

massive text corpus, we need to set up

an infrastructure/hardware supporting
multiple GPUs. Can you guess the time

taken to train GPT-3 – 175 billion

parameter model on a single GPU?

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 13 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM


It would take 288 years to train

GPT-3 on a single NVIDIA Tesla

V100 GPU.

This clearly shows that training LLM on

a single GPU is not possible at all. It

requires distributed and parallel


computing with thousands of GPUs.

Just to give you an idea, here is the

hardware used for training popular

LLMs-

1. Falcon-40B was trained on 384


A100 40GB GPUs, using a 3D

parallelism strategy (TP=8, PP=4,

DP=12) combined with ZeRO.

2. Researchers calculated that OpenAI


could have trained GPT-3 in as little

as 34 days on 1,024x A100 GPUs

3. PaLM (540B, Google): 6144 TPU v4


chips used in total.

Cost

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 14 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

It’s very obvious from the above that

GPU infrastructure is much needed for

training LLMs for begineers from


scratch. Setting up this size of

infrastructure is highly expensive.

Companies and research institutions

invest millions of dollars to set it up and

train LLMs from scratch.


It is estimated that GPT-3 cost

around $4.6 million dollars to

train from scratch

On average, the 7B parameter model

would cost roughly $25000 to train from


scratch.

Now, we will see the scaling laws of

LLMs.

Understanding the
Scaling Laws
Recently, we have seen that the trend

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 15 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

of large language models being

developed. They are really large

because of the scale of the dataset and

model size.

When you are training LLMs from

scratch, its really important to ask these

questions prior to the experiment-

1. How much data do I need to train

LLMs from scratch?

2. What should be the size of the

model?

The answer to these questions lies in


scaling laws.

Scaling laws determines how much

optimal data is required to train a model

of a particular size.

In 2022, DeepMind proposed the

scaling laws for training the LLMs with

the optimal model size and dataset (no.

of tokens) in the paper Training


Compute-Optimal Large Language

Models.These scaling laws are

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 16 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

popularly known as Chinchilla or

Hoffman scaling laws. It states that


The no. of tokens used to train

LLM should be 20 times more

than the no. of parameters of the

model.

1,400B (1.4T) tokens should be used to


train a data-optimal LLM of size 70B

parameters. So, we need around 20

text tokens per parameter.

Next, we will see how to train LLMs


from scratch.

How to build LLM


model from scratch?
Step 1: Define Your Goal

Start by figuring out what you want your


language model to do. Do you want it to

answer questions, generate text, or chat

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 17 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

like a human? Knowing your goal will

help you make better choices later.

Step 2: Choose a Model


Design

Most modern language models use

something called

the transformer architecture. This

design helps the model understand the


relationships between words in a

sentence. You can build your model

using programming tools

like PyTorch or TensorFlow.

Key Parts of a Transformer

1. Encoder: This part reads and


understands the input text.

2. Decoder: This part generates the


output text.

3. Self-Attention: This mechanism


helps the model focus on important
words in a sentence.

Step 3: Gather Your Data

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 18 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

You need a lot of text data to train your

model. This data should be relevant to

what you want the model to do. For


example, if you want it to write stories,

gather a variety of stories.

Tips for Good Data

Relevance: Make sure the data is

related to your goal.

Variety: Use different types of text


to help the model learn better.

Quantity: The more data you have,

the better your model can perform.

Step 4: Train Your Model

Training is the process of teaching your

model using the data you collected.

This can take a lot of time and computer


power.

Training Tips

Batch Training: Train the model

using small groups of data at a time

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 19 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

to save memory.

Adjust Settings: Experiment with

different settings (like learning

rates) to see what works best

How Do You Train


LLMs from Scratch?

The training process of LLMs is different

for the kind of LLM you want to build


whether it’s continuing the text or

dialogue optimized. The performance of

LLMs mainly depends upon 2 factors:


Dataset and Model Architecture. These

2 are the key driving factors behind the

performance of LLMs.

Let’s discuss the now different steps

involved in training the LLMs.

Continuing the Text Tutorial

The training process for LLMs that

continue the text is known as

pretraining. These LLMs use self-


supervised learning to predict the next

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 20 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

word in the text. We will exactly see the

different steps involved in training LLMs

from scratch.

a. Dataset Collection

The first step in training LLMs is

collecting a massive corpus of text data.


The dataset plays the most significant

role in the performance of LLMs.

Recently, OpenChat is the latest dialog-


optimized large language model

inspired by LLaMA-13B. It achieves

105.7% of the ChatGPT score on the

Vicuna GPT-4 evaluation. Do you know


the reason behind its success? It’s high-

quality data. It has been finetuned on

only ~6K data.

The training data is created by scraping


the internet, websites, social media

platforms, academic sources, etc. Make

sure that training data is as diverse as

possible.


https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 21 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Recent work has demonstrated

that increased training dataset

diversity improves general cross-


domain knowledge and

downstream generalization

capability for large-scale

language models

What does it say?

You might have come across the


headlines that “ChatGPT failed at JEE”

or “ChatGPT fails to clear the UPSC”

and so on. What can be the possible

reasons? The reason is that it lacked


the necessary level of intelligence,

which heavily depends on the dataset

used for training. Hence, the demand

for diverse dataset continues to rise as


high-quality cross-domain dataset has a

direct impact on the model

generalization across different tasks.


Unlock the potential of LLMs with

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 22 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

the high quality data!

Previously, Common Crawl was the go-


to dataset for training LLMs. The

Common Crawl contains the raw web

page data, extracted metadata, and text

extractions since 2008. The size of the


dataset is in petabytes (1 petabyte=1e6

GB). Its proven that the that Large

Language Models trained on this

dataset achieved effective results but


failed to generalize well across other

tasks. Hence, a new dataset called Pile

was created from 22 diverse high-


quality datasets. It’s a combination of

existing data sources and new datasets

in the range of 825 GB. In recent times,

the refined version of the common crawl


was released in the name of

RefinedWeb Dataset.

Note: The datasets used for GPT-3

and GPT-4 have not been open-


sourced in order to maintain a

competitive advantage over the

others.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 23 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

b. Dataset Preprocessing

The next step is to preprocess and

clean the dataset. Since we crawl the

dataset from multiple web pages and

different sources, it often contains


various nuances. We must eliminate

these nuances and prepare a high-

quality dataset for the model training.

The specific preprocessing steps


actually depend on the dataset you are

working with. Some of the common

preprocessing steps include removing

HTML Code, fixing spelling mistakes,


eliminating toxic/biased data, converting

emoji into their text equivalent, and data

deduplication. Data deduplication is one

of the most significant preprocessing


steps while training LLMs. Data

deduplication refers to the process of

removing duplicate content from the

training corpus.

It’s obvious that the training data may

contain duplicate or nearly identical

sentences because we collect it from

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 24 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

various data sources. We need data

deduplication for 2 primary reasons: It

helps the model not to memorize the


same data again and again. It helps us

to evaluate LLMs better because the

training and test data contain non-

duplicated information. If it contains


duplicated information, there is a very

chance that the information it has seen

in the training set is provided as output


during the test set. As a result, the

numbers reported may not be true. You

can read more about data deduplication

techniques in the paper Deduplicating


Training Data Makes Language

Models Better

c. Dataset Preparation

During the pretraining phase, the next

step involves creating the input and

output pairs for training the model.

Trainers prepare LLMs to predict the


next token in the text, generating input

and output pairs accordingly. While this

demonstration considers each word as


a token for simplicity, in practice,

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 25 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

tokenization algorithms like Byte Pair

Encoding (BPE) further break down

each word into subwords. The model is


then trained with the tokens of input and

output pairs.

For example, let’s take a simple corpus-

Example 1: I am a DHS Chatbot.

Example 2: DHS stands for

DataHack Summit.

Example 3: I can provide you with


information about DHS

In the case of example 1, we can create

the input-output pairs as per below-

Similarly, in the case of example 2, the

following is a list of input and output


pairs-

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 26 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Each input and output pair is passed on

to the model for training.

Now, what next? Let’s define the model

architecture.

d. Model Architecture

The next step is to define the model

architecture and train the LLM.

As of today, there are a huge no. of

LLMs being developed. You can get an

overview of different LLMs at the

Hugging Face Open LLM leaderboard.


There is a standard process followed by

the researchers while building LLMs.

Most of the researchers start with an

existing Large Language Model


architecture like GPT-3 along with the

actual hyperparameters of the model.

And then tweak the model architecture /

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 27 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

hyperparameters / dataset to come up

with a new LLM.

For example,

Falcon is a state-of-the-art LLM. It


ranks first on the open-source LLM

leaderboard. Falcon is inspired by

GPT-3 architecture with a couple of

tweaks.

e. Hyperparameter Search

Hyperparameter tuning is a very


expensive process in terms of time and

cost as well. Just imagine running this

experiment for the billion-parameter

model. It’s not feasible right? Hence,


the ideal method to go about is to use

the hyperparameters of current

research work, for example, use the

hyperparameters of GPT-3 while


working with the corresponding

architecture and then find the optimal

hyperparameters on the small scale and


then interpolate them for the final

model.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 28 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

The experiments can involve any or all

of the following: weight initialization,

positional embeddings, optimizer,


activation, learning rate, weight decay,

loss function, sequence length, number

of layers, number of attention heads,

number of parameters, dense vs.


sparse layers, batch size, and drop out.

Let’s discuss the best practices for

popular hyperparameters now-

Batch size: Ideally choose the large


batch size that fits the GPU

memory.

Learning Rate Scheduler: The

better way to go about this is to

decrease the learning rate as the


training progress. This will

overcome the local minima and

improves the model stability. Some

of the commonly used Learning


Rate Schedulers are Step Decay

and Exponential Decay.

Weight Initialization: The model

convergence highly depends on the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 29 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

weights initialized before training.

Initializing the proper weights leads

to faster convergence. The


commonly used weight initialization

for transformers is T-Fixup. Use the

weight initialization techniques only

when you are defining your own


LLM architecture.

Regularization: It’s observed that

LLMs are prone to overfitting.

Hence, it’s necessary to use the

techniques like Batch


Normalization, Dropout, and L1/L2

regularization that will help the

model overcome overfitting.

Dialogue-optimized LLMs

Dialogue-optimized Large Language


Models (LLMs) begin their journey with

a pretraining phase, similar to other

LLMs. Post-pretraining, these models

are capable of text completion. To


generate specific answers to questions,

these LLMs undergo fine-tuning on a


supervised dataset comprising

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 30 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

question-answer pairs. This process

equips the model with the ability to

generate answers to specific questions.

ChatGPT, a dialogue-optimized LLM,


follows a similar training method.

However, after pretraining and

supervised fine-tuning, it incorporates


an additional step known as

Reinforcement Learning from Human

Feedback (RLHF).

Interestingly, a recent paper titled


“LIMA: Less Is More Alignment”

suggests that RLHF might not be

necessary. The paper posits that


pretraining on a large dataset and

supervised fine-tuning on high-quality

data (less than 1000 examples) can

suffice.

As of now, OpenChat stands as the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 31 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

latest dialogue-optimized LLM, inspired

by LLaMA-13B. After fine-tuning on just

6k high-quality examples, it surpasses


ChatGPT’s score on the Vicuna GPT-4

evaluation by 105.7%. This

achievement underscores the potential

of optimizing training methods and


resources in the development of

dialogue-optimized LLMs.

How Do You Evaluate


LLMs?
The evaluation of LLMs cannot be

subjective. It has to be a logical process

to evaluate the performance of LLMs.

In the case of classification or


regression problems, we have the true

labels and predicted labels and then

compare both of them to understand


how well the model is performing. We

look at the confusion matrix for this

right? But what about large language

models? They just generate the text.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 32 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

There are 2 ways to evaluate LLMs:

Intrinsic and extrinsic methods.

Intrinsic Methods

Researchers evaluated traditional


language models using intrinsic

methods like perplexity, bits per

character, etc. These metrics track the

performance on the language front i.e.


how well the model is able to predict the

next word.

Extrinsic Methods

With the advancements in LLMs today,

researchers and practitioners prefer


using extrinsic methods to evaluate

their performance. The recommended

way to evaluate LLMs is to look at how

well they are performing at different


tasks like problem-solving, reasoning,

mathematics, computer science, and

competitive exams like MIT, JEE, etc.

EleutherAI released a framework called


as Language Model Evaluation

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 33 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Harness to compare and evaluate the

performance of LLMs. Hugging face

integrated the evaluation framework to


evaluate open-source LLMs developed

by the community.

The proposed framework evaluates

LLMs across 4 different datasets. The


final score is an aggregation of scores

from each dataset.

AI2 Reasoning Challenge: A

collection of science questions


designed for elementary school

students.

HellaSwag: A test that challenges

state-of-the-art models to make

common-sense inferences, which


are relatively easy for humans

(about 95% accuracy).

MMLU: A comprehensive test that

evaluates the multitask accuracy of

a text model. It includes 57 different


tasks covering subjects like basic

math, U.S. history, computer

science, law, and more.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 34 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

TruthfulQA: A test specifically

created to assess a model’s

tendency to generate accurate


answers and avoid reproducing

false information commonly found

online.

Also Read: 10 Exciting Projects on


Large Language Models(LLM)

Conclusion

Large Language Models (LLMs) have


revolutionized the field of machine

learning. They have a wide range of

applications, from continuing text to

creating dialogue-optimized models.


Libraries like TensorFlow and PyTorch

have made it easier to build and train

these models.

However, training LLMs is not without


its challenges. It requires substantial

infrastructure and can be costly.

Understanding the scaling laws is


crucial to optimize the training process

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 35 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and manage costs effectively. Despite

these challenges, the benefits of LLMs,

such as their ability to understand and


generate human-like text, make them a

valuable tool in today’s data-driven

world.

The process of training an LLM involves


feeding the model with a large dataset

and adjusting the model’s parameters to

minimize the difference between its

predictions and the actual data.


Typically, developers achieve this by

using a decoder in the transformer

architecture of the model.

Evaluating the performance of LLMs is


as important as training them. It helps

us understand how well the model has

learned from the training data and how

well it can generalize to new data.

LLMs have opened up new possibilities


in the field of machine learning. They

are a testament to how far we’ve come

since the early days of AI and a glimpse


into what the future might hold. As we

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 36 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

continue to explore and push the

boundaries of what’s possible with

LLMs, who knows what incredible


discoveries we’ll make next?

Hope you like the article on how to train

a large language model (LLM) from

scratch, covering essential steps and


techniques for building effective LLM

models and optimizing their

performance.

Key Takeaways

Large Language Models (LLMs) like


GPT-3, Falcon, and others have

revolutionized natural language

processing by enabling machines to

understand and generate human-


like text.

Training LLMs from scratch involves

collecting massive datasets,

preprocessing, defining model

architecture, hyperparameter
tuning, and evaluation.

Challenges in training LLMs include

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 37 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

infrastructure requirements, such as

powerful GPUs and substantial

costs, as well as understanding


scaling laws to optimize model size

and dataset.

One can evaluate LLMs through

intrinsic methods like perplexity and

extrinsic methods like evaluating


task-specific performance on

datasets such as AI2 Reasoning

Challenge and TruthfulQA.]

The media shown in this article is not


owned by Analytics Vidhya and is

used at the Author’s discretion.

Aravind Pai

Aravind Pai is passionate about building

data-driven products for the sports


domain. He strongly believes that

Sports Analytics is a Game Changer.

Beginner Best Of Tech

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 38 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Generative AI Guide

Large Language Models LLMs

Free Courses

4.7 Gene
rative
AI - A
Way
of
Life
Explore
Genera
tive AI
for
beginn
ers:
create
text
and
images
, use
top AI
tools,
learn
practic
al
skills,
and
ethics.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 39 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.5 Gettin
g
Starte
d with
Large
Lang
uage
Model
s
Master
Large
Langua
ge
Models
(LLMs)
with
this
course,
offering
clear
guidan
ce in
NLP
and
model
training
made
simple.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 40 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.6 Buildi
ng
LLM
Applic
ations
using
Prom
pt
Engin
eerin
g
This
free
course
guides
you on
building
LLM
apps,
masteri
ng
prompt
engine
ering,
and
develo
ping
chatbot
s with
enterpri
se
data.

4.8 Impro
ving
Real
World
RAG
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 41 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Syste
ms:
Key
Chall
enges
&
Practi
cal
Soluti
ons
Explore
practic
al
solution
s,
advanc
ed
retrieva
l
strategi
es, and
agentic
RAG
system
s to
improv
e
context
,
relevan
ce, and
accura
cy in
AI-
driven
applicat
ions.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 42 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.7 Micro
soft
Excel:
Form
ulas
&
Functi
ons
Master
MS
Excel
for data
analysi
s with
key
formula
s,
functio
ns, and
LookUp
tools in
this
compre
hensive
course.

Responses From
Readers

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 43 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

What are your thoughts?...

Submit reply

Akshat

Your blog is very helpful and informative. Thanks


For Sharing With Us.

Dipak Khatri

This is the most detailed explanation to the


audience.I learn many things.

Vinayak

Your blog explained very systematic manner and it's


very informative.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 44 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Frequently Asked
Questions

Q1. What is a large language model?

A. A large language model is a type of artificial


intelligence that can understand and generate human-
like text. It typically trains on vast amounts of text data
and learns to predict and generate coherent sentences
based on the input it receives.

Q2.What is the difference between NLP and


large language models?

Q3.

Q4. What is the difference between LLM


and AI?

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 45 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Write for us
Write, captivate, and earn accolades
and rewards for your work

Reach a Global Cash In on Your


Audience Knowledge
Get Expert Join a Thriving
Feedback Community
Build Your Brand Level Up Your
& Audience Data Science
Game

Flagship Courses
GenAI Pinnacle Program | AI/ML BlackBelt Courses

Free Courses
Generative AI | Large Language Models | Building LLM Applications using Prompt
Engineering | Building Your first RAG System using LlamaIndex | Stability.AI | MidJourney |
Building Production Ready RAG systems using LlamaIndex | Building LLMs for Code | Deep
Learning | Python | Microsoft Excel | Machine Learning | Decision Trees | Pandas for Data
Analysis | Ensemble Learning | NLP | NLP using Deep Learning | Neural Networks | Loan
Prediction Practice Problem | Time Series Forecasting | Tableau | Business Analytics

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 46 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Popular Categories
Generative AI | Prompt Engineering | Generative AI Application | News | Technical Guides |
AI Tools | Interview Preparation | Research Papers | Success Stories | Quiz | Use Cases |
Listicles

Generative AI Tools and Techniques


GANs | VAEs | Transformers | StyleGAN | Pix2Pix | Autoencoders | GPT | BERT |
Word2Vec | LSTM | Attention Mechanisms | Diffusion Models | LLMs | SLMs | StyleGAN |
Encoder Decoder Models | Prompt Engineering | LangChain | LlamaIndex | RAG | Fine-
tuning | LangChain AI Agent | Multimodal Models | RNNs | DCGAN | ProGAN | Text-to-Image
Models | DDPM | Document Question Answering | Imagen | T5 (Text-to-Text Transfer
Transformer) | Seq2seq Models | WaveNet | Attention Is All You Need (Transformer
Architecture)

Popular GenAI Models


Llama 3.1 | Llama 3 | Llama 2 | GPT 4o Mini | GPT 4o | GPT 3 | Claude 3 Haiku | Claude 3.5
Sonnet | Phi 3.5 | Phi 3 | Mistral Large 2 | Mistral NeMo | Mistral-7b | Gemini 1.5 Pro | Gemini
Flash 1.5 | Bedrock | Vertex AI | DALL.E | Midjourney | Stable Diffusion

Data Science Tools and Techniques


Python | R | SQL | Jupyter Notebooks | TensorFlow | Scikit-learn | PyTorch | Tableau |
Apache Spark | Matplotlib | Seaborn | Pandas | Hadoop | Docker | Git | Keras | Apache
Kafka | AWS | NLP | Random Forest | Computer Vision | Data Visualization | Data
Exploration | Big Data | Common Machine Learning Algorithms | Machine Learning

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 47 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Company Discover Learn Engage Contribute Enterprise

About Us Blogs Free Community Become an Our


courses Author offerings
Contact Us Expert Hackathons
session AI/ML Become a Trainings
Careers BlackBelt Events speaker
Podcasts Program Data Culture
AI Become a
Comprehensive GenAI Newsletter mentor
Guides Program
Become an
Agentic AI instructor
Pioneer
Program

Terms & conditions Refund Policy Privacy Policy Cookies Policy © Analytics
Vidhya 2025.All rights reserved.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 48 of 48

You might also like