0% found this document useful (0 votes)

192 views48 pages

Beginner's Guide to LLMs

Uploaded by

amioyiitd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views48 pages

Beginner's Guide to LLMs

Uploaded by

amioyiitd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Master Generative AI: Step-by-step guide to become a Download

GenAI expert Roadmap

New
Free Learning GenAI Agentic
Courses Paths Pinnacle AI
Program Pioneer
Program A
" Interview Prep Career GenAI Prompt Engg ChatGPT LLM Langchain RAG !
AI Agents

Home Large Language Models

Build Large Language Models from Scratch

Build Large
Language Models
from Scratch
Aravind
Pai
16
Last
15 min read
Updated
: 11 Nov,
2024

Be it X or Linkedin, I encounter
numerous posts about Large
Language Models(LLMs) for beginners

each day. Perhaps I wondered why

there’s such an incredible amount of
READING research and development dedicated to RECOMMEN
LIST DED
these intriguing models. From ChatGPT ARTICLES

Google
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 1 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Google
Introduction to Gemini, Falcon, and countless Gemma, the
to Open-Source
Generative
others, their names swirl around,
LLM
AI
leaving me eager to uncover their true Powerhouse
nature. These burning questions have How to Build a
Introduction lingered in my mind, fueling my Multilingual
to Chatbot using
Generative curiosity. This insatiable curiosity has
AI Large...
applications ignited a fire within me, propelling me to
What are
dive headfirst into the realm of LLMs. Large
No-code
Language
Generative Join me on an exhilarating journey as Models(LLMs)
AI app ?
development we will discuss the current state of the
Beyond
art in LLMs for begineers. Together,
Words:
Code- we’ll unravel the secrets behind their Unleashing
focused the Power of
Generative development, comprehend their
AI App Large Lan...
Development extraordinary capabilities, and shed
From GPT-3
light on how they have revolutionized
to Future
the world of language processing. Generations
of Language
Mo...
In this article, you will learn how to build
an LLM from scratch through a A Survey of
Large
beginner-friendly tutorial. We’ll cover Language
how LLMs are trained and share tips Models
(LLMs)
from Analytics Vidhya. By the end, you’ll
LLM Chatbot
have the skills to create a large
Architecture
language model. AI: Building
Smarter C...

12 Free And
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 2 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Paid LLMs for

Your Daily
Tasks

30+ LLM
Interview
Questions and
Answers

In this article, you will gain 7 Essential

Steps to
understanding on how to train a large Master Large
language model (LLM) from scratch, Language
Models
including essential techniques for

building an LLM model effectively.

Learning Objectives

Learn about LLMs and their current

state of the art.

Understand different LLMs available

and approaches to training these

LLMs from scratch

Explore best practices to train and

evaluate LLMs

This article was published as a part

of the Data Science Blogathon.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 3 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Table of contents

1. A Brief History of Large Language

Models

2. What are Large Language Models?

3. Why Large Language Models?

4. Different Kinds of LLMs

5. What are the Challenges of

Training LLMs for beginner’s?

6. Understanding the Scaling Laws

7. How to build LLM model from

scratch?

A Brief History of
Large Language
Models
The history of Large Language Models
goes back to the 1960s. In 1967, a

professor at MIT built the first ever NLP

program Eliza to understand natural
language. It uses pattern matching and
substitution techniques to understand

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 4 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and interact with humans. Later, in

1970, the MIT team built another NLP

program called SHRDLU to understand

and interact with humans.

In 1988, RNN architecture was

introduced to capture the sequential
information present in the text data. But
RNNs could work well with only shorter

sentences but not with long sentences.

Hence, LSTM was proposed in 1997.
During this period, huge developments
emerged in LSTM-based applications.
Later on, research began in attention
mechanisms as well.

Two Major Concerns With

LSTM

LSTM solved the problem of long

sentences to some extent but it could
not really excel while working with really
long sentences. Training LSTM models

cannot be parallelized. Due to this, the

training of these models took longer
time.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 5 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

In 2017, there was a breakthrough in

the research of NLP through the paper
Attention Is All You Need. This paper
revolutionized the entire NLP
landscape. The researchers introduced
the new architecture known as
Transformers to overcome the
challenges with LSTMs. Transformers
essentially were the first LLM developed
containing a huge no. of parameters.

Transformers emerged as state-of-the-

art models for LLMs. Even today, the

development of LLM remains influenced

by transformers.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 6 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Over the next five years, there was

significant research focused on building

better LLMs for begineers compared to

transformers. The size of LLM

exponentially increased over time. The

experiments proved that increasing the

size of LLMs and datasets improved the

knowledge of LLMs. Hence, GPT

variants like GPT-2, GPT-3, GPT 3.5,

GPT-4 were introduced with an increase

in the size of parameters and training

datasets.

In 2022, there was another

breakthrough in NLP, ChatGPT.

ChatGPT is a dialogue-optimized LLM

that is capable of answering anything

you want it to. In a couple of months,

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 7 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Google introduced Gemini as a

competitor to ChatGPT.

In the last 1 year, there have been

hundreds of Large Language Models

developed. You can get the list of open-

source LLMs along with the ranking on

the Hugging Face Open LLM

leaderboard. The state-of-the-art LLM

to date is Falcon 40B Instruct.

What are Large

Language Models?

Simply put this way, Large Language

Models are deep learning models

trained on huge datasets to understand

human languages. Its core objective is

to learn and understand human

languages precisely. Large Language

Models enable the machines to interpret

languages just like the way we, as

humans, interpret them.

Large Language Models learn the

patterns and relationships between the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 8 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

words in the language. For example, it

understands the syntactic and semantic

structure of the language like grammar,

order of the words, and meaning of the

words and phrases. It gains the

capability to grasp the whole language

itself.

But how exactly is language models

different from Large Language Models?

Language models and Large Language

models learn and understand the

human language but the primary

difference is the development of these

models.

Generally develop language models as

statistical models using HMMs or

probabilistic-based methods, while they

design Large Language Models as deep

learning models with billions of

parameters trained on a very large

dataset.

Why Large Language

Models?
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 9 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

The answer to this question is simple.

LLMs for begineers are task-agnostic

models. Literally, these models have the

capability to solve any task. For

example, ChatGPT is a classical

example of this. Every time you ask

ChatGPT something, it amazes you.

Here is the steps about the Large

Language Models:

1. Prompt the Model: Simply provide

a well-crafted prompt; no fine-tuning

needed.

2. Utilize Versatility: Use one model

for various tasks and applications.

3. Receive Instant Solutions: Get

quick and effective responses to

your problems.

4. Understand Foundation Models:

Recognize LLMs as foundational

tools in Natural Language

Processing (NLP).

5. Explore Features: Leverage their

ability to generate human-like text

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 10 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and insights easily.

Different Kinds of
LLMs

LLMs can be broadly classified into 2

types depending on their task:

Continuing the text

Dialogue optimized

Continuing the Text

These LLMs are trained to predict the

next sequence of words in the input

text. Their task at hand is to continue

the text.

For example, given the text “How are

you”, these LLMs might complete the

sentence with “How are you doing? or

“How are you? I am fine.

The list ofLLMs for begineers falling

under this category are Transformers,

BERT, XLNet, GPT, and its variants like

GPT-2, GPT-3, GPT-4, etc.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 11 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Now, the problem with these LLMs is

that its very good at completing the text

rather than answering. Sometimes, we

expect the answer rather than
completion.

As discussed above, given How are

you? as an input, LLM tries to complete

the text with doing? or I am fine. The

response can be either of them:

completion or an answer. This is exactly

why the dialogue-optimized LLMs were

introduced.

Dialogue Optimized

These LLMs respond back with an

answer rather than completing it. Given

the input “How are you?”, these LLMs

might respond back with an answer “I

am doing fine.” rather than completing

the sentence.

The list of dialogue-optimized LLMs is

InstructGPT, ChatGPT, Gemini, Falcon-

40B-instruct, etc.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 12 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Now, we will see the challenges

involved in training LLMs from scratch.

What are the

Challenges of Training
LLMs for beginner’s?

Training LLMs from scratch are really

challenging because of 2 main factors:

Infrastructure and Cost.

Infrastructure

LLMs for begineers are trained on a

massive text corpus ranging at least in

the size of 1000 GBs. The models used

to train on these datasets are very large

containing billions of parameters. In

order to train such large models on the

massive text corpus, we need to set up

an infrastructure/hardware supporting
multiple GPUs. Can you guess the time

taken to train GPT-3 – 175 billion

parameter model on a single GPU?

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 13 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

“
It would take 288 years to train

GPT-3 on a single NVIDIA Tesla

V100 GPU.

This clearly shows that training LLM on

a single GPU is not possible at all. It

requires distributed and parallel

computing with thousands of GPUs.

Just to give you an idea, here is the

hardware used for training popular

LLMs-

1. Falcon-40B was trained on 384

A100 40GB GPUs, using a 3D

parallelism strategy (TP=8, PP=4,

DP=12) combined with ZeRO.

2. Researchers calculated that OpenAI

could have trained GPT-3 in as little

as 34 days on 1,024x A100 GPUs

3. PaLM (540B, Google): 6144 TPU v4

chips used in total.

Cost

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 14 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

It’s very obvious from the above that

GPU infrastructure is much needed for

training LLMs for begineers from

scratch. Setting up this size of

infrastructure is highly expensive.

Companies and research institutions

invest millions of dollars to set it up and

train LLMs from scratch.

‘
It is estimated that GPT-3 cost

around $4.6 million dollars to

train from scratch

On average, the 7B parameter model

would cost roughly $25000 to train from

scratch.

Now, we will see the scaling laws of

LLMs.

Understanding the
Scaling Laws
Recently, we have seen that the trend

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 15 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

of large language models being

developed. They are really large

because of the scale of the dataset and

model size.

When you are training LLMs from

scratch, its really important to ask these

questions prior to the experiment-

1. How much data do I need to train

LLMs from scratch?

2. What should be the size of the

model?

The answer to these questions lies in

scaling laws.

Scaling laws determines how much

optimal data is required to train a model

of a particular size.

In 2022, DeepMind proposed the

scaling laws for training the LLMs with

the optimal model size and dataset (no.

of tokens) in the paper Training

Compute-Optimal Large Language

Models.These scaling laws are

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 16 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

popularly known as Chinchilla or

Hoffman scaling laws. It states that

‘
The no. of tokens used to train

LLM should be 20 times more

than the no. of parameters of the

model.

1,400B (1.4T) tokens should be used to

train a data-optimal LLM of size 70B

parameters. So, we need around 20

text tokens per parameter.

Next, we will see how to train LLMs

from scratch.

How to build LLM

model from scratch?
Step 1: Define Your Goal

Start by figuring out what you want your

language model to do. Do you want it to

answer questions, generate text, or chat

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 17 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

like a human? Knowing your goal will

help you make better choices later.

Step 2: Choose a Model

Design

Most modern language models use

something called

the transformer architecture. This

design helps the model understand the

relationships between words in a

sentence. You can build your model

using programming tools

like PyTorch or TensorFlow.

Key Parts of a Transformer

1. Encoder: This part reads and

understands the input text.

2. Decoder: This part generates the

output text.

3. Self-Attention: This mechanism

helps the model focus on important
words in a sentence.

Step 3: Gather Your Data

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 18 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

You need a lot of text data to train your

model. This data should be relevant to

what you want the model to do. For

example, if you want it to write stories,

gather a variety of stories.

Tips for Good Data

Relevance: Make sure the data is

related to your goal.

Variety: Use different types of text

to help the model learn better.

Quantity: The more data you have,

the better your model can perform.

Step 4: Train Your Model

Training is the process of teaching your

model using the data you collected.

This can take a lot of time and computer

power.

Training Tips

Batch Training: Train the model

using small groups of data at a time

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 19 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

to save memory.

Adjust Settings: Experiment with

different settings (like learning

rates) to see what works best

How Do You Train

LLMs from Scratch?

The training process of LLMs is different

for the kind of LLM you want to build

whether it’s continuing the text or

dialogue optimized. The performance of

LLMs mainly depends upon 2 factors:

Dataset and Model Architecture. These

2 are the key driving factors behind the

performance of LLMs.

Let’s discuss the now different steps

involved in training the LLMs.

Continuing the Text Tutorial

The training process for LLMs that

continue the text is known as

pretraining. These LLMs use self-

supervised learning to predict the next

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 20 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

word in the text. We will exactly see the

different steps involved in training LLMs

from scratch.

a. Dataset Collection

The first step in training LLMs is

collecting a massive corpus of text data.

The dataset plays the most significant

role in the performance of LLMs.

Recently, OpenChat is the latest dialog-

optimized large language model

inspired by LLaMA-13B. It achieves

105.7% of the ChatGPT score on the

Vicuna GPT-4 evaluation. Do you know

the reason behind its success? It’s high-

quality data. It has been finetuned on

only ~6K data.

The training data is created by scraping

the internet, websites, social media

platforms, academic sources, etc. Make

sure that training data is as diverse as

possible.

‘
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 21 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Recent work has demonstrated

that increased training dataset

diversity improves general cross-

domain knowledge and

downstream generalization

capability for large-scale

language models

What does it say?

You might have come across the

headlines that “ChatGPT failed at JEE”

or “ChatGPT fails to clear the UPSC”

and so on. What can be the possible

reasons? The reason is that it lacked

the necessary level of intelligence,

which heavily depends on the dataset

used for training. Hence, the demand

for diverse dataset continues to rise as

high-quality cross-domain dataset has a

direct impact on the model

generalization across different tasks.

‘
Unlock the potential of LLMs with

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 22 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

the high quality data!

Previously, Common Crawl was the go-

to dataset for training LLMs. The

Common Crawl contains the raw web

page data, extracted metadata, and text

extractions since 2008. The size of the

dataset is in petabytes (1 petabyte=1e6

GB). Its proven that the that Large

Language Models trained on this

dataset achieved effective results but

failed to generalize well across other

tasks. Hence, a new dataset called Pile

was created from 22 diverse high-

quality datasets. It’s a combination of

existing data sources and new datasets

in the range of 825 GB. In recent times,

the refined version of the common crawl

was released in the name of

RefinedWeb Dataset.

Note: The datasets used for GPT-3

and GPT-4 have not been open-

sourced in order to maintain a

competitive advantage over the

others.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 23 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

b. Dataset Preprocessing

The next step is to preprocess and

clean the dataset. Since we crawl the

dataset from multiple web pages and

different sources, it often contains

various nuances. We must eliminate

these nuances and prepare a high-

quality dataset for the model training.

The specific preprocessing steps

actually depend on the dataset you are

working with. Some of the common

preprocessing steps include removing

HTML Code, fixing spelling mistakes,

eliminating toxic/biased data, converting

emoji into their text equivalent, and data

deduplication. Data deduplication is one

of the most significant preprocessing

steps while training LLMs. Data

deduplication refers to the process of

removing duplicate content from the

training corpus.

It’s obvious that the training data may

contain duplicate or nearly identical

sentences because we collect it from

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 24 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

various data sources. We need data

deduplication for 2 primary reasons: It

helps the model not to memorize the

same data again and again. It helps us

to evaluate LLMs better because the

training and test data contain non-

duplicated information. If it contains

duplicated information, there is a very

chance that the information it has seen

in the training set is provided as output

during the test set. As a result, the

numbers reported may not be true. You

can read more about data deduplication

techniques in the paper Deduplicating

Training Data Makes Language

Models Better

c. Dataset Preparation

During the pretraining phase, the next

step involves creating the input and

output pairs for training the model.

Trainers prepare LLMs to predict the

next token in the text, generating input

and output pairs accordingly. While this

demonstration considers each word as

a token for simplicity, in practice,

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 25 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

tokenization algorithms like Byte Pair

Encoding (BPE) further break down

each word into subwords. The model is

then trained with the tokens of input and

output pairs.

For example, let’s take a simple corpus-

Example 1: I am a DHS Chatbot.

Example 2: DHS stands for

DataHack Summit.

Example 3: I can provide you with

information about DHS

In the case of example 1, we can create

the input-output pairs as per below-

Similarly, in the case of example 2, the

following is a list of input and output

pairs-

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 26 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Each input and output pair is passed on

to the model for training.

Now, what next? Let’s define the model

architecture.

d. Model Architecture

The next step is to define the model

architecture and train the LLM.

As of today, there are a huge no. of

LLMs being developed. You can get an

overview of different LLMs at the

Hugging Face Open LLM leaderboard.

There is a standard process followed by

the researchers while building LLMs.

Most of the researchers start with an

existing Large Language Model

architecture like GPT-3 along with the

actual hyperparameters of the model.

And then tweak the model architecture /

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 27 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

hyperparameters / dataset to come up

with a new LLM.

For example,

Falcon is a state-of-the-art LLM. It

ranks first on the open-source LLM

leaderboard. Falcon is inspired by

GPT-3 architecture with a couple of

tweaks.

e. Hyperparameter Search

Hyperparameter tuning is a very

expensive process in terms of time and

cost as well. Just imagine running this

experiment for the billion-parameter

model. It’s not feasible right? Hence,

the ideal method to go about is to use

the hyperparameters of current

research work, for example, use the

hyperparameters of GPT-3 while

working with the corresponding

architecture and then find the optimal

hyperparameters on the small scale and

then interpolate them for the final

model.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 28 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

The experiments can involve any or all

of the following: weight initialization,

positional embeddings, optimizer,

activation, learning rate, weight decay,

loss function, sequence length, number

of layers, number of attention heads,

number of parameters, dense vs.

sparse layers, batch size, and drop out.

Let’s discuss the best practices for

popular hyperparameters now-

Batch size: Ideally choose the large

batch size that fits the GPU

memory.

Learning Rate Scheduler: The

better way to go about this is to

decrease the learning rate as the

training progress. This will

overcome the local minima and

improves the model stability. Some

of the commonly used Learning

Rate Schedulers are Step Decay

and Exponential Decay.

Weight Initialization: The model

convergence highly depends on the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 29 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

weights initialized before training.

Initializing the proper weights leads

to faster convergence. The

commonly used weight initialization

for transformers is T-Fixup. Use the

weight initialization techniques only

when you are defining your own

LLM architecture.

Regularization: It’s observed that

LLMs are prone to overfitting.

Hence, it’s necessary to use the

techniques like Batch

Normalization, Dropout, and L1/L2

regularization that will help the

model overcome overfitting.

Dialogue-optimized LLMs

Dialogue-optimized Large Language

Models (LLMs) begin their journey with

a pretraining phase, similar to other

LLMs. Post-pretraining, these models

are capable of text completion. To

generate specific answers to questions,

these LLMs undergo fine-tuning on a

supervised dataset comprising

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 30 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

question-answer pairs. This process

equips the model with the ability to

generate answers to specific questions.

ChatGPT, a dialogue-optimized LLM,

follows a similar training method.

However, after pretraining and

supervised fine-tuning, it incorporates

an additional step known as

Reinforcement Learning from Human

Feedback (RLHF).

Interestingly, a recent paper titled

“LIMA: Less Is More Alignment”

suggests that RLHF might not be

necessary. The paper posits that

pretraining on a large dataset and

supervised fine-tuning on high-quality

data (less than 1000 examples) can

suffice.

As of now, OpenChat stands as the

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 31 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

latest dialogue-optimized LLM, inspired

by LLaMA-13B. After fine-tuning on just

6k high-quality examples, it surpasses

ChatGPT’s score on the Vicuna GPT-4

evaluation by 105.7%. This

achievement underscores the potential

of optimizing training methods and

resources in the development of

dialogue-optimized LLMs.

How Do You Evaluate

LLMs?
The evaluation of LLMs cannot be

subjective. It has to be a logical process

to evaluate the performance of LLMs.

In the case of classification or

regression problems, we have the true

labels and predicted labels and then

compare both of them to understand

how well the model is performing. We

look at the confusion matrix for this

right? But what about large language

models? They just generate the text.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 32 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

There are 2 ways to evaluate LLMs:

Intrinsic and extrinsic methods.

Intrinsic Methods

Researchers evaluated traditional

language models using intrinsic

methods like perplexity, bits per

character, etc. These metrics track the

performance on the language front i.e.

how well the model is able to predict the

next word.

Extrinsic Methods

With the advancements in LLMs today,

researchers and practitioners prefer

using extrinsic methods to evaluate

their performance. The recommended

way to evaluate LLMs is to look at how

well they are performing at different

tasks like problem-solving, reasoning,

mathematics, computer science, and

competitive exams like MIT, JEE, etc.

EleutherAI released a framework called

as Language Model Evaluation

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 33 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Harness to compare and evaluate the

performance of LLMs. Hugging face

integrated the evaluation framework to

evaluate open-source LLMs developed

by the community.

The proposed framework evaluates

LLMs across 4 different datasets. The

final score is an aggregation of scores

from each dataset.

AI2 Reasoning Challenge: A

collection of science questions

designed for elementary school

students.

HellaSwag: A test that challenges

state-of-the-art models to make

common-sense inferences, which

are relatively easy for humans

(about 95% accuracy).

MMLU: A comprehensive test that

evaluates the multitask accuracy of

a text model. It includes 57 different

tasks covering subjects like basic

math, U.S. history, computer

science, law, and more.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 34 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

TruthfulQA: A test specifically

created to assess a model’s

tendency to generate accurate

answers and avoid reproducing

false information commonly found

online.

Also Read: 10 Exciting Projects on

Large Language Models(LLM)

Conclusion

Large Language Models (LLMs) have

revolutionized the field of machine

learning. They have a wide range of

applications, from continuing text to

creating dialogue-optimized models.

Libraries like TensorFlow and PyTorch

have made it easier to build and train

these models.

However, training LLMs is not without

its challenges. It requires substantial

infrastructure and can be costly.

Understanding the scaling laws is

crucial to optimize the training process

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 35 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

and manage costs effectively. Despite

these challenges, the benefits of LLMs,

such as their ability to understand and

generate human-like text, make them a

valuable tool in today’s data-driven

world.

The process of training an LLM involves

feeding the model with a large dataset

and adjusting the model’s parameters to

minimize the difference between its

predictions and the actual data.

Typically, developers achieve this by

using a decoder in the transformer

architecture of the model.

Evaluating the performance of LLMs is

as important as training them. It helps

us understand how well the model has

learned from the training data and how

well it can generalize to new data.

LLMs have opened up new possibilities

in the field of machine learning. They

are a testament to how far we’ve come

since the early days of AI and a glimpse

into what the future might hold. As we

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 36 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

continue to explore and push the

boundaries of what’s possible with

LLMs, who knows what incredible

discoveries we’ll make next?

Hope you like the article on how to train

a large language model (LLM) from

scratch, covering essential steps and

techniques for building effective LLM

models and optimizing their

performance.

Key Takeaways

Large Language Models (LLMs) like

GPT-3, Falcon, and others have

revolutionized natural language

processing by enabling machines to

understand and generate human-

like text.

Training LLMs from scratch involves

collecting massive datasets,

preprocessing, defining model

architecture, hyperparameter
tuning, and evaluation.

Challenges in training LLMs include

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 37 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

infrastructure requirements, such as

powerful GPUs and substantial

costs, as well as understanding

scaling laws to optimize model size

and dataset.

One can evaluate LLMs through

intrinsic methods like perplexity and

extrinsic methods like evaluating

task-specific performance on

datasets such as AI2 Reasoning

Challenge and TruthfulQA.]

The media shown in this article is not

owned by Analytics Vidhya and is

used at the Author’s discretion.

Aravind Pai

Aravind Pai is passionate about building

data-driven products for the sports

domain. He strongly believes that

Sports Analytics is a Game Changer.

Beginner Best Of Tech

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 38 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Generative AI Guide

Large Language Models LLMs

Free Courses

4.7 Gene
rative
AI - A
Way
of
Life
Explore
Genera
tive AI
for
beginn
ers:
create
text
and
images
, use
top AI
tools,
learn
practic
al
skills,
and
ethics.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 39 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.5 Gettin
g
Starte
d with
Large
Lang
uage
Model
s
Master
Large
Langua
ge
Models
(LLMs)
with
this
course,
offering
clear
guidan
ce in
NLP
and
model
training
made
simple.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 40 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.6 Buildi
ng
LLM
Applic
ations
using
Prom
pt
Engin
eerin
g
This
free
course
guides
you on
building
LLM
apps,
masteri
ng
prompt
engine
ering,
and
develo
ping
chatbot
s with
enterpri
se
data.

4.8 Impro
ving
Real
World
RAG
https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 41 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Syste
ms:
Key
Chall
enges
&
Practi
cal
Soluti
ons
Explore
practic
al
solution
s,
advanc
ed
retrieva
l
strategi
es, and
agentic
RAG
system
s to
improv
e
context
,
relevan
ce, and
accura
cy in
AI-
driven
applicat
ions.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 42 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

4.7 Micro
soft
Excel:
Form
ulas
&
Functi
ons
Master
MS
Excel
for data
analysi
s with
key
formula
s,
functio
ns, and
LookUp
tools in
this
compre
hensive
course.

Responses From
Readers

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 43 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

What are your thoughts?...

Submit reply

Akshat

Your blog is very helpful and informative. Thanks

For Sharing With Us.

Dipak Khatri

This is the most detailed explanation to the

audience.I learn many things.

Vinayak

Your blog explained very systematic manner and it's

very informative.

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 44 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Frequently Asked
Questions

Q1. What is a large language model?

A. A large language model is a type of artificial

intelligence that can understand and generate human-
like text. It typically trains on vast amounts of text data
and learns to predict and generate coherent sentences
based on the input it receives.

Q2.What is the difference between NLP and

large language models?

Q3.

Q4. What is the difference between LLM

and AI?

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 45 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Write for us
Write, captivate, and earn accolades
and rewards for your work

Reach a Global Cash In on Your

Audience Knowledge
Get Expert Join a Thriving
Feedback Community
Build Your Brand Level Up Your
& Audience Data Science
Game

Flagship Courses
GenAI Pinnacle Program | AI/ML BlackBelt Courses

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 46 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Generative AI Tools and Techniques

Popular GenAI Models

Data Science Tools and Techniques

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 47 of 48
Build Large Language Models from Scratch - Analytics Vidhya 03/01/25, 2:23 PM

Company Discover Learn Engage Contribute Enterprise

About Us Blogs Free Community Become an Our

courses Author offerings
Contact Us Expert Hackathons
session AI/ML Become a Trainings
Careers BlackBelt Events speaker
Podcasts Program Data Culture
AI Become a
Comprehensive GenAI Newsletter mentor
Guides Program
Become an
Agentic AI instructor
Pioneer
Program

https://www.analyticsvidhya.com/blog/2023/07/beginners-guide-to-build-large-language-models-from-scratch/ Page 48 of 48

CS3691 - Esiot Lab Manual
No ratings yet
CS3691 - Esiot Lab Manual
80 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools To Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 Instant Download
100% (7)
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools To Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 Instant Download
89 pages
Database Normalization & Big Data Analysis
No ratings yet
Database Normalization & Big Data Analysis
7 pages
Data Science & MLOps 8-Month Roadmap
No ratings yet
Data Science & MLOps 8-Month Roadmap
35 pages
Ai Unit 4 Digital Notes
No ratings yet
Ai Unit 4 Digital Notes
65 pages
Semester 4 Exam Schedule
No ratings yet
Semester 4 Exam Schedule
5 pages
AI Midterm Exam Guide
No ratings yet
AI Midterm Exam Guide
11 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
29 pages
Ad3501 - Deep Learning
No ratings yet
Ad3501 - Deep Learning
2 pages
Soft Computing UNIT 3
No ratings yet
Soft Computing UNIT 3
10 pages
Neural Networks Basics
100% (1)
Neural Networks Basics
25 pages
Data Structures Using Python 1737761630
No ratings yet
Data Structures Using Python 1737761630
89 pages
Ref 3 Recommender Systems For Learning PDF
No ratings yet
Ref 3 Recommender Systems For Learning PDF
84 pages
Soft Computing 2017
No ratings yet
Soft Computing 2017
323 pages
Arithmetic Logic Unit: CSE 429 Digital System Design
No ratings yet
Arithmetic Logic Unit: CSE 429 Digital System Design
42 pages
21cs743 Model Question Paper Solution
No ratings yet
21cs743 Model Question Paper Solution
33 pages
Set 1 Assignments Nptel
No ratings yet
Set 1 Assignments Nptel
32 pages
Pattern Recognition
No ratings yet
Pattern Recognition
45 pages
ML-5TH Unit
No ratings yet
ML-5TH Unit
28 pages
Digital Design and Computer Architecture Arm Edition Solution Manual Sarah L Harris Download
No ratings yet
Digital Design and Computer Architecture Arm Edition Solution Manual Sarah L Harris Download
88 pages
CAMBRDIGE GRADE 7 Unit 4 Algorithms Unit 5 Flowcharts
No ratings yet
CAMBRDIGE GRADE 7 Unit 4 Algorithms Unit 5 Flowcharts
6 pages
Python Lab Manual 2022-23-2
No ratings yet
Python Lab Manual 2022-23-2
36 pages
CST383 B
No ratings yet
CST383 B
4 pages
Demo Course PPT - Python
No ratings yet
Demo Course PPT - Python
18 pages
Data Science Syllabus From Beginner To Advanced
No ratings yet
Data Science Syllabus From Beginner To Advanced
7 pages
ML Lesson Plan (21AI63)
No ratings yet
ML Lesson Plan (21AI63)
8 pages
N P-Hard and N P-Complete Problems
No ratings yet
N P-Hard and N P-Complete Problems
12 pages
Gate (Da Pyq Book)
No ratings yet
Gate (Da Pyq Book)
333 pages
Deep Learning Important Questions
No ratings yet
Deep Learning Important Questions
2 pages
Neural Network Full Course
No ratings yet
Neural Network Full Course
10 pages
Dcs Vit Lab Manual
No ratings yet
Dcs Vit Lab Manual
36 pages
AI Past Years DU
No ratings yet
AI Past Years DU
4 pages
Eroju Java Anadu Repu Bava Antadu
No ratings yet
Eroju Java Anadu Repu Bava Antadu
36 pages
Oops File-Converted-1 PDF
67% (6)
Oops File-Converted-1 PDF
183 pages
M.tech DL
No ratings yet
M.tech DL
221 pages
The Hundred Page Machine Learning Book
No ratings yet
The Hundred Page Machine Learning Book
7 pages
Rust Book en Us Shieber
100% (1)
Rust Book en Us Shieber
338 pages
AAD Flow Networks and Divide and Conquer
No ratings yet
AAD Flow Networks and Divide and Conquer
17 pages
DAA Unit-2: Fundamental Algorithmic Strategies
No ratings yet
DAA Unit-2: Fundamental Algorithmic Strategies
5 pages
Mobile Communication Course Guide
No ratings yet
Mobile Communication Course Guide
1 page
Linear Regression - Cheatsheet
No ratings yet
Linear Regression - Cheatsheet
8 pages
Experiment No. 7: Aim:-Write C Program To Generate Three Address Code For Arithmetic Operator
No ratings yet
Experiment No. 7: Aim:-Write C Program To Generate Three Address Code For Arithmetic Operator
6 pages
Data Science Report
No ratings yet
Data Science Report
35 pages
Soft Computing MCQ Assignment 3
No ratings yet
Soft Computing MCQ Assignment 3
35 pages
Version Space
No ratings yet
Version Space
26 pages
10 Reasoning
100% (1)
10 Reasoning
18 pages
UC Berkeley ML Course Guide
100% (1)
UC Berkeley ML Course Guide
185 pages
Artificial Intelligence: Dr. Sheraz Naseer Irfan Malik
No ratings yet
Artificial Intelligence: Dr. Sheraz Naseer Irfan Malik
23 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Computer Science (Ibdp 2023)
No ratings yet
Computer Science (Ibdp 2023)
17 pages
IT Worksheet Answers AS - A Level
No ratings yet
IT Worksheet Answers AS - A Level
56 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
26 pages
CNNs for Image Recognition
No ratings yet
CNNs for Image Recognition
16 pages
Ai QB
No ratings yet
Ai QB
3 pages
Placement Preparation Tasks For AI, ML
No ratings yet
Placement Preparation Tasks For AI, ML
4 pages
AD8402 - Artificial Intelligence (Unit III)
No ratings yet
AD8402 - Artificial Intelligence (Unit III)
24 pages
A Beginner's Guide To Large Language Models
No ratings yet
A Beginner's Guide To Large Language Models
25 pages
Kickstart Your Journey With LLM - A Comprehensive Guide
No ratings yet
Kickstart Your Journey With LLM - A Comprehensive Guide
2 pages
A Beginner's Guide To Large Language Models Part 1
No ratings yet
A Beginner's Guide To Large Language Models Part 1
25 pages
DZ-getting-started-large Language Models LLMs-2024
No ratings yet
DZ-getting-started-large Language Models LLMs-2024
7 pages
Lesson Plan: My Face
67% (3)
Lesson Plan: My Face
3 pages
Roles and Functions of A Nurse
50% (2)
Roles and Functions of A Nurse
3 pages
Language and Literature DP 2 - 2019-2020
No ratings yet
Language and Literature DP 2 - 2019-2020
29 pages
Navigating The AI Frontier - A Guide For Ethical Academic Writing
No ratings yet
Navigating The AI Frontier - A Guide For Ethical Academic Writing
3 pages
Lesson 2
No ratings yet
Lesson 2
6 pages
Toefl Ex
No ratings yet
Toefl Ex
3 pages
Viswam Poster WorkShop
No ratings yet
Viswam Poster WorkShop
2 pages
Arguing A Position
No ratings yet
Arguing A Position
17 pages
Unit 1 Philosophy & Ethics
100% (5)
Unit 1 Philosophy & Ethics
23 pages
AI Starter Pack: PyCharm, Python, Math
No ratings yet
AI Starter Pack: PyCharm, Python, Math
5 pages
Cultural Translation: Challenges & Strategies
No ratings yet
Cultural Translation: Challenges & Strategies
5 pages
Organizational Sensemaking Forms
No ratings yet
Organizational Sensemaking Forms
30 pages
Lab Guide - The Innovation Process
No ratings yet
Lab Guide - The Innovation Process
19 pages
Thakur Shyamnarayan College of Education and Research Nai - Talim Experential Learning Work Education Co-Teaching With Khyati Goel Lesson Plan
No ratings yet
Thakur Shyamnarayan College of Education and Research Nai - Talim Experential Learning Work Education Co-Teaching With Khyati Goel Lesson Plan
5 pages
Raven's Advanced Progressive Matrices Test With Iconic Visual Representations
0% (1)
Raven's Advanced Progressive Matrices Test With Iconic Visual Representations
6 pages
English Grammar Basics
No ratings yet
English Grammar Basics
17 pages
Theoretical and Practical Values Theoretical and Practical Values of English Lexicologyof English Lexicology
No ratings yet
Theoretical and Practical Values Theoretical and Practical Values of English Lexicologyof English Lexicology
10 pages
Math g5 m2 Topic B Lesson 4 2
No ratings yet
Math g5 m2 Topic B Lesson 4 2
13 pages
Landmarks Lesson Plan
100% (1)
Landmarks Lesson Plan
4 pages
Cave Lesson Plan
No ratings yet
Cave Lesson Plan
3 pages
Grammar and Oral Language Development (GOLD) : Reported By: Melyn A. Bacolcol Kate Batac Julie Ann Ocampo
No ratings yet
Grammar and Oral Language Development (GOLD) : Reported By: Melyn A. Bacolcol Kate Batac Julie Ann Ocampo
17 pages
A Detailed Lesson Plan in English II Preposition
No ratings yet
A Detailed Lesson Plan in English II Preposition
4 pages
Artificial Intelligence 2
No ratings yet
Artificial Intelligence 2
5 pages
Supporting Reading Comprehension CH 3
No ratings yet
Supporting Reading Comprehension CH 3
30 pages
(Rogers Brubaker) Ethnicity Without Groups
100% (7)
(Rogers Brubaker) Ethnicity Without Groups
148 pages
TRIZ
100% (2)
TRIZ
88 pages
IMs
100% (2)
IMs
46 pages
Dokument
No ratings yet
Dokument
5 pages
Understanding CLS
No ratings yet
Understanding CLS
8 pages
Thesis Help for Nature vs. Nurture
100% (2)
Thesis Help for Nature vs. Nurture
8 pages