0 ratings 0% found this document useful (0 votes) 42 views 15 pages Apurv Notes - Foundations of Pytorch
The document provides an overview of foundational concepts in PyTorch, including the differences between machine learning and deep learning models, tensor operations, and the importance of optimizing neural networks using gradient descent. It explains tensor structures, operations, and how to convert between PyTorch tensors and NumPy arrays, as well as the use of CUDA for GPU acceleration. Additionally, it covers the mechanics of gradient descent optimization, including forward and backward passes, learning rates, and methods for calculating gradients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Apurv Notes - Foundations of Pytorch For Later FOUNDATIONS
OF
PYTORCH
(JANANIT -RAVWEI
PLURALSIGHT)relationship, mostly in form of four most-prominent functions NAMELY,
{ReLU, Sigmoid (Logit, Tanh, Step)}
(4) Machine Learning vs Deep-Learning Model: In a typical machine learning
model, these relevant parameters are defined/prescribed manually by the
User, as against Deep-Learning Models, which decide, based on patterns
(fitting of curves), as to what variables are relevant, thereby improving the
Predictability /decision-making)
multiple-neurons,
@) The first layer (which is visible to user) is the Input Layer, and the Last
Layer, also visible to the user, is the Output Layer. The middle layers are
number of neurons per layer, and the number of layers must be optimized,
Since, too-much processing may lead to over-fitting of the model, and hence
may lead the model astray.
Arrays, Vectors, Matrices and Tensors:
() Vectors are 1-p array, Matrices are 2-p arrays and Tensors are n-
dimensional arrays,
(2) Pytorch processes tensors, and not numpy-arrays, and hence all numpy
Boas from input data have also to be converted to tensors,
(3) By default, Pytorch creates al clements of tensor as Float32, You can
Aunually decide whether to create float32 or float64 or int32 or inte,
(4) Array in Numpy are created and Processed on CPU, while tensors in Pytorch
cheered om GPU twhich if idx teste due to parallel processing
capabilities).
(9) The dimensions of tensors start from 0 (zero). So a tensor which is an array
of f4,3]]] has dimensions serial Pumbered as 0,1,2,3 along the rows ead
Serial-numbered as (0, 1,2]] along the column.HL/fy porch (Saravi Rov) @
Niger h mer funchion , The boot faa uate NN,
Layo f Neuron —<, Slee Gufuoddron Graph
OS Oobpud
——__—___}
ae flows
Nu=i¢ Ode > Ape Te
lake + fogm rowie + Hybriol
bull \isvadoher rh dn Libraries.
Gupalahon Pow (Compl — - hnoge Proestivg )
loyr-1: Pyeod ChoraR kaiatien Quy, ah) )
Layer: g as.)
en 3! Idd Odeo.
i Loye-4 2 i, pws will objeals
Es pute by t lym = Gges » Oak 4 4 bge
> No chongen In ovpub=> remy ovo vat Loar
Fae Rr
nt [ras fovivalion: Liroay Ravens tov
) Hf ots if poms
> Brain
2) Aebuahon Racch on : coe Non- tee Rennes
rR ule nde ugang a AeuralAeSivabron Fuaucdion Sraumplos _ Ege SO)
1) Relw: Reabrficck Linen Uil-
y Soft Cee
5 WS a se
(Sante
(3) (Step)
ate ! felts
a ||Newrol ote
t we iS E. aa | 5,
Tei
Voelns = tod arsay G39
Annys in Numpy => CPU Townace = Mubh-dmorsioval 21
§
Tem 1” fyb “hice = Nb. of clouds in. one den
lor = lo 2), £2.31, (5.4])
G) Veeby = Ona~Dimavat | 4,8), [3,011,441
Cli) Mabrieey = 2oAneyatonal:
EE: =e ole tab hb - huteds
ei
1
ee eae 2 ae fomnow (Orb) «home Tf: aoe ve
: Pe ee eer ser des 'o”
aneOperahony 0 Tonsens.
®
) Cennon $ Inshabigadion (Tord bal aoak Ie)
# Delaull fife is cook olomanl of Lloahgp a tontin dd
You es b Toabt uM [Sad bay 4
+ Ym on add a fod ee leat
teuror. add CC)
H You cam Creale om feusor Wilt abt zeaner ov Man,
leanecsrfeun =lovch. fe
oe a (202529
4 You com creale 4 leusoy in Brym of 24) hel- © 002, 6, 37]
In? fennor_ave= loved, bunasr C0012,3, 45,20),
tensor ovr sy
favor ( Eft, 2, 36], J, 3 tb rsa Hor
Ong. Faso Bis! ey ap et ie
H lock twhefn tb tt Ahonen + fy bral. ts teunor ( busesor ave)
Ou? Trae
# Check wbosf alomendy a heuer Lelowl, rewaltanor-on)
# Gul * Oninstiolised ' Tousor + Ondy voniory aMocaion,, ho valnoy beled
Ins Feuror—_uninthaliyed fords Foor fiy2)
Onk: — CLlaasg ot
Z
Eo er ee
Ft Creak randomavelya wl labrged! vy Se
Feuor—inthiobise = forol, hard ( Zed)
h fen —twilializeo) J
Out Cl Lotaoe , 23007
Coy, +4297)
He Geol Censor af 0 daftnad type (vol Hout 52)
Towson int = loved... devsor (053) ype (ch, Cea
ne nb WwW or
ea SQ and Geo li
robs r11da. nb Torinne Lr coarbm an (PUpt you with by create a (5:3) foto of WWhae ie mins
Int Puroy Ink gfu = forch. fensor (3,2). lobe ota i. ita)
bens — ime
On? toro (5,2). dlybe = foreb.sint32
ela
H Guohng bean of tyke Ideger If-bil-
K beursor_infg2 = Jorel.. short Tanonl Lio, 2.0, 2.0] )
‘eto — 12
Out tency (53); cblybe~ tovel.nt Ls
SE Onok o Lemar (0) 9 Al abl 10% on vabun
tram ten = fovok . Foste269, All sate = 10°)
st An Idaudily wrabrine
nOn-Zero = torch nonynra err eye).®
Foundations of PyTorch - Notes Part 2
Demo: Simple Operations on Tensors
(1A) Functions like "fill" which have a underscore as a suffix, overwrite the original tensor
by changing the values of elements of the original tensor. The other functions (without an
underscore as a suffix) like ".add” create a new tensor with fixed value added to the original
tensor
(1B) Not all in-place functions (with an underscore as suffix) have a corresponding function
without a suffix. So, while there is “fill”, there is no function such as "fill". Some in-place
functions like".add_" do havea corresponding out-of-place function “add”
(2A) There are also several functions from numpy library which are available in pytorch. For
example ‘linspeed’, ‘chunk’, ‘cat’, etc.
(28) tensor_chunk = torch.chunk(x,3,0) splits the tensor ‘x’ along dimesnion = U, into a tuple
of three tensors, being tensor_chunk(0], tensor_chunk(1] and tensor_chunk[2], each with 5
elemente
(2C) torch.cat(([tensor_chunk{0], tensor_chunk{1], tensor_chunk{2]),0) concatenates three
tensors along dimension = 0
(2c) Array-splicing: random_tensor(1:,1:] givea tensor for all rows from row no. 1 (second
row onwards) and all columns from column no. 1 (second column onwards).
(20) random_tensor{3,2] gives you a tensor with the value being the value of the element
(which could be a single digit or a tensor itself) at index (3,2) in "rendom_tensor"
(2£) Add or delete a dimension by ‘squeeze’ or ‘unsqueeze' functions of numpy.
(2f) Transpose a tensor from x,y to y,x dimension
Demo: Elementwise and Matrix Operations on Tensors
(26) Sorting :The sorting will be done dimension by dimension (row by row), and it will also
Produce an array of the the sorted indices,
(2H) Elementwise addition, subtracion, di
ion and multiplication.
(3) Clamp (restrict values) of tensor to minimum and maximum
(4) Dot-product: torch.dot (t1,t2) and Cross-Product : torch.mn
(48) Multiply matrix(x,y) with vector(, y) to get a tensor [sum of dot-product of each row of
matrix with each corresponding element of the vector]
(4C) torch.argmax and torch.argmin finds the index of largest & smallest element along
aparticular dimensionOo
(4D) Torch.Mul to multiply two matrices
Demo: Converting between PyTorch Tensors and NumPy Arrays
(5) Convert numpy array to torch-tensor: "tensoi
resulting tensor has datatype of "dtype=float64"
torch.from_numpy(np_array)" The
(58) "torch.as_tensor(np_array)" avoids making a copy of np_array if it is already a tensor,
thereby avoiding memory wastage, and retaining np_array in the same computation-graph (
whether on CPU or GPU. On the other hand, the tensor initiated/converted from
PyTorch Support for CUDA Devices
(6) CUDA is a GPU-enabled platform. Pytorch is supported/understood by CUDA.
(68) Use cuda to create different types of tensors (torch.cude.floattensor), select the GPU on
a multi-GPU device(torch.cude.device),
(6C) Cross-GPU operations are not allowed and will cause an error if you run an operation on
tensors stored on different GPU's (unless peer-to-peer memory access among devices is
enabled), however, tensors operating on one GPU can be copied onto other GPU's.
(6D) FIFO Queueing of executions among the GPU's, but synchronous processing can be
forced using CUDA_launch_blocking = 1 (say for error-handling, copying tensors from other
GPU's which may be needed on this GPU for an in-process execution)
Demo: Creating Tensors on CUDA-enabled Devices
(E) You have to initialize deployment ot CUDA, select & index the number ot GPU-Devices,
and when you create atensor,you have to specify that it has to be created on a specific GPU
else tensor will be created onCPU-Memory. Also, PYthonJupyter keeps track of current GPU,
So all tensors are created on that current-GPU. To create a tensor on another GPU, first
change the GPU through "torch.CUDA.device(cuda:2)" and then specify the cuda-device
while creating a tensor
(6F). When you copy a tensor created on cuda(2), and current device is CUDA(1), then the
new tensor is created on current device CUDA(1)
(66) You can even copy tensor from inside current CUDA(1) to a different device CUDA(2) ,
by using b2=tensor_random.to(device=CUDA2)
7) You can also check memory allocated to and memory cached for our tensors on the
current-device by torchh.cuda.memory_allocated() and torch.cuda.memory_cached{)®
FOUNDATIONS OF Pytorch - Notes (Part 3)
Gradient Descent Optimization
(8) Gradient Descent Optimization is a technique used to optimize the
mathematical model (a neuron).
| © Itideals with optimization of weights of input-parameters, and bias, which
leads to a best-fit curve (a curve which is closest to true-representation of
relationship between inputs & outputs).
* This is achieved by obtaining those model parameters (weights & bias)
which lead to the least Mean-Squared-Error (MSE).
(6A) Here, for simplicity, we shall assume, the neuron will have only a single
| Reuron with an affine function (for linear relationship) and no activation-function
(in this case being an identity function - an IDENTITY_MATRIX)
(8B) Each run involves two passes - a forward-pass and a backwards-pass.
(i) Forward-Pass In the first run: , in the forward-pass, we compute the
Predicted value of output, based on assumed values of bias & weight.
(i) Backward-Pass in the first-run: Then, based on the resulting error
(predicted value- actual value of training set), we work backward
through output layer > thru hidden-layers > thru the input layer, to
optimize/adjust - the weights /W) matrix - to achieve the least error.
* This new values of W & b are computed using an optimizer function
which tweaks assumed values of W and b by an amount called as a
“step” (also called the “learning-rate” of the neural network), to arrive
at the least possible error.
* This backward-pass generates a gradient-matrix, which is basically
the tensor, whose elements are
© Partial-derivative of error w.r.t. weights (that is, the rate of change
in error due to only the change in weights), and
© partial derivative of error w.r.t. bias (that is, the rate of change of
error only due to change in bias)(8C) The second run
(i) Ferward-Pass of Second Run: In this pass, the Weights Matrix then
Forward-Pass of Second Run:
uses these optimized values of bias and [W] matrix, to arrive at another
Predicted value of output, and therefore another error.
(i) Backward Pass in Second-Run: Again, the weights are again
optimized/adjusted in backward-pass to arrive at the least-possible
error (predicted value of output minus actual value of output). Again
optimizer "makes a change in the assumed (W}] and b, to minimize error
still further
(8D) Learning-Rate:
(i) The trade-off: The quantum of step (known as the “learning rate" for
that particular run) may be large or small - both have benefits and
risks. Too large a step-value may lead to swinging the model-output to
other extreme, and actually increasing the error. Too small a step may
not improve the error by too much, and then require many many more
forward-backward passes.
(i) The new values of Weights & Bias use the learning rate as follows:
New Values = Old Value minus Learning Rate * Gradient
(GE) Target-Value of Iterations: The optimizer-function, after every forward-
Pass, to decide on a step-value, actually computes something known as a
Gradient Vector for each error corresponding to the value set of weights-matrix
‘[W]' and the bias - 'b’ used in that pass,
°_ Ina three-dimensional plane comprising three axes - x, y, Mean Squared-
Error, the various passes generate a three-dimensional diagram formed
with the combination of input, output, error i.e. matrix [x, y , MSE]
© This MSE would have a minima at the point on the curve where the model
performs the best (has least MSE).
© Corresponding values of [W] and b, would be the best / optimized values
resulting from the training of the model. That is, the lowest gradient is
where the model is optimized and thereafter the error would only increase
© ** The more the number of iterations, the better is approximation.
(8F) The optimizer-function,
* After every forward-pass, to decide on a “step-value”, Optimizer actually
computes something known as a Gradient Vector for each errorcorresponding to the value set of weights-matrix [W] and the bias -'b' used |
in that pass.
© As we can see, there is a trade-off involved with learning rate and the
number of passes required to arrive at least MSE (the lowest
gradient),
© The lowest gradient is where the model is optimized and thereafter
the error would only increase
* This Gradient-Vector of error corresponding to particular W and b, is a
matrix with elements being (partial derivative w.r.t. W, partial
derivative w.r.t. bj. That is
© [rate of change of error w.r.t. W, rate of change of error w.r.t.
i.
© This is a matrix computation.
Calculating Gradients
(9) Three methods to compute gradients —
() Symbolic differentiation,
(i) Numeric differentiation and
(ii) Automation differentiation (mostly used - Tensorflow, Pytorch).
Pytorch uses "AutoGrad" package in backward-pass (known as
| backpropagation’, using a technique called "reverse auto-propagation’)
Using Gradients to Update Model Parameters
(9A) In reverse auto-differentiation,
* A gradient-matrix which is calculated, corresponds to a particular time
‘t' and resulting W & b are used for next pass at time 't+1').
* This gradient is fed into previous-run-parameters to obtain new
parameters for next run
"new parameter = [old parameter minus (gradient * learning-
rate)]"
(9B) As we can see, there is a trade-off involved with learning rate and the
number of passes required to arrive at least MSE (the lowest gradient). The
lowest gradient is where the model is optimized and thereafter the error would
only increase
‘Two Passes in Reverse Mode Automatic Differentiation@»
| : A
(9C) Symbolic Differentiation:
* Requires calculation of all elements of Gradient vector (ie. partial
derivatives of error w.r.t. each parameter), which is time-consuming
* Also, in many cases, the output function of the activation-function (and
therefore the resulting error-function) for a particular pass may not even
be differentiable.
(9D) Automatic differentiation uses "Taylor's series" to compute errors and
| therefore the gradient vector corresponding to each forward-pass,
Demo: Introducing Autograd: Gradient-Matrix, Gradient-Function
(10) Reverse Auto-Differentiation in AutoGrad Library: To decide the
combination of [W, b, least-error], you need to maintain a history of the weight-
tensors [W), bias tensor ‘b’, and [Gradient Matrix] after each forward-pass,
(i) Gradient-Matrix is also a tensor, whose shape matches the tensor for which
the gradient matrix is computed.
(8) Gradient-Function is the transformation used to compute the Gradient-Matrix
during the backward-pass
(11) Gradient-Matrix:
(4) Checking Status: Whether a tensor has a gradient enabled or not, can be
Chen atus: Whether a tensor has a gradient enabled or not,
checked using "output _tensor.requires grad’.
(8) Default-Vatue: the default value is “required. i grad = False”.
(C) Setting/Enabling gradient-tracking: To enable gradient-tracking, at the
time of creating a tensor, , specify that you need tracking of its gradient
in the backwards-phase, since
* fo enable gradient-tracking, the python command is
"tensor1.requires grad ()’.
* This would update the gradient-tracking "in-place" - note it isa
erat" function which has an underscore as a suffix and
therefore updates the tensor "in-place" rather than creating a
new tensor with gradient-values.
(D) Checking Gradient-Matrix: The gradient calculation w.r.t. any tensor can
be checked using “print (tensor1.grad)", which would produce "none"
before the first backward-pass.
|__@) For user created tensors ~@
() Gradient matrix will be developed only after the backward pass is
executed by "tensor_output. backward)”, even though they may not
have their gradient-tracking history enabled, AND
(i) Shape of gradient-matrix always matches the shape of the tensor
itself
(iii) The user-defined tensor1 & tensor2, which were gradient-enabled,
will not have a gradient function till we make a forward-pass. It is
because, only when we have made a forward-pass, that we will have
an error/cost/loss value, with respect to which, the partial derivative
vector of the tensor will be calculated.
(F) For Output Tensor:
() The tensor which is the output of "even those tensors which do
not have gradient-tracking enabled’, has its gradient-tracking
enabled by default, and also has a "gradient-function’ by default even
when no backward-pass has been made.
(ii) Output tensor will also have a gradient function automatically
enabled, even though no backward-pass has been implemented yet.
(12) Gradient-Function:
(i) The gradient function used to compute gradient can be printed by
"print (tensor1.grad_fn)".
(ii) The gradient function for the output tensor would always reference
with the last function in the computation-statement (i.e. reference to
“meanff in Tensor_Output = (A * B) * mean()
13 Directed Acyclic Graph: The combination of the original tensor, and its
gradient matrix, is called a "Directed Acyclic Computation Graph - DAG".
* “Directed”: It is called “Directed” because the direction of flow of
results/output is forward/defined.
* In a Computation-Graph: The “node” is the actual tensor and “function”
represents the transformation performed along edges of that node, where.
Demo: Working with Gradients
14) Gradients may be enabled, or can be disabled. "Decorator Functions" can be
used in Python to evaluate whether or not to enable gradient-matrix for the
tensors.Gy
* "@orch.nogradj) is typed before defining the function, to create a "no-
Sradient zone" in python code - Resulting tensor would not have a gradient-
matrix appended to original tensor even though the tensor itself may have
had gradient-tracking enabled.
° Through a nested/embedded "with @torch.enabled |_grad()" command, you
can enable gradient-tracking even within the larger-loop of "no_grad"
* Alternately, you can also set value of grad = true, while instantiating a
tensor (i.e. when you define & specify the tensor's elements - then & there
itself): “torch. tensor{[1,0}, [2,0], requires grad=True)"
15) Variables and Tensors:
(A) Earlier version of Pytorch: The gradient vector & gradient function were
stored in separate variables which wrapped the base tensor and its
gradient-matrix & gradient-function,
(B) Current Version of Pytorch:
@ The variable-wrapping API of base-tensors (along with their gradient-
matrix & gradient function) has been deprecated / discontinued, So,
now the gradient vector & gradient-function, both of which were
carlier stored in a separate variable, are now appended to original
tensor itself,
(i) "Vartabie’ is a proper library, which is imported,
(til) “var = Variable(torch. FloatTensor{9}))"
* Returns a tensor with 9 elements, and not a variable
° Like any other tensor, the variable named "var" would have
default gradient-tracking = False, and while instantiating ‘var’
itself, the gradient-tracking can be set to "requires s_grad = True’,
Demo: Training a Linear Model Using Autograd in Pytorch
(16) Now we build a simple neural network with neuron having only a linear
affine function, and no activation function:
(Step-1} we define & plot two training tensors from two numpy arrays "x train
= torch.from_numpy(x_trainy and "y train = torch.from_numpy(y_train)"
| ana S|(Step-2) set their gradient-tracking properties to True (requires grad=True)
(Step-3) Set the size (in number of neurons) of input, hidden and output layers
input_size = 1
hidden _size = 1
output_size = 1
(Step-4) Set the weights tensors between input & hidden-layers.
"wl = torch.randjinput_size, output size, requires = True)"
(Step-5) Set weights tensor between hidden and output layer,
w2 = torch.rand{hidden_size, output. size, requires grad = True)
(Step-6) Set the learning rate (the step by which the optimizer may change the
output function, to calibrate the weights-tensor in the backward-pass
17) The Python Code to implement Forward-Backward pass:
(Step- 1) Define the Prediction Function: ¥_Pred = X train * w1 * w2 and
(Step-2) Define Loss Function: Less(t+1) = Loss(t) + (¥ Pred-¥ train)*2,
(Step-3) Run a backward-pass : "Loss. Backward)’,
(Step-4)Adjust the weights for next pass: W1(t+1) = W1(t) - learn_rate *
wl1.grad
“*" The more the number of iterations, the better is approximation.
(18) For visualisation, convert tensor to numpy array by
"Y_np = ¥_tensor.detach).numpy()’
(19) Scatter the points on the graph and then plot the predicted-line.
13/43
ee