[go: up one dir, main page]

0% found this document useful (0 votes)
30 views36 pages

FAIML Unit 4 Introduction To ML

This document provides an introduction to machine learning (ML), explaining its definition, techniques, and applications. It covers various types of ML, including supervised, unsupervised, and reinforcement learning, and highlights its importance in modern technology through examples like self-driving cars, fraud detection, and virtual assistants. The document also outlines the historical development of ML and its current relevance in daily life.

Uploaded by

Sumit Kolgire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
30 views36 pages

FAIML Unit 4 Introduction To ML

This document provides an introduction to machine learning (ML), explaining its definition, techniques, and applications. It covers various types of ML, including supervised, unsupervised, and reinforcement learning, and highlights its importance in modern technology through examples like self-driving cars, fraud detection, and virtual assistants. The document also outlines the historical development of ML and its current relevance in daily life.

Uploaded by

Sumit Kolgire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 36
UNIT IV INTRODUCTION TO ML [4.1 INTRODUCTION TO MACHINE LEARNING ters to romatically from Machine «Machine learning is a growing technology which enables compu learning uses various algorithms for building mathematica! models ‘and making predictions using historical data Or information. Currently it(s being used (or various tasks such a5 image recognition, speech recognition. email filtering Facebook auto-tagging, recommender system and many more, «This section gives you an introduction to machine leaning akong with the wie range of machine learning techniques och a Supervised, Unsupersed and Reinforcement learning, You will learn about regression and classification del, custering mathedk, hidden Markov models and various sequential models What is Machine Learning? «inthe real world, we are surrounded by humans . capability and we have computers or machines w ‘experiences or past data lke a human does? So here comes the rl ces with their Fearning ne also: learn from who can learn everything from their experienc hich work on out instructions. But can # machi le of Machine Learning. Machine Yes, | can also lear {om past data with the help of machine leaming Fig. 4 lligence that ts mainly data and past experiences on their own. The term ma itn a summarized way as hhine to automaticaly fearn frorh data, improve performance from experiences + Machine Learning is sald as @ subset of artfictl intel concerned with the development of algorithms which allow a computer to learn from the icine earning was first introduced by Arthur Samuel in 1959. We can define > Machine learning enables a mad predict things without being explicitly programmed. > With the help of sample tistrical data, which is|known as training deta, machine learning algorthens build a mathematical model that helps in making predictions or decisions without being ‘explicitly programmed. Machine learning brings computer science and statistics together for creating predictive models. Machine learning aaacthacts of uses the algorithms that learn frors historical data, We will provide more information, then higher will be the performance. > Amachine has the abildy to learn i t'can improve its performance by gaining more data ‘now Daes Machine Learning Work? and A Machine Learning system ieams from historical data, builds the prediction models and whenever it receives new + data, predicts the output for it The accuracy of predicted output depends upon the amount of data, as the huge amount of data helps to build a better model which predicts the output more accurately. aay ee ——— go instead of writing & cag #509) 20 i aciten bude bane X, _ me eco exis” earning pipetine in itoln ier aatai! with 5 piagram Team fom Features of Machine Learning: dataset 2 achine loaring uses date to debect various patterns in a given e In-car lea from past data and improve automaticaly . data-driven technology. : ioc Se rnch smn to data ming ax it so deat wth the huge amour OF Need far Machine Learning . aerator machune iar is increasing day by day. The reason behind the need (OF crrelte ot doing tasks that are too complex fora person to implement directly. As a humans 6 7 cree runnst acrese the huge amount of data manually. so for this, we need some computer #75: the machine tearing to make things easy for Us . Paci, cache leering algorthens by proving them the huge amount of data dnc let them explore Une CDS construct the models and predict the required avtput automatically. The performance of the machine learning sGromarn depends on the amount of date and ican be determined by the cost function. with the help of maching, leaming, we can save both time and money. «the importance of machine leaming can be easily understood by its use cases. Currently, machine hearning is used in soll-diking cars, cyber fraud detection, face recognition and friend suggestion by Facebook, ete Various top companies such as Netfix and Amazon have build machine learning models that ace using 3 vast amount of data to: analyze the user interest and recommend product accordingly, Following are some key points which show the importance of Machine Learning: > Rapid increment in the production of data. > Solving complex problems, which are difficult for a human. > Decision making in various sector including Finance. > Finding hidden patterns and extracting useful information from data Classification of Machine Learning the gate itis machine learning is that have some limitations ‘Supervised © of machine leaming 77% onorament Fig. 4.3 . - INTRODUCTION TO ML ‘At a broad level, machine , Hearing can be classitied into three types: 1. Supervised tearing canes 460) Explain supervised and 2 Unsupervised learning unsupervised Telntercement 3. Reiniorcement tear: with example - A teaming rg, Learning, Supervised learning is a type af machine leaning method in which we provide sample labelled data to the machine learning system in order te train it and on that basis it predicts the output. The system creates a mociel using fabelted data to understand the datasets and learn about each data, once the training ancl processing are done then we test the model by providing a sample data to check whether it's predicting the exact output or not * The gaal of supervised learning is Yo map input data with the output data The supervised learning is based on supervision and it i the same as when @ student learns things in the supervision of the teacher. The example of Supervised leaming is spam fitering © Supervised learning can be grouped further in two categories of algorithms: > Classification > Regression (JO Unsiipervised Learning + Unsupervised learning isa learning method in which a machine leas without any supervision. ‘The training is provided to the machine with the set of data that has not been labelled, classified or categorized and the algorithm needs to act on that data without any supervision. The goal of unsupervised learning is to restructure the input data into new features or 3 group of objects with similar patterns. s In unsupervised learning we don't have @ predetermined result. The machine tries to find usefull insights from the huge amount of data It can be further classifieds into two categories of algorithms: > Custering UO Reintorcement Leaming ‘© Reinforcement leaening is @ feedback-based learning method, in which learning agent gets 2 reward for each right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and improves its performance. In reinforcement learning, the agent interacts with the environment and explores it. The goal of an agent: {5 to get the most reward points and hence. it improves its performance. * _ The robotic dog, which automatically learns the mnavernent of his arms, is an example of Reinforcement learning. 4.1.1 History of ML +” Before some years (about 40-50 years), machine learning was science fiction, but today iti the part of our daily life. Machine learning is making our day to day life easy from self-driving cars to Amazon virtual assistant “Alera” However, the idea behind machine leaming is sa ol and has a long history. Below some milestones are given wich hhave occurred in the history of machine learning: 1940's 1950's 1960's 1970's 1960" 1990 2000's 2010s Fig. 44 au researchers and this duration vas te tough time for Ab and ced tne or A wen GS fhine translation occurred 806 ‘The First "AIT Winter: ‘+ The duration of 1974 to 1980 bn this duration, fare af mac reduced funding by the governmen achive Learning trom Theory to Reality in Li rh vet sep a tw pote 6 an adaptive fer. . ant ss an Ces Rey me nel mee Ne SN ant how ta corey pronounce 20000 words 0 one week soa . ttn ge cmp wen ci oe ps cs ep any SE Panne the fst computer which had beaten a human ches €XD¢t. 1 to the researches. echoas over phone hnes sid able to: teach YouTube videos. : Youbet ot togenoasima’ lee the Ting et was the fst Chabot who consnced Me ‘of human jucges that it was not a machine . ana eres ws dp neural etwor created by Facebook and they claimed that it could recogni # Pe with the same precision as @ human can do . 2o16: Aiphaco beat he word's number second player Le Sedo} at Go game. In 2047 Wt beat the number one F of this game Kejie «2017. In 2017, te Aiphabet’ Jigsaw team bult an iteligent system that was able to learn the antine trating, to read millions of comments of different websites to leam to stop online trolling Machine Learning at Present: + Now machine leaming has got a great advancement in its research and it is Present everywhere around 1 eso amt ay more Wc Spee, sc riekecaoes aig te as , = Modem machine learning models can be used for making various predictions, including weather ict tpediction stock market analysis ete. “ palate OF A Bc MU (SE, A Be ML) — so anernagucTion 10 Mt sng rachine ning yurst ne bse ne soi 1 a uc a A the concepts of machine Yeamings > Fundamental knowledge of prohabiity and ness algebra . Knowledge of Cok expucsy drat of single oral and mueniae ONY 12 Examples of Machine - , rapidly day by day. We ove using machine learning in our daily life even joene enast trending real-world applications of Machine LeaTiog: Fig AS Image Recognition: Image recognition is one af places digital mages: ec The papular usecase of image fecognitian and suggestion: Facebook provides us a feature @f auto fiend tagging suggestion friends, then we automaticaly get @ tagging suggestion jeatting's face detection and recagetion algerith. fk fs based on the Facebook project named "Deep Face” which is identification in the picture ‘Speech Recognition While using Google, we get 27 application of machine learning. srrvc enguon a procs of conning vce stun adits duo known #5 Spek 0 ae a coin” M pret macie ling ag ily sed by ores Hk t speach recognition. ons assistant, Si, Cortana and Alea are using speedh recogriton technology 19 flow the voice instructions. sre mort common applications of machine ening. ts ud to ienfy Obes PTA face detection is, Automatic fiend tagging responsible for face recognition and person eption of "Search by voice” it comes under speech recognition and its 2 popular FUNDAMENTALS OF Al ML (SEAL M)_—— 3. Traffic Prediction: Google Maps «we want to vita new place, we wake Hep oF ‘and predicts the traffic conditions: alc eed, owen * 1n predicts the traffic conditions such a5 two ways > aoc fhe veld eo Googe Mao #9 08 Er same time. from the user and sendy > ‘Average time has taken on past days at the . sr aakesifoeation . trey wo i aing Googe Map tpg ths EP 0, ™AE better: back to its database to improve the performance. a * m ies such a Machine learing is widely nent compari an ¥ we d various e-commerce and entertain getti tain vite Wirenever we search gor some product O” i ‘ot machine {or Protmmerere for the sme product whl inte suring on te STE paeee ests the product #5 Per . Me nds th er mares ig ous mci mang ais and 8 : ete. and this is also ° a ge Nein we fd sone recommendations fr ettvmen sees MoM done with the help of machine learning. plays a significant role 5. Self-Driving Cars: Machine learning monn oasis ax. it is using . ications of machine learning is se S saaeg o tek te wa popular car manufacturing company is working on self-driving ann Tearing method a the car mds to eet prope ana objects we cing ‘& Email Spam and Malware Filtering: © Whenever we receive 2 new email, it fitered automaticaly as important, normal important mai in our inbox with the important symbol and spar emails in our Spam this is Machine learning. Below are some spam fiters used by Grit > Content Fiter > Header fier > General biacklists filter > Rules-based filters > Petmission filters : Some sachin leaning algorthms such as Mult-Layer Perceptron. Decision tree and Naive Bayes classifier are used for email spam fitering and malware detection. 7. Virtual Personal Assistant: «We have various virtual personal assistants such 2s Google assistant, Alexa, Cortana, Sir. As the name suggests, they help usin finding the information using our voice instruction. These assistants can help usin various ways just by out voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc. ‘+ These virtual assistants use machine learning algorithms as an important part. % These assistant record our voice instructions, send it over the server on a cloud and decode it using ML algorithms anv act accordingly. 8. Online Fraud Detection: * Machine learning is making. our online transaction safe and secure by detecting fraud transaction. Whenever perform some online transaction, there may be various ways that a fraudulent transaction can take place such as fal accounts, fake ids and steal money in the middle of a transaction. So to detect this, Feed Forward Neural netwo helps us by checking whether itis a genuine transaction ar a fraud transaction. ‘and spam. We always receive 3 box and the technology behind: jDAMENTALS OF AMENTALS OF AL nm neon an IWTROBUCHION TO ML enh - - e = en neh transaction, the ouput is converted Into some hash valies and Unewe values becorne the inptt 1" anen: 4 dalect For each genuine transaction, ahere is a specific pattern ‘which gets change for the fraud wrarvaenon . sit and makes our aniline trangaetions more necute 9. Stock Market Trading: # — Tearing is widely used in stock market trading. Inthe stock market. thave is aways aes of up and crs I flees: 90 tr {hs machine feaming’s long short term memory neural network 8 ised for the predictlon of stock ket trends, 10, Medical Diagnosis: * In medical sckenca, machine learning Is used for diseases diagnoses. With this, medical technology !# growing very fast and ble to build 3D models that cn prt the exact pation f fais ih ‘he baie, «+ {ips in finding brain tumors another brain-reated diseases asl. 11, Automatic Language Translation: © Nowadays, if we visit "snow place and we ave rt aware of the anguage then it is not problem at all, as for this also rary known languages, Google's GNMT (Googie Neurdt Machine srsachine Fearring hops us by converting the text sat yaton) provide ts eature, which a Neural Machine Learning at eansates the text into. our familiar language ‘and it calted as automatic translation. automatic tr Je learning algorithm, which is used with = Mechine learning ise subset of Al, which enables # ke predictions. Machine learning contains 2 set af algorithms that work on ‘build the model & from past experiences and ral trom Prt deta, Data is fed to these algorithms to tain them and on the basis of training, they perform a spaciic task ‘of machine learning lems lke Regression, Cl Fig. 4.6: Types: «These ML algorithms help to solve diferent business Pro! and Associations, etc. «Boned on the methods and way of fearing, machine leaming 1. Supervised Machine Learning 2. Unsupervised Machine Learning 3. Semi-Supervised Machine Learning 4. Reinforcement Learning sification, Forecasting, Clusterin divided into mainly four types, which are: a — - INTRODUCTION 8 —— actine 1 nomsannsornase 8M —— a we ayes Sw et re aoe eae 4 , More precinusly, 1 Sopa ci A ing or EN aba red 10 OP he machine Io 2 sts name soggest bled” ‘the machines using ‘of the inputs 3° ‘output nee nt species at 0 comesponsing dog images, So, | the tail of cat we input the ts and input Lee a x sire of 6 Casification dasification h : problems in which the output variable is categorical, suc [ ee a een eae ercavon sgorts predict the catagories present i the Yes" or No, Male oF Female, ae ie ae rond eames of asscatonalgthns re Spam Detection, Era tering etc ‘+ Some popular diassification algorithms are given below: 2» Random Forest Algorithm > Decision Tree Algorithm > Logistic Regression Algorithm > Support Vector’ Machine Algorithm Gi) Regression Regression algorithms are used to solve regression problems in which there is a linear relationship between input ar ‘ouiput variables, These are used to predict continuous output variables, such as market trends, weather prediction, ett * Some propular Regression algorithms are given below: > Simpie Linear Regression Algorithm 2 Multivariate Regression Algorithm > Decision Tree Algorithm > Lasso Regression ° Since it Supervised learning work with the labelled dataset so we can have an exact idea about the classes of objects. + These algorith iain ims are helpful in predicting the output on the basis of prior experience. JES goths ae not ate to sole complex tasks ‘may predict the wrong output if the test data is different from the traini requires ots of computational time to train the algorithm, mee SE eee t™S Supervised Learning algorithins are used in image segmentation. In this process, image med on different image data with pre-defined labels Medical Diagnosis: Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using Medical images and past labelled data with labels for disease conditions, With such a process, the machine can ‘identify a disease for the new patients. Fraud Detection: Supervised Learning classification algorithms are used for identifying fraud transactions, fraud ‘Customers, etc. It is done by using historic data to identify the patterns that can lead to possible fraud. Spam Detection : In spam detection & filtering, classification algorithms are used. These algorithms classify an email ‘2S spam or not spam. The spam emails are sent to the spam folder. Speech Recognition: Supervised leaming algorithms are also used in speech recognition. The algorithm is trained with voice data and various identifications can be done using the same, such as voice-activated passwords, voice commands, ete. 2. Unsupervised Machine Learning ‘+ Unsupervised learning i different from the Supervised learning technique: as its name suggests, there is no need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabelled dataset and the machine predicts the output without any supervision. * Inunsupervised learning, the models are trained with the data that is neither classified nor labelled and the model acts on that data without any supervision + The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the similafities, patterns and differences. Machines are instructed to find the hidden patterns from the input dataset * Let's take an exemple to understand it more preciously, suppose there is a basket of fruit images and we input it into the machine learning model. The images are totally unknown to the model and the task of the machine isto find the patterns and categories of the objects. * So, now the machine will discover its patterns and differences, such as colour difference, shape difference and predict the output when its tested with the test dataset. Categories of Unsupervised Machine Learning * Unsupervised Learning can be further classified into two types, which are given below. () Gustering (i) Association (i) Clustering The clustering technique is used when we want to find the inherent groups from the data. It is a way to group the objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no similarities with the objects of other groups. An example of the clustering algorithm is grouping the customers by their Purchasing behaviour. Some of the popular clustering algorithms are given below: > K-Means Clustering algorithm > Mean-shift algorithm > DBSCAN Algorithm > Principal Component Analysis > Independent Component Analysis (il) Association Association rule learning is an unsupervised leaming technique, which finds interesting relations among variables Within a large dataset. The main aim of this learning algorithm is to find the dependency of one data item ove another data item and map those variables accordingly so that it can generate maximum profit. This algorithm mainly applied in Market Basket analysis, Web usage mining, continuous production, etc. Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm. SUNDAMENTALS OF AL & Mi (SE, AL 8 ndvantages and Disadvantages of Unsupervised Advantages. vo the supervise nes ecause tse AHS rc . These algorithms can be used for complicated teks compares : Of the unlabelled dataset. - elbelled dataset i ease 25 COMPETE 0 thy + Unsupervised sigoriths ace preferable fot vrious tasks a geting te labelled dataset. Disadvantages: rithms are not : ete nner sien cant es sre athe cae ct bles we ‘trained with the exact output in ot th . Fane aa oon nang more et st wots wh unable dnt at SO ae the output. Applications of Unsupervised Leaming oy nt in document network e Network Analysis: Unsupervised learning is used for identifying plagiari analysis of text data for scholaly articles. «Recommendation Systeme: Recommendation systems widely use unsupervised lkamiag technlaues recommendation applications for different web applications and e-commerce websites, * nomaly Detection: Anomaly detection i papular appliation of unsupervised leaming, which can tery ars data points within the dataset It is used to discover fraudulent transactions. . Singstar Value Decomposition: Singular Vale Decomposition or SVO is used to exract particu information fom the database. For example, extracting information of each user located at a particular location. 3. Semi-Supervised Learning Semi-Supervsed learning is a type of Machine Leaming aigorithm that fies between Superised and Unsupervised machine learning. i represents. the intermediate ground between Supervised (With Labeled training data) and Unsupervised learning (with no labelled training data) algorithens and uses the combination of labelled and unlabelled datasets during the training period. '¢ Although Semi-supervised leaming is the middie ground between supervised and unsupervised fearning and operates fon the data that consists ofa few labels. it mestly consists of unlabelled data, As labels are costly, but for corporate purposes, they may have few labels. It is completely different from supervised and unsupervised leaming as they are based on the presence & absence of labels. ? To overcome the drawbacks of supervised learning and unsupervised leaming algorithms, the concept of Sem supervised learning is introduced. The main aim of semi-supervised learning isto effectively use all the available dat rather than only labelled data lite in supervised learning, initially, similar data is clustered along with an unsupervise learning algorithm and further, it helps to label the unlabelled data into labelled data tis because labelled data is comparatively more expensive acquisition than unlabelled data ¢ We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision an instructor at home and college. Further, if that student is self-analysing the same concept without any help fro the instructor, it comes under unsupervised leattiing. Under semi-supervised learning, the student has to revise hims after analyzing the same concept under the guidance of an instructor at college. Advantages and Disadvantages of Semi-Supervised Learning Advantages: + Itis simple and easy to understand the algorithm, + Itis highly efficient. + Itis used to solve drawbacks of Supervised and Unsupervised Learning algorithms Disadvantages: ‘¢ Iterations results may not be stable, * We cannot apply these algorithms to network-level data, * Accuracy is low. for building jah a — ae ——_—__HTROBUETION Fo ME Reinforcement cay ane worls on a feedback-based process, in which an Al agent (A software component) performance, Aer’, surrounding by hiding & tral taking action. laining Irom experionces and improving its Soon. how ‘Gets fewarded for each good action and get punished for each bad action; hence the goal of 'arning agent is to maximize the rewards, {rein he forcement learning. there is no labelied data like supervised learning and agents learn from their experiences ™ einforcement learning process fs similar to a human being: for example, « child learns various things by ixperiances in his day-to-day lile, An example of reinforcement learning is to play a game, where the Game is the Environment, maves of an agent at each step define states and the goal of the agent is fo get a high acore. Agent receives feedback in terms of punishment and rewards. Die to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation Research, Information theory, multi-agent systems. A reinforcement learning problem can be formalized using Markov Decision Pracess(MOP), In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment responds and generates a new sta Categories of Reinforcement Learning Reinforcement (earning is categorized mainly inta two types of methods/algorithns: () Positive Reinforcement Leaming: Positive reinforcement learning specifies increasing the tendency that the required behaviour would oceur again by adding something. It enhances te strength of the behaviour of the agent and pasitively impacts it (W) Negative Reinforcement Learning: Negative reinforcement leatning works exactly opposite to the positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the negative condition. * Real-world Use cases of Reinforeement Learning Video Games: “* RL algorithms are much popular in gaming applications. It ts used to gain super-human performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero Resource Management: +The “Resource Management with Deep Reinforcement Learning’ paper showed that how to use RL in computer to automatically learn and schedule resources to wait for different jobs in order to minimize average jab slowdown. Robotics: + RL is widely being used in Robotics applications. Robots are used in the industrial and manufacturing area and these tobots are made more powerful with reintarcement learning. There are different industries that have their vision of building inteltigent robots using Al and Machine learning technology. Text Mining + Text-mining, one of the great applications of NLP, is now being implemented with the help of Reinforcement Learning by Sales force company. Advantages and Disadvantages of Reinforcement Learning Advantages © Ithelps in solving complex real-world prablems which are dificult to be solved by general techniques. ©The learning model of RL is similar to the learning of human beings; hence most accurate results can be found. + Helps in achieving long term results. Disadvantage 2 RLalgorithms are not preferred for simple problems. © RL algorithms require huge data and computations. 3 Too much reinforcement learning can lead to an overload of states which can weaken the results The curse of dimensionality limits reinforcement learning for real physical systems. — pain withou! mo od the life cycle of ject The main purpos me ead pr = " = - coin steps \ ata preparation . 400? expen ona rs . 2 cmoerin’ ° SH pnalyse Date > Train the model > Test the model > Deployment oO Gathering 0) Deployment data @ a Data model ™ Machine jeaing Train gy model ona wrangling © data @ rats OF AL MLSE Al a maty aay) INTRODUCTION TO ML step includes the below tasks > Identity various data sources > Collect data > Integrate the data obtained from different sources By performing the above task we get a coherent set of data, also called as a dataset. It will be used in further steps. Data Preparation After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our data into a suitable place and prepare it to use in our machine learning training, In this step. first, we put all data together and then randomize the ordering of data This step can be further divided into two processes: (i) Data Exploration: > It is used to understand the nature of data that we have to work with. We need to understand the characteristics, format and quality of data > A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends and outliers, (ii) Data Pre-Processing: = Now the next step is pre-processing of data for its analysis. Data Wrangling Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of cleaning the data, selecting the variable to use and transforming the data in a proper format to make it more suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning of data is required to address the quality issues It is not necessary that data we have collected is always of our use as some of the data may not be useful. In real worid applications, collected data may have various issues, including: > Missing Values > Duplicate data > Invalid data > Noise So, we use various filtering techniques to clean the data It is mandatory to detect and remove the above issues because it can negatively affect the quality of the outcome, Data Analysis Now the cleaned and prepared data is passed on to the analysis step. This step involves: > Selection of analytical techniques > Building models > Review the result The aim of this step is to build a machine learning mode! to analyze the data using various analytical techniques and review the outcome. It starts with the determination of the type of the problems, where we select the machine learning techniques such as Classification, Regression, Cluster analysis, Association, ete, then build the model using prepared data and evaluate the model. Hence, in this step, we take the data and use machine learning algorithms to build the model. ‘Train Modet Now the next step is to train the madel, in this step we train our model to improve its performance for better outcome of the problem. We use datasets to train the model using various machiine learning algorithms. Training a model is required so that i Gan understand the various patterns, rules and, features. —aae_— po ss. 8 ne model. In this st % test U ep. then we we eget UNOAMENTALS OF AI & ME 6. Text Model been trained on 2 9 + nce ur mine ening eS dng vest dataset tH the # Rrineccney ator mde by rowing 9 eH SNE al as par the model determines the percentage — Donegan deploy the model in the real-world system, Tala wp of machine earring ie yc is deployment. where we COPTT amment with acceptable speed, th 11 above prepared model is producing an accurate Teall # PST et we will check whether it is improving yw deploy Ye model in the teal system But before deploying the project. we WY CNG A one tor a project petlormance using available data or not. The deployment phase is similar to makING [44 ALAND Mt + Neil itsligance ead machine learning are the part of computer science thal is correlated with each other. Thes mince eet ae the most trending technologies which are used for creating intelligent systems. Coma ees ie teated techectogies and sometimes people use them as a synonym for each other, but sti ent terms in various cases Ona tread lee we can diferentiate both Al sre) tat ox: © allie a Bioger concept tae ven dataset. equirement Of Project OF Problern, ? Machine leaming is about extracting knowledge ftom the tata It can be defined as, ‘Machine learning is 4 subfield of artilicial intelligence. which enatiles machines to learn fram past dats or experiences, without being explicitly programmed ‘Machine learning enables a computer system te make predictions or take some decisions sing Nistorical data without being explicitly programmed. Machine learning uses a massive amount of structured and semi-structured data so that 4 machine learning model can generate accurate result or give predictions based on that dare ‘Machine learning works on algorithm which learn by it's own Using historical data. It works ony for such as if we are creating a machine learning model to detect pictures of dogs, it will only give result for dog images. but it we provide » new data tke cat image then it will become unresponsive, Machine learning is being used i various places such as for online recommender system, for Google search algorithens. Email spar filter, Facebook AUto fiend tagging suggestion. et 1t can be divided into three types: specific domains 1 Supervised learning 2, Reinforcement earring 3 ised learning @4cc> DisHngwsn betweer artificial T-2mL le 4.1: Key Differences Between Artificial Intelligence (AI) and Machine Learning (ML) [Attica Intetigence Machine Learning ‘Artficial intligence is a technology whic enables 2 mache 1 ‘Machine earring is 2 subset of Al which aiows @ machine to ‘simulate human benavior ‘automaticaty lam from past data wihout progcarening expicly | The goat of Als to mate @ smart computer system ike hurias to | ‘The goa of ML isto alow machines 10 loam Irom dato so that hey can soe complex problems. ‘ive accurate out [ tnt we toach machines with data to pertorm a partcutar task and 4 Anse et pen eto at ho ive an accurate result [ Machine tearsing and dean ering are he wo man subse 01, | Dewplowing sa main subsatof machine marin eereseeetoe Vette ect gions ed ee. ‘Al's working to create an intaligent system which can perform various | Machine leeming is working to create machines that can perform only comple tasks. those specific tasks lr which they ae trained. Machine earring is mainly concerned about accuracy and paters “The main applications of machine learning are Oniine recommender system, Google search elgarthms, Facebook auto find tagging suggestions, et (On the basis of capabities, Alcan be divided ito twee types, which | Machine learning can also be divided into mainly twee fypes that are ‘are, Weak Al, General Aland Strong Al. ‘Supervised leaming, Unsupervised learning and Retnforcament learning It ineludes fearing and seit-correction when introduced with new data Al system is concerned about maximizing the chances of suctess, The main applications of Ai are Siri, cusiomer support using catbosts Expert System, Online game playing inteligent humanoid rbot, etc. {includes learning. reasoning and selt-correction. ‘Al completely deals with Structured, semi-structured and unstuctured | Machine leaming deals with Structured and semi-structured data data. [4.5 DATASET FOR ML The key to success in the field of machine leaming or to become a great data scientist is to practice with different types of datasets. But discovering a suitable dataset for each kind of machine learning project isa dificult task. So, in |. this section, we will provide the detail of the sources from where you can easily get the dataset according to your Project. Before knowing the sources of the machine learning dataset, let's discuss datasets. + et can contain any dae, Aa tabular dataset can be unde variable and each row Ci Comma Separated File.” OF of Data in Datasets (i) Numerical Data: Such as house Pr temperature, ete Blue/green. etc i aregea be measured on the basis of comparis initial level. Th i) Categorical Data: Such as Ve data are similar to categorical data but an yge and process at the (ii) Ordinal Data: These ote: A reotworld dataset & of huge size, which 5 difficult to mana practice machine lering algorthims, we can use 2° dummy dataset Need of Dataset «To work with machine leaming projects, we need # ML/AI models. Calecting and preparing the dataset is The technology applied behind any ML projects cannot wor processed, f the Mt project. the developers ¢: «During the development of anplications, datasets are divided into two parts: > Training dataset: | Validation set | Test sat wage amount of data, because, without the data, one ¢3 ‘one of the most crucial parts while creating 2” MUAI k properly if the dataset is not well prepared lompletely rely on the datasets. In bui Final performance estimate Fig. 4.10 Predictive mod! Note: [i rhe datasets zte of large size, so to download these datasets, you must have fast internet on your compute NOAMENTALS OF At 2 oa [AL Bc aL (SE, A Be MALY aan popular Sources for Machine Learning Datase for the public to work on it / aetow isthe tst of datasets wtch ace freely / 1. Kaggle Datasets / Coc earners, It allows with other machine atasets for Data Scientists and Machine L ithe opportunity to work ‘qurces for providing 4 covides Kaggle is one of the best 5 oF and publish datasets in an easy way HOSS Ps sult asks. find, downto: learning engineers and solve diffic raset in Data Science related % diferent formats that we can easily find onc download Kaggle provides a high-quality da The link for the Kaggle dataset is htt Uct Machine Learning Repository ps //wews kag <. This repository contains raset the vachine learning dal xd by the machin ing community for f the great sources of Mm: are widely use’ ning repository is one OF nd data generators th: e lear © UCI Machine lear! databases, domain theories 2 analysis of ML algorithms. has been widely used it yy students, professors researchers a5 2 primary source of machine sification, t, Poker * Since the year 1987, learning dataset. the datasets as per t! It also contains som g such as Regression, Clas s of machine learnin et, Car Evaluation dataset he problems and task: sets such as the Iris datas «It classifies Clustering, et e of the popular data ” Hand dataset, et. he UCI machine learnint is https//archiveics uc.edu/mi/index phe. g repository © The link for t f ou es, These datasets izations, vent organ! st access and share oa we can ser dommes sures Mt provided 0” pe can be accessed through AW! ; _ sm Ee ‘AWS resou! researches, businesses OF song tov 078 build various services it ott _ + Anyone can — cenen ‘on dato analysis eather than on aca} rate _— rai ape ih examples anid Wa) een es of datasets Wit oo + Pic i eo ua for the required dataset ‘Anyone can add any search box using which we can Sear | , of Open Data on AW’ 7 ‘é The fink for the resource 1 hitps//ragistry.opendata 4. Google's Dataset Search Engine Cunefieator ; j ptember 5, 2018, This source helps ! ilable for use. ; \olbox google. com/datasetsearch 4 «Google dataset search engine is a search engine launched By Google on Se fesearchers to get online datasets that are freely ava «The link for the Google dataset search engine is https://to 5, Microsoft Datasets * The Microsoft has launched the "Microsoft Research Open data’ various areas such as natural language processing, computer vision and domain-specific sciences, ." repository with the collection of free datasets ins © Using this resource, we can download the datasets to use on the current device or we can also directly use it on se the * The link to download or use the dataset from this resource is https://msropendata.com/. Als OF ao AL& MLSE. at amy (49) INTRODUCTION TO ML ey pee Le ste Tg] scaeeseeaeeeteiiamenmnenaeaesel * Awesome public dataset collection provides high-quality datasets that are arranged in @ well-organized manner within a list according to topics such as Agriculture, Biology, Climate, Complex networks, etc: Most of the datasets are available free, but some may not, so it is better to.check the license before downloading the dataset + The link to download the dataset from Awesome public dataset collection is https://github, com/awesomedata/awesome-public-datasets. 7. Government Datasets + There are different sources to get goverrment-related data, Vaious counting publish government data for public use collected by them from different departments these datasets is to increase transparency of govern of government datasets: © The goal of providing iment work amnong the people and to us the data in an innovative approach, Below are some units > Indian Government dataset > US Government Dataset > Northern Ireland Public Sector Datasets y= European Union Open Data Portal & Computer Vision Datasets «Visual data provides multiple numbers of the gre Classification, Video classification. Image Segmentation, © orimage processing, then you can refer 0 this source. «The link for downloading the dataset from this source is https//www. yisualdataio/. / to computer visions such as Ima wat detaset that are specific 1 to build a project on deep learit ‘etc, Therefore, if you want vy and real-world datasets rovides both # veral dataset APL predefined funct ources. Teaming enthusiasts. This source P ackage and using 9°" be baded using 5° any ions such as But these great source for machine from sklear.datasets Ps tained = Seikitslearn is 2 file from external 5 These datasets can be ol sckitleam can ete, rather than importing the toy dataset available OF win XD) y), Yoad iis Joad_boston(ireture XI ot suitable for real-world projects wurce is hetps://sci idatasets/index htm! } datasets are mntoad datasets from this $01 iKit-learnorg/stable/ + The link 10 dow fe learning model. It is [a.6 DATA PRE-PROCESSING ‘ocess of preparing the raw data and 1m sine loaming model » Data pre-processing is # pr aking it suitable for 2 machin ted data. the first and crucial step while creating @ mac itis not always a case that we come across the clean and format © When creating 3 machine learning project ‘And white doing any operation with data, itis mandatory fo ‘dean it and put in a formatted way. So for this, we Use data pre-processing task Why Do We Need Data Pre-processing? ‘A real-wotld data generally contains ncises, missing values and maybe in an unusable format which cannot be directly is required tasks for cleaning the data and making it suitable usad for machine learning models. Data pre-processing i for a machir mt machine learing model which alo increases the acuraty and efficiency ofa machine learning model my INTRODUCTION TO ML f > Encoding Categorical Data > Splitting dataset into training and test set > Feature scaling 1, Get the Dataset . To create a machine tearning model, the fist thing we required is a dataset as a machine learning model completely ‘works on data, The collected data for a particular problem in a proper format is known as the dataset. * Dataset may be of different formats for different purposes, such as. it we want to create a machine leaming model tor business purpose, then dataset will be different with the dataset required for a liver patient. So each dataset is Wifferent from another dataset. To use the dataset in our code, we usually put it into a CSV file. However, sometimes, We may also need to use an HTML or xlsx file What is a CSV File? * CSV stands for “Comma-Separated Values” files: itis a file format which allows us to save the tabular data, such as Spreadsheets. Its useful for huge datasets and can use these datasets in programs. * Here we will use a demo dataset for data pre-processing and for practice, it can be downloaded from here, “https//wwew superdatascience.com/pages/machine-learning. For real-world prablems, we can download datasets Online from various sources such as, httns//www kagole.com/ucimU/datasets, https://archiveics uci edu/mi/index php ete. * We'can also create our dataset by gathering data using vafious APL with Python and put that data into a csv file: Importing Libraries * Ih onder to pertorm data pre-processing using Python, we need to import some predefined Pythion libraries. These Nbraries are used to perform some specific jobs. There are three specific libraries that we will use for data pre- Processing, which are: (2) Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is the fundamental package for scientific calculation in Python. It also supports to add large, multi-dimensional arrays ‘and matrices. So, in Python, we can import it as: > Import Numpy as nm: Here we have used nm, which is a short name for Numpy and it will be used in the whole » program, (b) Matplottib: The second library is matplotlib, which is 3 Python 20 plotting library and with this library, we need to import 2 sub-library pyplot. This library is used to plot any type of charts in Python for the code. It will be imported as below: > Import matplotlib.pyplot as mpt: Here we have used mpt as a short name for this library. (©) Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used for importing and managing the datasets. It is an open-source data manipulation and analysis library. It will be imported as below: Here, we have used pd as a short name for this library. import numpy as nm import matplotlib pyplot as mtp import pandas as pd Now we need to import the datasets which we have collected for our machine learning Project. But before importing z sdataset, we need to set the current directory as a working directory. To set a working directory in Spyder IDE, we neet to follow the below steps: apropos ncccnm ene * . 3 ere 08 em nee a* \MENTALS OF > ALG MUISE, AL e ML) wan INTRODUCTION TO ML ¢ feat a cor eae tings rd om wk ne at dng thon can ste car he icking an the format option. ntracting Dependent and independent Variables: + Ih machine feerning, it is important to distinguish the matrix of features (independent vanables) and dependent vatiablos from dataset. n our dataset, there are three independent variables that are Country, Age ancl Salary and one 5. dependent variable which is purchayed. Extracting Independent Variable; + To extract an independent variable, we will use tloct | method of Pandas library, and columns from the dataset. x= dato_set.iloc(',-1} values + tn the abave code, the frst colom() is have used ©1, because we don't want to take the we will get the matrix of feat © By executing the above code, we will get output as > [['Indio’ 38.0 68000.0) (France? 43.0 450000) {(Germany’ 30.0 $4000.0) ('France’ 48,0 65000,0} (Germany: 40.0 nan} [ndia’ 35.0 $800.0) ((Germany’ rian 530000) [ France’ 49.0 790000) {'ndia’ 50.0 8800.0} ('France’ 37.0 77000.0)) * As we can seein the above output, there are only three variables. Extracting Dependent Variable: + Toextract dependent variables, again, we will use Pandas .iloc(] method y= data_set.iloc[:,3] values + Here we have taken all the rows with the last column only, It will give the array of dependent variables. It is used to exteact the required rows wed to take all the rows and the second colon) Is for all the columns. Here we 1 column as it contains the dependent variable, So by doing this, VV¥YVY¥YYYY¥YY + By executing the above code, we will get output as: Dutput: array(('No’, ‘Yes’, ‘No’, ‘No’ Yes’, ‘No’, Yes!) dtypesobject) Jote: ff you are using Python language for machine learning, then extraction is mandatory, but for R language it fs not equired. . Handling Missing Data: ‘The next step of data pre-processing Is to handle missing data in the datasets, If our dataset contains some missing data, then it may cfeate a huge problem for our machine Yearning model, Hence it Is necessary to handle missing values present in the dataset. fays to Handle Missing Data: nere are mainly two ways to handle missing data, which are: > By Deleting the Particular Row: The first way is used to commonly deal with null values, In this way, we jus’ delete the specilic row or column which consists of null values. But this way is not so efficient and removing dat may lead to Joss of information which will not give the accurate output. framed, 480, 6500001 [Geemany’, 40.0. (65222.222222222221, [rindia’, 35.0, $8000.01, [Germany ‘aL LED TL12121114, 5300001 [France 49.0, 790000}, india’, $0.0, 880000}, (France, 37.0, 7700.0}, utype object values have been replaced with the means dataset: there are two categorical variable, Country | ‘As we ean see in the above output 5. Encoding Categorical Data: ‘¢ Categorical data is data which and Purchased. «Since machine leaming model completely works on ‘categorical variable, then it may create trouble variables into numbers. For Country Variable: country variables into categorical data. So te do this, we will use Firstly, we wil convert the pre-processing library. #Catgorical data #for Country Voriable the missing thas some categories such as, in oUF mathematics and numbers, but if while building the model. So it is necessary from sklearn pre-processing import LabelEncoder label_encoder_x= LabelEncoder() x{:, 0} lobel_encoder_x.fit_transform(xt:, 0]) Output: Out(asp array({[2, 38.0, 680000}, 10, 43.0, 45000.9}, 1,300, 5400.0), 10, 48.0, 6500.0}, ‘of rest column values. ‘our dataset would have & to encode these categorical LabelEncoder() class from 7 a - labelencoder_ys Labelncoder() ¥F labelencoder_y fit_transtorm(y) ‘+ For the secanel categorical variable, we will only use labetencoder abject of LableEncoder class. Here we are nat Using Onertotincoxter class because the purchased variable has only two categories yes or no. and which are automatically encoded into 0 and 1 ‘Output: Qut{L7} array{t0, 2. 0,0,1, 1,0, 1,0. 4p ‘It can also be seen as: 6. Splitting the Dataset into the Training Set and Test Set © It machine teaming data pre-processing, we divide: our dataset into a training set and test set. This is one of crucial steps of data pre-processing as by doing this, we can enhance the performance of our machine leaning mod ‘© Suppose, if we have given training to our machine leatring model by a dataset and we test it by a completely di dataset. Then, it will create difficulties for our mode! to understand the correlations between the models. OF Al AL (SE. AB ML) 425, erRODUCTION TO ME (1. 490, 65222.22222222222), (2. 350, 580000, (1 41.aantaaaiaiii4, s30000), 10, 49.0, 7900000}. (2,500, 889000). {0 37.0, T7000.04, dtype=abject) Explanation: . met xh np anc co sien ny Ts me eS ‘vanables into digits . ot a ene i cm nse ea On anny ma ony mon a ems oe a nS aoc te war ct Sts reene oe. wee ny uy ‘Qummy Variables: . mer es wrest oe rete Poe Da tn maroon 0 in ony ens ate na ees aS number of categories. «to our dataset we have 3 categories $0 it will produce: wil use OneHatEncoder class of pre-processing MBTaFY + Sefor Country Variable from shlecrn.pre-precessng import LabelEncoder, OneHotEncoder + label encader_x= LabelEncoder{) xf;, OF labelencoder_xfit_transtorm(xt OD) hres columns having 0 and I values. For Dummy Encoding we ‘onehot_encoder= OneHotEncoder(categorical_features= fon oo onehot encoder fit__transform(x).toarrey0 Output: ‘array(tg0.00000000e+00, 0.00000000¢+00, 1.9090000e+00, 3.80000000e+01, 6800000002 +04), {1.00000000e+60, 0.00000000e+ 00, 0.00000000e+08, 4300000008 +02, 4.sp00000e+04}, {n-00090000e +00, 0:90000000e +00, o-o0000000e+98 -g0090000e +1, 6 50000000e+04), {e.00000000e400, 1.90000000€+00, 0:90000000e+00 4.09000000e+01, 6 52222222e+04), {0.00000000e +00, .90000000e+00, L n000000e+00. 3.50000000e+G1, 5:200000002+04}, {9.00000000e +00, 1. ‘00000000 +100, 0.00600000<+00, ‘AT1I1 Le +01, 5 30000000e+04}, ‘f:00000000e+09, :00000000e+00, .00900000e+00, .4:90000000e +01, 7.90000000e+04], {p.oone0d0de-+-00, .00000000e+00, 1 00090000e+00- s,o0900000e-01, 8.80000000e+04), tr.ec09000e+60, en0000006e+00, .90000000e+00. 3 7oq00000e+01, 7.70000000e+ 0411) he saswe can seein the above outta the variables are encoded into numbers and and divided inte three colum ‘ tt canbe seen more cleat in the variable explore section, by clicking an x epHon 2 = ————————E—EE~ “FUURAAENETALS OF AT AMAL (SE, ALAMO) reese te tee * As we can see, the age and salary column vahies afe not on the same scale. A machine leaming model is based on Eucidean distance and it we do not scale the variable, then it will cause some issue in our machine learning model Euclidean destance w gren 3s Fig 4.12 Euclidean distance between A and B = \/(X Xi) sw 3 we compute any two values from age and salary, then salary values wil dominate the age values and it wil produc 30 incomrect result. So to remove this sue, we need to petfotm featute scaling lr machine learning * There are two ways to perform feature scaling in machine learning Standardization Now value Crete Ne jeg Amanda) €—— Moan a Now vale ( Cigna! vee, WG _ Here, we will use the standardization method for our datuset. 1a (429) feature scaling, we wil import StandardScaler class of stlearn pre-processing Werary as from sklsam.pre-processing import Standardscaler "Now, we will create the object of StandardScaler lass fo transform the training dataset St_2x= StondardSeaier() X_troin= st_x.Fit_transform(x_troin) + For test dataset, we will directly apply transform0) fun training set x_test= st_xtransform(x_test) Output: By executing the above lines of code, we will get the x tain: 1 independent variables or Yeatyres And then we wil it and ction instead of fit.transform) because it is already done 19 scaled values for x train and x test a5: x test: ‘As we can see in the above output, all the variables are scaled between values -1 to 1. Note: Here, we have not scaled the dependent variable because there are only two values D and 1. But if these variables will ave more range of values, then we will also need to scale those variables, ots ‘ rg re cane) 1 Variable -tvalues #impo' data_set= jpextracting Independen xc data_setioct sing dat withthe mean value) stranding missing dataReplacing ssid + Tmputer from sklearn pre-processing import TP “s capsenian imputer= Tmputer(missing_values 2'NaN', strategy="mean’, axis ) spitting imputer object to the independent varibles x imputerimputer= imputer.fit(xl:. 13D) #Replocing missing data with the calculated mea I imputer.transform(x{:, 1:3]) n value xf, #for Country Variable from sklearn pre-processing import LabelEncoder, OneHotEncoder label_encoder_x= LabelEncoder() x{:, OJ: label_encoder_x.fit_transform(x{:, OD) #Encoding for dummy variables onehot_encoder= OneHotEncoder(categorical_features= [0}) x onehot_encoder.fit_transform(x).toarray() ##encoding for purchased variable labelencoder_y= LabelEncoder() y= labelencoder_y.fit_transform(y) # Splitting the dataset into training and test set. from sklearn.mode|_selection import train_test_split os Pe ‘ pe ‘ A trate, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random state=0) IDAMENT/ OAMENTALS OFALSML(SE AtgeML) (430) eae —— #Feature Scaling of datasets from sklearn pre-processing import StandardScaler st_x= StandardScaler() X_train: St_xfit_transform(x_train) A test= st_x.transform(x_test) In the above code, we have included all the data pre-processing steps together. But there are some steps or lines of code ‘which are not necessary for all machine leaming models. So we can exclude them from our code to make it reusable for all models. «(47 POSITIVE AND NEGATIVE CLASS Confusion Matrix: * The confusion matrix, precision, recall and Fl score gives better intuition of prediction results as compared to accuracy. To understand the concepts, we will limit this article to binary classification only. What is a Confusion Matrix? + It isa matrix of size 22 for binary classification with actual values on one axis and predicted on another. QBCo DiFFesence fowl bdbween positive Negative Positive cass ® negelive class Negative True negative | False negative yo | [es f Positive False positive True positive Fig. 4.13 ~ Confusion Matrix Let's understand the confusing terms in the confusion matric true positive, true negative, false negative and false positive with an example. Example + Amachineleaming mode is trained to predict tumor in patients, The test dataset consists of 100 people. Negative Positive 5 | Negative 60 8 ET pains |e 10 Fig. 414 Confusion Matrix for Tumor Detection ‘© True Positive(TP): Model correctly predicts the positive class (prediction and actual both are positive). In the above ‘example, 10 people who have tumors are predicted positively by the model. True Negative (TN): Model correctly predicts the negative class (prediction and actual both are negative). In the above example, 60 people who don't have tumors are predicted negatively by the model. False Positive (FP): Model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In the above example, 22 people are predicted as positive of having a tumor, although they don’t have a tumor. FP is also called a TYPE I error. fine learning algorithms. meant 25 of unsupervised fearing algorithms neaoteg of association tule mining with suitable exam, tages of association nile mining learning algorithms. semi supervised learning algorithms. ent learning with suitable examp| 11. Enlist the various advantages and disadvan' 12. Explain any three popular sem! supervised machine 1. Enlist the various advantages and disadventages of 14. What is reinforcement learning? Explain the working of reinforce , Fnlict the various sdvantages and disadvantages af reinforcement learning. Draw and explain the life cycle of machine leaming. Differentiate between Al and ML 18, What are the different types of data? Explain the need of training and testing the datasets for ML models. 20. Enlist the popular sources of ML datasets. What is the need of data pre-processing? Descrive the various steps involved in data pre-processing . What is confusion matrix? Define the fallawing terms ® True positives b. False positives © True negatives d, Falge negatives © Precision £ Recall 23. What is cross validation? Deséribe the various techniques to perform cross validation 24. Differentiate between cross validation and test-train split 25, Describe the limitations of cross validation, Et oe ENTALS MESITAAS OF ALG Mi SEAL tatty aan | we train our model y | se the petton ea, a a8 raring accuracy sao vy High, Bu we provi anew dataset ot her it wil tet and tag aNE®. So we aways ty ta make a machine ening model which performs well with the training with the test dataset Here, we can define these datasets as +} Diane — “+ ‘Training set ‘Yost set Hig Aan * Training Set: A subset of dataset to train the machine learning model and we already know the output, * Test set: A subset of dataset to test he machine learning model and by using the test set, model predicts the output, + For splitting the dataset, we will use the below lines of cove: from skleornmodel_selection import train_test._split *_train, x_test, y_train, y_tests troin_test_split(x, y, test_size= 0.2, rondom_state=0) Explanation: + In the above code, the first line js used for splitting arrays of the dataset into random train and test subsets ‘+ In the second line, we have used four variables for our output that are > x .train: features for the training data > x test: features for testing data > y. trains: Dependent variables for training data > y.test: Independent variable for testing data “en tralntest_split() function, we have passed four parameters in which liest two are for arrays of data and test size is for Specifying the size of the test set. The test.size maybe S, .3 or 2, which tells the dividing ratio of training and testing sets. * The last parameter random state is used to set a seed for a random generator so that you always get the same result ‘and the most used value for this is 42. Output: ‘By executing the above code, we will get 4 different variables, which can be seen under the variable explorer section. ‘As we can see in the above image, the x and y variables are divided into 4 different variables with correspondin values 7. Feature Sealing Feature scaling is the final step of data pre-processing in machine learning. It is a technique to standardize | independent variables of the dataset in 2 specific range. n feature scaling, we put our variables in the same range < in the same scale so that no any variable dominate the other variable. + False Negative (FN): Mostel ywrons ggearamy) AE a i actual posi cs he pst a peter “So ‘xara 8 people wv have tumors are gaditd as regal ate (P0), False Negative Rate (FPR) ty. ; ‘Wi the help ofthese Your vais, wo can calelat Trve Posie Na favo (TNR) and Fae Negative Rate (NR) 7 TPR * Rajon Ponte” FN a. FAR © Zam poate * 19+ FN mW, TR * Tani negate “TW + FP ae FPR © val Negative * TH+ FP © Even it date is tbalanced, we car Squre Out that our model ic working well oF not. For that the values of TPR ang, TNR should be high, FPR and FNR should be as tow as pastibte, + With the help of TP, TN, PN and FP, other performance metrics can be calculated. Precisién, Recall * Both precision anki reeall are crucial for information retrieval, where postive class mattered the most as compared to negative, Why? * While searching something on the web, the madel does. not care about something irrelevant and not retrieved (this ts the true negative case). Therefore only TP, FP, FN are used in Precision and Recll, Precision Out of al the positive predicted, what percentage i try postive Pe Prection = 3 Fp The precision value lies between 0 and 1 Recall ‘Out of the total positive, what percentage is predicted positive? Its the same as TPR (true positive rate). TP. Real = a How are precision and recall useful? Let's see through examples Example 1: Credit card fraud detection Confusion Matrix for Credit Card Fraud Detection We do nat want t0 miss any fraud transactions. Therefore, we want False-Negative to be a6 low as possible, In th situations, we can compromise with the low precision, but recall should be high. Similarly, in the medical applica we don’t want to miss aly patient. Therefore we focus oni having a high recall ‘So far, we have discussed when the recall is important than precision. But, when is the precision more important t recall? Confaion Maia fr Spam Deecon * In the detection of spam mai itis okay i any spam mail remains undetected (alse negative), but what if we miss Reserve a subset of the dataset asa validation set. Provide the training to the mode! using the training dataset > Now, evaluate model performance using the validation set. If the model performs well with the validation set, perform the further step, else check for the issues. Methods used for Cross-Validation ‘© There are some common methods that are used for cross-validation. these methods are given below: 1. Valiedation Set Approach 2. Leave-P-out cross-validation 3. Leave one out cross-validation 4. K-fold cross-validation: 5, Stratified k-fold cross-validation 1 Validation Set Approach © We divide our input dataset into. a training set and test or validation set in the validation set approach. Both the subsets are given 50% of the dataset. ow core dataset to trait UF MOSEL soy ‘ve the undar-ftted modet =" FUNDAMENTALS OF AL 8 Na (SE, AI BML) aan ae hat we are just sing 8 S Sit tha a le i See nat tne taser 1x also tends 10. Leave-P-Out Cross-Validation if there are total n data points in, = Me tac ow wt tn ig Se cae 9s pana ag Pee ~ input dataset, then #-p data points wil be used as the training js calqilated to know the effectine, ce Sy oan eevee: provers tc cepeated tor al the samples and the average erTor is © ves So modet the large B. oF Py There nna dsadventage of this technique dat st can be computationally dict for = 2. Leave One Out Cross-Validation ‘oe take 1 dataset out of Thin method i similar tothe leave-p-out cross-validation, but instead of p, we need 16 Te tS aon or means, in this approach, for each learning set, only one data point is reserved ‘we get n different training set ang’ * train the model. This process repeats for each data point, Hence for n samples, we 9 * . test set. it has the following features > 9 this approach, the bias is minimum as all the data points are used > The process is executed for n times: hence execution time is high > This approach leads ta high variation in testing the effectiveness of the model, as we iteratively check 2G2inst ong data poset 4 Fold Cross-Validation told cross-validation approach divicles the input dataset into K groups of samples of equal sizes. These samples are called folds. For each learning set, the prediction function uses k-1 folds and the rest of the folds are used for the test ‘Set. This approach is a very popular CV approach because it is easy to understand and the output is less biased than | other methods. ‘The steps for k-fold cross-validation are: 2 Split the input dataset into K groups > For each group: Take one group as the reserve or test data set Uxe remaining groups as the training dataset Fit the model on the Watning set and evaluate the performance of the mode! using the test set + Let's take an example of 5:folds cross-validation. So, the dataset is gréuped inte § folds. Orv 1" iteration, the first fold is reserved for test the model and rest are used to train thve model. On 2 iteration, the second fold is used ta test the) model and rest are aised:to train the model. This process will continue Until each fold is not used for the test fold. Consider the below diagram: Fold 1 Fou 1 Fo 1 [row] {eve 2] ire] [Some roa] [Bao] ffoaa) > [ee Fo 4 Fold 4 Fone 4] Fig. 427 KeFolg TRE DUCTION Fo Hat Ths technique wt Validation {PRED kW proceso TO sosltion wth om Ht change This approach watson aienon complete dataset Ty, rearanging the dats Lo enine tha each fold or group te a good mpreventave of the Ais ‘eat withthe bias and variance, it sone of the best approachen otha houses Ta th am pe of hung kes, ich Ith ste ts can be uch igh Yan Holdout tackle such situations, a stratified k-old cross-validation technique is sett This method i” ' the slmpletcrom-yaidation technique among ll In his method, we need io remave a subset of the aig data and use it to get precicon tests by taining it on the rest pao the datset. The error that octurs inthis proces el haw well ovr mod wl parfonn with the unnown detane. Although this Speroech 1 ple Yo perform. it stil faces the isu of high vaviance and it abo predces misleading rests sometimes. Comparison of Cross-Validation to Teain/Test Split in Machine Learning, © Traln/Test Spl: Te input daa is divided into two pats, that ae trbning sat and test set cn a rato f 70:30, 8:20, te. It provides # high variance, which is one of the biggest disadvantages. > Training Data: The training data is used to tain the model andl the dependent variable is known. > Test Data: The test data i used to make the predictions fom the model that is already trained on the taining data. This has the same features as training data but not the part of that Cross-Velidation Dataset: It is used to overcome the disadvantage of waltvtest split by splitting the dataset into Sroups of tralin/test spits and averaging the result It can be used IF we want to optimize our modeb that has been ‘ained on the taining dataset for the best performance. It is more efficient as compared to train/test split as every Sbservation is used for the training and testing bath. Limitations of Cross-Vatidation “There are some imitations ofthe cross-validation technique, which are given below. * For the ideal conditions, it provides the optimum output. But for the inconsistent data. it may produce a drastic result So, itis one of the big disadvantages of cross-validation, as there is no cerainty of the type of data in machine learning. In predictive modeling, the data evolves over a period, due to which, it may face the differences between the training Set and validation sats. Such as if we create a madel for the preidiction of stock market values and the data i trainec ‘on the previous 5 years stock values, but the realistic future values for the next 5 years may drastically different, sa iti difficult to expect the correct output for such situations. Applications of Cross-Validation * This technique can be used to compare the performance of different predictive modelling riethods. It has great scope in the medical research field, W can also be used for the meta-analysis, as it is already being used by the date scientists in the field of medic statisties. EXERCISE What is machine learning? Explain the working machine leaming models with suitable diagram, What are the different features of ML? Give the classification of machine learning techniques with suitable exarnples. Describe any five applications of ML in brief What are the diferent types of leanings in MIL? Explain any three popular supervised machine learning algorithms. Enlist the various advantages and disadvantages of supervised learning algorithms. Seba pe

You might also like