[go: up one dir, main page]

0% found this document useful (0 votes)
15 views25 pages

Lecture 3

The document provides an overview of datasets in artificial intelligence, detailing types such as tabular, image, text, and time series datasets, along with their applications. It also discusses data pre-processing, the importance of training and test datasets, and popular sources for machine learning datasets including Kaggle, UCI Machine Learning Repository, and others. Additionally, it emphasizes the ethical considerations in data collection and usage.

Uploaded by

Ebad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views25 pages

Lecture 3

The document provides an overview of datasets in artificial intelligence, detailing types such as tabular, image, text, and time series datasets, along with their applications. It also discusses data pre-processing, the importance of training and test datasets, and popular sources for machine learning datasets including Kaggle, UCI Machine Learning Repository, and others. Additionally, it emphasizes the ethical considerations in data collection and usage.

Uploaded by

Ebad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS-323 : Artificial Intelligence

Module 3: Data Sets


Instructor: Dr. Tabassum Waheed
What is a Data set?
• A dataset is a collection of data in which data is arranged in some
order.
• A dataset can contain any data from a series of an array to a database
table.
Data
Tabular Dta
• A tabular dataset can be understood as a database table or matrix,
where each column corresponds to a particular variable, and each
row corresponds to the fields of the dataset. The most supported file
type for a tabular dataset is "Comma Separated File," or CSV.
Types of Data
• Numerical data:Such as house price, temperature, etc.
• Categorical data:Such as Yes/No, True/False, Blue/green, etc.
• Ordinal data:These data are similar to categorical data but can be
measured on the basis of comparison.
Types of Data Set
Image Datasets:
• Image datasets contain an assortment of images and are normally
utilized in computer vision tasks such as image classification, object
detection, and image segmentation.
Examples :
• ImageNet
• CIFAR-10
• MNIST
Types of Data Set
Text Datasets:
• Text datasets comprise textual information, like articles, books, or
virtual entertainment posts. These datasets are utilized in NLP
techniques like sentiment analysis, text classification, and machine
translation.
Examples :
• Gutenberg Task dataset
• IMDb film reviews dataset
Types of Data Set
Time Series Datasets:
Time series datasets include information focuses gathered after some time.
They are generally utilized in determining, abnormality location, and pattern
examination. Examples :
Securities exchange information
Climate information
Sensor readings.
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or
calculation sheets.
Data Pre-processing:

• Data pre-processing is a fundamental stage in preparing datasets for


machine learning. It includes changing raw data into a configuration
reasonable for model training.
• Normal pre-processing procedures incorporate data cleaning to
eliminate irregularities or blunders, standardization to scale data
inside a particular reach, highlight scaling to guarantee highlights
have comparative ranges, and taking care of missing qualities through
ascription or evacuation.
Training Dataset and Test Dataset:

• In machine learning, datasets are ordinarily partitioned into two


sections: the training dataset and the test dataset.
• The training dataset is utilized to prepare the machine learning
model, while the test dataset is utilized to assess the model's
exhibition.
Popular sources for Machine Learning
datasets
• Kaggle is one of the best sources for providing datasets for Data
Scientists and Machine Learners.
• It allows users to find, download, and publish datasets in an easy way.
It also provides the opportunity to work with other machine learning
engineers and solve difficult Data Science related tasks.
• Kaggle provides a high-quality dataset in different formats that we
can easily find and download.
• The link for the Kaggle dataset is https://www.kaggle.com/datasets.
UCI Machine Learning Repository

• The UCI Machine Learning Repository is an important asset that has


been broadly utilized by scientists and specialists beginning around
1987.
• It contains a huge collection of datasets sorted by machine learning
tasks such as regression, classification, and clustering.
• Remarkable datasets in the storehouse incorporate the Iris dataset,
Vehicle Assessment dataset, and Poker Hand dataset.
• The link for the UCI machine learning repository is
https://archive.ics.uci.edu/ml/index.php.
Datasets via AWS
• We can search, download, access, and share the datasets that are
publicly available via AWS resources.
• These datasets can be accessed through AWS resources but provided
and maintained by different government organizations, researches,
businesses, or individuals.
• The link for the resource is https://registry.opendata.aws/.
Google's Dataset Search Engine

• Google's Dataset Web index helps scientists find and access


important datasets from different sources across the web.
• It files datasets from areas like sociologies, science, and
environmental science.
• The link for the Google dataset search engine is
https://toolbox.google.com/datasetsearch.
Microsoft Datasets

• The Microsoft has launched the "Microsoft Research Open data"


repository with the collection of free datasets in various areas such as
natural language processing, computer vision, and domain-specific
sciences.
• It gives admittance to assorted and arranged datasets that can be
significant for machine learning projects.
• The link to download or use the dataset from this resource is
https://msropendata.com/.
Awesome public dataset
• Awesome public dataset collection provides high-quality datasets
that are arranged in a well-organized manner within a list according
to topics such as Agriculture, Biology, Climate, Complex networks,
etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.
• The link to download the dataset from Awesome public dataset
collection is
https://github.com/awesomedata/awesome-public-datasets
Computer Vision Datasets
• Visual data provides multiple numbers of the great dataset that are
specific to computer visions such as Image Classification, Video
classification, Image Segmentation, etc.
• Therefore, if you want to build a project on deep learning or image
processing, then you can refer to this source.

• The link for downloading the dataset from this source is


https://www.visualdata.io/.
Ethics
• It is fundamental to guarantee that data is gathered and utilized
morally, regarding privacy freedoms and observing pertinent
regulations and guidelines.
• Data experts ought to go to lengths to safeguard data privacy, get
appropriate assent, and handle delicate data mindfully.
• Assets, for example, moral rules and privacy structures can give
direction on keeping up with moral practices in data assortment and
use.
Resources
• https://www.v7labs.com/blog/best-free-datasets-for-machine-learni
ng
• https://figshare.com/articles/dataset/MAST-ML_Education_Datasets
/7017254

You might also like