[go: up one dir, main page]

0% found this document useful (0 votes)
11 views23 pages

Data Science Challenges & Solutions

midterm data

Uploaded by

Kent De Vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Data Science Challenges & Solutions

midterm data

Uploaded by

Kent De Vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

According to ac uvate .

com o r g an i z at i o n s ac r o s s the globe ar e looking to o r g an i z e ,


process and unlock the value of the torrential amounts of data they generate
an d t r an s f o r m t h e m i n t o act io nab le and hi gh value bu s ine s s i ns ig ht s . He nce , hir ing dat a s cie nt i s t s –
highly skilled professional data s cience experts , has become super critical. Today, there is virtually no
bus ines s f u nct io n t hat canno t benef it f ro m th em . In f act , t he Harvard Bus ines s R eview has labele d
data science as the “ sexiest ” career of the 21st century.

However, no career is without its own challenges, and being a data scientist, despite its “
s ex in es s ” is no ex ceptio n . Acco rding t o th e Fin an cial T im e s , m an y
o r g a n i z a t i o n s a r e f a i l i n g t o m ak e t h e b e s t us e of t heir dat a s cient is t s by being u nable t o
p r o v i d e t h e m w i t h t h e n e ce s s a r y r a w m at e r i a l s t o d r i v e r e s u l t s . I n f a ct , a c co r d i n g t o a S t a ck O v e r f l o w
survey , 13 . 2% of the data scientists are looking to jump ship in search of
g r e e n e r p a s t u r e s – s e c o n d o n l y t o m achi ne le ar ning s p e cialis t s . H aving he lp e d s e ve ral dat a
scientists solve their data problems , we share some of their common challenges and how they can
overcome them .

The most vital unit of D ata Science and Statistics are the data and informations
provided f or t he re s e ar che s o r analytics , this includes the actual gathered data and the
reference
datasets . To prepare a very accurate and useful data, the ir relevance, legitimacy and date
publis hed m u s t f irs t be check to as s ure t hat th e data bein g prepared and will s oo n be u s ed
are guaranteed to positively impact the research itself. The most common problem when
preparing a data is the ir relevance and legitimacy due to lots of sources where some might
be fake or biased leading to a weaker and a more debatable dataset .

D at a s c ie nt is t s s p e nd
nearly 80% of their time
cle aning and pr ep ar ing
data to improve its quality –
i . e . , m ake it accurate and
consistent , before utilizing it
for analysis . However, 57%
of them consider it as the
worst part of their jobs, labeling it as time- consuming and highly mundane . They are required to
go through terabytes of data, across multiple formats, sources, functions, and
a day-to-day basis, whilst keeping a log of their activities to prevent duplication.
O n e w a y t o s o l ve t h i s c h al l e n g e i s b y ad o p t i n g e m e r g i n g A I - e n a b l e d d at a
science technologies like Augmented Analytics and Auto feature engineering.
Augmented Analy it cs au tomates manual data cleansing and preparation tasks and enables
data scie ntists to be more pr
As the community gets an easier access to gadgets and devices the availability of
s o u r c e s h a s b e g a n t o s p r e a d w i d e r l e a di n g t o m u l t i p l e s o u r c e s an d p l at f o r m s t h a t m i g h t b e a n
area for fake informations and data, scams and even online attacks leading into a less reliable and
r i s k i e r s e t s o f d a t a. U s u a l l y , t h e p r e s e n c e o f a w i d e r p l a t f o r m m ak e s i t e as i e r f o r r i s k y s o u r ce s
b l e n d i n , p r o m o t i n g p i r ac y, s cam s and cyb e r at t ack s whe r e t h e r ule s o f co py r ig ht and cybe r
security are easily bypassed due to limited

As or ganiz ati ons co nti nue to uti li z e dif f e r ent types of apps and tools and generate
different formats of data, there will be more data sources that the data scientists need to
a cc e s s t o p r o d u c e m e a n i n g f u l d e c i s i o n s . T h i s p r o c e s s r e q u i r e s
manual entry of data and time - consuming data searching , which leads to errors and
repetitions, and eventually, poor decisions .

O r g a n i z a t i o n s n e e d a c e n t r al i z e d p l at f o r m i n t e g r a t e d w i t h m u l t i p l e d at a
s o u r c e s t o i n s t a n t l y a c c e s s i n f o r m at i o n f r o m m u l t i p l e sources . Data in this centralized
platform can be aggregated and controlled effectively and in real- time , improving its
ut i liz at ion and s aving huge am ount s o f ti me and efforts of the data scientists .

One of the most common cyber protocols are the CIA, which refers to the confidentiality,
Integrity and Availability, this is guaranteed to be affected as more platf orms and data
s o urc es ar e e s t ab lis h ed wit hi n t he internet . Conf ident iality of an information or data is
important to assure the saf ety of an individual or a group . Integrity values the ownership of
t h e o w n e r s t o s o m e d a t a , g i v i n g w a y t o t h e c r e a t i o n o f c o p y r i g h t i n f r i n g e m e n t b y w h i ch i s
filed for those individuals or groups that steals or uses an information or data without the
accredit atio n f ro m t he o riginal owner . And las tly the availability is not always a good option ,
s pecially today which is the information age , due to the availability of data, some data are
manipulated to be used for cyber crimes such as hacking and even identity theft .

As organizations
transition into cloud
data management,
cyber- attacks have
become increasingly
common. This has
caused two major
problems –

a . Confidential
data becoming vulnerable
b. As a response to repeated cyber-attacks, regulatory standards have evolved
w i c h ave exten de t e ata consent and utilization processes adding to the
fr st tionof thedat s ie ti st s
O r gani z at io ns s ho ul d ut ili z e ad vanc ed machine learning enabled security platforms
and ins till additional s ecurity checks to s afeguard their data. At the same time, they must
strict adherence to the data protection norms to avoid time-consuming audits and expensive
fine

T o s o l v e a c e r t a i n i s s u e , it m us t f irs t b e we ll de f ine d to avoid m is c once pt i ons and wro ng


actions that might cause a bigger problem into a business. To take the importance of
un ders tandin g th e bus ines s problem, let us have an example, imagine having a s tore and you priced
the products a bit higher than standard and you thought that it's the brand that causes the
c u s t o m e r s t o n o t p u r c h a s e i t , a n d y o u have bo ugh t a m o re clas s y brand, an d no body s t ill purchas e d
it , he nce it wi ll l ead you t o b ankr up t cy, bu t jf yo u'd ne w t hat it was t he p r oduct 's p r ice t he n you
have p ro bab ly s o ld bo t h t he regular brands and the more classy one. Things always have
i n t e g r at i o n h e n c e a w r o n g a c t i o n m i g h t c au s e m o r e p r o b l e m s s o d o de f i n e t h e p rob le m

Before p e r f o r m i n g da t a an a l y s i s an d b u ilding s ol ut ions , dat a s cie nt is ts mus t f i rs t


thoroughly understand the business problem. Most data scientists follow a mechanical
approach to do this and get started with analyzing data sets without clearly defining the
business problem and

T h e r e f o r e , d at a s c i e n t i s t s m u s t f o l l o w a p r o p e r w o r k f l o w b e f o r e s t ar t i n g a n y an a l ys i s . T he
workflow must be built after collaborating with the
stakeholders and consist of well-defined checklists to improve understanding and
problem identification

Not everyone have enough knowledge about technology, specially in data science, hence
to have a more understandable and effective communication with each staff of an organization
a team the clarity must be more priotized than the technicality of terms. People's
differs from one another which can also depend on their field of proficiency hence it is
better to use and communicate using common words to avoid confusion and
d u r i n g a t a l k o r a m e e t i n g w h i c h w i l l h e l p o n de b at i n g ab o ut ce rt ai n

It is imperative for the data scientists to communicate effectively with business


executives who may not understand the complexities and the technical jargon of their work.
t h e e x e c u t i v e , s t ak e h o l d e r , o r t h e c l i e n t c a n n o t u n d e r s t an d
their models, then their solutions will, most likely, not be

This is something that data scientists can practice. They can adopt concepts like “data
storytelling” to give a structured approach to their communication and a powerful
narrative to their analysis and

T h e r e ar e di f f e r e n t p r o f e s s i o n s t h a t de al s w i t h d at a s cie nce whic h als o di f f e rs t he ir pr of ici e ncy


and f u n c t i o n i n t h e f i e l d , f o r e x a m p l e , d a t a s c i e n t i s t s a r e t h o s e w h o m o d e l s a n d d e s i g n s a s y s t e m
f o r a pr oj e ct , whi le t he dat a eng in ee r s are r e s po ns ib le f or handli ng t h e dat as e t s b e ing g at he r e d and
p r o c e s s e d m ak i n g i t m o r e o r g a niz e d and m or e ef f ic ie nt . In an o rg aniz at i on, des pi t e t he di f f e r e nce in
r ole s s nd f unct io ns , t here must be a unison between each staf f, an collaborative act will help on
e x p a n d i n g t h e k n o w l e d g e a n d c an a l s o g i v e b i r t h t o m o r e i d e a s w h i ch w i l l h e l p o n t h e d e v e l o p m e n t o f
an o rganiz at ion or a pr oj ect . T h e dif f erence o n prof icienc ies and s kills will help on a wider
f unct i onal it y wit hi n a g roup .

Or ganiz ati ons us uall y have dat a s ci e nti s t s and


data engineers working on t he same projects . This
means t here mus t be effective communication
them to ensure the best output. However, the two
usually have different priorities and workflows, which
causes mis understanding and stif les k nowledge
sharing.

Management should take active steps to


enhance collaboration between data scientists and data engineers . It can foster open
communication by setting up a common coding language and a real-time
collaboration tool. Moreover , appointing a Chief Data Officer to oversee both the departments
h a s a l s o p r o v e n t o h a v e i m p r o v e d c o l l a b o r at i o n b e t w e e n t h e t w o t e am s .

One of the most common errors in the professional world are the poorly
due to the stereotypes by which a higher salary refers to a higher profession. In reality having a
profession deals with specific sets of skills and knowledge regards something. For
a data scientist doesn't make you an analyst, data scientist are those to creates or develops
system, while the data analysts are the individuals or groups that are skill on data analytics and
statistics that deals with

In big organizations, a data scientist is expected to be a jack of all trades – they are
r e q u i r e d t o c l e a n d a t a , r e t r ieve dat a , build m o dels , and con duct
an alys i s . Howe ve r , t his i s a b ig ask f or any data scientist . For a data science t eam to f unction
e f f e c t i ve l y , t a s k s n e e d t o b e d i s t r i b u t e d am o n g i n d i v i d u al s p e r t ai n i n g
to data vis ualization, data preparation, model building and so on .

It is critical for data scientists to have a clear understanding of their roles and
responsibilities before they start working with
When dealing with statistics, the Key Performance Indicators( KPIs) and Metrics are
fundamental, they show and tell the researchers the statistical status of an d ataset,
they are valid or invalid,accurate or in accurate, or relevant or irrelevant. KPIs and Metrics
p l ay s a m aj o r r o l e i n D a t a S c i e n c e as t h e y h e l p o n p o i n t i n g o u t t h e n e e ds and t he unnec es s ar y
things in a dataset. Hence to provide a better result, the research materials such as KPIs and
Metrics M U ST alway c e ked and tested to avoid errors as researchers proceed and
exe cute theirtests e rc
The lack of understanding of data science am ong management teams leads to
unrealistic expectations on the data scientist, which affects their performance. Data
s c i e n t i s t s ar e e x p e c t e d t o p r o d u c e a s i l v e r b ull e t and s olve al l
the business problems. This is very counterproductive.
Therefore, every business should have:

a. Well-defined metrics to measure the accuracy of analysis generated by the data


scientists
b. Proper business KPIs to analyze the business impact generated by the analysis

Data Science is no easy thing due to the large and wide variety of data due to multiple
sources and time frames, this lessens the efficiency of both the technology and
that deals with data science. Storage is the most common problem when we deal with data as
m u s t al w ay s b e s t o r e d f o r f u t u r e u s es s uch as a re f e re nce or e ve n as a t ool it s e lf t hat' s
computer components are constantly changing. Also more methods and models are being
designed by data scientist to help on efficiently testing the datasets being collected and
this also comes with the development of stronger

Cle ar ri s k . co m e x p laine d s om e c hall e nge s t oo t hat wi t h to day ’s dat a- drive n or ganiz at i ons
and the introduction of big d a t a, risk m a n ag e r s and other employees are often
overwhelmed with the amount of data that is collected. An organization may receive
i n f o r m a t i o n o n e v e r y i n c i d e n t a n d i n t e r a ct i o n t h a t t a k e s p l a c e o n a d a i l y b as i s , l e a v i n g a n a l y s t s
w i t h t h o u s a n ds o f i n t e r l o c k i n g d at a s e t s .

There is a need for a data system that automatically collects and organizes
information. Manually performing this process is far too time-consuming and
unnecessary in today’s environment. An automated system will allow employees to use the
s p e n t p r o c e s s i n g d at a t o a c t o n i t

The relevance of a data is a priority in data science which will provide a better insight for
researchers to help them on developm ent of systems, machines and others. The legitimacy
publication date of a data tells its usefulness in a system, due to consistent innovations
d e ve l o p m e n t s i n o u r c u r r e n t w o r l d s o m e d at a b e c o m e s i r r e l e va n t as t h e y we r e r e p l a c e d b y a m o r e
e f f i c i e n t , u s e f u l l a n d a d v a n c e d o n e s . D u e t o t h o s e d e v e l o p m e n t s a n d i n n o v at i o n s c a c h e d an d
irrelevant data are disturbingly rises in numbers making it more difficult for researchers to find
useful and valuable data.

With so much data available, it ’s difficult to dig down and access the
i n s i g h t s t h a t a r e n e e d e d m o s t . W h e n e m p l o y e e s a r e o v e r w he lm ed , t hey m ay no t f u lly analyz e
data or only focus on the measures that are easiest to collect instead of those that truly add
value . In addit io n , if an em ployee has t o manu ally sift t hrough data , it can be impos sible to
g a i n r e a l - t i m e i n s i g h t s o n w h a t i s c u r r e nt ly happen ing . O u tdat ed data can have s ign if icant
negative impacts on decision- making .

A d at a s y s t e m t h at c o l l e c t s , o r g an i z e s a n d a u t o m at icall y al er t s us e rs o f
trends will help s olve this is sue . Employees can input their goals and easily create a report that
p r o v i d e s t h e a n s w e r s t o t h e i r m o s t i m p o r t an t q u e s t i o n s . W it h real- tim e r epor ts an d alert s , decis ion
- makers can be confident they are basing any choices on complete and accurate information .
A clear and efficient data story telling is a must in the world of data science,
on inducing an idea to i n d i v i d u a l s e v e n w i t h o u t t h e s t a t i s t i c al s k i l l s s o a g o o d v i s u a l
representation of data must be provided. The most common types of data visual aids are the
c h a r t s , b a r g r a p h s a n d l i n e g r a p h s , t h i s h e l p s o n p r o v i d i n g t h e a c t u al d a t as e t s i n a m o r e
u n d e r s t a n d a b l e a n d m i n i m a l w a y t o b e c o m p r e h e n d e d an d b e a n al y z e d.

T o b e u n de r s t o o d a n d i m p a c t f ul, dat a of t e n ne e ds t o b e vis ually pr es ent e d in g rap hs or


ch ar ts . Wh ile th es e t oo ls are in credibly us e fu l, it ’s diff icult to build them manually. T aking the time
p ull i nf orm ati on f ro m m ul ti pl e are as and put i t i nt o a r ep or ti ng t oo l is f rus t rat ing and ti m e
consuming.
Strong data systems enable report building at the click of a button. Employees
an decision makers will have access to the real-time information they need in an appealing and
ed cational o rmat

Due to the rising am o u n t s of


available data, s ome individuals or groups
took the chance to sell the data, making
difficult for researchers to access data.
Besides the online businesses using data,
data censorship have also been a problem as
the government and even the internet
handlers decided to block or even delete
s o m e d at a t h at c a n b e u s e f u l f o r
leading to a more limited sources. Despite
the vast amount of data floating
internet it will still be difficult to look
useful data due to their relevance, legitimacy
and legality. Hence having the censorship
and business will choke the researchers
more.
Moving data into one
centraliz ed system has little impact if it
is n o t easily accessible to the people
t h a t need it . Decision - makers
an d r is k m anag e rs nee d ac ce s s t o all
of an organization ’s data for insights on
what is happening at any given
moment, even if they are working off-site . Accessing information should be the easiest part
of data

An effective database will eliminate any accessibility issues. Authorized employees


will be able to securely view or edit data from anywhere, illustrating organizational changes
an d e n ab l i n g h i g h - s p e e d d e c i s i o n m ak i n g .

Despite the vast number of sources and platforms some data are deemed to be
useless and pointless due to their poor and weak quality leaving researchers disappointed.
The quality of a data will be dictated by its relevance, date of publication, completeness,
legitimacy and legality, and due to the high numbers of sources some data are altered, wether
improperly tranated, or in the worst case,it might be used to scam and other cyber

Nothing is more harmful to data analytics than inaccurate data. Without good input,
output will be unreliable. A key cause of inaccurate data is manual errors made during data
entry. This can lead to significant
consequences if t he analysis is used to influence decisions. Another issue is
asymmetrical data: when information in one system does not reflect
changes made in another system, leaving it outdated.

A centralized system eliminates these issues . Data can be input


automatically with mandatory or drop-down fields, leaving little room for human error .
System integrations ensure that a change in one area is instantly reflected across the board.
In t he pr of e s s i onal worl d, t im e is ve r y im p ort ant t hat i t l e ads to procrastination and pressure,
adding the fact that some superiors are not that delightful to be with, focus is important in data
s c i e n c e a s i t d e a l s m o s t l y w i t h q u a n t i t i e s and q ualit ies , an d s o m e t ech nical s k ills are eve n applied
hence the press ure being exerted makes it difficult for data professionals to understand and solve
certain issues.
As risk management becomes more popular in
organiz ations , CFOs and other ex ecutives demand more results
from risk managers. They expect higher returns and a large
number of reports on all kinds of data.

With a comprehensive analysis system, risk managers can


go above and beyond expectations and easily deliver any desired analysis. They’ ll also have more
time to act on insights and further the value of the department to the

Funds and moral support is very important to everyone, even the professionals, sadly most
e m p loye e s no wadays are no t pr o pe r ly f un de d and s up p or t e d m ak ing t he m s t and o n t he ir own t r ying
li f t t he ir dif f icul t ie s wit hout any help. In the world of data s cience where research and technology is
prio rit y t her e s h ou ld be pro per f un ds , co oper at io ns an d co llaborat ion s t o help bo os t t he
and innovations.

Risk managers will be powerless in many pursuits if executives don’t give


them the ability to act. Other employees play a key role as well: if they do not submit data for
analysis or their systems are inaccessible to the risk manager, it will be hard to create any
actionable
E m p has i z e t he val ue o f ris k m anage m e nt and anal ys i s t o al l as pe ct s of t he or ganiz at i on
t o g e t p as t t h i s c h a l l e n g e . O n c e o t h e r m e m b e r s o f t h e t e a m u n d e r s t an d t h e b e n e f i t s ,
t hey ’ re m o re lik ely to co o perate . Im plem enting change can be difficult , but us ing a centralized
data analysis system allows risk managers to easily communicate results and effectively
achieve buy- in from multiple stakeholders .

E ve ryt hi ng t hat i s a b us iness needs to be properly funded to be well established and to


be s table hence having enough budget mus t be planned to avoid bankruptcy.

Another challenge risk managers regularly face is budget. Risk is often a


small department, so it can be difficult to get approval for significant purchases such as
analytics

R i s k m a n ag e r s c a n s e c u r e b u dg e t f o r d at a a n aly ti cs b y m e as ur ing the r et ur n on


investment of a system and making a strong business case for the
benefits it will achieve. For more information on gaining support for a risk
software system, check out our blog

Personnels differs in skills, they are uniquely proficient on different things.

Some organizations struggle with analysis due to a lack of talent. This is


true in t hose w it h o u t orma r is e a rt ments. Employees may not have the knowledge or
c a p a bilit y to ru n in -d pth d ta alpysis
At the end of this lesson, you will be able to:
1. Understand the job and functions of a Data Scientist

Based on northeastern.edu the


Data scientists work closely with
stakeholders to understand their goals
and determine how data can be used
to achieve those goals. They design
data modeling processes, create
algorithms and predictive models to
extract the data the business needs, and
help analyze the data and share insights
with peers. While each project is
different, the process for
analyzing ata generally follows the
below path

1. Ask the right questions to begin the discovery


2. Acquire d ata
3. Process and clean the data
4 . Integrate and store data
5 . Initial data investigation and exploratory data analysis
6. Choose one or more potential models and algorithms
7. A p ly d a ta cience techniques, such as machine learning, statistical modeling,
an ar tifi cia
8. Measure and improve results
9. Present final result to
1 0 . Make adjustments based on feedback
11. Repeat the process to solve a new problem

The most common careers in data science include the following roles.

1. Design data modelling processes to create algorithms and


predictive models and perform
2. Manipulate large data sets and use them to identify trends and
reach meaningful conclusions to inform strategic
3. Clean, aggregate, and organize data from disparate sources and
transfer it to
4. Identify trends in data sets
5. Design, create, and manage an organization’ s data architecture
Most data scientists use the following core skills in their daily work:

1. Identify patterns in data. This includes having a keen sense of


pattern detection a ano ma ly detec tion.
I p leme nlearn
2. computer to automatically t algorithms
from and statistical models to enable
3. Apply the principles of artificial intelligence, database
systems, human/computer interaction, numerical analysis, and
engineering.
4. : Write computer programs and analyze large datasets to uncover
answers to complex problems. Data scientists need to be
code working in a variety of languages such as Java, R, Python, and SQL.
5. C o m m u n i c at e ac t i o n ab l e i n s i g h t s u s i n g da t a , o f t e n f o r a n o n -
t echn ical audie nce .

Data scientists
skills play a key role in helping organizations make sound decisions. As such, they need“soft
” in the
1. Connect with stakeholders to gain a full understanding
of the problems they’ re looking
2. Find analytical solutions to abstract
3. Apply objective analysis of facts before coming to
4. Look beyond what’s on the surface to discover patterns and
within
5. Communicate across a diverse audience across all levels of
organization.
Components that
Those are what we refer to as " ," and they are placed on a canvas.
Widgets communicate by sending data and a channel for communication. A widget's
output can be used as an input for another widget.

Workflows in Orange consist of " " that perform tasks like reading, analyzing,
and visualizing data. Widgets are connected on a canvas to create workflows.
A screenshot above shows a simple workflow with two connected widgets and
one widget without connections. The outputs of a widget appear on the right,
while the inputs appear on the left.
We construct workflows by dragging widgets onto the canvas and
connecting them by drawing a line from the transmitting widget to the receiving
widget. The widget’ s outputs are on the right and the inputs on the left. In the
workflow above, the File widget sends data to the Data Table widget.

A machine learning dataset is a collection of data that is used to train the model.
Datasets are crucial for training and evaluating machine learning models. A dataset
acts as an example to teach the machine learning algorithm how to make
predictions. The common types of data include( ):

: Text data consists of words, phrases, and sentences, usually gathered from
sources like documents, social media posts, or emails. Processing text data involves
such as tokenization (breaking text into words or phrases), ste mi g ( e ucing words to their
form a n s o r d remo v a l eliminat ingcommon wo s like the or is ). Natural
Pr oc ssi g N s com mon l used to analy zetext dat
Language
tran slation n d docume n tcla ssificf t ion. I t he pizsebusi n ess e s a y z e cus tomer feedback,
c u stomer
a service through chat b ot s o r categ o l a rgesets of cuments
automate
: Image data is composed of pixels, and each pixel has an intensity value for red,
green, and blue ( ) channels. Processing image data often involves resizing, normalization, and
filtering. Deep learning models, particularly Convolutional Neural Networks ( ), are widely used
to analyze image data.
: Image data is used in various applications like facial recognition, object detection,
medical imaging ( ), and self-driving cars. It is also vital for applications
in augmented reality ( ) and digital art.

: -Asu io data r esentsso u waveFso,t p ica cap tur d as gi ta sidgnalsa n


processed as time e ie sda t a eTperchniques li Fast u ie r T nsform FF T ) re u e t o c onv rt
these signals into frequency domain data, which is then analyzed. Machine learning models
analyze features like pitch, frequency, and amplitude to process audio data.
: Audio data is used in speech recognition systems (e.g., voice assistants like Siri or Alexa),
music recommendation algorithms, and emotion detection from voice. It’s also important for
time noise detection and monitoring

: Video data consists of a sequence of images (frames) captured over time.


frame is an image, and video data is essentially a combination of image data and time-series data.
Video processing requires techniques such as frame extraction, motion detection, and
tracking. Recurrent Neural Networks (RNNs) and CNNs are often applied in video data analysis.
: Video data is essential in surveillance systems, video analytics, action recognition, and
video editing software. It is also widely used in entertainment industries for content
such as in YouTube or Netflix recommendation

Numeric data consists of numbers and is the most straightforward type of


used in data science. It can be continuous or discrete, and methods like normalization or scaling are
used to process it. Numeric data is often represented in tabular formats like spreadsheets,
traditional machine learning models (e.g., regression, decision trees) are often applied.
: Nu me ric data i s us ed n financialana l si s c k mar et ctions , sales
forecasting statistical modeling and any quant i a tiv s It’s c ucipa r a ppli ations such as
risk management, scientific research, and data-driven business decision-making.
Preparing and choosing the right dataset is one of the most crucial steps in
training an AI/ML
model. It can be the determinant between the

• : The dataset is used to teach the machine learning model by


showing it examples with known outcomes. The model learns patterns and
relationships from this data to make predictions or classifications. A well-
prepared training dataset ensures the model captures important features and
performs well
• : After training, a separate portion of the dataset
( ) is used to evaluate the model's performance. This helps
determine how well the model generalizes to unseen data, ensuring it can make
accurate predictions in real-world applications.

The whole dataset that is collected is separated into 3 subsets which are as
f ollows :

T his is one of t he m os t im por t ant s ubs et s of t he whole dat as et ,

This set comprises the data that will initially be used to


train the model. In other words, it helps teach the algorithm what to look for in
the data. For instance, a vehicle license plate recognition system will be
trained with image data with labels indicating the location (e.g., front or rear
of the car) and the data format of the license plates of vehicles:

The validation data is


known data that helps in identifying any shortcomings in the model. This
data is also used to
identify if the model is over/underfitting.

This subset is input at the

The data in this subset is unknown to the model and is


used to test the accuracy of the model. In simpler words, this dataset will show
how much your model has learned from the previous 2 subsets.

: First, users might add a " " widget to read in a dataset from a
: Then, a " " widget could be added to choose which
features from the dataset to include in the analysis.
:A" widget could be used to visualize the relationship between two
variables in the dataset.
: Finally, a " " or " " widget could be added to apply a
machine learning model to the data.
: An " " widget would then show the accuracy of the model and help
refine it.
Reads attribute-value data from an input file.
The widgetreads the input data file( ) and
sends the dataset to its output channel. The history of most recently
opened files is maintained in the widget. The widget also includes a directory
with sample datasets that come pre- installed with Orange.
The widget reads data from Excel ( ) , simple tab-delimited ( ) , comma-
separated files ( ) or URLs.

Most Orange work flows would probably start with the widget . In the
schema below ,
the widget is used to read the data that is sent to both the Data
Tableand the Box Plot widget .

Import a data table from a CSV formatted file.

“ Data: dataset from the .csv file


“ Data Frame: pandas DataFrame object
The widget reads comma-separated files and sends the dataset to
its output channel. File separators can be commas, semicolons, spaces, tabs or
manually defined delimiters. The history of most recently opened files is
maintained in the widget .

Widgets are components in Orange that perform various tasks, such as reading data from
files, visualizing it, or applying machine learning algorithms. Users can connect these
widgets to create a full workflow that handles data mining from start to finish.

1. The folder icon opens the dialogue for import the local .csv file. It can be
used to
either load the first file or change the existing file
( ) . The File dropdown stores paths to
previously loaded data sets .
2. Information on the imported data set. Reports on the number of instances
( ), variables ( ) and meta variables ( ).
3. Import Options re-opens the import dialogue where the user can set
delimiters, encodings, text fields and soon . Cancel aborts data import.
Reload imports the
file once again, adding to the data any changes made in the original file .

Displays attribute- value data in a spreadsheet.

“ Data: input dataset

“ : instances selected from the table


The widget receives one or more datasets in its input and presents
them as a spreadsheet. Data instances may be sorted by attribute values.
The widget also supports manual selection of data instances.

1. The name of the dataset ( ) . Data instances are


in
rows and their attribute values in columns . In this example , the
data set is sorted
by the attribute .
2. Info on current dataset size and number and types of attributes
3. Values of continuous attributes can be visualized with bars;
colors can be attributed to different classes .
4. Data instances ( ) can be selected and sent to the widget’s output
channel.

Selects a subset of data instances from an input dataset.

“ : sam
The widget implements several data sampling methods . It
outputs a sampled and a complementary dataset (

) . The output is processed after the input


dataset is provided and is
pressed.

1. Information on the input and


output data set .

2. The desired sampling method :


returns
a selected percentage of the entire data (
)
returns a
Cross Validation partitions
data
instances into the specified number
of
complementary subsets . Following a
typical
validation schema, all subsets except
the one
selected by the user are output as Data
Sample, and the selected subset goes to
Remaining Data. (

)
B o o t s t r a p i n f e r s t h e s a m p le f r o m
the populat ion st atistic .
3. Replicable sampling maintains
sampling
patterns that can be carried across
users, while stratify sample mimics
the composition of the input dataset.

4 . Press Sample Data output the data sample.

If all data instances are selected (

), output instances are still


shuffled.

Manual selection of data attributes and composition of data domain.

“ Data: The dataset that contains various attributes (columns) and instances (rows) from which
users can select or deselect specific columns.

“ Data: dataset with columns as set in the widget


The widget is used to manually compose yourdata domain. The
user can
decide which attributes will be used and how . Orange distinguishes between
ordinary attributes, (
) class attributes and meta-attributes. For instance, for building a
classification model, the domain would be composed of a set of attributes
and a discrete class attribute. Meta
attributes are not used in modeling, but several widgets can use them as
instance labels .
Orange attributes have a type and are either discrete, continuous or a character
string.

1 . Left- out data attributes


that will
n ot b e in th e out pu t data file
2. Data attributes in the
new data file
3. Target variable. If
none, the new dataset will
be without a
target variable .
4. Meta attributes of the
new data file. These attributes
are included in the dataset but
are, for most methods, not
considered in the analysis.
5. Produce a report.
6. Reset the
domain composition
to that of the input
data file .
7. Tick if you wish to
auto- apply changes of the
data domain.
8. Apply changes of the
data
domain and send the new
data file to the output
channel of the widget .

S elects data instances based on conditions over data features.

“ Data: The entire dataset, consisting of rows ( ) and columns ( ), that will
be filtered.

“ : instances that match the conditions


“ : instances that do not match the conditions
“ : data with an additional column showing whether a instance is
selected
This widget selects a subset from an input dataset, based on user-defined
conditions. Instances that match the selection rule are placed in the output
Matching Data channel. Criteria for data
selection are presented as a collection of conjunct terms (i.e. selected
items are those matching all the terms in ‘ Conditions ').

Condition terms are defined through selecting an attribute, selecting an


operator from a list of
operators, and, if needed, defining the value to be used in the condition
term .
Operators are different
for discrete, continuous
and string attributes.

1. Conditions you
want to apply, their
operators and related
values
2. Add a new
condition to the list
of conditions .
3. Add all the
possible
variables at once .
4. Remove all the
listed
va r i a b l e s a t o n c e .
5. Information on
the input dataset and
information on
instances that
match the condition( s )
6. Purge the output data.
7. When the Send
automaticay box is
ticked, all changes will
be automatically
communicated to
o t h e r widgets .
8. Produce a report.
Any change in the composition of the condition will update the information
pane . If Send is selected, then the output is updated on any
change in the composition of the condition or any of its terms.

Scatter plot visualization with exploratory analysis and intelligent data


visualization enhancements.

“ Selected Data: instances selected from the plot


“ Data: data with an additional column showing whether a point is selected
The widget provides a 2-dimensional scatter plot visualization.
The data is displayed as a collection of points, each having the value
of the x-axis attribute determining the position on the horizontal axis
and the value of the y-axis
attribute determining the position on the vertical axis. Various properties of
the graph, like color, size and shape of the points, axis titles, maximum point
size
andjittering can be adjusted on the left side of the widget. A snapshot
below shows the scatter plot of the Iris dataset with the
coloring matching of the class attribute .

1. Select the x and y attribute. Optimize your projection with


. This feature scores attribute pairs by average
classification accuracy and returns the top scoring pair with a
simultaneous
visualization update.
2 . A ributes : Set the color of the displayed points (you will get colors for
categorical values and blue-green-yellow points for numeric) . Set label,
shape and size to differentiate between points . Labelonly
selectedpoints allows you to select
individual data instances and label only those.

Selection can be used to


manually defined subgroups in
the data.
Signal data outputs a data table with an additional column that contains group
indices .
:
- : Made up of "widgets" that perform data tasks like reading, analyzing, and
visualizing. These are connected on a canvas to build workflows.
- : Orange includes basic widgets for various data tasks and supports
add- ons for specialized areas like text mining and bioinformatics.

:
- Used to train and evaluate machine learning models. They include text, image, audio,
video, and numeric data.
:
1. ( %) : Used to teach the model.
2. ( % ) : Used to fine-tune model parameters.
3. ( %) : Used to evaluate the model’ s final accuracy.

:
1 . : Reads and imports data files.
2. : Specifically handles data import from CSV files.
3. : Displays data in a spreadsheet for easy viewing and selection.
4. : Selects random or specific subsets of data for analysis.
5. : Allows manual selection of data attributes.
6. : Filters data based on specific conditions.
7 : Visualizes data in a 2D scatter plot with customizable features.
WIDGETS IN THE ORANGE DATA MINING

MOST COMMONLY USED WIDGETS:

File: This widget is used to load data from various file formats like CSV, Excel, tab-delimited text files, and others. It’s the
first step in bringing raw data into the Orange workspace.
SQL Table: Connects to an SQL database and allows you to query specific tables or views directly from the database. It
can also run custom SQL queries to filter or join tables before loading the data into Orange.
Data Table: Displays the loaded dataset in a table format. You can view the data row-by-row and column-by-column,
making it easy to inspect the contents of the dataset before any analysis.
Select Columns: This widget helps you filter and select specific columns (features) of interest. You can deselect
unnecessary features, reorder them, or filter out irrelevant ones.
Edit Domain: Allows you to modify the dataset's feature set by renaming columns, changing feature types (e.g., turning
continuous variables into categorical), grouping categories, or applying transformations.

MOST COMMONLY USED WIDGETS:


Impute: Handles missing data by filling in the gaps. It can impute missing values using statistical measures (like mean,
median) or advanced techniques (such as model-based imputation).
Discretize: Converts continuous numerical variables into discrete categories or bins. This is useful in algorithms that
require categorical inputs or when you want to simplify the representation of continuous data.
Continuize: This widget transforms categorical variables into numerical representations. For example, one-hot encoding is
a common technique used to turn categorical variables into binary columns for machine learning algorithms.
Select Rows: Filters the dataset by rows based on conditions you specify. For example, you can choose rows where a
specific feature falls within a certain range, or only include rows with specific class labels.
Transpose: Swaps rows and columns in the dataset. It is often used in datasets where features are stored as rows, and
this widget can rearrange them to be columns.
Merge Data: Combines datasets that share a common attribute (e.g., ID), allowing for a more comprehensive dataset.
Randomize: Shuffles the dataset to ensure that the order of data does not impact downstream analysis, especially
important in training/testing scenarios.
MOST COMMONLY USED WIDGETS:

Scatter Plot: Displays two or three continuous variables in a scatter plot, helping you observe correlations, trends, and
clusters in the data. You can color the points by class labels or other attributes.
Box Plot: Shows the distribution of a continuous variable, highlighting the median, quartiles, and any outliers. Box plots are
useful for comparing distributions across multiple variables or groups.
Histogram: Visualizes the frequency distribution of a continuous variable. It’s helpful for understanding the shape (e.g.,
normal, skewed) and spread of the data.
Heatmap: Displays correlations or relationships between variables or data points using color intensities. Heatmaps are
commonly used to visualize correlation matrices or similarity/distance metrics.
Mosaic Display: A graphical representation for visualizing relationships between categorical variables. The size of the
rectangles is proportional to the frequency of the combination of categories.
Line Plot: Plots time-series or ordered data, making it suitable for visualizing trends over time or a continuous sequence.
Venn Diagram: Visualizes the overlap between different sets of items, commonly used to compare groups or clusters in
the data.

MOST COMMONLY USED WIDGETS:

Logistic Regression: A classification algorithm that models the relationship between the features and a binary outcome. It
outputs probabilities and is widely used for binary classification tasks.
k-Nearest Neighbors (kNN): A simple classification and regression algorithm that assigns class labels based on the
majority class of the k nearest neighbors in the feature space.
Naive Bayes: A probabilistic classifier that applies Bayes’ theorem with strong (naive) independence assumptions
between the features. It’s fast and works well for high-dimensional datasets, especially in text classification.
Random Forest: An ensemble method that builds multiple decision trees and aggregates their predictions for improved
accuracy and robustness. It’s effective for both classification and regression tasks.
Neural Network: Implements feed-forward neural networks, where data is passed through layers of interconnected nodes
(neurons) to predict the output class. Suitable for complex datasets and non-linear relationships.
SVM (Support Vector Machine): A powerful classifier that constructs a hyperplane in the feature space to separate data
points from different classes. It works well for high-dimensional and complex datasets.
Decision Tree: A tree-like model where each internal node represents a decision based on a feature, and each leaf node
represents an output class. Decision trees are intuitive and easy to interpret.
AdaBoost: A boosting algorithm that improves weak classifiers by focusing on misclassified instances in each
subsequent iteration.
Linear Regression: A regression algorithm that models the relationship between continuous target variables and one or
more predictors using a linear equation.

MOST COMMONLY USED WIDGETS:


Test & Score: Evaluates the performance of predictive models using metrics like accuracy, precision, recall, F1 score, and
AUC-ROC. It allows testing with cross-validation or holdout sets.
Confusion Matrix: Displays the performance of classification models by showing the number of true positives, false
positives, true negatives, and false negatives. It’s essential for understanding the accuracy of predictions.
ROC Analysis: Plots the ROC (Receiver Operating Characteristic) curve, which shows the trade-off between the true
positive rate and the false positive rate at different classification thresholds.
Performance Curve: Measures how much better a predictive model is at identifying positive instances compared to
random guessing. Lift curves are useful in marketing and fraud detection tasks.
Calibration Plot: Visualizes how well the predicted probabilities from a model align with actual probabilities, helping
assess the model's reliability.

You might also like