[go: up one dir, main page]

0% found this document useful (0 votes)
57 views37 pages

DSA Unit1

This document discusses various applications of data science techniques. It covers recommender systems, time series analysis, supply chain management, and applications in domains such as economic analysis of electric consumption, stock market prediction, bioinformatics, social media analytics, and email/SMS mining. It also discusses data science techniques including classification, regression, deep learning, clustering, association rules, and time series analysis and their use in areas such as text mining.

Uploaded by

neha yarrapothu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views37 pages

DSA Unit1

This document discusses various applications of data science techniques. It covers recommender systems, time series analysis, supply chain management, and applications in domains such as economic analysis of electric consumption, stock market prediction, bioinformatics, social media analytics, and email/SMS mining. It also discusses data science techniques including classification, regression, deep learning, clustering, association rules, and time series analysis and their use in areas such as text mining.

Uploaded by

neha yarrapothu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Science Applications

UNIT-1
• Data Science Applications in various domains
• Challenges and opportunities
• Tools for data scientists
• Recommender Systems : Introduction
• Methods
• Application
• Challenges
Recommender Systems
• Introduction
• Methods
• Application
• Challenges
Unit-2 Time Series Data
• Stock market index movement forecasting
Supply Chain Management

• Real world case study in logistics


Data Science
• Data science is a new area of research that is related to huge data and
involves concepts like collecting, preparing, visualizing, managing, and
preserving.
• Reveals the hidden features of complex social, human, and natural
phenomena related to data from another point of view other than
traditional methods.
• Data science includes three stages: designing the data, collecting the
data, and analyzing the data
• DS has made remarkable advancements in the fields of ensemble
machine learning, hybrid machine learning, and deep learning
• Machine learning methods (ML) can learn from the data with
minimum human interference.
• Deep learning (DL) is a subset of ML that is applicable in different
areas, like self-driving cars, earthquake predictions, and so on.
• but literature shows that the superiority of DL over ML methods; DL
methods include artificial neural networks, k-nearest neighbors, and
support vector machine (SVM) in different disciplines, such as
medical, social media, and so on.

• Advancements in different areas of communications and information


technology—like email information privacy, market, stock data, data
science, and real-time monitoring—have also been a good influence
• data science builds algorithms and systems for discovering knowledge,
detecting the patterns, and generating useful information from massive data.
• To do so, it starts with the extraction of data , cleaning, and extends to data
analysis, description, and summarization
• Fig1 shows the complete process.
• It starts with data collection. Next, the data is cleaned to select the segment
that has the most valuable information.
• the user will filter over the data or formulate queries that can erase
unnecessary information.
• After the data is prepared, an exploratory analysis that includes visualizing
tools will help decide the algorithms that are suitable to gain the required
knowledge.
• This complete process will guide the user toward the results that will help
them make suitable decisions.
• Depending on the primary outcomes, the complete process should be
fine-tuned to obtain improved results.
• This will involve changing the parameter values or making changes to the
datasets.
• These kinds of decisions are not made automatically, so the involvement
of an expert in result analysis is a crucial factor.
• From a technical point of view, data science consists of a set of tools and
techniques that deals with various goals corresponding to multiple
situations.
• Some of the recent methods used are clustering, classification, deep
learning, regression, association rule mining, and time-series analysis.
• Even though these methods are often used in text mining and other
areas, anomaly detection and sequence analysis are also helpful to
provide excellent results for text mining problems.
Classification
• Wu et al. have classified a set of objects that predict the classes based on
the attributes.
• Decision trees (DT) are used to perform and visualize that classification [3].
• DTs may be generated using various algorithms, such as ID3, CLS, CART,
C4.5, and C5.0.
• Random forest (RF) is one more classifier that will construct a set of DTs, and
then predicts through the aggregation of the values generated from each DT.
• A classification model was developed by using a technique known as Least
Squares Support Vector Machine (LS-SVM).
• The classification task is performed by LS-SVM by using a hyper-plane in a
multidimensional space for separating the dataset into the target classes
Regression
• Regression analysis aims for the numerical estimation of the
relationship between variables.
• This involves the estimation of whether or not the variables are
independent.
• If a variable is not independent, then the first step is to determine
the type of dependence.
• Chatterjee et al. proposed a regression analysis that is often used for
predicting and forecasting, and also to understand how the
dependent variables will change corresponding to the fixed values of
independent variables
Deep Learning
• In deep learning, many hidden layers of neural networks are used to deeply
understand the information that images are attempting to predict accurately.
• Here, each layer will learn and detect low-level features, such as edges.
• New layers will be merged with the features of the previous layer to represent it better.
• Fischer and Krauss [6] have expanded the long short-term memory (LSTM) networks
for forecasting out-of-sample directional movements in the stock market.
• Here, a comparative study has been performed with DNN, RF, and LOG, and it
demonstrates that the LSTM model outperforms the others.
• Tamura et al. [7] have proposed a model for predicting stock values, which is a two
dimensional approach.
• In this model, technical, financial indexes related to the Japanese stock market are
used as input data for LSTM to predict.
• Using this data, the financial statements of other companies have been retrieved and
are also added to the database.
Clustering
• Jain et al. proposed a clustering-based method using the degree of similarity
[8].
• In clustering, the objects are separated into groups called clusters.
• This type of learning is called unsupervised learning, as there is no prior idea
over the classes as to which group the objects belong. Based on the similarity
measure criterion, cluster analysis has various models: (i) based on the
connectivity distance, connectivity models are generated, i.e., hierarchical
clustering; (ii) by using the nearest cluster center, the objects are assigned,
centroid models are generated, i.e., k-means; (iii) by means of statistical
distributions, the distributed models are generated, i.e., expectation-
maximization algorithm; (iv) based on highdensity areas that exist in the data,
the clusters are defined in density models; (v) graphs are used for expressing
the dataset in graph-based models.
Association rules
• Association rules are suitable tools to represent the new information that has
been extracted from the raw dataset.
• These rules are expressed to make the decisions in terms of implication rules.
• The respective rules indicate the frequency of occurrence of the attributes with
high reliability in databases.
• This example represents an association rules related to the database of the
supermarket.
• Even though the algorithms like ECLAT and FP-Growth algorithms are available for
large datasets, in the Apriori algorithm, for example, the generalized rule
induction algorithm and its adaptations are often used.

Figure 1.2. shows different types of techniques used in data science and application
Data Science Techniques
Time Series Analysis
• The time-series data, which is collected over time, is used for modeling the data.
• this model is used for predicting future values of the time series.
• The often used methods are the following:
(i) Techniques for exploratory analysis, for example wavelets, trend analysis, autocorrelation
(ii) Forecasting and prediction methods, for example signal estimation, regression methods,
(iii) Classification techniques that will be assigned a category to patterns related to the series;
(iv) Segmentation that aims to identify a sequence of points that share particular properties.
(v) A fuzzy extension that allows for processing uncertain and imprecise data related to different
domains
(vi) A fuzzy k-means method.
This method is similar to a type of clustering technique that has given efficient results in
different scenarios, as it will permit the assignment of data elements related to single or more
clusters
1.3 Applications of Data Science in various domains
• Data science application began from a narrow field of analytics and statistics and
has improved to be applied to different areas of industry and science.
• The data science applications can do the following:
(i) economic analysis of electric consumption
(ii) stock market prediction
(iii) bioinformatics
(iv) social media analytics
(v) email mining
(vi) big data analysis
(vii) SMS Mining
1.3.1 Economic Analysis of Electric Consumption
• Different electric companies or utilities approached data science to find out and
understand when and how consumers use energy.
• There has been an increase in competition among companies that use data science to
develop such information.
• Traditionally, this information has been determined via classification, clustering, and
pattern analysis methods by using the association rule.
• Ex. Grouping of consumers as various classes based on their behavior and usage of
electricity. Customers budget spenders , big spenders
• The comparative evaluation was made with self-organizing maps and an improved
version of follow-the-leader methods.
• This was the first step initiated for a tariff of the electrical utilities.
• A framework was developed for exploiting the historical data, which consists of two
modules: (i) a load-profile module, which creates a set of customer classes by using
unsupervised and supervised learning, and (ii) a classification module, which builds
models for assigning customers to their respective classes.
1.3.2 Stock Market Prediction
• An application of ML and DL techniques in the stock market is increasing compared to other areas
of economics.
• Even though investing in the stock market gives profits, high risk is often involved along with high
benefits.
• So, investors try to estimate and determine the value of a stock before they make an investment.
• The cost of the stock varies depending upon factors like local politics and economy, which causes
difficulties in identifying future trends of the stock market.
• LSTM technique can be used to forecast future trends in the stock market.
• The results have been compared with LOG, DNN, and RF, and have shown improved results over
the others.
• A new method for predicting the values of the stock has been proposed , here financial data
related to the stock market of Japan has been used as a prediction input in LSTMs (Long short-term
memories).
• Further, the financial statements of the companies are recovered and then added to the database.
• Sharaff and Srinivasarao [16] proposed Linear Support Vector Machine (LSVM), to identify the
correlation among the words in content and subject of the emails
1.3.3 Bioinformatics
• Bioinformatics is a new area that uses computers to understand biological data like genomics
and genetics.
• This helps scientists understand the cause of disease, physiological properties, and genetic
properties.
• we can utilize various techniques to estimate the applicability and efficiency of different
predictive methods in the classification task.
• The previous error estimation techniques are primarily focused on supervised learning using
the microarray data.
• Michiels et al. [18] have used various random datasets to predict cancer using microarray data.
• Ambroise et al. [19] solved a gene selection problem based on microarrays data.
• Here, 10-fold validation has been used.
• Here, 0.632 bootstrap error estimates are used to deal with prediction rules that are overfitted.
• The accuracy of 0.632 bootstrap estimators for microarray classification using small datasets is
proposed in Braga et al.
1.3.4 Social Media Analytics
• Twitter data can be used to classify the sentiments included in tweets.
• They have applied various machine learning methods.
• A comparative study has been carried out by using maximum entropy, naïve Bayes, and
positive-negative word counting.
• Wolny [22] proposed a model to recognize the emotion in Twitter data and performed an
emotion analysis study.
• Here, the feelings and sentiments were discussed in detail by explaining the existing methods.
• The emotion and sentiment are classified based on symbols via an unsupervised classifier, and
the lexicon was explained by suggesting future research.
• Coviello et al. [23] have analyzed the emotion contagion related to Facebook data.
• The instrumental variable regression technique has been used to analyze the Facebook data.
Here, the emotions of the people, such as negative and positive emotions during rainy days,
were detected.
• the detection of the people who influence social networks is a difficult task or area of research,
but one of great interest so that referral marketing and spreading information regarding
products can reach the maximum possible network.
1.3.5 Email Mining
• There is a threat to internet security with spam emails.
• Spam emails are nothing but unwanted or unsolicited emails.
• Mailboxes will overload with these unwanted emails, and there may be losses in storage and bandwidth, which
favors quick, wrong information and malicious data.
• Gudkova et al. [25] conducted a study and explained that 56% of all emails are spam emails.
• the machine learning method is successful for detecting spam data.
• These include learning classifier models, which map data by using features like n-gram and others into spam or
ham classes.
• email features may be either manual or automatic.
• the manually extracted rules are known as knowledge engineering, which requires expert and regular updates to
maintain good accuracy.
• Text mining methods are used for automated feature extraction of useful information like words, enabling spam
discrimination, HTML mark up, and so on.
• Using these features, an email is represented as Bag-of-Words (BoW).
• Here the unstructured word tokens are used to discriminate the spam messages with the others. The BoW
assumes word tokens that are not dependent that will prevent from delivering the good semantic content to
represent the email.
• Sharaff and Nagwani [30] have identified the email threads using Latent Dirichlet allocation LDA- and nonnega-tive
matrix factorizationNMF-based methodology.
1.3.6 Big Data Analysis Mining Methods
• Big data is one of the very fast-growing technologies that is critical to handle in the
present era.
• The information is used for analytical studies to help drive decisions for giving quick and
improved services.
• big data consists of three characteristics: velocity, volume, and variety.
• These are also called the 3Vs. HERE data mining is a procedure where potentially useful,
unknown, and hidden meaningful information is extracted from noisy, random,
incomplete, and fuzzy data.
• The knowledge and information that has been extracted is used to derive new
comprehensions, scientific events, and influences business scientific discovery
• Two articles have aimed at improving the accuracy of data mining.
• 1. Skyline algorithm. Here, a sorted positional index list (SSPL), which has low space
overhead, has been used to reduce the input or output cost.
• Table 1.1 shows an overview of data science methods used in different applications
1.4 Challenges and Opportunities
• key issues, challenges, and opportunities that are related to data science in different
fields.
1.4.1 Challenges in Mathematical and Statistical Foundations
The main challenge is to find out why theoretical foundations are not enough to solve
complex problems, and then identify and obtain a helpful action plan.
1.4.2 Challenges in Social Issues
Here the challenges are to specify, respect, and identify social issues.
Any domain-specific data is to be selected, and then its related concepts—like business,
security , protection privacy—should be accurately handled.
1.4.3 Data-to-Decision and Actions
• It is important to develop accurate decision-making systems that are data-driven.
These systems should also be able to manage and govern the decision-making
systems.
1.4.4 Data Storage and Management Systems
• One of the challenges include designing a good storage and management system that
has the capability to handle large amounts data, stream-speed in real time, and can
manage such data in an Internet-based environment, including cloud.

1.4.5. Data Quality Enhancement


• It is issues of data quality like uncertainty, noise, unbalance, and so on. The level of
presence of these issues will vary depending upon the data complexity.

1.4.6 Deep Analytics and Discovery


• Cao [35] proposed new algorithms to deal with the deep and implicit analytics that
are not able to be tackled using the existing descriptive, latent, and predictive
learning.
• Also, how to aggregate the model based with data-driven problem-solving solutions
to balance the domain-specific data complexity, intelligence-driven evidence learning,
and common learning frameworks.
1.4.7 High-Performance Processing and Analytics
• Systems must handle the online, real-time, Internet-based, large-scale, high-
frequency, data analytics and processing with balanced resource involvement
that may be local and global.
• This requires new array disk storage, batch, and high performance parallel
processing.
• It is also necessary to use complex matrix calculations, data-to-knowledge
management, mixed data structures, and management systems.
1.4.8 Networking , Communication , and Interoperation
The challenge involved is how to support the interoperation, communication, and
networking between various data science roles like distributed and complete cycle
of problem-solving in data science.
Here, it is necessary to coordinate management of tasks, data, workflows, control,
task scheduling, and governance.
1.5 Tools for Data Scientists
Tools for Data Scientists
• These tools are classified as data and application integration
• cloud infrastructure
• programming, visualization
• high-performance processing
• analytics, master data management
• business intelligence reporting
• data preparation and processing and project management.
• The researcher can use any number of tools depending upon the
complexity of the problem being solved.
1.5.1 Cloud infrastructure
• Like Map R, Google Cloud Platform, Amazon Web Services, Cloudera, Spark,
Apache Hadoop, and other systems may be used.
• Most of the traditional IT vendors at present are using cloud platform.
1.5.2.Data and application integration : This includes Clover ETL, Information
Builders, DM Express Sync sort, Oracle Data Integrator, Informatics, Including
Ab Initio, and so on.
1.5.3.Master data management : It includes SAP Net Weaver Master Data
Management tool, Black Watch Data, Microsoft Master Data Services,
Informatica MDM, TIBCO MDM, Teradata Warehousing, and so on.
1.5.4. Data preparation and processing
• Some platforms and data preparation tools like Wrangler Enterprise and
Wrangler, Alpine Chorus, IBM SPSS, Teradata Loom, Platfora.
1.5.5 Analytics : It includes commercial tools like Rapid Miner, Mat Lab,
IBM SPSS Modeler and SPSS Statistics, SAS Enterprise Miner, and so on, in
addition to some new tools, like Google Cloud Prediction API, ML Base,
Big ML, Data Robot, and others.
1.5.6 Visualization
• Some commercial and free software listed in KDnuggets to visualize
include Miner3D, IRIS Explorer, Interactive Data Language, Quadrigram,
Science GL, and so on.
1.5.7. Programming: Additionally, Java, Python, SQL, SAS, and R languages
have been used for data analytics.
Some data scientists have also included Go, Ruby, .net, and Java Script
1.5.8 High-Performance Processing : Around 40 computer cluster software
programs, like Platform Cluster Manager, Moab Cluster Suite, Stacki, and
others, have been listed in Wikipedia.

1.5.9 Business Intelligence Reporting: Some of the reporting tools


commonly used are SAP Crystal Reports, SAS Business Intelligence, Micro
Strategy, and IBM Cognos, among others

1.5.10.Social Network Analysis: Around 30 tools have been listed for social
network analysis and to help visualize data. For example, Ego Net,
Cuttlefish, Commetrix, Keynetiq, Node XL, and so on]. Figure 1.3 shows the
different types of programming languages that are used in data science.

You might also like