DSA Unit1
DSA Unit1
UNIT-1
• Data Science Applications in various domains
• Challenges and opportunities
• Tools for data scientists
• Recommender Systems : Introduction
• Methods
• Application
• Challenges
Recommender Systems
• Introduction
• Methods
• Application
• Challenges
Unit-2 Time Series Data
• Stock market index movement forecasting
Supply Chain Management
Figure 1.2. shows different types of techniques used in data science and application
Data Science Techniques
Time Series Analysis
• The time-series data, which is collected over time, is used for modeling the data.
• this model is used for predicting future values of the time series.
• The often used methods are the following:
(i) Techniques for exploratory analysis, for example wavelets, trend analysis, autocorrelation
(ii) Forecasting and prediction methods, for example signal estimation, regression methods,
(iii) Classification techniques that will be assigned a category to patterns related to the series;
(iv) Segmentation that aims to identify a sequence of points that share particular properties.
(v) A fuzzy extension that allows for processing uncertain and imprecise data related to different
domains
(vi) A fuzzy k-means method.
This method is similar to a type of clustering technique that has given efficient results in
different scenarios, as it will permit the assignment of data elements related to single or more
clusters
1.3 Applications of Data Science in various domains
• Data science application began from a narrow field of analytics and statistics and
has improved to be applied to different areas of industry and science.
• The data science applications can do the following:
(i) economic analysis of electric consumption
(ii) stock market prediction
(iii) bioinformatics
(iv) social media analytics
(v) email mining
(vi) big data analysis
(vii) SMS Mining
1.3.1 Economic Analysis of Electric Consumption
• Different electric companies or utilities approached data science to find out and
understand when and how consumers use energy.
• There has been an increase in competition among companies that use data science to
develop such information.
• Traditionally, this information has been determined via classification, clustering, and
pattern analysis methods by using the association rule.
• Ex. Grouping of consumers as various classes based on their behavior and usage of
electricity. Customers budget spenders , big spenders
• The comparative evaluation was made with self-organizing maps and an improved
version of follow-the-leader methods.
• This was the first step initiated for a tariff of the electrical utilities.
• A framework was developed for exploiting the historical data, which consists of two
modules: (i) a load-profile module, which creates a set of customer classes by using
unsupervised and supervised learning, and (ii) a classification module, which builds
models for assigning customers to their respective classes.
1.3.2 Stock Market Prediction
• An application of ML and DL techniques in the stock market is increasing compared to other areas
of economics.
• Even though investing in the stock market gives profits, high risk is often involved along with high
benefits.
• So, investors try to estimate and determine the value of a stock before they make an investment.
• The cost of the stock varies depending upon factors like local politics and economy, which causes
difficulties in identifying future trends of the stock market.
• LSTM technique can be used to forecast future trends in the stock market.
• The results have been compared with LOG, DNN, and RF, and have shown improved results over
the others.
• A new method for predicting the values of the stock has been proposed , here financial data
related to the stock market of Japan has been used as a prediction input in LSTMs (Long short-term
memories).
• Further, the financial statements of the companies are recovered and then added to the database.
• Sharaff and Srinivasarao [16] proposed Linear Support Vector Machine (LSVM), to identify the
correlation among the words in content and subject of the emails
1.3.3 Bioinformatics
• Bioinformatics is a new area that uses computers to understand biological data like genomics
and genetics.
• This helps scientists understand the cause of disease, physiological properties, and genetic
properties.
• we can utilize various techniques to estimate the applicability and efficiency of different
predictive methods in the classification task.
• The previous error estimation techniques are primarily focused on supervised learning using
the microarray data.
• Michiels et al. [18] have used various random datasets to predict cancer using microarray data.
• Ambroise et al. [19] solved a gene selection problem based on microarrays data.
• Here, 10-fold validation has been used.
• Here, 0.632 bootstrap error estimates are used to deal with prediction rules that are overfitted.
• The accuracy of 0.632 bootstrap estimators for microarray classification using small datasets is
proposed in Braga et al.
1.3.4 Social Media Analytics
• Twitter data can be used to classify the sentiments included in tweets.
• They have applied various machine learning methods.
• A comparative study has been carried out by using maximum entropy, naïve Bayes, and
positive-negative word counting.
• Wolny [22] proposed a model to recognize the emotion in Twitter data and performed an
emotion analysis study.
• Here, the feelings and sentiments were discussed in detail by explaining the existing methods.
• The emotion and sentiment are classified based on symbols via an unsupervised classifier, and
the lexicon was explained by suggesting future research.
• Coviello et al. [23] have analyzed the emotion contagion related to Facebook data.
• The instrumental variable regression technique has been used to analyze the Facebook data.
Here, the emotions of the people, such as negative and positive emotions during rainy days,
were detected.
• the detection of the people who influence social networks is a difficult task or area of research,
but one of great interest so that referral marketing and spreading information regarding
products can reach the maximum possible network.
1.3.5 Email Mining
• There is a threat to internet security with spam emails.
• Spam emails are nothing but unwanted or unsolicited emails.
• Mailboxes will overload with these unwanted emails, and there may be losses in storage and bandwidth, which
favors quick, wrong information and malicious data.
• Gudkova et al. [25] conducted a study and explained that 56% of all emails are spam emails.
• the machine learning method is successful for detecting spam data.
• These include learning classifier models, which map data by using features like n-gram and others into spam or
ham classes.
• email features may be either manual or automatic.
• the manually extracted rules are known as knowledge engineering, which requires expert and regular updates to
maintain good accuracy.
• Text mining methods are used for automated feature extraction of useful information like words, enabling spam
discrimination, HTML mark up, and so on.
• Using these features, an email is represented as Bag-of-Words (BoW).
• Here the unstructured word tokens are used to discriminate the spam messages with the others. The BoW
assumes word tokens that are not dependent that will prevent from delivering the good semantic content to
represent the email.
• Sharaff and Nagwani [30] have identified the email threads using Latent Dirichlet allocation LDA- and nonnega-tive
matrix factorizationNMF-based methodology.
1.3.6 Big Data Analysis Mining Methods
• Big data is one of the very fast-growing technologies that is critical to handle in the
present era.
• The information is used for analytical studies to help drive decisions for giving quick and
improved services.
• big data consists of three characteristics: velocity, volume, and variety.
• These are also called the 3Vs. HERE data mining is a procedure where potentially useful,
unknown, and hidden meaningful information is extracted from noisy, random,
incomplete, and fuzzy data.
• The knowledge and information that has been extracted is used to derive new
comprehensions, scientific events, and influences business scientific discovery
• Two articles have aimed at improving the accuracy of data mining.
• 1. Skyline algorithm. Here, a sorted positional index list (SSPL), which has low space
overhead, has been used to reduce the input or output cost.
• Table 1.1 shows an overview of data science methods used in different applications
1.4 Challenges and Opportunities
• key issues, challenges, and opportunities that are related to data science in different
fields.
1.4.1 Challenges in Mathematical and Statistical Foundations
The main challenge is to find out why theoretical foundations are not enough to solve
complex problems, and then identify and obtain a helpful action plan.
1.4.2 Challenges in Social Issues
Here the challenges are to specify, respect, and identify social issues.
Any domain-specific data is to be selected, and then its related concepts—like business,
security , protection privacy—should be accurately handled.
1.4.3 Data-to-Decision and Actions
• It is important to develop accurate decision-making systems that are data-driven.
These systems should also be able to manage and govern the decision-making
systems.
1.4.4 Data Storage and Management Systems
• One of the challenges include designing a good storage and management system that
has the capability to handle large amounts data, stream-speed in real time, and can
manage such data in an Internet-based environment, including cloud.
1.5.10.Social Network Analysis: Around 30 tools have been listed for social
network analysis and to help visualize data. For example, Ego Net,
Cuttlefish, Commetrix, Keynetiq, Node XL, and so on]. Figure 1.3 shows the
different types of programming languages that are used in data science.