Bigdata
Bigdata
Bigdata
I. I NTRODUCTION
In regular life we generally deals with people and devices that are constantly generating data.
Generally, user activity generates data about their needs and preferences, as well as the quality
of their experiences. In 2014, estimates put worldwide data generation at a staggering 7ZB. It
is expected that by 2018 each smart phone is suppose to generate 2GB of data every month. At
the same time, it is expected that the amount of data is supposed to grow at a 40 percent of
rate. This says that the processing and generation of data grows exponentially [1], [2].
Then question arises in mind, ”Is bigdata meant by only large amount of data storage man-
agement?” The obvious answer is NO. The data becomes big data when its volume, velocity,
or variety exceeds the abilities of an information systems to ingest, store, analyze, and process
it [3]. Today, big data is understood more or less from a technology perspective: the possibility
of better storage (volume), the ability to process the information and make it available in real
time (velocity) and the ability to deal with various kinds of data sources, including structured,
semi-structured and unstructured ones (variety). Many equipment and expertise are developed in
last few decades to handle large quantities of structured databut with the increasing volume and
faster flows, most of them are lacked by the ability to mine it and derive actionable intelligence
in a timely way. Not only is the volume of this data growing too fast for traditional analytic,
but the speed with which it arrives and the variety of data types necessitates new types of data
processing and analytic solutions [2].
Big data analytics is the process of examining big data to uncover hidden patterns, unknown
correlations and other useful information that can be used to make better decisions. With big
data analytics, data scientists and others can analyze huge volumes of data that conventional
analytics and business intelligence solutions can’t touch. Let us consider a case of handling a
billions rows of data with hundreds of millions of data combinations in multiple data stores and
abundant formats. Then some high-performance analytics is necessary to process or mine them
and this enters big data analytics [4].
The concept of Big Data was first introduced to the computing world by Roger Magoulas from
OReilly media in 2005 in order to define a great amount of data that traditional data management
techniques cannot manage and process due to the complexity and size of this data. A study on
the Evolution of Big Data as a research and scientific topic shows that the term Big Data was
present in research starting with 1970s. Nowadays the Big Data concept is treated from different
points of view covering its implications in many fields [5].
One can view the concept of bigdata analytics as [6]:
Volume: refers to the quantity of data gathered by a company. This data must be used further
to obtain important knowledge;
Velocity: refers to the time in which Big Data can be processed. Some activities are very
important and need immediate responses, that is why fast processing maximizes efficiency;
Variety: refers to the type of data that Big Data can comprise. This data can be structured as
well as unstructured;
Veracity: refers to the degree in which a leader trusts the used information in order to take
decision. So getting the right correlations in Big Data is very important for the business future.
The main importance of Big Data consists in the potential to improve efficiency in the
context of use a large volume of data, of different type. If Big Data is defined properly and
used accordingly, one organizations can get a better view on their business therefore leading
to efficiency in different areas like sales, improving the manufactured product and so forth.
When big data is effectively and efficiently captured, processed, and analyzed, companies are
able to gain a more complete understanding of their business, customers, products, competitors,
etc. which can lead to efficiency in improvements, increased sales, lower costs, better customer
service, and/or improved products and services [7].
Big data analysis has an important role in different fields of research including: computer
vision, social media analysis, text analysis, biological network analysis, audio analysis, etc.
Currently, video processing is an emerging field of research [8]. Easy availably of digital
devices and cheap sensors has increased the demand of surveillance systems also. But the case
is also not limited to recording of surveillance video only. Starting from surveillance cameras,
GoPros, Dropcams, cell phones and even old-fashioned camcorders, were able to record video at
unprecedented scale. Currently, YouTube sees 100 hours of new content added every minute and
may contain much information embedded in different frames of the video. Similarly, Video
analytic is also refereed as video content analysis (VCA), involves a variety of techniques
to monitor, analyze, and extract meaningful information from video streams. Surveillance is
also a part of VCA [9]. The increasing prevalence of closed-circuit television (CCTV) cameras
and the popularity of video-sharing websites are the two leading contributors to the growth of
computerized video analysis. A key challenge, however, is the sheer size of video data. However,
Big data technologies turn this challenge into an opportunity. The primary application of video
analytics in recent years has been in automated security and surveillance systems. Video analytics
can efficiently and effectively perform surveillance functions such as detecting suspected objects,
identifying objects removed or left unattended, detecting loitering in a specific area, recognizing
suspicious activities, and detecting camera tampering, etc [10].
The common weakness of most of the video processing systems is their inability to handle
densely crowded scenes. As the density of moving objects in the scene increases, a significant
degradation in the performance in terms of surveillance is observed. View variations and varying
density of people as well as the ambiguous appearance of body parts, e.g. some parts of
one object in the scene may be similar to another near-by object. This inability to deal with
crowded scenes represents a significant problem. Such system has different applications in
real life scenario including, crowd management, behavior analysis, public space design, virtual
environment, intelligent environment, etc [11].
Another potential application of video analytics in retail lies in the study of buying behavior
of groups. Among family members who shop together, only one interacts with the store at the
cash register, causing the traditional systems to miss data on buying patterns of other members.
Video analytics can help retailers address this missed opportunity by providing information about
the size of the group, the group’s demographics, and the individual members buying behavior.
Automatic video indexing and retrieval constitutes another domain of video analytics applica-
tions. The widespread emergence of online and offline videos has highlighted the need to index
multimedia content for easy search and retrieval. The indexing of a video can be performed based
on different levels of information available in a video including the metadata, the soundtrack,
the transcripts, and the visual content of the video [12].
As proposed by Gandomi and Haider [3] there exist two approaches to video analytics: server
based and edge based. In Server-based architecture the video captured through each camera
is routed back to a centralized and dedicated server that performs the video analytics. Due to
bandwidth limits, the video generated by the source is usually compressed by reducing the frame
rates and/or the image resolution. The resulting loss of information can affect the accuracy of the
analysis. However, the server-based approach provides economies of scale and facilitates easier
maintenance. Similarly, in edge-based architecture, the analytics are applied at the edge of the
system. That is, the video analytics is performed locally and on the raw data captured by the
camera. As a result, the entire content of the video stream is available for the analysis, enabling
a more effective content analysis. Edge-based systems, however, are more costly to maintain and
have a lower processing power compared to the server-based systems.
Social media is abroad term encompassing a variety of online platforms that allow users
to create and exchange contents from social media channels. Social media can be catego-
rized into the following types: Social networks (e.g., Facebook, LinkedIn), blogs (e.g., Blogger
and WordPress), microblogs (e.g.,Twitter and Tumblr), social news (e.g., Digg and Reddit),
social bookmarking (e.g., Delicious and Stumble Upon), media sharing(e.g., Instagram and
YouTube), wikis (e.g., Wikipedia and Wikihow),question-and-answer sites (e.g., Yahoo! Answers
and Ask.com) and review sites (e.g., Yelp, Trip Advisor) [13]. The key characteristic of the
modern social media analytics is its data-centric nature. The research on social media analytics
spans across several disciplines, including psychology, sociology, anthropology,computer science,
mathematics, physics, and economics [14].
One of the popular research is social media analysis is Community detection [15], where the
task is to extract implicit communities within a network. For online social networks, a community
refers to a sub-network of users who interact more extensively with each other than with the rest
of the network. Often containing millions of nodes and edges, online social networks tend to be
colossal in size. Community detection helps to summarize huge networks, which then facilitates
uncovering existing behavioral patterns and predicting emergent properties of the network. In
this regard, community detection is similar to clustering, used to partition a data set into disjoint
subsets based on the similarity of data points.
Link prediction specifically addresses the problem of predicting future linkages between the
existing nodes in the underlying network. Typically, the structure of social networks is not static
and continuously grows through the creation of new nodes and edges. Therefore, a natural goal
is to understand and predict the dynamics of the network. Link prediction techniques predict the
occurrence of interaction, collaboration, or influence among entities of a network in a specific
time interval [16].
Text analytics (text mining) refers to techniques that extract information from textual data.
Social network feeds, emails, blogs, online forums, survey responses, corporate documents,
news, and call center logs are examples of textual data held by organizations. Text analytics
involve statistical analysis, computational linguistics, and machine learning. Text analytics enable
businesses to convert large volumes of human generated text into meaningful summaries, which
support evidence-based decision-making. For instance, text analytics can be used to predict stock
market based on information extracted from financial news [3].
Text summarization techniques automatically produce a summary of a single or multiple
documents. The resulting summary conveys the key information in the original text.Applications
include scientific and news articles, advertisements,emails, and blogs. The text, summarization
follows two approaches: the extractive approach and the abstractive approach. In extractive
summarization, a summary is created from the original text units. The resulting summary is
a subset of the original document. Based on the extractive approach,formulating a summary
involves determining the salient units of a text and stringing them together. The importance
of the text units is evaluated by analyzing their location and frequency in the text. Extractive
summarization techniques do not require an understanding of the text. In contrast, abstractive
summarization techniques involve extracting semantic information from the text. The summaries
contain text units that are not necessarily present in the original text.
Sentiment analysis techniques [17] analyze opinionated text, which contains peoples opin-
ions toward entities such as products, organizations, individuals, and events. Businesses are
increasingly capturing more data about their customerssentiments that has led to the prolifer-
ation of sentiment analysis. Marketing, finance, and the political and social sciences are the
major application areas of sentiment analysis. Sentiment analysis techniques are further divided
into three sub-groups, namely document-level, sentence-level, and aspect-based. Document-level
techniques determine whether the whole document expresses a negative or a positive sentiment.
The assumption is that the document contains sentiments about a single entity. While certain
techniques categorize a document into two classes, negative and positive, others incorporate
more sentiment classes. Sentence-level techniques attempt to determine the polarity of a single
sentiment about a known entity expressed in a single sentence. Sentence-level techniques must
first distinguish subjective sentences from objective ones. Hence, sentence-level techniques tend
to be more complex compared to document-level techniques. Aspect-based techniques recognize
all sentiments within a document and identify the aspects of the entity to which each sentiment
refers. For instance, customer product reviews usually contain opinions about different aspects (or
features) of a product. Using aspect-based techniques, the vendor can obtain valuable information
about different features of the product that would be missed if the sentiment is only classified
in terms of polarity.
In biology, link prediction techniques are used to discover links or associations in biological
networks (e.g., proteinprotein inter-action networks), eliminating the need for expensive experi-
ments [18].
An important area of research in Bioinformatics [19] is Gene expression data classification
[20]. Bioinformatics is the application of computer technology to the management of biological
information. Microarray technology is an important tool used to monitor expression levels of
genes of a given organism. A microarray contains a thousands of spots and each spot contains
million copies of DNA molecules corresponding to a particular gene. Given microarray and some
information on the outcomes of the disease a new study can be conducted to predict the disease
on the new patients.
Unlike Bioinformatics which focuses on individual molecules, such as sequence of nucleotide
acids and amino acids, systems biology focuses on systems that are composed of molecular
components and their interactions. Analysis of a gene regulatory networks involves: finding an
optimal pathway, and effect of gene expression/regulation on other pathways. For a particular
tissue, every gene is not expressed, only a subset of it are expressed and it exhibits some pattern
over the time. The objective is to find this temporal pattern.
Audio analytics uses a technique commonly referred to as audio mining where large volumes
of audio data are searched for specific audio characteristics. When applied in the area of speech
recognition, audio analytics identifies spoken words in the audio and puts them in a search file.
The two most common approaches to audio mining are text-based indexing and phoneme-based
indexing [3].
Call centers use audio analytics for analysis of millions of hours of recorded calls. These
techniques help to improve customer experience, and to evaluate the performance of the call
center executive, with different policies (e.g., privacy and security policies), gain insight into
customer behavior, and identify product or service issues, among many other tasks.
Large-vocabulary continuous speech recognition (LVCSR) converts speech to text and then
uses a dictionary to understand what is being said. The dictionary typically contains up to
several hundred thousand entries which include generic words as well as industry and company
specific terms. Using the dictionary, the analytics engine processes the speech content of the
audio to generate a searchable index file. The index file contains information about the words
it understood in the audio data and can be quickly searched for key words and phrases to bring
out the relevant conversations that contain the search terms.
Phonetic recognition does not require any conversion from speech to text but instead works
only with sounds. The analytics engine first analyses and identifies sounds in the audio content
to create a phonetic based index. It then uses a dictionary of several dozen phonemes to convert
a search term to the correct phoneme string. The system then looks for the search terms in the
index.
In recent years, new technologies with lower costs have enabled improvements in data capture,
data storage and data analysis. Organizations can now capture more data from many more
sources and types (blogs, social media feeds, audio and video files). The options to optimally
store and process the data have expanded dramatically and technologies such as MapReduce
and in-memory computing provide highly optimized capabilities for different business purposes
[21].
Hadoop is a framework that provides open source libraries for distributed computing using
MapReduce software and its own distributed file system, simply known as the Hadoop Distributed
File System (HDFS). It is designed to scale out from a few computing nodes to thousands of
machines, each offering local computation and storage. One of Hadoop’s main value propositions
is that it is designed to run on commodity hardware such as commodity servers or personal
computers, and has high tolerance for hardware failure. The HDFS is a fault-tolerant storage
system that can store huge amounts of information, scale up incrementally and survive storage
failure without losing data. Hadoop clusters are built with inexpensive computers. If one computer
(or node) fails, the cluster can continue to operate without losing data or interrupting work by
simply re-distributing the work to the remaining machines in the cluster. HDFS manages storage
on the cluster by breaking files into small blocks and storing duplicated copies of them across the
pool of nodes. Currently, a number of emerging Hadoop vendors are offering their customised
versions of Hadoop [5].
R EFERENCES
[1] L. N. Kwon, O. and B. Shin, “Data quality management, data usage expe-rience and acquisition intention of big data
analytics,” International Journal of Information Management, vol. 34, no. 3, pp. 387–394, 2014.
[2] C. Yadav, S. Wang, and M. Kumar, “Algorithm and approaches to handle large data- a survey,” IJCSN International Journal
of Computer Science and Network, vol. 2, no. 3, pp. 1–5, 2013.
[3] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, methods, and analytics,” International Journal of
Information Management, vol. 35, pp. 137–144, 2015.
[4] A. Labrinidis and H. V. Jagadish, “Challenges and opportunities with big data,” Proceedings of the VLDB Endowment,
vol. 5, no. 12, pp. 2032–2033, 2012.
[5] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National Science Review, vol. 1, no. 2, pp. 293–314, 2014.
[6] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for big data,” in
Proceedings of VLDB’09, pp. 1–6, 2009.
[7] J. Lin and D. Ryaboy, “Scaling big data mining infrastructure: The twitter experience,” SIGKDD Explorations, vol. 14,
no. 2, pp. 6–19, 2014.
[8] A. Ghosh, B. N. Subudhi, and S. Ghosh, “Object detection from videos captured by moving camera by fuzzy edge
incorporated Markov Random Field and local histogram matching,” IEEE Transactions on Circuits and Systems for Video
Technology, vol. 22, no. 8, pp. 1127–1135, 2012.
[9] B. N. Subudhi, P. K. Nanda, and A. Ghosh, “A change information based fast algorithm for video object detection and
tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 7, pp. 993–1004, 2011.
[10] B. N. Subudhi, P. K. Nanda, and A. Ghosh, “Entropy based region selection for moving object detection,” Pattern
Recognition Letters, vol. 32, no. 15, pp. 2097–2108, 2011.
[11] B. Zhan, D. Monekosso, P. Remagnino, S. Velastin, and L.-Q. Xu, “Crowd analysis: A survey,” Machine Vision and
Applications, vol. 19, no. 5-6, pp. 345–357, 2008.
[12] W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank, “A survey on visual content-based video indexing and retrieval,” IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 41, no. 6, pp. 797–819, 2011.
[13] K. M. Heidemann, J. and F. Probst, “Online social networks: A survey of aglobal phenomenon,” Computer Networks,
vol. 56, no. 18, pp. 3866–3878, 2012.
[14] P. Gundecha and H. Liu, “Mining social media: A brief introduction,” Tutorials in Operations Research, vol. 1, no. 4,
pp. xxx–xxx, 2012.
[15] L. Tang and H. Liu, “Community detection and mining in social media synthesis,” Lectures on Data Mining and Knowledge
Discovery, vol. 2, no. 1, pp. 1–137, 2010.
[16] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for socialnetworks,” in Twelfth international conference
on informationand knowledge management, pp. 556–559, ACM, 2003.
[17] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, vol. 2,
no. 1–2, pp. 1–135, 2008.
[18] A. Ghosh, U. Seiffert, and L. Jain, “Evolutionary computation in bioinformatics,” Journal of Intelligent and Fuzzy Systems,
vol. 18, no. 7, pp. 25–26, 2007.
[19] P. Mahanta, D. K. Bhattacharyya, and A. Ghosh, “Fumet: A fuzzy network module extraction technique for gene expression
data,” Journal of Biosciences, (In press), 2015.
[20] M. Roy, A. Law, and S. Ghosh, “Semi-supervised self-organizing feature map for gene expression data classification,” in
Proceedings of 5th International Conference on Pattern Recognition and Machine Intelligence-PReMI 2013, pp. 688–694,
2013.
[21] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, “Starfish: A selftuning system for big data
analytics,” in Proceedings of 5th Biennial Conference on Innovative Data Systems Research, pp. 261–272, 2011.