Big Data Seminar Report
Big Data Seminar Report
BELGAVI, KARNATAKA
Submitted By:
SNEKA K S
(4MG18CS038)
CERTIFICAT
Certified that the seminar work entitled “BIG DATA ANALYTICS” has been
presented by SNEHA K S (4MG18CS038) forth partial fulfilment of the Eighth
Semester, B.E, degree in Computer Science & Engineering of the Visvesvaraya
Technology University, Belagavi during the year 2021-22. It is certified that all
correction/suggestion indicated have been incorporated in the report. The seminar
report has been approved and certified as per the requirements.
GMIT, Bharathi
DECLARATION
The project entitled “BIG DATA ANALYTICS” was duly executed by us SNEHA
K S(4MG18CS038) Eighth Semester, B.E in Computer Science and Engineering, G
Madegowda Institute of Technology, Bharathinagara, under the guidance of Mr.
Pradeep B M Associate. Prof., Dept. of CS&E, G Madegowda Institute of
Technology, Bharathinagara, 2021-2022. We hereby declare that, the above-entitled
project work is executed only by us, for the partial fulfilment of the requirement for
the award of the Bachelor degree in Computer Science and Engineering prescribed
by Visvesvaraya Technology University “jnanaSangama”, Belagavi 590014.
SNEHA K S
(4MG18CS038)
ACKNOWLEDGEMENT
I feel great pleasure to acknowledge the guidance and assistance and assistance of all
those people who have made my work on this report pleasant endeavor.
Also, I thank the members of the faculty of Department of Computer Science and
Engineering, GMIT, Bharathinagara whose suggestions enabled us to surpass
many of theseemingly impossible hurdles. We also thank our guides and lastly, we
thank everybodywho has directly or indirectly us in course of this work.
SNEHA K S(4MG18CS038)
ABSTRACT
Big data is a term for massive data sets having large, more varied and complex
structure with the difficulties of storing, analysing and visualizing for further
processes or results. The process of research into massive amounts of data to reveal
hidden patterns and secret correlations named as big data analytics. These useful
information’s for companies or organizations with the help of gaining richer and
deeper insights and getting an advantage over the competition. For this reason, big
data implementations need to be analysed and executed as accurately as possible.
This paper presents an overview of big data's content, scope, samples, methods,
advantages and challenges and discusses privacy concern on it.
TITLE PAGE NO
1. Introduction 2
2. Literature Survey 3
3. Concept of Topic 4
6. Conclusion 11
7. References 12
INTRODUCTION
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization, and information privacy.
The term often refers simply to the use of predictive analytics or other certain
advanced methods to extract value from data, and seldom to a particular size of
data set. Accuracy in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost reductions and
reduced risk.
Analysis of data sets can find new correlations, to "spot business trends,
prevent diseases, combat crime and so on." Scientists, practitioners of media
and advertising and governments alike regularly meet difficulties with large
data sets in areas including Internet search, finance and business informatics.
Scientists encounter limitations in e-Science work, including meteorology,
genomics, connectomes, complex physics simulations, and biological and
environmental research.
Data sets grow in size in part because they are increasingly being gathered
by cheap and numerous information-sensing mobile devices, aerial (remote
sensing), software logs, cameras, microphones, radio-frequency identification
(RFID) readers, and wireless sensor networks. The world's technological per-
capita capacity to store information has roughly doubled every 40 months since
the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data were created; The
challenge for large enterprises is determining who should own big data initiatives
that straddle the entire organization.
Recent studies show that the use of a multiple layer architecture is an option
for dealing with big data. The Distributed Parallel architecture distributes data
across multiple processing units and parallel processing units provide data much
faster, by improving processing speeds. This type of architecture inserts data into
a parallel DBMS, which implements the use of MapReduce and Hadoop
frameworks. This type of framework looks to make the processing power
transparent to the end user by using a front-end application server.
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable
elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging
from a few dozen terabytes to many petabytes of data. Big data is a set of
techniques and technologies that require new forms of integration to uncover large
hidden values from large datasets that are diverse, complex, and of a massive
scale.
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as being
three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data
in and out), and variety (range of data types and sources). Gartner, and now much
of the industry, continue to use this "3Vs" model for describing big data. In 2012,
Gartner updated its definition as follows: "Big data is high volume, high velocity,
and/or high variety information assets that require new forms of processing to
enable enhanced decision making, insight discovery and process optimization."
Additionally, a new V "Veracity" is added by some organizations to describe it.
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of
the concept fosters a more sound difference between big data and Business
Intelligence, regarding data and their use:
A more recent, consensual definition states that "Big Data represents the
Information assets characterized by such a High Volume, Velocity and Variety to
require specific Technology and Analytical Methods for its transformation into
Value".
ADVANTAGES AND DISADVANTAGES OF BIG
DATA
ADVANTAGES:
• Our newest research finds that organizations are using big data to target
customer-centric outcomes, tap into internal data and build a better
information ecosystem.
• Big Data is already an important part of the $64 billion database and data
analytics market
• scale to enterprise software in the late 1980s. And the Internet boom of the
1990s, and the social media explosion of today.
DISADVANTAGES:
• Will be so overwhelmed
• Self-regulation
• Legal regulation
APPLICATIONS OF BIG DATA
Big data has increased the demand of information management specialists in that
Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have
spent more than $15 billion on software firms specializing in data management
and analytics. In 2010, this industry was worth more than $100 billion and was
growing at almost 10 percent a year: about twice as fast as the software business as
a whole.
Government
The use and adoption of Big Data within governmental processes is beneficial and
allows efficiencies in terms of cost, productivity, and innovation. That said, this
process does not come without its flaws. Data analysis often requires multiple
parts of government (central and local) to work in collaboration and create new
and innovative processes to deliver the desired outcome.
Below are the thought leading examples within the Governmental Big Data space.
• In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address
important problems faced by the government. The initiative is composed of
84 different big data programs spread across six departments.
• Big data analysis played a large role in Barack Obama's successful
2012 re-election campaign.
• The United States Federal Government owns six of the ten
most powerful supercomputers in the world.
• The Utah Data Center is a data center currently being constructed by the
United States
India
• Big data analysis was, in parts, responsible for the BJP and its allies to
win a highly successful Indian General Election 2014.
• The Indian Government utilises numerous techniques to ascertain how the
Indian electorate is responding to government action, as well as ideas for
policy augmentation
United Kingdom
Manufacturing
Based on TCS 2013 Global Trend Study, improvements in supply planning and
product quality provide the greatest benefit of big data for manufacturing. Big data
provides an infrastructure for transparency in manufacturing industry, which is the
ability to unravel uncertainties such as inconsistent component performance and
availability. Predictive manufacturing as an applicable approach toward near-zero
downtime and transparency requires vast amount of data and advanced prediction
tools for a systematic process of data into useful information.
Cyber-Physical Models
Current PHM implementations mostly utilize data during the actual usage
while analytical algorithms can perform more accurately when more
information throughout the machine’s lifecycle, such as system
configuration, physical knowledge and working principles, are
included. There is a need to systematically integrate, manage and analyze
machinery or process data during different stages of machine life cycle to handle
data/information more efficiently and further achieve better transparency of
machine health condition for manufacturing industry.
Media
To understand how the media utilises Big Data, it is first necessary to provide
some context into the mechanism used for media process. It has been suggested by
Nick Couldry and Joseph Turow that practitioners in Media and Advertising
approach big data as many actionable points of information about millions of
individuals. The industry appears to be moving away from the traditional
approach of using specific media environments such as newspapers, magazines, or
television shows and instead tap into consumers with technologies that reach
targeted people at optimal times in optimal locations. The ultimate aim is to serve,
or convey, a message or content that is (statistically speaking) in line with the
consumers mindset. For example, publishing environments are increasingly
tailoring messages (advertisements) and content (articles) to appeal to consumers
that have been exclusively gleaned through various data-mining activities.
Big Data and the IoT work in conjunction. From a media perspective, data is the
key derivative of device inter connectivity and allows accurate targeting. The
Internet of Things, with the help of big data, therefore transforms the media
industry, companies and even governments, opening up a new era of economic
growth and competitiveness. The intersection of people, data and intelligent
algorithms have far-reaching impacts on media efficiency. The wealth of data
generated allows an elaborate layer on the present targeting mechanisms of the
industry.
Technology
• eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well
as a 40PB Hadoop cluster for search, consumer recommendations, and
merchandising. Inside eBay’s 90PB data warehouse
• Amazon.com handles millions of back-end operations every day, as
well as queries from more than half a million third-party sellers. The
core technology that keeps Amazon running is Linux-based and as of
2005 they had the world’s three largest Linux databases, with capacities
of 7.8 TB, 18.5 TB, and 24.7 TB.
• Facebook handles 50 billion photos from its user base.
• As of August 2012, Google was handling roughly 100 billion searches
per month.
Private sector
Retail
Retail Banking
Real Estate
• Windermere Real Estate uses anonymous GPS signals from nearly 100
million drivers to help new home buyers determine their typical drive
times to and from work throughout various times of the day.
Science
The Large Hadron Collider experiments represent about 150 million sensors
delivering data 40 million times per second. There are nearly 600 million
collisions per second. After filtering and refraining from recording more than
99.99995% of these streams, there are 100 collisions of interest per second.
• As a result, only working with less than 0.001% of the sensor stream
data, the data flow from all four LHC experiments represents 25
petabytes annual rate before replication (as of 2012). This becomes
nearly 200 petabytes after replication.
• If all sensor data were to be recorded in LHC, the data flow would be
extremely hard to work with. The data flow would exceed 150 million
petabytes annual rate, or nearly 500 exabytes per day, before
replication. To put the number in perspective, this is equivalent to 500
quintillion (5×1020) bytes per day, almost 200 times more than all the
other sources combined in the world.
The Age of Big Data is here, and these are truly revolutionary
times if both business and technology professionals continue to work
together and deliver on the promise.
REFERENCES
1. Adams, M.N.: Perspectives on Data Mining. International Journal of
Market Research 52(1), 11–19 (2020).
2. Asur, S., Huberman, B.A.: Predicting the Future with social media. In:
ACM International Conference on Web Intelligence and Intelligent Agent
Technology, vol. 1, pp. 492–499 (2021).
3. Bakshi, K.: Considerations for Big Data: Architecture and Approaches. In:
Proceedings of the IEEE Aerospace Conference, pp. 1–7 (2019).
4. Cebr: Data equity, Unlocking the value of big data. in: SAS Reports, pp.
1–44 (2022).
5. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD
Skills: New Analysis Practices for Big Data. Proceedings of the ACM VLDB
Endowment 2(2), 1481–1492 (2019).