Big Data Summary
Big Data Summary
What is Data?
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner.
Characteristics of Data
Veracity is all about the trust score of the data. If the data is collected from trusted or reliable
sources then the data neglect this rule of big data.It refers to inconsistencies and uncertainty
in data, that is data which is available can sometimes get messy and quality and accuracy are
difficult to control.Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.Example: Data in bulk could create
confusion whereas less amount of data could convey half or Incomplete Information.
Challenges of Conventional System:
Fundamental challenges
–How to store
–How to work with voluminous data sizes,
–and more important, how to understand data and turn it into a competitive advantage.
How about Conventional system technology?
• CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
• RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
• Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
• Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
How long it will take to read 1TB of data?
•1TB (at 80Mb / sec):
• 1 disk - 3.4 hours
• 10 disks - 20 min
•100 disks - 2 min
• 1000 disks - 12 sec
Structured Data
1. Created:
Created data is just that; data businesses purposely create, generally for market research. This
may consist of customer surveys or focus groups. It also includes more modern methods of
research, such as creating a loyalty program that collects consumer information or asking
users to create an account and login while they are shopping online.
2. Provoked:
A Forbes Article defined provoked data as, “Giving people the opportunity to express their
views.” Every time a customer rates a restaurant, an employee, a purchasing experience or a
product they are creating provoked data. Rating sites, such as Yelp, also generate this type of
data.
3. Transacted:
Transactional data is also fairly self-explanatory. Businesses collect data on every transaction
completed, whether the purchase is completed through an online shopping cart or in-store at
the cash register. Businesses also collect data on the steps that lead to a purchase online. For
example, a customer may click on a banner ad that leads them to the product pages which
then spurs a purchase.
As explained by the Forbes article, “Transacted data is a powerful way to understand exactly
what was bought, where it was bought, and when. Matching this type of data with other
information, such as weather, can yield even more insights. (We know that people buy more
Pop-Tarts at Walmart when a storm is predicted.)”
4. Compiled:
Compiled data is giant databases of data collected on every U.S. household. Companies like
Acxiom collect information on things like credit scores, location, demographics, purchases
and registered cars that marketing companies can then access for supplemental consumer
data.
5. Experimental:
Experimental data is created when businesses experiment with different marketing pieces and
messages to see which are most effective with consumers. You can also look at experimental
data as a combination of created and transactional data.
Unstructured Data
People in the business world are generally very familiar with the types of structured data
mentioned above. However, unstructured is a little less familiar not because there’s less of it,
but before technologies like NoSQL and Hadoop came along, harnessing unstructured data
wasn’t possible. In fact, most data being created today is unstructured. Unstructured data, as
the name suggests, lacks structure. It can’t be gathered based on clicks, purchases or a
barcode, so what is it exactly?
6. Captured:
Captured data is created passively due to a person’s behavior. Every time someone enters a
search term on Google that is data that can be captured for future benefit. The GPS info on
our smartphones is another example of passive data that can be captured with big data
technologies.
7. User-generated:
User-generated data consists of all of the data individuals are putting on the Internet every
day. From tweets, to Facebook posts, to comments on news stories, to videos put up on
YouTube, individuals are creating a huge amount of data that businesses can use to better
target consumers and get feedback on products.
Big data is made up of many different types of data. The seven listed above comprise types of
external data included in the big data spectrum. There are, of course, many types of internal
data that contribute to big data as well, but hopefully breaking down the types of data helps
you to better see why combining all of this data into big data is so powerful for business.
Sources of Big Data
Classification of Types of Big Data
There is a vast opportunity offered by Big Data technologies to discover new insights that
drive significant business value. Industries are seeing data as a market differentiator and have
started reinventing themselves as “data companies”, as they realise that information has
become their biggest asset. This trend is prevalent in industries such as telecommunications,
internet search firms, marketing firms, etc. who see their data as a key driver for monetisation
and growth. Insights such as footfall traffic patterns from mobile devices have been used to
assist city planners in designing more efficient traffic flows. Customer sentiment analysis
through social media and call logs have given new insights into customer satisfaction.
Network performance patterns have been analysed to discover new ways to drive efficiencies.
Customer usage patterns based on web click-stream data have driven innovation for new
products and services to increase revenue. The list goes on.
Key to success in any Big Data analytics initiative is to first identify the business needs and
opportunities, and then select the proper fit-for-purpose platform. With the array of new Big
Data technologies emerging at a rapid pace, many technologists are eager to be the first to
test the latest Dr. Seuss-termed platform. But each technology has a unique specialisation,
and might not be aligned to the business priorities. In fact, some identified use cases from the
business might be best suited by existing technologies such as a data warehouse while others
require a combination of existing technologies and new Big Data systems.
With this integration of disparate data systems comes the 5th V – Veracity, i.e. the
correctness and accuracy of information. Behind any information management practice lies
the core doctrines of Data Quality, Data Governance, and Metadata Management, along with
considerations for Privacy and Legal concerns. Big Data needs to be integrated into the entire
information landscape, not seen as a stand-alone effort or a stealth project done by a handful
of Big Data experts.
BIG DATA ANALYTICS
In the excitement and hype around Big Data analytics, it’s easy to see this emerging
technology as a “silver bullet” that can magically generate new insights solely through
powerful technology and smart data scientists. As in any age of change, however, core
principles still apply, and in order to gain insights from Big Data, you need to make sure your
“little data” is correct. Many of the “golden nuggets” of discovery are obtained through an
intersection of Big Data analytics with traditional sources such as a data warehouses or
master data management hubs.
Customer sentiment analysis is a common use-case for Big Data analytics—i.e. what are our
customers saying about our products in social media and/or call log records? And how can
we leverage this information to improve our business? Unless you have a robust ‘single
BIG DATA ANALYTICS
source of record’ for customer information, new discoveries from Big Data analytics will be
of little use. Was it Jane R. Doe or Jane P. Doe complaining about the new luxury sedan
model? With data properly managed within an information management framework, the full
value of Big Data becomes apparent and “golden nuggets” of information can appear. For
example, not only did Jane R. Doe complain about the new luxury sedan, but she had five
service calls about her transmission. She has purchased five high-priced sedans from us in the
past ten years and has an income of over $750,000. Jane R. Doe recently followed our
competitor on Twitter and has asked several questions about new features. It might be worth
having a representative call her personally.
Big Data analytics is an exciting development in the field of information management and, if
used properly, can generate a wealth of opportunity. In order to discover the “golden
nuggets” in your organisation, remember these guiding principles:
•Start with your business goals and drivers and align them to fit-for-purpose technologies (not
the other way around)
•Integrate your Big Data initiatives with core information management practices
•Build your information management practice on a core framework that includes data
governance, data quality management, data quality, and the other principles that create a
trusted source of information
Lastly, have fun—this is an exciting time to be in information management. New
technologies are emerging almost daily that can add significant value to your organisation,
particularly in the Big Data space.
A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems. The threshold at which
organizations enter into the big data realm differs, depending on the capabilities of the users
and their tools. For some, it can mean hundreds of gigabytes of data, while for others it means
hundreds of terabytes. Over the years, the data landscape has changed. What you can do, or
are expected to do, with data has changed. The cost of storage has fallen dramatically, while
the means by which data is collected keeps growing. Some data arrives at a rapid pace,
constantly demanding to be collected and observed. Other data arrives more slowly, but in
very large chunks, often in the form of decades of historical data. You might be facing an
advanced analytics problem, or one that requires machine learning. These are challenges that
big data architectures seek to solve.
BIG DATA ANALYTICS