[go: up one dir, main page]

0% found this document useful (0 votes)
240 views19 pages

Bigdata With Python

This document provides an overview and introduction to big data analytics with Python. It discusses key topics like the history and definition of big data, advantages of big data solutions, common software stacks and platforms used for big data analytics like Hadoop, Spark, and Python. It also provides examples of using Python for big data applications like word count and configuring the Cloudera big data platform. Finally, it discusses trends like artificial intelligence and machine learning for cybersecurity and information security analytics using big data.

Uploaded by

Amrit Chhetrib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
240 views19 pages

Bigdata With Python

This document provides an overview and introduction to big data analytics with Python. It discusses key topics like the history and definition of big data, advantages of big data solutions, common software stacks and platforms used for big data analytics like Hadoop, Spark, and Python. It also provides examples of using Python for big data applications like word count and configuring the Cloudera big data platform. Finally, it discusses trends like artificial intelligence and machine learning for cybersecurity and information security analytics using big data.

Uploaded by

Amrit Chhetrib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BigData With Python

Prepared and Presenting By: Amrit Chhetri (Certified BigData Analyst),


Principal Techno-Functional Consultant and Principal IT Security Consultant
Presentation Topics:

BigData Introduction

History of BigData

Advantages of BigData Solution

Trends of BigData Analytics

BigData Adoption Trends

BigData Software Stacks

BigData Analytics-Platforms

BigData Programming Platforms

Trends of BigData Analytics

BigData MapReduce Demo- Word Count

BigData in Telecommunication

Configuring BigData Platform from Cloudera

AI-Enabled Cybersecurity Platforms

Information Security Analytics
BigData Introduction:

BigData is a large set of data and it follows 3Vs ( Volume, Variety and Velocity) that traditional data
processing application does not.


"BigData is a collection of very large set of data which includes structured ,semi-structured and non-
structured data and they are processed by non-traditional and parallel data processing engine to
produce meaningful insights." - Amrit Chhetri


The challenges of BigData are capture, citation and storage, search, query, visualization and
analysis which are handled by Apache Hadoop or Spark.


Apache Hadoop Stack or Apache Hadoop-based Platforms is the distributed Data Processing
Platforms and it solves the issues of BigData.

BigData and BigData Analytics is the key Software Component of today’s Information Security
Analytics .

Contd…
BigData Introduction:

BigData positions at the top of Gartner’s Hype Cycle 2015 .

BigData in Gartner's 2015 Hype Cycle 2015:

Source: Gartner
History of BigData:

The term 'BigData' was introduced the very first time in 2007 .

Apache Hadoop is the first BigData Engine and it was incorporated in
2004

Hadoop-As-A-Service(HAAS) is the latest trend of BigData Analytics and
Qubole is an example.

Source: Google Images


Advantages of BigData Solution:

BigData is a distributed and Parallel Processing Platform or Engine .

BigData Analytics supports heterogeneous Data Sources .

Availability of Open Source Platforms for analyzing large volume of Data is
another great advantage .

BigData is meant to address all 3/4 V- Volume, Variety, Velocity and
Veracity

Source: Google Images


Trends of BigData Analytics:

Self-Service BigData Analytics using BigData Analytics

Mobile Analytics for accessing Analytics on Mobile

Interactive Visuals or Reports to drill-into details

Machine Learning and AI for Business Forecasting and Monitoring

Deep Learning-based Systems are growing across all Business domains
BigData Adoption Trends:

Customer Retention-Telecom, Banking, Finance, Healthcare and
Infotainment and others .

Service Quality Improvement-Telecom, Banking, Healthcare and
Infotainment

Predictive Analytics, Deep Learning and AI- everywhere!

Source: Garrtner
BigData Software Stacks:

Apache Hadoop is the main Platform of BigData Hadoop Stack

Apache Hadoop is distributed or shipped by other BigData brands too
including MapR, Cloudera, etc .

The common distributions of Hadoop are:

Cloudera Hadoop

MapR Hadoop

Hortonworks Hadoop

MS Azure HDInsight

Oracle Hadoop

Syncfusion

Qubole and Informatica
BigData Analytics-Platforms:

Apache Hadoop and HDFS are two core components of BigData Analytics

BigData Hardtop Analytics comprises of

Distributed Processing Engine: Apache Hadoop and Apache Tez

Distributed File System : HDFS and RDD

Data Warehouse System : HBase

Scripting/Quering : Pig and Hive

Database System : NoSQL, Cassandra

Data Analysis Platforms : Hive, Spark,
R/Octave/
MATLab and
BI Tools
(BIRT)

Monitoring : Apache Amber

Machine Learning AP : Mahout, Spark, MATLAB,
Google TensorFlow
BigData Programming Platforms:

BigData compatible Programming Languages:++, C#, Java, Python,
Scala, PHP and Ruby

BigData Scripting Languages: Pig Latin and HiveQL

IDEs for Python :Eclipse(PyDev), IDLE, Anaconda and Geany

API : MapReduce, Pig, Hive, HBase, Spark, MRLib, Mahout

IDEs for MapReduce : Eclipse, IDLE, Anaconda and Geany

Adoption of Machine API-Mahout, Spark, Octave/R/MATLAB
Trends of BigData Analytics:

BigData In-Memory Analytics-Tez and Spark

BigData Analytics on Mobile-example Fitbit Graphs

Adoption of AI(Artificial Intelligence), Machine Learning & Deep Learning-
Mahout, Spark, Octave/R/MATLAB, TensorFlow, etc

BigData for IOT Ecosystem-Sensors, IOT Protocols, BigData Analytics
and Telecommunication Platforms

Self-Service Analytics

BigData and BigData Security Auditing

BigData Analytics for Information Security and Computer Forensics
MapReduce Demo-Word Count:

Get Eclipse, install it on Windows or Linux System

Install PyDev using ‘Install New Software’ option and install MRJob Python
Module , #pip install MRJob

Write Word Count Code :
BigData in Telecommunication:

Market Share Analysis and Competitive Analysis

Customer Retention - Mobile, Phone, Data Card and Other Services

Customer Behavior Analysis- Demographics, Usage Patterns, Payment/Recharge Patterns

Location-Based Marketing-Service Analysis by Location, Regions and Seasons

Real-Time Promotion and Offerings

MIS Reporting -Call Drops,

Service Quality Improvement - Network Traffics, Customer, Location, Trend and Demands

Customer Service Improvement- Appropriate Plans, Billing

Recommendation System Improvement

Real-Time Performance Monitoring

Smart Recommendation System( using Machine Learning and AI)

Customer Plan, Services

Customized Services,


Special Offer Improvement


IOT (4G/5G) Communication Analysis - Analyzing IoT Networks over Telecommunication Networks


Cloudera BigData Platform:

Install Windows 2012 Server(64x) or Windows 10 (64x) or Windows 2016(64x)

Install Vmware Player 12 or higher

Create folder ‘BigDataTrainining/Vminstances/Cloudera’ on D or E Drive

Extract zipped file of Cloudera inside BigDataTrainining/Vminstances/Cloudera

Open VDMX file using Vmware Player and import the necessary configuration

Start Cloudera VM and open the Cloudera Home page on browser

Open Hue ( username/password: cloudera/cloudera)

Create table: CREATE TABLE PRD( prd_id int, prd_cat int) ;

Insert Date: INSERT INTO PRD values(23,23);

Select Data: SELECT prd_id , prd_cad from PRD;

WOW! Hive is working on Cloudera!!
Practical :HAAS-Qubole Registration:

Hadoop on Cloud( Public or Hybrid) is called HAAS and it stands for Hadoop-As-A-
Service .

HASS is ready to use Platform model based on SAAS( Software-As-A-Service)

Qubole is one of the HAAS

Follow the steps below to run Hive on Qubole

Register for Qubole.com or log-in into it using Gmail credentials

Create table: CREATE TABLE PRD( prd_id int, prd_cat int) ;

Insert Date: INSERT INTO PRD values(23,23);

Select Data: SELECT prd_id , prd_cad from PRD;

WOW! Hive is working on Cloudera!!
AI-Enabled Cybersecurity Platforms:

Artificial Intelligence, Deep Learning & Machine Learning in Cyber
Security have been deployed in two ways:

Deployment of AI-Powered Cybersecurity Platforms

Use of AI on In-House Cyber Security Platforms/Tools/Systems

AI on In-House Cyber Security Platforms/Tools/Systems:

AI Engines(Data Driven) : TensorFlow, MS Azures Machine Learning, Apache Spark, WIPRO Holmes

AI Engines(Light-weight) : Caffe2, Caffee, MxNet, Theano, The Microsoft Cognitive Toolkit

Programming Languages: Python, Scala, Java, Prolog, C++, Haskell AI

Data-Storage Platforms : Apache Hadoop/Spark, Hive, Apache Cassandra
Information Security Analytics:

BigData Platforms can deployed in three distinct areas of Information Security:-

Intelligent IT Security System Designing- powered by AI, Deep Learning & Machine Learning

Intelligent IT Monitoring System Designing- powered by AI, Deep Learning & Machine Learning

Intelligent CFIR Platforms Designing- powered by AI, Deep Learning & Machine Learning

Analysis of real-time data generated by Threats including Mirai, Dharma.Wallet
and WannaCry Ransomware enhances Cybersecurity and Malware Forensics.

Artificial Intelligence enhances functionalities of IOT-enabled IT Security
Architecture to automate Monitoring of Modern Cyber Threats including Zero-Day
Attacks.

The effective sets of IT Security Tools by categories are:-

SIEM Platforms-LogRhythm

Intelligent CFIR Platforms -Splunk Enterprise 6.5 or higher, IBM Q-Radar
THANK YOU ALL

You might also like