[go: up one dir, main page]

0% found this document useful (0 votes)
474 views19 pages

Big Data Unit 2

Uploaded by

Rahul N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
474 views19 pages

Big Data Unit 2

Uploaded by

Rahul N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 19
Big Data Analytics BRIEF CONTENTS © What’ in Store? + Where do we Begin? + What is Big Dara Analytics? + What Big Data Analytics isn’t? + Why this Sudden Hype Around Big Data Analytics? Classification of Analytics Greatest Challenges that Prevent Businesses from Capitalizing on Big Data Top Challenges Facing Big Data Why is Big Data Analytics Importane? What Kind of Technologies are we Looking Toward to Help Meet the Challenges Posed by Big Data? Dara Science + Data Scientist ... Your New Best Friend!!! ‘Terminologies Used in Big Data Environment In-Memory Analytics In-Database Processing Symmetric Multiprocessor System Massively Parallel Processing Difference between Parallel and Distributed Systems Shared Nothing Architecture Consistency, Availability, Partition Tolerance (CAP) Theorem Explained Basically Available Soft State Eventual Consistency (BASE) *+ Few Top Analytics Tools SSS ae “Ifyou do nor know how to ask the right question, you discover nothing” WHAT'S IN STORE? =W, Edwards Deming This chapter is abour understanding “Big Data Analytics.” We have taken you through the comprehension of the term Big Data — datasets which are voluminous, rich in variety, and calls for processing at a great speed. Big data analytics is process of exami ing these lange datasets of big data ~ to unearth hidden 2 xe nthe Amazon site, ane ‘that does not escape your attention. Amazon has: ‘made a few suggestions (of books on similar topics or books by the same author) to you to help with your ‘next or future purchases. You wonder how Amazon's tecommendation engine was able to do this for you. Is it something that they do for all their customers? Well, Amazon's recommendation engine churns out these sort of good suggestions for customers like you, day in and day out. The company gathers all information about your past purchases together with what it knows about you, studies your buying pat tems, and the buying patterns of customers like you 3.1 WHERE DO WE BEGIN? “company. Yo routes and car It is one of those trucks are engaged in carryir help with a cargo delivery. The ble the charge. You do not want tunity. But which truck should you enge that is the nearest but is facing the heaviest the second nearest one but that is occupie and will not be able to take more load. The need to analyze the truck load, the fuel ‘consumption, the traffic on various routes, etc. before deci which truck to select to pick up the new delivery. ca e busy days W Raw daa is collected, asified, and organized. Associating ic with adequate metadata and laying bare the con- Ie is then aggregated and summarized so that it becomes text converts this data into meaningful information. easy to consume it for analysis. helps wit repository. This, in turn, th actionable insights wh Organizations have rea enough and make those timely decisions to ma Figure 3.1 Gradual accumulation of such m« lized that they will not be able ke well of the fleeting opportunities. They wi aningful information builds a knowledge ful for decision making. Refer Figure 3.1 they want to be competitive Il have to analyze ich prove usel 10 ignore big data Transformation of data to yield actionable insights. SS Big Data Analytics a Websites Bling (POS) ERP. cRM RFIO~ Social media Figure 3.2 Types of unstruct ailable for analysis. gt tured data available lysis. big time and also take into consideration bi in terms of volume, velocity, and mae ig data that makes it to the organization at unprecedented level ig data analytics is the wing Bile cians ® the proces of xamining big dato uncover puters une vend and find Se aeae ee ae ‘make faster and better decisions. Analytics begin with 3.2. WHAT IS BIG DATA ANALYTICS? Big Data Analytics is. 1. ae analytics: Quite a few data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS), etc. to help process and analyze your big data. inthe right direc , About gaining a meaningful, deeper, and richer insight into your business to steer iti tion, understanding the customer's demographics to cross-sell and up-sell ro them, better leveraging, the services of your vendors and suppliers, etc recommendations via email ded clothing line from my Author’: experience: The other day I was pleasantly surprised to get a few arrive at this? In from one of my frequently visited online retailers. They had recommen« orite brand and also the color suggested was onc to my liking. How did they clothing line of a particular brand and the color preference was fay making recommendations the recent past, I had been buying pastel shades. They had it stored in their database and pulled ic out while tome, 3, About a competitive ed nabling you with findings that allow quicker and er decision-making se over your competitors by € bei 4, Attight handshake berween three communities: IT, business users, and data scientists. Refer Figure 3.3. volume and variety exceed the current storage and processing capabilities 5, Working with datasets whose enterprise This makes perfect sense as the program for distributed processing is che data (Tetabytes or Petabytes today and likely to be Exabytes or ed to and infrastructure of you About moving code to tiny (just a few KBs) compa Zettabytes in the ni 3.3. WHAT BIG DATA ANALYTICS ISN'T? jpants of our learning programs as what comes to mind when you hear the term peed by the answer... itis “Volume.” But now that we have a clear understanding. snly about volume but the variety and velocity too are very important factors. Wehave often asked partic And we are whose volume and variety isbeyond the storage and eon Cieibeesocans Te nec ac nae Dyes cea Rta nuances feos Gry oteeea! Ga aeas Figure 3.3 What is big data analytics? echnology: It is about understanding what the data is saying Refer Figure 3.4. Big data isn't just about t existed becween datasets. It is about to us, It is about understanding relationships that we thought neve patterns and trends waicing ro be unveiled "And of course, big daca analytics is nor here to replace our now very robust and powerful Relational Database Management System (RDBMS) or our traditional Data Warehouse, It is here ro coexist with both RDBMS and Data Warchouse, leveraging the power of cach to yield business value. Big data analytics is not “One-size fits all” traditional RDBMS built on shared disk and memory. Figure 3.4 What big data analytics isn’t? Big Data Analytics 39 used by huge online companies like a Google or Amazon, let us clear the And before we think it is onl industry that needs actionable insights out of their data (both internal myth, Ieis for any business and ai and external). 3.4 WHY THIS SUDDEN HYPE AROUND BIG DATA ANALYTICS? If we go by the industry buzz, every place there seems to be talk about big data and big data analytics. Why this sudden hype? Refer Figure 3.5. Let us put it down to three foremost reasons: ching nearly 45 ZB by 2020. In 2010, almost about 1.2 trillion Gigabyte of data was generated. This amount doubled to 2.4 trillion Gigabyte in 2012 and to about 5 trillion Gigabytes in the year 2014. The volume of business data world js expected to double every 1.2 years. Wal-Mart, the world retailer, processes one million custome! “Twitter users every day. 2.7 billion “Likes’ 5 quintillion byces of daca is created, transactions per hour. 500 million “tweets” are posted by and comments are posted by Facebook users in a day. Every day with 909% of the world’s data created in the past 2 years alone. Source: (a) hutp://wwwintel.com/content (b)heep://www-01 ibm.com/software/data/bigdata/wh 1y dropped. ber of user-friendly analytics tools available 1/ywwww/us/en/communications/intemet-minute-infographic-hem! is-big-data. hem! 2. Cost per gigabyte of storage has h the market today. 3. There are an overwhelming num! 3.5 CLASSIFICATION OF ANALYTICS sically nwo schools of thought: alized, advanced, and monetized. 1. Those that classify analytics into basic, operatios nalytics 2.0, and analytics 3.0. nalytics into analytics 1.0, a 2. Those that cl: Sean) More data growth of produced analysis Beter More data predictions stored ~ More data a analyzed Figure 3.5 What big data entails? fig Data and Analy 40+ 3.5.1 First School of Thought s 1. Basic analytics: This primarily is slicing and dicing of data to help with basi is about reporting on historical dat, basle wsualzations 2. Operationalized analytics: It is operationalized analytics ic business insights. This i gets woven into the enterprises busines recasting forthe fare by way of predictive and prescrip. process ‘ 3, Advanced analytics: This largely is about £6 vive modeling. car ‘This is analyses In use to derive direct BURINEA FEN: 4, Monetized analytics: 3.5.2. Second School of Thought and analytics 3.0. Refer Table 3.1. Let us take a closer look at analytics 1.0, analytics 2.0, Table 3.1 Analytics 1.0, 2.0, and 3.0 Analytics 1.0 Analytics 2.0 ‘Analytics 3.0 : 2012 to present Era: mid 1950s to 2009 2005 to 2012 Descriptive statistics Descriptive statistics + predictive sti ptive + predic (report on events, {se data from the past to make predictions prescriptive staistics occurrences, etc. of the for the future) (use data from the past to make oat prophecies for the future and at the ‘same time make recommendations to leverage the situation to one’s advantage) Key questions asked: What will happen? When will tt happen? Why will it happen? What should be the action taken to take advantage of what will happen? A blend of big data and data from legacy systems, ERP, CRM, and 3¢ party applications. Big data is being taken up seriously. Data A blend of big data and traditional data sources. Data is mainly unstructured, arriving at a much analytics to yield insights and ‘toned in enterprise higher pace. This fast flow of data entailed offerings with speed and impact. ta warehouses or data that the influx of big volume data had to be stored and processed rapidly, often on massive parallel servers running Hadoop. Data was internally Data was often externally sourced. Data ‘is both being internally and externally sourced. istics Descriptive + predictive + Key questions asked: What will happen? Why will it happen? Key questions asked: What happened? Why did it happen? Data from legacy Big data stems, ERP, CRM, and dé party applications. Small and structured marts. sourced. Relational databases Database appliances, Hadoop clusters, SQL In memory analytics, in database to Hadoop environments, etc. processing, agile analytical methods, machine learning techniques, etc. a Analytics oH How can we. make it happen? ore Wat wil coe happen? Why dit ie happen? ‘ u ie Foresight Insight . psa dur! Hindsight Figure 3.6 Analytics 1.0, 2.0, and 3.0. Figure 3.6 shows the subtle growth of analytics ftom Descriptive > Diagnostic > Predictive > PrescaPans analytics. 3.6 GREATEST CHALLENGES THAT PREVENT BUSINESSES FROM. CAPITALIZING ON BIG DATA 1. Obsaining executive sponsorships for investments in big data and its related activities (such as train- ing, etc.) 2, Getting the business units to share information across organizational silos 3, Finding the right skills (business analysts and dara scientists) that can manage tured, semi-structured, and unscructured data and create insights from it. 4. Determining the approach co scale rapidly and clastically. In other words, storage and processing of large volume, velocity, and variety of big data 5. Deciding whether to use structured or unstructured, internal or external dara ro make business decisions. 6. Choosing the optimal way to report findings and analysis of big data (visual presentation and analy- tics) for the presentations to make the most sense. Determining what to do with the insights created from big data. large amounts of struc- the need to address the 3.7. TOP CHALLENGES FACING BIG DATA 1. Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not only SQL) is tone major concetn that needs to be addressed ro handle the need for scaling rapidly and elastically. The need of the hour is a storage that can best withstand the onslaught of large volume, velocity, and variety of big data? Should you scale vertically or should you scale horizontally? = a: Big Data and Analy, 2. Security: Most of the NoSQL big data platforms have poet security mechanisms (lack of prop, authentication and authorization mechanisms) when it comes © safeguarding tis data. A spor th cannot be ignored given that big data ‘carries credit card information, person information, and othe, 3. ae schemas have no place. We want the technology f0 be able to fit our big data and ng, need ofthe hour is dynamic schema, Static (pre-defined schemas) are pase 4, Continuous availability: The big question here is how to provide 24/7 support because almos: RDBMS and NoSQL big data platforms have a certain amour of downtime built in. i i al consistency? 5. Consistency: Should one opt for consistency oF event a noe oy build pardon ©lemapueyeemet i acacia cof both hardware ang 6. Partition tolerant: software failures? : 7. Data aie or ato enaintaln dda quality ~davasccuracy Comp tantasaterica? etc? Do we have appropriate metadata in place? 3.8 WHY IS BIG DATA ANALYTICS IMPORTANT? the other way around. The Ler us study the various approaches to analysis of data and what it leads co ‘does Business Intelligence (BI) help us with? Tt allows the peace 9 make [aster and bever decisions by providing the right information ro the right Person a the rghe time inthe right format, [cis about analysis ofthe past ot historical data and then displaying the findings of the analysis or reports in the form of enterprise dashboards ler: notifications, etc I has support for both pre-specified reports as well as ad hoe querying: 2. Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the approach is sil reactive as it is still based on static data. 5, Proactive _ Analytics: This is ro support futuristic decision making by the use of data mining, pre: dictive modeling, texe mining, and stacstical analysis. This analysis is not on big data as it stil uses “Treabase management practices on big data and therefore has severe limitations on the 1. Reactive — Business Intelligence: What the tradition storage capacity and the processing capability. 4. Proactive — Big Data Analytics: This is sieving through terabytes, petabytes, exabytes of information } te filter out the relevant data to analyze, This also includes high performance analytics to gain rapid insights from big data and the ability to solve complex problems using more data, 3.9 WHAT KIND OF TECHNOLOGIES ARE WE LOOKING TOWARD TO HELP MEET THE CHALLENGES POSED BY BIG DATA? 1. The first requirement is of cheap and abundant storage. 2. We need faster processors to help with quicker processing of big data 3, Affordable open-source, distributed big data platforms, such as Hadoop. 4, Parallel processing, clustering, virtualization, large grid environments (to distribute processing to a number of machines), high connectivity, and high throughputs rather than low latency. 5, Cloud computing and other flexible resource allocation arrangements. 3,10 DATA SCIENCE cei the science of extracting 1 ience es ae a a SRR from data. In other words, it isa science of drawing out hidden ig tistical ane mathematical techniques. It employs techniques and theories drawn ids from the broad areas 0 1m the broad areas of mathematics, statistics, information technology including machine etc Data si parrerns amon from many fiek faring di eaday we have a plethora of use-cases for Today WE plethora of for “Data Science” that are already exploring za bytes of Information) for weather predictions, oil drillings, seismic activities, financial fra dia analytics, and so many , market basket analytics ary. Refer gincering, probability models, statistical learning, pattern recognition and learning, massive datasets auds, (Pera to Zee rete network and activities, global economic impacts, sensor log, social me yond standard retail, manufacturing use-cases such as customer ch ining), collaborative filtering, regression analysis, etc. Data science is multi-disciplina others be} (associative to Figure 3.10.1 4 data scientist should domain further helps, 1, Understanding of domain. 2, Business strategy. 3, Problem solving. 4, Communication. 5, Presentation. 6, Inquisitiveness. Business Acumen Skills s of business. A firm understanding of business have the prowess to counter the pressure: 1 role of data scientist The following is a lis of traits that needs to be honed to play th 3.10.2 Technology Expertise saying that technology expertise will come in handy if one is ‘ed as far as technical expertise is concerned. to play the role of a data scien- Icgoes without sist. Cited below are few skills requir Business ‘acumen Technology expertise Figure 3.7 Data scientist. 44 1, Good database knowledge such as RDBMS. 2. Good NoSQL database knowledge such as MongoDB, Cassandra, HBasey ete: 3. Programming languages such as Java, Python, C+, ete. 4, Open-source tools such as Hadoop. 5. Data warehousing, 6, Data mining, 7. Visualization such as Tableau, Flare, Google visualization APIs ete: 3.10.3 Mathematics Expertise : sm to comprehend data, interpret it, make sense of; Since the core job of the data scientist will requite hi aoe ep lowing at the ky sl cee and analyze it, helshe will have to dabble in learning a! scientist will have to have in his arsenal 1, Mathematics, y 2. Statistics. Artificial Intelligence (Al). 4. Algorithms, 5. Machine learning, 6. Partern recognition 7. Natural Language Processing. To sum it up, the data science process is ng raw data from multiple disparate data sources. 3. Integrating the data and preparing clean datasets. 4. Engaging in explorative data analysis using model and algorithms. : r 5. Preparing presentations using data visualizations (commonly called Infographics, or BizAnalytics, or VizAnalytics, etc:) 6. Communicating the findings to all stakeholders 7. Making faster and better decisions. 3.11 DATA SCIENTIST...YOUR NEW BEST FRIEND! 2 age, a data scientist is the best friend that you can gift yourself. Refer Figure 3.8 to learn abou the data scientise can help you with. 3.11.1 Responsibilities of a Data Scientist Refer Figure 3.8 1, Data Management: A data scientist employs several approaches to develop the relevant datasets for nalysis. Raw data is just “RAW,” unsuitable for analysis, The data scientist works on it to prepare it so seflect the relationships and contexts, This data then becomes useful for processing and further Big Dara Analytics aah Peet Susan ett Cee Maal Peer ea patterns, spots trends eevee ued Ie aac tp Creed aeeeied es) mes Deeks pahed ake teenanaallGainal Figure 3.8 Data scientist: your new best friend!!! trying to find answers and 2. Analytical Techniques: Depending on the business questions which we are yrical techniques to develop the type of data available at hand, the data scientist employs a blend of anal tnodels and algorithms to understand the data, interpret relationships, spot trends, and unveil patter 3, Business Analysts: A data scientists a business analyst who distinguishes cool facts fro ights and is able to apply his business acumen and domain knowledge to see the results in ‘the business context. He is a good presenter and communicator who is able to communicate ‘the results of his findings in a Janguage that is understood by the different business stakeholders. 3.12 TERMINOLOGIES USED IN BIG DATA ENVIRONMENTS. In order to get a good handle on the big data environment, let us get familiar witha few key terminologies in this arena. 3.12.1 In-Memory Analytics Data access from non-volatile storage such as hard diskis Pert Goin fad disk o: secondary sorese: tedlowenene procesurs Oneway combat os challenge isto pre-process and store data (cubes, aggregate tables, query sets, etc.) so that the CPU has to fetch a small Fee rey, Bur ahis requires thinking in advance as o what dara will be requited for analysis. If there sao fos different or mote data, i s back to the intial process of pre-compuring and storing data or fetching it from secondary storage. “This problem has been addressed using in-memory analytics. Here all the relevant daca is stored in Random Avcess Memory (RAM) or primary storage thus eliminating the need ro access the dara from hard disk The advantage is faster access, rapid deployment, better insights, and minimal IT involvement, a slow process. The more the data is required co be 3.12.2 In-Database Processing In-database processing is also called as in-database analytics. Ic works by Fusing data warehouses with analyti- cal systems, Typically the daca from various enterprise On Line ‘Transaction Processing (OLTP) systems after Ee process of ETL is stored in the Enterprise p, hen exported to analytical programs for comp the database program itself can run the coms. me. Leading database vendors are offering cleaning up (de-duplication, scrubbing, etc.) through the Warehouse (EDW) or data marts, The huge datasets are t and extensive computations. With in-database processings tations eliminating the need for export and thereby saving on tn this feavure to large businesses 3.12.3 Symmetric Multiprocessor System (SMP) ; ‘on main memory that is shared by two or more identical processors. Th, rolled by a single operating system instance, has its own high-speed memory, caleg In SMP, there is a single comm processors have full access to all I/O devices and are cont SMP ate tightly coupled multiprocessor systems, Each processor cache memory and are connected using a system bus, Refer Figure 3.9. 3.12.4 Massively Parallel Processing Massive Paralel Procesing (MPP) refers to the coordinated processing of programs by a number of processor, working parallel, The processors, each have theie own operating systems and dedicated memory. They work on different parts of the same program. The MPP processors communicate using some sort of messaging interface. The MPP systems are more difficult to program as the application must be divided in such a way chav all the executing segments can communicate with each other. MPP is different from Symmetrcally Multiprocessing (SMP) in that SMP works with the processors sharing the same operating system and same ‘memory. SMP is also referred to as tightly-coupled mulriprocessing, 3.12.5 Difference Between Parallel and Distributed Systems The next ewo terms that we discuss are parallel and distributed systems. As is evident from Figure 3.10, a parallel database system is a tightly coupled system. The processors co-operate for query processing, The user is unaware of the parallelism since he/she has no access to a specific eee | ? y Geen ed | Big Data Analytics Figure 3.11 Parallel system. procesor of the system. Fither the processors have acces to a common memory (Refer Fg 3.11) or make tse of message passing for communication. Disuibured database systems are known to be loosely coupled and are composed by individual machines. Refer Figure 3.12. Each of the machines can run their individual application and serve their own rspec. aeeroee The data is usualy distributed across several machines, thereby necessitating quite a number of ‘machines to be accessed to answer a user query. Refer Figure 3.13. 3.12.6 Shared Nothing Architecture Leruslook at the three most common types of architecture for multiprocessor high transaction rate systems. They are: 1, Shared Memory (SM). 2. Shared Disk (SD). 3. Shared Nothing (SN). Inshated memory architecture, a common central memory i shared by multiple processors. In shared disk architecture, multiple processors share a common collection of disks while having their own private mem- ‘ory In shared nothing architecture, neither memory nor disk is shared among multiple processors. ___ Big Data and 4, Figure 3.12 Distributed system. Figure 3.13 Distributed system. 3.12.6.1 Advantages of a “Shared Nothing Architecture” 1. Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating fault. A faule ina single node is contained and confined to that node exclusively and exposed only through messages (or lack of ie). 2. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk band- width are also shared. Synchronization will have to be implemented to maintain a consistent shared state. This would mean thar different nodes will have to take turns to access the critical data. This 9 poses a limit on how many nodes ean i Bei sally, yy nodes can be added to the distributed shared disk system, chus compro 3.12.7 CAP Theorem Explained ‘The CAP theorem is also called the Brewer's Theorem, bi nvironment (a collection of interconnected nodes that share Ee Bates ce vide fi fe folowing sa ea Refer Figure 3.14. At best you can have two of the following thre: oA eee Eiaaaee ie 1¢ ~ one must be sacrificed 1, Consistency 2, Availability 3. Partition tolerance 3.12,7.1 CAP Theorem Let us spend some time understanding the earlier mentioned terms. 1. Consistency implies thac every read fetches the last write, 2, Availability implies that reads and writes always succeed. In other words, each non-failing node will return a response in a reasonable amount of time 3, Partition tolerance implies that the system will continue to function when network partition occurs Let us try to understand this using a real-life situation. You work for a training institute, “XYZ.” The instiute has 50 instructors including you. All of you report to a training coordinator. Ar the end of the month, all the instructors together with the training coordina tor peruse through the training requests received from the various corporate houses and prepare 2 eainit ‘Schedule for each instructor. These taining schedules (one for each instructor) are shared with “Amey,” the ffce administrator. Each morning, you either call the office helpdesk (essentially Amey’s desk) or check in-person with Amey for your schedule for the day. In case a training request hhas been cancelled or updated (updates can be in the form of change in course, change in duration, change of the training timings, etc-), ‘Amey is informed of the updates and the schedules are subsequently updated by him “Things were good until now. Few corporate houses were your clients and che schedules of each inserucoe could be smoothly managed without any major hiccups. But your caning institute has been implement ing promocion campaigns to expand the business. As a ceslt of advertising in the media and word of smeuth publicity by your existing clients, you suddenly see an upsurge in training requests from existing and motion, In consequence ofthat, more instructors have been recruited. Few trainers/consulranss have ls been roped in fiom other training institutes to help tackle the load. Car scies items Figure 3.14 Brewer's CAP. ee ES aa 50+ sdule o¢ call in at the helpdesk, you Jule or call helpdesk, you are prepared fo, ‘Amey to check your sche s, the training coordinator decides to recrne ae Now w you go to a ir c current state of alfa fl remain the same and will be shared by beet in the queue. Looking at th ail HJoey.” The helpdesk numbet wil office administrator “Joey You: Hey Amey! OP eens a training at 3:00 pm today. Can T please have the details? You I think Lam scheduled to anchor training scheduled 404, s the schedules. He does not se “You do not have any training to conduct at 3:00 pm Amey: Sure! Just a minute. or called up yesterday evening to inform of the same -s through the file where he maintains Se hr and responds back, Jou: How is that possible? The training coordina said he has updated the office administracors of the same: Amey: Ob! Did he say which office administrator? It could have been Jocy. Please check with Jocy, “Amer Hey Jocy! Please check the schedule for Paul here... Do you see something scheduled at 3:09 ,, today? Joo: Sure enough! He is anchoring the training for client “Z” today at 3:00 pm. A clear case of inconsistent system!!! The updates in the schedule were shared by the training coordina, with Jocy and you were checking for your schedule with Amey. this incident with the training coordinator and that gets him thinking. The issue has to b. id a chaotic situation. He comes up with a plan Amey br your name at 3:00 p' You share addressed immediately otherwise it will be difficult co avoi ‘and shares it with both the office administrators the following day. each time that either an instructor or me calls any one of you to update 2 of you update ic in your respective files. This way the instructor will always Training Coordinator: Folks, respective of whom amongst the two of you he/she schedule, make sure that both get the most recent and consistent information ir speaks to. Bur thar could mean a delay in answering either a phone call or sharing the schedule with the instructor poy: waiting in queue. Yes, I understand. Bue there is no way that we can give incorrect information. Training Coordinator. “Amey, Thete is this other problem as well. Suppose one of us is on leave on a particular day. ‘That would mean chat we cannot take any update related calls as we will not be able to simultaneously update both the files (my file and Joey’). Thats the availability problemi! But 1 have thought about that as Training Coordinator. Well, good poi well. Here is the plan: L. Ifone of you receives the update call (any updates to any schedule), ensure that you inform the othe person if he is available. 2, In case the other person is not available, ensure that you inform him of all the updates co all schedule via email. It is a must!!! 3. When the other person resumes duty, the first thing he will do is update his file with all che updates « all schedules that he has received via email. a Wow!!! That is sure a Consistent and Available systemtt in the office Looks like everything is in control. Wait a minute! ‘There is a tiff thae has taken place betwee administrators. The vo are pretty much available but are not talking to each other which, in other words, means that the updates are not flowing from one to the other. We have ta be partition tolerantil As 4 vain ing coordinator, you instruct them saying that none of you are taking any calls requesting for schedules or tupdates 0 schedules till you patch up. This implies that the system is partition toleranc but not available ac that time. In summary, one can at most decide to go with two of the three. 1, Consistent: The instructors or the training coordinator, once they have updated informacion with you, will always get the most updated information when they call sub: d a ey call subsequently. 2, Availability: The instructors or the training coordinators will always get the schedi the office administrators have reported to work. 3, Partition Tolerance: Work will go on as usual even if there is communication loss between the office administrators owing to a spat or at jule if any or both of “When to choose consistency over availability and vice-versa... 1. Choose availability over consistency when your business requirements allow some flexibility around when the data in the system synchronizes. 2. Choose consistency over availability when your business requirements demand atomic reads and writes. Examples of databases that follow one of the possible three combinations: 1. Availability and Partition Tolerance (AP) 2. Consistency and Partition Tolerance (CP) 3. Consistency and Availability (CA) Refer Figure 3.15 to get a glimpse of databases that adhere to two of the three characteristics of CAP theorem. ‘A Isavailable/accessible/ operational at all times ‘AP. Riak, Cassandra, CouchDB, Traditional RDBMS cA Dynamo like systems PostgreSQL, MySQL, i ae yucca etc. c cp Pe Commits are atomic HBase ‘System responds incorrectly across the entire MongoDB only when there is a total distributed systems Resis network failure MemeacheDB BigTable like systems Figure 3.15 Databases and CAP. ™ Data ang Ma. and Ang, 52, Ow eee 3.13 BASICALLY AVAILABLE SOFT STATE EVENTUAL CONSISTENcy (BASE) A few basic questions to start ith: 1. Where is it used? In distributed computing 2. Why is it used? To achieve high availability. 3. How is it achieved? ‘Assume a given data item. Ifo new updates are made 0 this given data item fora stipulated perigg Assume 3 accesses Co his data item wil return the updated value, In other words, f ng Fe enna) roa given data item fora stipulated period of ime, all updates cha were made ny e siren data item and che several epics oft will percolate o this data item past and not applied to this ¢ F thac it stays as current/recent as is possible. 4. What is replica convergence? von None that has achieved eventual consistency is said to have eonverged or achieved replica convergen, 5. Conflict resolution: How is the conflict resolved? (@) Read repair: Ifthe read leads to discrepancy or inconsistency, a correction is initiated. It slows down the read operation. (b) Write repair If the write leads to discrepancy or inconsistency, a correction is initiated. This wil) cause the write operation to slow down. (0) Asynchronous repair: Here, the correction is not part of a read or write operation 3.14 FEW TOP ANALYTICS TOOLS There is no dearth of analytical tools in the market. Please find below our list of few top analytics tools ‘We have also provided the links after each tool for you to explore more... : 1. MS Excel hucps://support office. microsoft.com/en-in/article/Whats-new-in-Excel-2013-lebc42cd-bfaf-43d 9031-5688ef1392fd?CorrelationId= 14217 1cc-191f-47de-8a55-08a5f2c9c739 &ui=en-US&rs=en- IN&ad=IN 2. SAS hetp://www.sas.com/en_us/home.html 3. IBM SPSS Modeler hexp://www-01.ibm.com/sofeware/analytics/spss/products/modeler/ 4, Statistica heep://www.statsoft.com/ ee salford systems (World Programming Systems 2 Rapides com/ eee wrs 4 pep: //ovorw.teamwpe.co.uk/products/wps g.14.1 Open Source Analytics Tools Lecus look ata couple of open source analytics tools, We have also provided the links after each tool for you co explore mor 1, Ranalytics heep//ww-revolutionanalytics,com/ 2. Weka F bep://wwwwres.waikato.ac.ne/ml/wekal REMIND ME + Quite a few data analytics and visualization tools are available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS), ‘exc. to help process and analyze your big data. + Big daca analytics is a about a tight handshake between three communities: data scientists. + Data science is the science of extracting knowledge from data. + The CAP theorem is also called the Brewer's Theorem. Ie states that in a distributed computing environment (a collection of interconnected nodes that share data), it is impossible to provide the following guarantees. At best you can have two of the following three ~ one must be sacrificed. * Consistency * Availabilicy * Partition tolerance [T, business users, and CONNECT ME (INTERNET RESOURCES) + hetp://en.wikipedia.org/wiki/Data_science * hup://simplystatistics.org/2013/12/12/the-key-word-in-dava-science-is-not-data-it-i * hucp://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html * http://spotfire.tibco.com/blog/?p=4240 * hetp://reports.informationweek.com/abstract/106/1255/Financial/tech-center-taking-advantage- of in-memory-analytics.htm| http://www.informationweek.com/software/information-management/oracle-analytics-package- ‘expands-in-database-processing-options/d/d-id/1 102712?

You might also like