Cloud Databases: A Paradigm Shift in Databases
Cloud Databases: A Paradigm Shift in Databases
2
Department of Computer Science and Application,
Panjab University, Chandigarh
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 78
Dropbox, iCloud etc. are popular cloud storage services data partitioning. It needs a piece of middleware to route
[2]. DaaS allows user to store data at a remote disk database requests to the appropriate server. As more
available through Internet. It is used mainly for backup servers are added, data has to be repartitioned. Data
purposes and basic data management. Cloud storage partitioning should be done very carefully, otherwise data
cannot work without basic data management services. So, shipping (passing of the information from one machine to
these two terms are used interchangeably. DBaaS is one- the other machine for processing) and joining will become
step ahead. It offers complete database functionality and difficult. More data shipping means more latency and
allows users to access and store their database at remote network bandwidth bottlenecks. These issues reduce
disks anytime from any place through Internet. Amazon’s database performance badly. Shared-nothing Storage
SimpleDB, Amazon RDS, Google’s BigTable, Yahoo’s architecture is also used mainly for data-intensive
Sherpa and Microsoft’s SQL Azure Database are the workloads. IBM and Oracle released their shared-nothing
commonly used databases in the Cloud [3]. implementation of DB2 in 1990 and September 2008
respectively for scalable analytical applications of data
Cloud database is a database delivered to users on demand warehouses. Amazon’s SimpleDB, Hadoop Distributed
through the Internet from a cloud database provider's File System and Yahoo’s PNUTS also implement shared-
servers. Cloud databases provide scalability, high nothing architecture [5-7].
availability, optimized resource allocation and multi-
tenancy. A cloud database can be a traditional database 2.2 Shared-disk Database Architecture
such as MySQL and SQL Server. These databases can be
installed, configured and maintained on a Cloud server by Shared-disk Database Architecture treats the whole
the user himself. This option is popularly called the “Do- database as a single large piece of database stored on a
it-Yourself” approach (DIY). Few providers offer ready- Storage Area Network (SAN) or Network Attached
made database services such as Xeround’s MySQL [4]. In Storage (NAS) storage that is shared and accessible
“Do-it-Yourself” approach, the developers manually through network by all nodes. It requires fewer low-cost
ensure reliability and elasticity service. Selection of a servers. It is easy to virtualize them as each compute
DBaaS solution reduces the complexity and cost of server is identical. It separates the compute from the
running one’s own database. It spares the developer from storage as any number of compute instances may work on
the hassles of tedious management tasks of the database. the entire data. Middleware is not required to route data
Cloud databases provide improved availability, scalability, requests to specific servers as each node/client has access
performance and flexibility at lesser price. Conventional to all of the data. Hence, it is more suitable for On-Line
DBMS (Data Base Management System) deals with Transaction Processing applications. Oracle RAC, IBM
structured data which is held in databases along with its DB2 pureScale, Sybase etc. support this architecture [11].
metadata. While Cloud databases can be used for
unstructured, semi-structured data or structured data. Data Table 1: Comparison of shared-nothing and shared disk storage
architectures
stored in files of various types where the metadata was
either unavailable or incomplete is called unstructured data.
Architecture
Maintenance
Cloud databases are able to support changing storage
Partitioning
Distributed
Scalability
Analytical
Useful for
requirements of Internet-savvy users who deal more with
unstructured data, user created content such as documents
OLTP
Cloud
ACID
Cost
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 79
data not only for transaction processing, but to analyze They have emerged to address the requirements of data
consumer trends and business needs. Enterprises want to management in the cloud as they follow BASE (Basically
use analytical knowledge to enhance their business value. Available, Soft state, eventually consistent) in contrast to
So, enterprise applications are broadly categorized into the ACID guarantees. So, they are not suitable for update-
transactional and analytical applications. Relational intensive transaction applications. They provide high
databases played dominant role in handling transactional availability at the cost of consistency [16-17].
data. Later on, industry leaders like IBM and Oracle added
analytical capabilities to their relational databases for data Table 2: Comparison of RDBMS and NoSQL databases
mining applications. In the mean time, number of RDBMS NoSQL Databases
databases such as Column databases, Object-oriented Data within a database is Each entity is considered an
databases etc. came into market [12-13]. But they could treated as a “whole” independent unit of data and
can be freely moved from
not overpower the relational databases. Then Internet
one machine to the other
revolution and web 2.0 applications started producing RDBMS support centrally They follow distributed
massive sparse and unstructured data. RDBMS are not managed architecture. architecture.
suitable for handling massive sparse data sets with loosely They are statically They are dynamically
defined schemas. The need to store and process such big provisioned. provisioned.
data defined the role of NoSQL databases in the database It is difficult to scale them. They are easily scalable.
technology as Cloud databases. RDBMs and NOSQL They provide SQL to query They use API to query data
databases are briefly discussed as follows: data (not feature rich as SQL).
ACID (Atomicity, Follow BASE (Basically
3.1 Relational Databases Consistency, Isolation and Available, Soft state,
Durability) Compliant; Eventually consistent); The
The concept of relational databases is forty years old. It DBMS maintains user accesses are guaranteed
worked best in the era of hardware limits such as small Consistency. only at a single-key level.
They support on-line They support web2.0
disk space, little memory, slow processor speed and
Transaction Processing applications.
limited networking. It has rigid database architecture based applications.
on tables, columns, indexes, relationships and schema. ORACLE, MySQL, SQL Amazon SimpleDB,
Data is stored in tables with predefined complex Server etc. are popular Yahoo’s PNUTS, CouchDB
relationships. Column indexes are used for faster search. RDBMS. etc. are popular NoSQL
Highly skilled Developers and DBAs are required for Databases.
database design and maintenance. Conventionally, they
are used for transactional databases. They include details
at the lowest granularity. They contain sensitive and 4. Challenges to Develop Cloud Databases
operational data such as employee data and credit card
numbers to handle critical business operations. These Cloud DBMSs should support features of Cloud
databases are not well suited for Cloud environment as computing as well as of traditional databases for wider
they do not support full content data search and are acceptability, which is a Hercules’s task. The potential
difficult to scale beyond a limit [14-15]. challenges associated with cloud databases are as follows:
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 80
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 81
Elastic Compute Cloud (EC2) cloud. Even third party called GQL () which is not as feature rich as SQL. Select
management providers like Elastra and Rightscale offer statements in GQL can be performed on one table only.
MySQL images. Scaling is not easy with MySQL but it GQL does not support the “Join” statement [23, 24].
can be done. EnterpriseDB’s Postgres Plus Advanced
Server, a transactional database also runs in Amazon’s 5.4 MapReduce
cloud. Earlier Storage was tied to the EC2 instance.
Termination of instance means loss of data associated with It is an easy-to-use programming model that supports
that instance. With Amazon’s Elastic Block Store (EBS), parallel architecture. It is very scalable and works in a
user can choose to allocate storage volumes that persist distributed manner. It is useful for massive data processing,
reliably and independently from EC2 instances. Amazon large scale search and data analysis in the cloud. It
Relational Database Service (RDS) is also a web service provides an abstraction by defining a “mapper” and a
that makes it easy to set up and scale a relational database “reducer”. The “mapper” is applied to every input
in the Cloud. It is designed for developers or businesses key/value pair to generate an arbitrary number of
that require the full features and capabilities of a relational intermediate key/value pairs. The “reducer” is applied to
database. It gives access to the capabilities of a MySQL, all values associated with the same intermediate key to
Oracle or SQL Server database engines running on generate output key/value pairs. It has sufficient
Amazon RDS database instance [20-21]. expression capability to support many real world
algorithms and tasks. It can partition the input data,
5.2 Amazon SimpleDB schedule the execution of program across a set of
machines, handle machine failures and manage the inter-
It is a highly available, scalable and flexible non-relational machine communication. But it cannot be compared to
data store. It works closely with Amazon S3 and Amazon database systems [25].
EC2 to provide the ability to store, process and query data
sets in the cloud. It is NoSQL and name/value pair data 5.5 Hadoop
store. It offers a simple interface of Get, Post, Delete and
Query to run queries on structured data. It is comprised of It is a programming framework for implementing
domains, items, attributes and values. A domain is MapReduce across large grid of servers. It is distributed in
comparable to a table or a worksheet in a spreadsheet e.g. nature and has better scalability than relational and column
employee table. Domains are further comprised of items store databases. It is more suitable for unstructured data. It
(rows) and items are described by attribute-value pairs. is not for mixed workloads, complex data structures and
Unlike a spreadsheet, it allows cells to contain multiple multitasking. Hadoop is a Java based open source project.
values per entry. Each item can have its own unique set of With the support from Yahoo, Hadoop has achieved great
associated attributes(e.g. item “1” might have attributes progress. It has been deployed in a large system with 4,000
“Basic” and “tax” whereas item “2” may have attributes nodes and is used in many large scale data processing
“Basic”, “tax” and “Saving”. It provides scalability by tasks. It enables the addition of Java software Components
allowing user to partition the workload across multiple and provides HDFS (Hadoop Distributed File System) and
domains. Initially, user is allocated a maximum of 250 has been extended to include HBase, a column store
domains. User can choose between consistency and database [26].
eventual consistency. But with complex applications, it is
difficult to maintain data integrity. It allows user to 5.6 Windows Azure Cloud Storage
encrypt data before saving it. It does not decode the data
but query directly on the strings stored. It automatically The aim of Windows Azure Storage is to let users and
manages replication, indexing of data and performance applications access their data efficiently from anywhere at
tuning [22]. any time using simple and familiar programming API.
They can use scalable storage to store any amount of data
5.3 Google App's Bigtable for any length of time on pay per use basis. It supports
structured as well as unstructured data, NoSQL databases
It is a distributed storage system based on GFS (Google and queues. It provides three data abstractions: Blobs,
File system) for structured data. It implements a replicated Tables and Queues. Blobs provide a simple interface for
shared-nothing database. It has been successfully deployed storing named files along with metadata for the file. Tables
in many Google products like Google app engine. It allows provide structured storage. A Table is a set of entities,
a more complex data store than SimpleDB. It allows which contain a set of properties. Queues provide reliable
entities and properties comparable to tables and columns. storage and delivery of messages for an application. All
One can create an entity by creating a python object. The information held in Windows Azure storage is replicated
Google Datastore API also allows a get, put, delete format three times which allows fault tolerance [27].
for accessing data. It also offers a non-SQL language
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 82
5.7 Microsoft SQL Server Data Services (SDDS) created using JavaScript. These views map the document
data onto a table-like structure that can be indexed and
It is a key/value data store, which is also called the cloud queried. It does not support a non-procedural query
extension of Microsoft’s SQL Server. It integrates with language. It achieves scalability through asynchronous
Microsoft’s Sync Framework, which is a .NET library for replication. It has unique capability to serve as a self-
synchronizing dissimilar data sources. It provides schema- contained application server and database [32].
free data storage, SOAP or REST APIs and a pay-as-you-
go payment system. It has three core concepts: Entity, 5.12 MongoDB
Container and Authority. Entity is a property bag of name
and value pairs. Container is a collection of entities. MongoDB is a GPL (General Public License) open source
Authority is collection of containers and acts as a billing document-oriented JSON database system being
unit [28]. developed at 10gen by Geir Magnusson and Dwight
Merriman. It is designed to be a true object database,
5.8 Sherpa rather than a pure key/value store. It stores data in JSON-
like documents with dynamic schemas. It provides the
It was popularly known as PNUTS in earlier publications. speed and scalability of key-value stores and rich
Data is organized into tables of records with attributes. functionality like indexes and dynamic queries of
Tables can be hashed or ordered. It supports blob data type relational databases. It provides horizontal scalability [33].
along with typical data types. It is a simplified relational
data model. It supports selection and projection from a Though NoSQL databases are widely accepted as cloud
single table and avoids join operation. Data is replicated databases in the database landscape, they are not a solution
asynchronously. It can operate in high availability or high for all problems. They can work easily with large sparse data,
consistency mode. Hadoop can use Sherpa as a data store but do not provide transactional integrity, flexible indexing,
instead of the native HDFS [29]. querying and SQL. They are not able to connect with
commonly used Business Intelligence tools. It is difficult to
5.9 Dynamo find experienced NoSQL programmers, developers and
administrators to install and maintain them. So, Cloud
It is a highly available, scalable and distributed key-value databases should be used with full awareness of their
data-store used by Amazon’s core services. It uses limitations.
eventual consistency to achieve high level of availability
i.e. it can write anywhere and update will eventually
propagate to all replicas asynchronously. There is no
6. Conclusions
record structure or indexes in Dynamo. It permits only
Massive data generated by web-based applications have
single key updates. It makes extensive use of object
changed the whole database scenario. Cloud databases
versioning and application-assisted conflict resolution [30].
appear to be a good solution for handling such data.
Moreover, all organizations cannot afford to set up
5.10 MegaStore
expensive data center infrastructure for managing their
It blends the scalability of a NoSQL data-store and the own databases. The growing popularity of Cloud databases
convenience of a traditional RDBMS to meet the storage is marking the beginning of new era of databases. Though
requirements of interactive Internet services such as e-mail, cloud databases are not ACID compliant, they are able to
documents, social networking. It uses synchronous handle massive workloads of web-based applications,
replication to achieve high availability and a consistent which do not require such guarantees. Different Cloud
view of the data. It provides transactional (ACID) databases are available in the market. They share similar
guarantees within an entity group. It is a flexible data concepts and features such as schema free database, simple
model with user-defined schema, full-text indexes and API, eventual/timeline consistency, scalability
queues [31]. synchronous/asynchronous replication etc. But each has its
unique API, query interface, data model and database
5.11 CouchDB functions. These concepts need to be standardized for
their better growth. Cloud computing and Cloud databases
CouchDB is a free, open-source, Apache project since are set to rule the next decade by overcoming the
early 2008. It is a document-oriented database written in limitations they have.
Erlang. It belongs to NoSQL generation of databases.
Documents (i.e. records) are stored in JSON (JavaScript References
Object Notation) format and are accessed through an [1] Rajkumar Buyya et al., “Cloud computing and emerging IT
HTTP interface. It allows "views" to be dynamically platforms: Vision, hype, and reality for delivering
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012
ISSN (Online): 1694-0814
www.IJCSI.org 83
computing as the 5th utility”, Future Generation Computer [24] F. Chang et al., “Bigtable: A Distributed Storage System for
Systems, Vol. 25, Issue 6, June 2009, pp. 599-616. Structured Data”, in 7th Usenix Symp. Operating Systems
[2] Jiyi Wu et al, “Recent Advances in Cloud Storage”, in Third Design and Implementation (OSDI 06), Usenix Assoc.,
International Symposium on Computer Science and 2006, pp. 205–218.
Computational Technology(ISCSCT ’10), Jiaozuo, P. R. [25] Dawei Jiang et al., “MAP-JOIN-REDUCE: Toward
China, 14-15,August 2010, pp. 151-154. Scalable and Efficient Data Analysis on Large Clusters”,
[3] Database as a Service: Reference Architecture – An IEEE Transactions on Knowledge and Data Engineering,
Overview, An Oracle White Paper on Enterprise Vol. 23, No. 9, 2011.
Architecture September 2011 [26] D. Borthakur, “The Hadoop Distributed File System:
http://www.oracle.com/technetwork/topics/entarch/oes- Architecture and Design, Apache Software Foundation”,
refarch-dbaas-508111.pdf last accessed on May 28, 2012. http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.htm
[4] http://xeround.com last accessed on May 25, 2012. l last accessed on May 27, 2012.
[5] Daniel J. Abadi, “Data Management in the Cloud: [27] Troy Davis, “Cloud Computing Use Cases and
Limitations and Opportunities”, Bulletin of the IEEE Considerations”, http://digissance.com/ Cloud Computing
Computer Society Technical Committee on Data Talk.pdf last accessed on June 10, 2012
Engineering, 2009, 32(1):3-12. [28] www.windowsazure.com/en-us/develop/net/.../cloud-
[6] http://aws.amazon.com/simpledb/ last accessed on May 23, storage/last accessed on June 10, 2012
2012. [29] Brian Cooper et al., “Building a Cloud for Yahoo”, Bulletin
[7] B.F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving of the IEEE Computer Society Technical Committee on
Platform”, in International Conference on Very Large Data Data Engineering, 2009.
Bases (VLDB), Vol. 1, no. 2, 2008, pp. 1277–1288. [30] Giuseppe DeCandia et al., “Dynamo: Amazon’s Highly
[8] Mike Hogan, “Cloud Computing & Databases”, November Available Key-value Store”, in of 21st ACM Symposium on
14, 2008. Operating System Principles, SOSP 2007, pp 205-220.
[9] Emmanuel Cecchet et al, “Dolly:Virtualization-driven [31] Jason Baker et al., “Megastore: Providing Scalable, Highly
Database Provisioning for the Cloud”, UMass Technical Available Storage for Interactive Services”, in 5th Biennial
Report UM-CS-2010-006. Conference on Innovative Data Systems Research
[10] Daniel J. Abadi, “ColumnStores vs. RowStores: How (CIDR ’11), 2011, pp.223-234.
Different Are They Really?” in International Conference on [32] http://www.couchbase.com/couchdb last accessed on
Management of Data- SIGMOD’08. May31, 2012.
[11] Donald Kossmann, Tim Kraska, Simon Loesing, "An [33] http://www.mongodb.org last accessed on May31, 2012.
Evaluation of Alternative Architectures for Transaction
Processing in the Cloud", SIGMOD’10, June 2010.
[12] Daniel J. Abadi et al., “Column-oriented Database Systems”, Indu Arora obtained her MCA degree from Guru Nanak Dev
VLDB ’09. University in 1992. She has been working as Assistant Professor
[13] Stonebraker, et al., “C-Store: A Column-oriented DBMS”. in Computer Science & Applications at MCMDAV College,
Chandigarh since 1998. She also served at BBKDAV College
[14] Thakur Ramjiram Singh, “Cloud Computing: An Analysis”,
(Aug. 1993- Oct. 1997) and AB College, Pathankot (Aug. 1992 –
International Journal of Enterprise Computing and Business Feb. 1993). She is also pursuing Doctor of Philosophy from
Systems”, Vol. 1, issue 2, July 2011, pp. 2230-8849. Department of Computer Science & Applications from Panjab
[15] Rick Cattell, “Scalable SQL and NoSQL Data Stores”, University, Chandigarh. Her research interests include Internet
ACM SIGMOD, Vol. 39, Issue 4, 2011, pp. 12-27. technologies, databases and Cloud Computing. She has many
[16] Arpita Mathur et al., “Cloud Based Distributed Databases: research papers to her credit.
The Future Ahead”, International Journal on Computer
Science and Engineering (IJCSE) Vol. 3, No. 6, 2011. Dr. Anu Gupta has been working as Assistant Professor in
Computer Science and Applications at Panjab University,
[17] Bo Peng, “Implementation Issues of A Cloud Computing
Chandigarh (India) since July 1998. She held the position of
Platform”, Bulletin of the IEEE Computer Society Chairperson, Department of Computer Science & Applications,
Technical Committee on Data Engineering. Panjab University, Chandigarh from Feb. 2008 to Jan. 2011. She
[18] Mihaela Ion, “Enforcing multi-user access policies to was awarded University medal for securing first position in M.C.A.
encrypted cloud databases”, IEEE International Symposium at Punjabi University, Patiala, Punjab in the year 1997. She has
on Policies for Distributed Systems and Networks, 2011, pp. the experience of working on several platforms using a variety of
175-177. development tools and application packages. She obtained Doctor
[19] Maggiani, R. “Cloud computing is changing how we of Philosophy Degree from Panjab University in the area of
Free/Open Source Software. Her research interests include Cloud
communicate”, IPCC 2009, 2009, pp. 1-4.
Computing, Networking, Multimedia Technologies, E-Commerce
[20] http://aws.amazon.com/rds/S3 last accessed on May 24, and Software Engineering. She is a life-member of ‘Computer
2012. Society of India’ and ‘Indian Academy of Science’. She has
[21] http://aws.amazon.com/rds/ last accessed on May 24, 2012. published several research papers in various journals and
[22] http://aws.amazon.com/simpledb last accessed on May 25, conferences.
2012.
[23] S. Ghemawat et al., “The Google File System”, in
proceeding of 19th ACM Symp. Operating System
Principles (SOSP 03), ACM Press, 2003, pp. 29–43.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.