Chapter Five - Partitioning in Data Warehouse: SATA Technology and Business Collage
Chapter Five - Partitioning in Data Warehouse: SATA Technology and Business Collage
Chapter Five - Partitioning in Data Warehouse: SATA Technology and Business Collage
Data warehouses often contain large tables and require techniques both for managing these large
tables and for providing good query performance across these large tables. The objective of
partitioning is manageability and performance related requirements. These topics are discussed:
1. Hardware partitioning
2. Software partitioning
5.2 Hardware Partitioning
RAID is a storage technology often used for databases larger than a few gigabytes. RAID can
provide both performance and fault tolerance benefits. A variety of RAID controllers and disk
configurations offer tradeoffs among cost, performance, and fault tolerance.
Performance
Hardware RAID controllers divide read/writes of all data from Windows NT 4.0 and Windows
2000 and applications (like SQL Server) into slices (usually 16–128 KB) that are then spread
across all disks participating in the RAID array. Splitting data across physical drives like this has
the effect of distributing the read/write I/O workload evenly across all physical hard drives
participating in the RAID array. This increases disk I/O performance because the hard disks
participating in the RAID array, as a whole are kept equally busy, instead of some disks
becoming a bottleneck due to uneven distribution of the I/O requests.
Fault Tolerance
RAID also provides protection from hard disk failure and accompanying data loss by using two
methods: mirroring and parity.
Mirroring
drive replacement and re-mirroring while Windows NT 4.0 and Windows 2000 and RDBMS are
online. Such RAID systems are commonly referred to as “Hot Plug” capable drives.
Advantage
It offers the best performance among RAID options if fault tolerance is required. Bear in mind
that each RDBMS write to the mirror set results in two disk I/O operations, once to each side of
the mirror set. Another advantage is that mirroring provides more fault tolerance than parity
RAID implementations. Mirroring can enable the system to survive at least one failed drive and
may be able to support the system through failure of up to half of the drives in the mirrorset
without forcing the system administrator to shut down the server and recover from the file
backup.
Disadvantage
The disk cost of mirroring is one extra drive for each drive worth of data. This essentially
doubles your storage cost, which, for a data warehouse, is often one of the most expensive
components needed. Both RAID 1 and its hybrid, RAID 0+1 (sometimes referred to as RAID 10
or 0/1) are implemented through mirroring.
Parity
It is implemented by calculating recovery information about data written to disk and writing this
parity information on the other drives that form the RAID array. If a drive should fail, a new
drive is inserted into the RAID array and the data on that failed drive is recovered by taking the
recovery information (parity) written on the other drives and using this information to regenerate
the data from the failed drive. RAID 5 and its hybrids are implemented through parity.
Advantage
The main plus of parity is cost. To protect any number of drives with RAID 5, only one
additional drive is required. Parity information is evenly distributed among all drives
participating in the RAID 5 array.
Disadvantages
The minus of parity are performance and fault tolerance. Due to the additional costs associated
with calculating and writing parity, RAID 5 requires four disk I/O operations for each write,
compared to two disk I/O operations for mirroring. Read I/O operation costs are the same for
mirroring and parity. Read operations, however, are usually one failed drive before the array
must be taken offline and recovery from backup media must be performed to restore data.
General Rule of Thumb: Be sure to stripe across as many disks as necessary to achieve solid disk
I/O performance. System monitor will indicate if there is a disk I/O bottleneck on a particular
RAID array. Be ready to add disks and redistribute data across RAID arrays and/or small
computer system interface (SCSI) channels as necessary to balance disk I/O and maximize
performance.
This level is also known as disk striping because of its use of a disk file system called a stripe
set. Data is divided into blocks and spread in a fixed order among all disks in an array. RAID 0
improves read/write performance by spreading operations across multiple disks, so that
operations can be performed independently and simultaneously. RAID 0 is similar to RAID 5,
except RAID 5 also provides fault tolerance. The following illustration shows RAID 0.
2. Level 1
This level is also known as disk mirroring because it uses a disk file system called a mirror set.
Disk mirroring provides a redundant, identical copy of a selected disk. All data written to the
primary disk is written to the mirror disk. RAID 1 provides fault tolerance and generally
improves read performance (but may degrade write performance). The following illustration
shows RAID 1.
3. Level 2
This level adds redundancy by using an error correction method that spreads parity across all
disks. It also employs a disk-striping strategy that breaks a file into bytes and spreads it across
multiple disks. This strategy offers only a marginal improvement in disk utilization and
read/write performance over mirroring (RAID 1). RAID 2 is not as efficient as other RAID
levels and is not generally used.
4. Level 3
This level uses the same striping method as RAID 2, but the error correction method requires
only one disk for parity data. Use of disk space varies with the number of data disks. RAID 3
provides some read/write performance improvement. RAID 3 also is rarely used.
5. Level 4
This level employs striped data in much larger blocks or segments than RAID 2 or RAID 3. Like
RAID 3, the error correction method requires only one disk for parity data. It keeps user data
separate from error correction data. RAID 4 is not as efficient as other RAID levels and is not
generally used.
6. Level 5
Also known as striping with parity, this level is the most popular strategy for new designs. It is
similar to RAID 4 because it stripes the data in large blocks across the disks in an array. It differs
in how it writes the parity across all the disks. Data redundancy is provided by the parity
information. The data and parity information are arranged on the disk array so the two are always
on different disks. Striping with parity offers better performance than disk mirroring (RAID 1).
However, when a stripe member is missing, read performance degrades (for example, when a
disk fails). RAID 5 is one of the most commonly used RAID configurations. The following
illustration shows RAID 5.
7. Level 10 (1 + 0)
This level is also known as mirroring with striping. This level uses a striped array of disks, which
are then mirrored to another identical set of striped disks. For example, a striped array can be
created using four disks. The striped array of disks is then mirrored using another set of four
striped disks. RAID 10 provides the performance benefits of disk striping with the disk
redundancy of mirroring. RAID 10 provides the highest read/write performance of any of the
RAID levels at the expense of using twice as many disks. The following illustration shows RAID
10.
As mentioned above, RAID 1 and RAID 0+1 offer the best data protection and best performance
among RAID levels, but cost more in terms of disks required. When cost of hard disks is not a
limiting factor, RAID 1 or RAID 0+1 are the best choices in terms of both performance and fault
tolerance.
RAID 5 costs less than RAID 1 or RAID 0+1 but provides less fault tolerance and less write
performance. The write performance of RAID 5 is only about half that of RAID 1 or RAID 0+1
because of the additional I/O needed to read and write parity information.
The best disk I/O performance is achieved with RAID 0 (disk striping with no fault tolerance
protection). Because RAID 0 provides no fault tolerance protection, it should never be used in a
production environment, and it is not recommended for development environments. RAID 0 is
typically used only for benchmarking or testing.
Many RAID array controllers provide the option of RAID 0+1 (also referred to as RAID 1/ 0 and
RAID 10) over physical hard drives. RAID 0+1 is a hybrid RAID solution. On the lower level, it
mirrors all data just like normal RAID 1. On the upper level, the controller stripes data across all
of the drives (like RAID 0). Thus, RAID 0+1 provides maximum protection (mirroring) with
high performance (striping). These striping and mirroring operations are transparent to Windows
and RDBMS because they are managed by the RAID controller. The difference between RAID 1
and RAID 0+1 is on the hardware controller level. RAID 1 and RAID 0+1 require the same
number of drives for a given amount of storage. For more information on RAID 0+1
implementation of specific RAID controllers, contact the hardware vendor that produced the
controller.
Range partitioning maps data to partitions based on ranges of partition key values that you
establish for each partition. It is the most common type of partitioning and is often used with
dates. For example, you might want to partition sales data into weekly, monthly or yearly
partitions. Range partitioning maps rows to partitions based on ranges of column values. Range
partitioning is defined by the partitioning specification for a table or index.
2. Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a
partitioning key that you identify. The hashing algorithm evenly distributes rows among
partitions, giving partitions approximately the same size. Hash partitioning is the ideal method
for distributing data evenly across devices. Hash partitioning is a good and easy-to-use
alternative to range partitioning when data is not historical and there is no obvious column or
column list where logical range partition pruning can be advantageous. Oracle uses a linear
hashing algorithm and to prevent data from clustering within specific partitions, you should
define the number of partitions by a power of two (for example, 2, 4, 8). The statement below
creates a table sales_hash, which is hash partitioned on the salesman_id field. data1, data2, data3,
and data4 are tablespace names.
3. List Partitioning
List partitioning enables you to explicitly control how rows map to partitions. You do this by
specifying a list of discrete values for the partitioning column in the description for each
partition. This is different from range partitioning, where a range of values is associated with a
partition and with hash partitioning, where you have no control of the row-to-partition mapping.
5 Data Warehousing and Data Mining Instructor – Ephrem
A.
SATA Technology and Business Collage
The advantage of list partitioning is that you can group and organize unordered and unrelated
sets of data in a natural way.
4. Composite Partitioning
Composite partitioning combines range and hash partitioning. Oracle first distributes data into
partitions according to boundaries established by the partition ranges. Then Oracle uses a
hashing algorithm to further divide the data into subpartitions within each range partition.
Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying tables. You
can create both local and global indexes on a table partitioned by range, hash, or composite
methods. Local indexes inherit the partitioning attributes of their related tables. For example, if
you create a local index on a composite table, Oracle automatically partitions the local index
using the composite method.
DATA MART
A data warehouse is a cohesive data model that defines the central data repository for an
organization. A data mart is a data repository for a specific user group. It contains
summarized data that the user group can easily understand, process, and apply. A data
mart cannot stand alone; it requires a data warehouse. Because each data warehousing
effort is unique, your company’s data warehousing environment may differ slightly from
what we are about to introduce.
Each data mart is a collection of tables organized according to the particular requirements
of a user or group of users. Retrieving a collection of different kinds of data from a
“normalized” warehouse can be complex and time-consuming. Hence the need to
rearrange the data so they can be retrieved more easily. The notion of a “mart” suggests
that it is organized for the ultimate consumers — with the potato chips, and video tapes
all next to each other.
META DATA
Meta data is literally “data about data”. It describes the kind of information in the
warehouse, where it stored, how it relates to other information, where it comes from, and
how it is related to the business. The topic of standardizing meta data across various
products and applying a systems engineering approach to this process in order to
facilitate data warehouse design is what this project intend to address