Rules of Thumb in Data Engineering
Rules of Thumb in Data Engineering
Latency due to the speed of light (60 ms round trip 5. Caching: Location, Location, and Location
within North America, within Europe, and within Asia)
will be with us forever, but terabit-per-second bandwidth Processor clock speeds have been improving, as has
will allow us to design systems that cache data locally, the parallelism within the processor. Modern processors
and quickly access remote data if needed. are capable of issuing four or more instructions in parallel
and pipelining instruction execution.
Traditionally, high-speed networking has been
limited by software overheads. The cost of sending a In theory, current quad-issue Intel processors are able
message is [10]: Time = senderCPU + receiverCPU + to execute three billion instructions per second 4
bytes/bandwidth instructions per clock and 750 M clocks per second. In
The sender and receiver cpu costs have typically been practice, real benchmarks see CPI (clocks per instruction)
10,000 instructions and then 10 instructions per byte. So of 1 to 3, and the CPI is rising as processor speeds
to send 10 KB cost 120,000 instructions or something like outpace memory latency improvements [6,7,8].
a millisecond of cpu time. The transmit time of 10,000
The memory subsystem cannot feed data to the caches beyond smaller-is-better and locality-is better.
processor fast enough to keep the pipelines full. But, two good rules have evolved for disk data locality
Architects have added 2-level and 3-level caches to the and caching. It is possible to quantitatively estimate
processors in order to improve this situation, but if when you should cache a disk page in memo ry: trading
programs do not have good data locality, there is not off memory consumption against disk arm utilization.
much the architects can do to mask “compulsorily” cache
misses. As mentioned before, disk arms are precious. If a
disk costs $1200 and does 120 accesses per second, then a
Software designers are learning that careful program disk access per second costs $10. It would be
and data placement and cache sensitive algorithms with advantageous to spend up to $10, to save one access per
good locality give 3x speedups on current processors. As second. Well, $10 buys about 10MB of DRAM, so if a
processor speeds continue to outpace memory speeds, cache of that size would indeed save one access per
there will be increasing incentives for software designers second, it would be a good investment.
to look for algorithms with small instruction cache
footprints, with predictable branching behavior, and with One can ask the question, how frequently should a
good or predictable data locality (read clustered or disk-resident object be accessed to justify caching it in
sequential). main memory? When does the rent of RAM space
balance the cost of an access? The analysis in [15] shows
There is a hardware trend to design huge (256 way) that:
multiprocessors that operate on a shared memory. These BreakEvenReferenceInterval (seconds) =
systems are especially prone to instruction stretch in PagesPerMBofRAM x PricePerDiskDrive
which bus and cache interference from other processors AccessPerSecondPerDisk PricePerMBofDRAM
causes each processor to slow down. Getting good For randomly accessed data, the first term (call the
performance from these massive SMPs will require access pattern) is approximately 1, the second term
careful attention to data partitioning, data locality, and (called the technology ratio) varies from 100 to 400
processor affinity. today. So, the breakeven interval is about 2 minutes to 5
minutes.
An alternative design opts for many nodes each with
its own IO and bus bandwidth and all using a dataflow For sequentially accessed data the access pattern term
programming model and communicating via a high-speed is approximately 0.1 (1MB “pages” and 10 pages per
network [14]. These designs have given rise to very second) so the break-even interval is 10 to 40 seconds.
impressive performance, for example, the sort speed of
computer systems has been doubling each year for the last This analysis gives the rules:
15 years through a combination of increased node speed 19. The 5-minute random rule: cache randomly accessed
(about 60%/year) and parallelism (about 40%/year). The disk pages that a re re-used every 5 minutes.
1999 terabyte sort used nearly 2,000 processors and disks, 20. The 1-minute sequential rule: cache sequentially
http://research.microsoft.com/~gray/sort_benchmark. accessed disk pages that are re-used within a minute.
The argument for the many-little scalable design tries A related rule that has not seen much use is that one
to leverage the fact that mainframe:mini:commodity price can spend 1 byte of RAM to save 1 MIPS. The argument
ratios are approximate 100:10:1. That is, mainframes cost goes that RAM costs about 1$/MB and today one can get
about 100 times more than commodity components, and a 100 extra MIPS from Intel for 100 extra dollars
semi-custom mini-computers have a 10:1 markup over (approximately). So, the marginal cost of an instruction
commodity components (see prices for comparable per second is approximately the marginal cost of a byte.
systems at the www.tpc.org benchmarks). The cluster Fifteen years ago, the ratio was 10:1, but since then Intel
advocates admit the many-little design is less efficient, and VLSI has made processors much less expensive.
but they argue that it is more cost-effective. 21. Spend 1 byte of RAM to save 1 MIPS.
There seems no good general rule of thumb for cpu- Now let’s consider web page caching. We can use
logic similar to the five-minute rule to decide when it
pays to cache web pages. The basic diagram is shown in
client cache Link server Figure 2, where the link speed varies from 100 KBps for
intranets, to modem speeds of 5 KBps, to wireless speeds
Figure 2. The client-side or proxy web cache improves
of 1 KBps. In case of a modem and wireless links, we
response time by eliminate link transmission times and
assume a local browser cache. For high-speed links, the
server times.
cache could either be a browser cache or a proxy cache.
In case of a proxy, we assume a fast connection between wait (3 seconds in the examples above), and if the object
the user and the cache (e.g., a 100Mb/s LAN), so that cost is in the browser cache local access avoids the
of accessing data from a remote proxy disk is not transmission time. If the local access saves both, then the
significantly larger than that from a local disk. R_local is a few hundred milliseconds.
Hence,
Given these assumptions consider three questions: R_local = 100ms (browser cache)
(1) How much does web caching improve response = 300ms (proxy cache intranet)
times? = 2s (proxy cache modem)
(2) When should a web page be cached? = 10s (proxy cache wireless)
(3) How large should a web cache be? Proxy cache studies indicate that H_proxy_cache =
0.4 is an upper bound [16,17]. Anecdotal evidence
Assume that the average web object is 10KB. Define suggests browser hit ratios are smaller: assume.
R_remote: response time to access an object at server. H_browser_cache = 0.20. Assuming a 20$/hr human
R_local: response time to access the object from cache. cost, each second costs 0.55 cents. Using that, Table 3
H: cache hit ratio (fraction of requests that cache computes the response time savings using the
satisfies) Response_Time_Improvement equation at left.
Then: Response_Time_Improvement =
R_remote - (H * R_local + (1-H) * R_remote) = Table 3: Shows the benefits of browser and proxy caching
H * (R_remote - R_local) (pennies saved) assuming people’s time is worth 20$/hr.
connection cache R_remote R_local H People
We now estimate R_remote and R_local. R_remote seconds seconds hit Savings
consists the server response time and the download rate ¢/page
network time. The server response time (the queuing LAN proxy 3 0.3 .4 0.6
delay and the service time) can range from several LAN browser 3 0.1 .2 0.3
hundred milliseconds to several seconds. Assume a Modem proxy 5 2 .4 0.7
response time of 3 seconds. Modem browser 5 0.1 .2 0.5
Mobile proxy 13 10 .4 0.7
Mobile browser 13 0.1 .2 1.4
The download time over the network depends on
network conditions and on link speeds. WAN Links are
typically shared, so the user bandwidth is smaller than the If a user makes ten requests per hour, and uses web
typical link bandwidth (a bottlenecked link at the server 400 hours per year then the benefit of caching is about 7
may further reduce the bandwidth/request). Assume that cents/hour and 20$/year. This should be balanced against
the effective LAN/WAN bandwidth is 100KB/s; hence the cost of the disk to store the pages – but as mentioned
time to transmit a 10KB object is a tenth of a second, and earlier, $20 will buy a LOT of disk space.
the R_remote of 3 seconds is dominated by the server
time. Having computed the savings for a cached page
(Table 3), we can now compute the point where caching a
Modem bandwidth available on a dial-up link is 56 page begins to pay off. Table 4 has the calculation. The
KB. With compression, the effective bandwidth is often first column of Table 4 estimates download costs from
twice that, but there are also start/stop overheads. We Odlysko [13 table 7] and assumes a wireless (1KBps)
assume an effective modem bandwidth of 5KB/s. Hence, link costs $0.1/minute ($6/hr). The second column
the modem transmit time for a 10 KB object is 2 seconds, assumes desktop disks cost 30$/GB and last 3 years,
and R_remote is 5 seconds. while mobile storage devices are 30x more expensive.
A mobile user on a wireless link gets 1KB/s, and so it The break-even cost of storing a page happens when
takes 10 seconds to download a 10KB object and the storage rent matches the download cost. The
R_remote is 13 seconds. We ignore the fact that mobile download cost has two components: the network time (A
systems often compress the data to make the
objects much smaller. Summarizing: Table 4: Caching is a very good deal: cache web pages if they
R_remote will be re-used within the few years.
= 3 + .1 = 3s (high speed connection) A B Time =A/B C Time=
$/10 KB $/10 KB Break-even People Cost (A+C)/B
= 3 + 2 = 5s (modem connection) download storage/mo cache Of download Break Even
= 3 + 10 = 13s (wireless connection) network cost storage time $ (table 3)
Internet/LAN 1e-4 1e-4 100 months 3E-3 400 months
Modem 2E-4 1e-4 250 months 5E-3 750 months
R_local depends many details, but
Wireless 1E-2 3E-3 56 months 1E-2 102 months
fundamentally local access avoids the server-time
in Table 4) and the people time C. The fourth column of 6. Summary
the table shows the calculation ignoring people’s time, C.
In that case the break-even interval is 8 years rather than Data stores will become huge. Our biggest challenge
many decades. In either case, the table indicates that is to make it easy to access and manage them.
caching is very attractive: cache a page if will be Automating all the tasks of data organization, accesses,
referenced within the next 5 years months (longer than the and protection.
lifetime of the system (!)).
Disk technology is overtaking tapes, but at the same
Certainly, our assumptions are questionable, but the time disks are morphing into tape-like devices with
astonishing thing is that a very wide spectrum of primarily sequential access to optimize the use of disk
assumptions concludes that a “cache everything” strategy arms. Meanwhile, RAM improvements encourage us to
is desirable. build machines with massive main memory. Indeed, the
main change to Amdahl’s balanced system law is that
How will Table 4 change with time? Network speeds alpha (=MIPS/DRAM size) is rising from 1 to 10.
are predicted to increase and network costs are predicted Network bandwidth is improving at a rate that
to drop. Column 4, Time=A/B, may drop from 100 challenges many of our design assumptions. LAN/SAN
months to one day. But column 6, Time=(A+C)/B, will software is being streamlined so it is no longer the
grow as people’s time grows in value, while the cost of bottleneck. This may well allow a re-centralization of
technology (A and B) decline. In summary, technology computing.
trends suggest that web page caching will continue be a
popular, especially for bandwidth-limited mobile devices. Still, data caching is an important optimization. Disk
caching still follows the 5-minute random rule and the
How much would it cost to cache all web accesses one-minute sequential rule. Web caching encourages
for a year? If users make 10 requests per hour with a hit designs that simply cache all pages.
ratio of H=0.4 the cache gets 4 hits and 6 new objects per
user hour. For an 8-hour workday, this is 480KB per user 7. References
per day. If H=0.2, then it is 640KB per user per day. In
both cases, this is about a penny a day. So, again we [1] IBM UltraStar72, http://www.storage.ibm.com/
conclude a simple “cache everything” strategy is a good hardsoft/diskdrdl/ultra/72zxdata.htm.
default. [2] Brewster Kahle, private communication, http://archive.org
[3] Data heat is the number of times the data is accessed per
These calculations suggest the simple rule: second.
22. Cache web pages if there is any chance they will be [4] Dell tpcC: http://www.tpc.org/results/individual_results
re-referenced within their lifetime. /Dell/ dell_8450_99112201_es.pdf
[5] Compaq tpcH: http://www.tpc.org/results/individual_results/
Compaq/compaq.8000.h.99110901.es.pdf
Web object lifetimes are bi-modal, or even tri-modal [6] J. L. Hennessy, D.A. Patterson, Computer Architecture, a
in some cases. Studies show median lifetimes to be a few Quantitative Approach. Morgan Kaufman, San Francisco,
days or few tens of days [18]. The average page has a 75- 1990, ISBN 1-55860-069-8
day lifetime (ignoring the modalities and non-uniform [7] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, W. E.
access.) A heuristic that recognized high-velocity pages Baker, “Performance Characterization Of A Quad Pentium
would both improve usability (by not showing stale Pro SMP Using OLTP Workloads,” ACM ISCA p. 15-26.
cached pages) and would save cache storage. June 1998.
[8] A. Ailamaki, D. J. DeWitt, M. D. Hill, D. A. Wood.
“DBMSs On A Modern Processor: Where Does Time Go?”
A major assumption in these calculations is that
VLDB 99, pp. 266-277, Sept 1999.
server performance will continue to be poor: 3 seconds on [9] H. Garcia-Molina, A. Park, L.R. Rogers: “Performance
average. Popular servers tend to be slow because web site Through Memory.” ACM SIGMETRICS, Performance
owners are not investing enough in servers and Evaluation Review 15(1), May 1987. pp. 122-131.
bandwidth. With declining costs, web site owners could [10] J. Gray, “The Cost of Messages,” ACM PODC, 1988, p1-7
invest more and reduce the 3-second response time to less [11] Virtual Interface Architecture: http: //www.viarch.org
than a second. If that happens, then the web cache’s [12] G. Gilder, “Fiber Keeps Its Promise: Get ready. Bandwidth
people cost savings will evaporate, and the need for will triple each year for the next 25.” Forbes, 7 April 1997,
caching would be purely to save network bandwidth and http://www.forbes.com/asap/97/0407/090.htm
[13] A. M. Odlysko “The Economics of the Internet: Utility,
download time -- which we believe will not be a scarce
Utilization, Pricing, and Quality of Service,
resource except for mobile devices. http://www.research.att.com/~amo/doc/networks.html
[14] R.H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D.E.
Culler, J.M. Hellerstein, D.A. Patterson, “Rivers. Cluster
I/O with River: Making the Fast Case Common.” IOPADS
'99.
[15] J. Gray, G. Graefe, “The 5 minute rule, ten years later,”
SIGMOD Record 26(4): 63-68, 1997
[16] R. Tewari and M. Dahlin and H M. Vin and J. Kay,
”Beyond Hierarchies: Design Considerations for
Distributed Caching on the Internet”, IEEE ICDCS'99
June, 1999.
[17] A. Wolman and G. Voelker and N. Sharma and N.
Cardwell, A. Karlin, H. Levy,”On the scale and
performance of cooperative web proxy caching”, ACM
SOSP'99, pp.16--21, Dec., 1999.
[18] J. Gwertzman, M. Seltzer, “World-Wide Web Cache
Consistency,” 1996 USENIX Annual Technical
Conference, Jan. 1996.