1
- <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.44 2007/06/28 00:02:37 tgl Exp $ -->
1
+ <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.45 2007/08/01 22:45:07 tgl Exp $ -->
2
2
3
3
<chapter id="wal">
4
4
<title>Reliability and the Write-Ahead Log</title>
5
5
6
6
<para>
7
- This chapter explain how the Write-Ahead Log is used to obtain
7
+ This chapter explains how the Write-Ahead Log is used to obtain
8
8
efficient, reliable operation.
9
9
</para>
10
10
<
F438
button class="Button Button--iconOnly Button--invisible ExpandableHunkHeaderDiffLine-module__expand-button-line--wZKjF ExpandableHunkHeaderDiffLine-module__expand-button-unified--Eae6C ExpandableHunkHeaderDiffLine-module__expand-up-and-down--uarCJ" aria-label="Expand file up from line 71" data-direction="up" aria-hidden="true" tabindex="-1">
@@ -71,7 +71,7 @@
71
71
write caches. At the drive level, disable write-back caching if the
72
72
drive cannot guarantee the data will be written before shutdown.
73
73
</para>
74
-
74
+
75
75
<para>
76
76
Another risk of data loss is posed by the disk platter write
77
77
operations themselves. Disk platters are divided into sectors,
86
86
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
87
87
restore partially-written pages. If you have a battery-backed disk
88
88
controller or file-system software that prevents partial page writes
89
- (e.g., ReiserFS 4), you can turn off this page imaging by using the
89
+ (e.g., ReiserFS 4), you can turn off this page imaging by using the
90
90
<xref linkend="guc-full-page-writes"> parameter.
91
91
</para>
92
92
</sect1>
93
-
93
+
94
94
<sect1 id="wal-intro">
95
95
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
96
96
105
105
106
106
<para>
107
107
<firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
108
- is a standard approach to transaction logging . Its detailed
108
+ is a standard method for ensuring data integrity . A detailed
109
109
description can be found in most (if not all) books about
110
110
transaction processing. Briefly, <acronym>WAL</acronym>'s central
111
111
concept is that changes to data files (where tables and indexes
112
112
reside) must be written only after those changes have been logged,
113
- that is, when log records describing the changes have been flushed
113
+ that is, after log records describing the changes have been flushed
114
114
to permanent storage. If we follow this procedure, we do not need
115
115
to flush data pages to disk on every transaction commit, because we
116
116
know that in the event of a crash we will be able to recover the
120
120
</para>
121
121
122
122
<para>
123
- A major benefit of using <acronym>WAL</acronym> is a
123
+ Using <acronym>WAL</acronym> results in a
124
124
significantly reduced number of disk writes, because only the log
125
- file needs to be flushed to disk at the time of transaction
126
- commit, rather than every data file changed by the transaction.
127
- In multiuser environments, commits of many transactions
128
- can be accomplished with a single <function>fsync</function> of
129
- the log file. Furthermore, the log file is written sequentially,
125
+ file needs to be flushed to disk to guarantee that a transaction is
126
+ committed, rather than every data file changed by the transaction.
127
+ The log file is written sequentially,
130
128
and so the cost of syncing the log is much less than the cost of
131
- flushing the data pages. This is especially true for servers
129
+ flushing the data pages. This is especially true for servers
132
130
handling many small transactions touching different parts of the data
133
- store.
131
+ store. Furthermore, when the server is processing many small concurrent
132
+ transactions, one <function>fsync</function> of the log file may
133
+ suffice to commit many transactions.
134
134
</para>
135
135
136
136
<para>
147
147
</para>
148
148
</sect1>
149
149
150
+ <sect1 id="wal-async-commit">
151
+ <title>Asynchronous Commit</title>
152
+
153
+ <indexterm>
154
+ <primary>synchronous commit</primary>
155
+ </indexterm>
156
+
157
+ <indexterm>
158
+ <primary>asynchronous commit</primary>
159
+ </indexterm>
160
+
161
+ <para>
162
+ <firstterm>Asynchronous commit</> is an option that allows transactions
163
+ to complete more quickly, at the cost that the most recent transactions may
164
+ be lost if the database should crash. In many applications this is an
165
+ acceptable tradeoff.
166
+ </para>
167
+
168
+ <para>
169
+ As described in the previous section, transaction commit is normally
170
+ <firstterm>synchronous</>: the server waits for the transaction's
171
+ <acronym>WAL</acronym> records to be flushed to permanent storage
172
+ before returning a success indication to the client. The client is
173
+ therefore guaranteed that a transaction reported to be committed will
174
+ be preserved, even in the event of a server crash immediately after.
175
+ However, for short transactions this delay is a major component of the
176
+ total transaction time. Selecting asynchronous commit mode means that
177
+ the server returns success as soon as the transaction is logically
178
+ completed, before the <acronym>WAL</acronym> records it generated have
179
+ actually made their way to disk. This can provide a significant boost
180
+ in throughput for small transactions.
181
+ </para>
182
+
183
+ <para>
184
+ Asynchronous commit introduces the risk of data loss. There is a short
185
+ time window between the report of transaction completion to the client
186
+ and the time that the transaction is truly committed (that is, it is
187
+ guaranteed not to be lost if the server crashes). Thus asynchronous
188
+ commit should not be used if the client will take external actions
189
+ relying on the assumption that the transaction will be remembered.
190
+ As an example, a bank would certainly not use asynchronous commit for
191
+ a transaction recording an ATM's dispensing of cash. But in many
192
+ scenarios, such as event logging, there is no need for a strong
193
+ guarantee of this kind.
194
+ </para>
195
+
196
+ <para>
197
+ The risk that is taken by using asynchronous commit is of data loss,
198
+ not data corruption. If the database should crash, it will recover
199
+ by replaying <acronym>WAL</acronym> up to the last record that was
200
+ flushed. The database will therefore be restored to a self-consistent
201
+ state, but any transactions that were not yet flushed to disk will
202
+ not be reflected in that state. The net effect is therefore loss of
203
+ the last few transactions. Because the transactions are replayed in
204
+ commit order, no inconsistency can be introduced — for example,
205
+ if transaction B made changes relying on the effects of a previous
206
+ transaction A, it is not possible for A's effects to be lost while B's
207
+ effects are preserved.
208
+ </para>
209
+
210
+ <para>
211
+ The user can select the commit mode of each transaction, so that
212
+ it is possible to have both synchronous and asynchronous commit
213
+ transactions running concurrently. This allows flexible tradeoffs
214
+ between performance and certainty of transaction durability.
215
+ The commit mode is controlled by the user-settable parameter
216
+ <xref linkend="guc-synchronous-commit">, which can be changed in any of
217
+ the ways that a configuration parameter can be set. The mode used for
218
+ any one transaction depends on the value of
219
+ <varname>synchronous_commit</varname> when transaction commit begins.
220
+ </para>
221
+
222
+ <para>
223
+ Certain utility commands, for instance <command>DROP TABLE</>, are
224
+ forced to commit synchronously regardless of the setting of
225
+ <varname>synchronous_commit</varname>. This is to ensure consistency
226
+ between the server's filesystem and the logical state of the database.
227
+ The commands supporting two-phase commit, such as <command>PREPARE
228
+ TRANSACTION</>, are also always synchronous.
229
+ </para>
230
+
231
+ <para>
232
+ If the database crashes during the risk window between an
233
+ asynchronous commit and the writing of the transaction's
234
+ <acronym>WAL</acronym> records,
235
+ then changes made during that transaction <emphasis>will</> be lost.
236
+ The duration of the
237
+ risk window is limited because a background process (the <quote>wal
238
+ writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
239
+ every <xref linkend="guc-wal-writer-delay"> milliseconds.
240
+ The actual maximum duration of the risk window is three times
241
+ <varname>wal_writer_delay</varname> because the wal writer is
242
+ designed to favor writing whole pages at a time during busy periods.
243
+ </para>
244
+
245
+ <caution>
246
+ <para>
247
+ An immediate-mode shutdown is equivalent to a server crash, and will
248
+ therefore cause loss of any unflushed asynchronous commits.
249
+ </para>
250
+ </caution>
251
+
252
+ <para>
253
+ Asynchronous commit provides behavior different from setting
254
+ <xref linkend="guc-fsync"> = off.
255
+ <varname>fsync</varname> is a server-wide
256
+ setting that will alter the behavior of all transactions. It disables
257
+ all logic within <productname>PostgreSQL</> that attempts to synchronize
258
+ writes to different portions of the database, and therefore a system
259
+ crash (that is, a hardware or operating system crash, not a failure of
260
+ <productname>PostgreSQL</> itself) could result in arbitrarily bad
261
+ corruption of the database state. In many scenarios, asynchronous
262
+ commit provides most of the performance improvement that could be
263
+ obtained by turning off <varname>fsync</varname>, but without the risk
264
+ of data corruption.
265
+ </para>
266
+
267
+ <para>
268
+ <xref linkend="guc-commit-delay"> also sounds very similar to
269
+ asynchronous commit, but it is actually a synchronous commit method
270
+ (in fact, <varname>commit_delay</varname> is ignored during an
271
+ asynchronous commit). <varname>commit_delay</varname> causes a delay
272
+ just before a synchronous commit attempts to flush
273
+ <acronym>WAL</acronym> to disk, in the hope that a single flush
274
+ executed by one such transaction can also serve other transactions
275
+ committing at about the same time. Setting <varname>commit_delay</varname>
276
+ can only help when there are many concurrently committing transactions,
277
+ and it is difficult to tune it to a value that actually helps rather
278
+ than hurting throughput.
279
+ </para>
280
+
281
+ </sect1>
282
+
150
283
<sect1 id="wal-configuration">
151
284
<title><acronym>WAL</acronym> Configuration</title>
152
285
188
321
<varname>checkpoint_timeout</varname> causes checkpoints to be done
189
322
more often. This allows faster after-crash recovery (since less work
190
323
will need to be redone). However, one must balance this against the
191
- increased cost of flushing dirty data pages more often. If
192
- <xref linkend="guc-full-page-writes"> is set (as is the default), there is
193
- another factor to consider. To ensure data page consistency,
194
- the first modification of a data page after each checkpoint results in
324
+ increased cost of flushing dirty data pages more often. If
325
+ <xref linkend="guc-full-page-writes"> is set (as is the default), there is
326
+ another factor to consider. To ensure data page consistency,
327
+ the first modification of a data page after each checkpoint results in
195
328
logging the entire page content. In that case,
196
329
a smaller checkpoint interval increases the volume of output to the WAL log,
197
- partially negating the goal of using a smaller interval,
330
+ partially negating the goal of using a smaller interval,
198
331
and in any case causing more disk I/O.
199
332
</para>
200
333
206
339
don't happen too often. As a simple sanity check on your checkpointing
207
340
parameters, you can set the <xref linkend="guc-checkpoint-warning">
208
341
parameter. If checkpoints happen closer together than
209
- <varname>checkpoint_warning</> seconds,
210
- a message will be output to the server log recommending increasing
342
+ <varname>checkpoint_warning</> seconds,
343
+ a message will be output to the server log recommending increasing
211
344
<varname>checkpoint_segments</varname>. Occasional appearance of such
212
345
a message is not cause for alarm, but if it appears often then the
213
346
checkpoint control parameters should be increased. Bulk operations such
280
413
modifying the configuration parameter <xref
281
414
linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
282
415
buffers is 8. Increasing this value will
283
- correspondingly increase shared memory usage. When
284
- <xref linkend="guc-full-page-writes"> is set and the system is very busy,
285
- setting this value higher will help smooth response times during the
416
+ correspondingly increase shared memory usage. When
417
+ <xref linkend="guc-full-page-writes"> is set and the system is very busy,
418
+ setting this value higher will help smooth response times during the
286
419
period immediately following each checkpoint.
287
420
</para>
288
421
307
440
<para>
308
441
The <xref linkend="guc-wal-sync-method"> parameter determines how
309
442
<productname>PostgreSQL</productname> will ask the kernel to force
310
- <acronym>WAL</acronym> updates out to disk.
443
+ <acronym>WAL</acronym> updates out to disk.
311
444
All the options should be the same as far as reliability goes,
312
445
but it's quite platform-specific which one will be the fastest.
313
446
Note that this parameter is irrelevant if <varname>fsync</varname>
360
493
The aim of <acronym>WAL</acronym>, to ensure that the log is
361
494
written before database records are altered, can be subverted by
362
495
disk drives<indexterm><primary>disk drive</></> that falsely report a
363
- successful write to the kernel,
496
+ successful write to the kernel,
364
497
when in fact they have only cached the data and not yet stored it
365
498
on the disk. A power failure in such a situation might still lead to
366
499
irrecoverable data corruption. Administrators should try to ensure
0 commit comments