Write Cache Performance Improvements
Write Cache Performance Improvements
Contents
1 Summary 2
2 Terminology 2
4 Architecture 4
4.1 SpacePort.sys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Determing Cache Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.3 Backwards compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.4 SpaceLog library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.4.1 SL_TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.5 SIO_LOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5.1 Log Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5.2 Cache Line Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.5.3 Determining Free Space in Cache . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.5.4 Metadata Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5.5 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5.6 Advance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5.7 Log Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5.8 Log Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 SIO_RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6.1 New Control Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6.3 Converting RAID Offsets to and from Cache Line Offsets . . . . . . . . . . . 14
4.6.4 Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6.5 Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6.6 Destage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6.7 Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Testing 21
5.1 Provisioning a VM for Spaces Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Performance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
1 Summary
To scale the performance of the write-back cache implementation in Storage Spaces for cache sizes
16GB and greater, cache data and metadata must be managed separately. The primary motivation
for these improvements is to provide a software-based hybrid drive for consumer devices using a
tiered storage configuration (SSD for a cache + performance tier, HDD for capacity). The goal is to
approach SSD performance for random reads and writes to a tiered virtual disk by caching random
I/O on the SSD, as the vast majority of user perceptible latency due to storage accesses comes from
slow random reads.
In the future, in conjunction with NTFS and the user mode tiering engine (which uses the defrag
engine to move data between tiers), we will also be able to selectively “pin” data ranges to the cache
in order to increase the likelihood of cache hits for frequently used files.
2 Terminology
Term Definition
Storage Spaces Windows feature that organizes physical disks into storage pools from
which logical/virtual disks (Spaces) can be created.
NVMe NVM Express: PCIe attached solid state storage standard. Uses the
high bandwidth of the PCIe bus to provide high performance and
vastly increase the number of concurrent I/O requests.
SSD Solid State Drive: A storage device that marries a storage controller
implementing a specific protocol (NVMe, SATA, etc.) to NAND flash.
Modifications to NAND flash result in a read-modify-write of the flash
blocks containing the data to be written. Flash block size is typically
larger than sector size.
HDD Hard Disk Drive: A common device that uses magnetic rotational
disks for data storage. Typically, good at sequential access, bad at
random access (seek penalty – latency waiting for the data reader to
land on the appropriate area of a magnetic platter).
Sector Smallest unit of access for disk devices. Typically 512 bytes in length,
though modern drives often have 4KB sectors, which is optimal for
most I/O typically sent in an OS environment, as 4KB is the standard
page size (unit of addressability for RAM). Drives with 4KB sectors
often expose a virtual sector size of 512B for backwards compatibility.
Typically, only partition table and OS metadata is accessed in units <
4KB.
LBA Logical block address: Index of single unit of access for disk devices.
The value corresponds to the sector offset of the given disk.
Stripe Unit representing a single pass of data written to a Storage Space.
The default stripe size for Storage Spaces is 256KB.
Column A set of physical disks for which a single stripe is written. A column
may be mirrored for data resiliency.
Row Array of stripe units spanning every column. A row is the main entity
that can be cached, as cache destage should be spread out as evenly as
possible over every column to maximize physical disk utilization.
2
Term Definition
RAID Acronym for Redundant Array of Independent Disks, a storage
virtualization algorithm that combines multiple physical disks into a
single logical/virtual unit for performance, resiliency, or some
combination of both.
RAID0 (Striping) Logical disk where data is striped/split evenly between multiple
physical disks for performance (the total bandwidth of the underlying
physical disks can be used simultaneously). Not resilient to disk
failure. Can be nested with RAID1. Minimum 1 column needed, 1:1
column/disk, minimum 1 disk.
RAID1 (Mirror) Logical disk where data is mirrored/copied between multiple physical
disks, where each disk contains the same data as every other disk. As
writes cannot complete until each underlying physical disk has been
written to, write performance is poor, though reads have no such
restriction (can read from any disk). Resilient as long as one physical
disk remains operational. Can be nested with RAID0 or RAID5.
Minimum 1 column, 1:N column/disk, minimum 2 disks.
RAID5/6 (Parity) Logical disk where data is striped with distributed parity. If one drive
fails, reads can be recalculated from the parity. Requires at least 3
disks to be resilient to single disk failure, 7 disks for two disk failure.
Provides good read performance since data can be served from each
column. Can be nested with RAID1. Minimum 3 columns, 1:1 column
disk, minimum 3 disks, maximum 8 disks (single parity, RAID5).
Minimum 7 columns, 1:1 column/disk, minimum 7 disk, maximum 17
disks (dual parity, RAID6).
3.2 Dependencies
3
4 Architecture
4.1 SpacePort.sys
I/O directed to the log has been traditionally serialized via SpLogWriteCallback. Requests are
queued to the callback in SpSpaceReadWrite (previously SpScsiReadWrite before the I/O path
optimizations to reduce the impact of classpnp on the IO path).
With a larger cache size, the need to serialize writes to prevent reordering during destage isn’t
necessary, so this work item can be used simply to requeue writes when the metadata log becomes
full or when a cache line is being destaged.
I/O will be redirected to the cache by SIO_RAID objects/slabs, which will forward writes, free space
permitting, to the cache provided that request length is under a programmable threshold (default
256KB, which is the stripe size). Additionally, if a range is already mapped to the cache, I/O requests
will naturally be serviced from the cache.
Depending on provisioned write-back cache size, it may be better for performance to use the data log
for write-back caching, since the amount of usable free space in the cache is a function of mapped
cache lines, not the absolute amount of dirty data in the cache.
By default, if Space->GetVersion() >= ScVersionWindowsRedstone3Internal1, then the meta-
data log implementation will be used, regardless of provisioned cache size. However, the default for
parity journal will be the data log.
Using the write log as a write-back cache (i.e. data payload is contained in log records, SIO_LOG_DATA)
is still supported for older Spaces versions. For the time being, parity journal still uses SIO_LOG_DATA,
but using the newer write-back cache for parity journal is supported if we choose to do so in the
future for performance reasons.
Another design goal of these write-cache improvements was to make the minimal amount of change
to the existing core Spacelog library. Spacelog is divided into two libraries:
• SlLog: interface between SpaceIo (SIO_LOG) and LogCore
• LogCore: Spaces agnostic write log functionality, including log header manipulation, reading
and writing records, and flushing the log
As there are other potential non-Spaces clients of the write log implemented by Spacelog.lib, any
Spaces-specific business logic should remain in SpaceIo.lib, which is consumable by all Spaces binaries
in any environment (boot, kernel mode, user mode, dump).
4
4.4.1 SL_TABLE
SL_TABLE maintains an in-memory representation of the data in the cache/write log using an AVL
table. Each SIO_RAID object of a space that contains a write cache or parity journal allocates an
SL_TABLE to provide fast lookup of ranges contained in the cache. Each node within the AVL table
is of type SL_TABLE_NODE, which represents a row of data in the SIO_RAID object. A row is an array
of columns which each represents a drive and its mirrors. SL_TABLE is the only class defined in
SpaceLog.lib that was modified to implement true write-back caching.
The states of a given SL_TABLE_NODE conform to the finite state machine shown below:
Two new flags to track node state were added. SL_TABLE_NODE_FLAGS_WRITING is used in conjunc-
tion with SL_TABLE_NODE_FLAGS_DESTAGING and SL_TABLE_NODE_FLAGS_DESTAGED to synchronize
destage with writes to a given row. SL_TABLE_NODE_FLAGS_KEEP_IN_CACHE is not yet used, but is
intended to mark a given row/cache line as “pinned” to the cache, and as such, not eligible for destage.
A new linked list (_TablePinnedListHead) can be added to SIO_RAID in order to partition/track
rows that should always remain cached with best effort (e.g. if a cache is full with pinned nodes and
a write to an unmapped row is marked as pinned, the oldest pinned row has to be evicted).
#define SL_TABLE_NODE_FLAGS_DESTAGING 0x00000001
#define SL_TABLE_NODE_FLAGS_DESTAGED 0x00000002
#define SL_TABLE_NODE_FLAGS_REPLAY 0x00000004
#define SL_TABLE_NODE_FLAGS_WRITING 0x00000008
#define SL_TABLE_NODE_FLAGS_KEEP_IN_CACHE 0x00000010
SL_TABLE::Initialize now takes in additional parameters for the manipulation of cache lines:
NTSTATUS
SL_TABLE::Initialize(
__in SL_TABLE_TYPE Type,
__in ULONG NumberOfColumns,
__in ULONG StripeSize,
__in ULONG BytesPerLogicalSector
);
Type is used to indicate whether the table is associated with a data log or metadata log (i.e. cache).
5
StripeSize and BytesPerLogicalSector are used to determine the size of the cache lines used for
data, and are only useful when using SIO_LOG to manage a true write-back cache.
To preserve backwards compatibility, the SL_TABLE_NODE struct now contains a union of mutually
exclusive structures corresponding to either the data log or metadata log with cache lines.
typedef struct _SL_TABLE_NODE {
//
// Usage type of the node
//
SL_NODE_TYPE Type;
ULONGLONG Row;
...
union {
struct {
//
// Lowest location within the
// log that hosts the data for a
// range within this row
//
SL_LSN MinLsn;
...
LIST_ENTRY ElementListHeads[ANYSIZE_ARRAY];
} Log;
struct {
...
//
// Bitmap tracking active blocks
// within this cache line
//
RTL_BITMAP ActiveBitMap;
//
// Bitmap tracking valid/present
// data within this cache line
//
RTL_BITMAP ValidBitMap;
} Cache;
};
6
} SL_TABLE_NODE, *PSL_TABLE_NODE;
Also, enumerating which ranges are in the cache requires a new function, SL_TABLE::EnumerateOverlapsCache,
as the metadata describing which ranges are present in the cache are bitmaps as opposed to tuples.
private:
NTSTATUS
SL_TABLE::EnumerateOverlapsCache (
__in ULONG Length,
__in ULONGLONG Offset,
__in ULONGLONG Row,
__in ULONG Column,
__in PSL_OVERLAP_ROUTINE OverlapRoutine,
__in PVOID OverlapContext
);
4.5 SIO_LOG
The SIO_LOG layer provides an interface to a persistent write log associated with a given Space/Virtual
Disk. The SIO_LOG class manipulates log records through the SpaceLog library. A new log type is
being added, called SIO_LOG_METADATA, that logs only the metadata for a given data write, while
storing the data in direct mapped cache lines associated with a single SIO_RAID object’s row.
Previously, write-back caching using the write log used (offset, length, column) tuples for metadata,
with the data and metadata coupled as part of an individual log record. Each log record repre-
sented a contiguous range of data, mapped in-memory to an SL_TABLE_ELEMENT. The AVL table’s
nodes (SL_TABLE_NODE), which each map to a single row within the RAID slab, contain lists of
SL_TABLE_ELEMENTs, with one list mapping to each column represented by the node. These lists of
SL_TABLE_ELEMENTs are ordered by offset in order to make the destaging of data from the log as
sequential as possible, with the assumption that the disk(s) to which the data was destaged were
rotational.
In order to save metadata space—with disjoint runs, the metadata size on the log device could
balloon rather quickly—data in the cache is mapped using RTL_BITMAPs. Using bitmaps to track
blocks within each cache line scales better with cache size, as metadata space scales linearly with the
size of the cache (assuming the log is checkpointed regularly), and is more predictable.
Using per-cache line bitmaps for metadata therefore implies that each log record now tracks a cache
line :: row mapping as opposed to a run. This fact also means that multiple records tracking the
state of a given cache line may be present in the log, where only the most recent record (the one
furthest from the start of the log) contains correct metadata.
The main job of SIO_LOG::Initialize if using cache lines is to partition the available log space
between metadata (the actual log area) and data (where the cache lines are). The expectation is that
the vast majority of available log space should be reserved for cache lines, which are each aligned to
the size of the cache line. Once the number of cache lines is determined from the available space,
two bitmaps are allocated to track the state of “used” and “active” cache lines. A “used” cache line
is defined as a cache line with a valid mapping to the disk being cached. An “active” cache line is
defined as a cache line with dirty data, i.e. data yet to be destaged to the disk being cached.
7
4.5.2 Cache Line Management
To allocate/map a cache line to a given SIO_RAID object’s row, we search for the first clear bit within
SIO_LOG::_UsedCacheLinesBitMap, all while holding the SIO_LOG::_SpinLock:
Index = RtlFindClearBits(&_UsedCacheLinesBitMap, 1, 0);
if (Index == MAXULONG) {
Status = STATUS_LOG_FULL;
goto Cleanup;
}
If we find a free cache line, we mark it as used and active:
RtlSetBit(&_UsedCacheLinesBitMap, Index);
RtlSetBit(&_ActiveCacheLinesBitMap, Index);
To free a cache line, we simply clear the bit associated with a given cache line, derived from the
cache line’s byte offset:
Index = (ULONG)((CacheOffset - _CacheOffset) / _CacheLineSize);
Additionally, similarly to allocating/freeing cache lines, there is the concept of “activating” and
“retiring” cache lines. “Activating” a cache line occurs when a cache line has new data written to it,
and thus is considered, as a whole, dirty. “Retiring” a cache line occurs when a cache line’s data is
fully destaged, and is thus considered “clean”.
The free space remaining in the cache is a function of which cache lines are mapped/used. Just like
any basic filesystem that uses bitmaps to represent free clusters, all that’s needed to determine the
used percentage is to call RtlNumberOfSetBits:
Lock = _CacheLock.AcquireShared();
SetBits = RtlNumberOfSetBits(&_UsedCacheLinesBitMap);
_CacheLock.ReleaseShared(Lock);
//
// N.B. Used bytes is the total
// *reserved* cache space, part
// of which may not have valid
// data
//
8
4.5.4 Metadata Changes
In order to support cache lines for data caching, as opposed to storing data and its associated
metadata within a log record, the current log metadata format needed to be updated.
Instead of tracking each contiguous run (offset, length pair) as a separate log record, we can snapshot
the state of a given cache line. In this snapshot, we need track both valid and active (i.e. not yet
destaged, as this is a write-back cache) data using bitmaps to represented written blocks.
class SIO_LOG_METADATA_INFO : public SIO_LOG_INFO {
public:
//
// Relative offset of the cache
// line to the log device
//
ULONGLONG _Offset;
//
// Row associated with the cache
// line
//
ULONGLONG _Row;
//
// Id associated with the object
// that hosts this range
//
ULONGLONG _Id;
//
// Bitmap buffer size in bytes
//
ULONG _BitMapBufferLength;
//
// Buffer holding the active and
// valid bitmaps (in that order)
//
UCHAR _BitMapBuffer[ANYSIZE_ARRAY];
};
9
metadata size. For example:
RecordLength = FIELD_OFFSET(SIO_LOG_METADATA_INFO,
_BitMapBuffer[Node->Cache.BitMapLength]);
4.5.5 Checkpointing
Status = SlLogAdvanceLogStart(_Handles[SIO_LOG_METADATA],
&Packet,
_LastCheckpointToLsn,
NULL,
NULL);
Once the checkpoint is written, the LSN of all nodes represented by the checkpoint is updated to
the checkpoint’s own LSN so that all nodes are guaranteed to have an LSN >= the StartLsn of the
metadata log. This is done using the new SioControlCodeUpdateLsns control code to broadcast
the checkpoint LSN to all logged SIO_RAID objects.
For the write-back cache, there should only be at most one checkpoint record at any one time, since
the checkpoint record maintains the mapping state of all cached rows in all SIO_RAID objects.
Note that data writes to a given SIO_RAID object will be blocked while a checkpoint write is in
progress. Data writes that occur during a checkpoint write will be queued and redirected back to the
SIO_RAID object later via SpLogWriteCallback.
10
4.5.5.2 Reading Checkpoints
Checkpoints are used to consolidate log records and reduce scan time for on-disk metadata during
log distribution (constructing in-memory AVL trees to improve lookup times for the cache mappings)
in SIO_LOG::Reconcile. SIO_LOG::Reconcile starts at the tail of the log and makes a first pass
looking for any checkpoint records. Data logs (SIO_LOG_DATA and SIO_LOG_METADATA) are reconciled
before the parity log (SIO_LOG_PARITY).
After processing all the checkpoint records (SIO_LOG::ReconcileCheckpoint), we then scan the
rest of the log records not represented by the checkpoint to rebuild the AVL tree. Contents of
the on-disk metadata are distributed to the SIO_RAID objects via two new SIO_CONTROL_CODEs,
SioControlCodeAddNodes and SioControlCodeAddNode. This new control codes are roughly anal-
ogous to SioControlCodeAddElements and SioControlCodeAddElement used for SIO_LOG_DATA,
except we construct the cache mapping of an entire row at a time as opposed to a single contiguous
range within a column.
The biggest difference between using runs and cache lines is that the mapping of log record changes
from record -> element to record -> node. Additionally, there is no payload (data) in the log record
when using cache lines, so scanning the log should be much quicker. Since log records have to be at
least a minimum of sector size, it’s possible that the actual metadata contained in the log record will
be smaller than the total on-disk size reserved.
4.5.6 Advance
Similar to parity log, SP_DESTAGE::AttemptAdvance will wait for outstanding destage packets to
complete. Then, SIO_LOG::Advance (called by SpLogAdvanceCallback, which is synchronized with
SpLogWriteCallback via SP_SPACE::_LogMutex) will write a checkpoint with all active nodes, and
then advance the metadata log to the LSN of the last record represented by the checkpoint. This is
because advance effectively removes all previously written metadata records from the log.
As mentioned earlier, checkpointing moves the log tail (effectively doing the advance). However, the
actual SIO_LOG::Advance logic removes destaged nodes from each SIO_RAID object before advancing
the log so that only rows with dirty data in the cache are checkpointed. Logic for parity and data
logs remains unchanged.
Nodes are removed from an SIO_RAID object’s SL_TABLE via SIO_RAID::RemoveNodes, which calls
SIO_RAID::RetireNode with Remove == TRUE in order to free the associated cache line and free the
memory for the SL_TABLE_NODE itself. SIO_RAID::RetireNode is also called with Remove == FALSE
when a row is destaged.
11
Offset = ParentPacket->_CacheOffset + ParentPacket->_RelativeOffset;
Status = SC_ENV::SendAsynchronousFsdRequest(ParentPacket,
IRP_MJ_READ,
DataVa,
ParentPacket->_Length,
Offset,
NULL,
NULL);
The primary difference between the data log and the metadata log when reading data from the cache
is that a cache node contains an explicit byte offset into the log at which the data resides, whereas
reading data from the log simply required the element(s) LSN(s).
Log writes work in much the same way as log reads, except that some log writes may involve
an additional metadata write if the range being written has not yet been populated in the cache.
Whether metadata needs to be written is indicated by the _OptimizeCacheWrite flag, which is set
if the range being written has already been marked as dirty.
if (!ParentPacket->_OptimizeCacheWrite) {
Status = SlLogWriteLogRecord(_Handles[ParentPacket->_Log],
ParentPacket,
StartVa,
ParentPacket->_HeaderLength,
InfoVa,
(ULONG)ParentPacket->_InfoLength,
DataVa,
PayloadLength,
&ParentPacket->_Lsn);
if (!NT_SUCCESS(Status)) {
goto Cleanup;
}
}
Because a metadata log record may need to be written, the buffer describing the data payload will
also contain the log header and metadata. The data payload itself sits in-between the log header and
metadata, just like in the existing write log implementation.
DataVa = (PVOID)((ULONG_PTR)ParentPacket->_Buffer.VirtualAddress() +
ParentPacket->_HeaderLength);
4.6 SIO_RAID
The RAID layer is responsible for the redirection of I/O to a given column, which may be a single
physical disk or a series of mirrored disks. The SIO_RAID object contains an AVL table, _Table, that
caches in-memory the contents of the on-disk log for ranges mapped to the RAID object.
12
Figure 1: Converting relative offset from SIO_RAID to SIO_LOG
13
4.6.1 New Control Codes
A few new SIO_CONTROL_CODE codes are added in order to accomodate the new per-node bitmap
metadata format.
SioControlCodeGetOldestActiveNode,
SioControlCodeUpdateLsns,
SioControlCodeGetNodes,
SioControlCodeAddNode,
SioControlCodeAddNodes,
SioControlCodeRemoveNodes,
4.6.2 Initialization
As part of SIO_RAID object initialization, we allocate and initialize a SL_TABLE if the SIO_RAID is
logged/cached. Since we need to “acquire the region” when writing (explained in further detail
below), we need to set a flag to enable region tracking (typically done for RAID1 spaces):
if (_Space->_Log->_Handles[SIO_LOG_METADATA]) {
SET_FLAG(_Flags, SIO_RAID_FLAGS_REGION_TRACKING);
TableType = SlTableTypeCache;
}
To easily convert a SIO_RAID relative offset to and from a cache line relative offset, two helper
functions are added. These functions are declared as virtual, but are only implemented in the
SIO_RAID base class as the implementation remains the same regardless of RAID type:
ULONGLONG
SIO_RAID::CacheToColumn (
__in ULONGLONG Row,
__in ULONGLONG Offset
)
{
return (Offset % _StripeSize) + (Row << _StripeShift);
}
ULONGLONG
SIO_RAID::ColumnToCache (
__in ULONG Column,
__in ULONGLONG Offset
)
{
return (Offset % _StripeSize) + ((ULONGLONG)Column << _StripeShift);
}
Cache lines are conceptually an array of all columns in a row. The start of the cache line represents
column 0, and the end byte of the cache line is mapped to the last column in the row. A cache line
14
is considered full if the data columns in the row are completely dirty (all bits in that column are set
in the Active bitmap).
4.6.4 Reads
The main entry point for reads to an SIO_RAID object is SIO_RAID::BuildChildPacketsRead, which
simply calls SioRaidOverlapRoutineRead to determine whether a read should be serviced from the
cache or from the SIO_RAID object itself.
The read I/O path is simpler than for writes, as reads don’t modify on disk state; the goal is simply
to find the appropriate location of the data requested. The routine used to find whether a range
exists in the cache (SL_TABLE::EnumerateOverlapsCache) are used for both reads and writes.
Child read packets are built in SioRaidOverlapRoutineReadCache. For cached data ranges, packets
are directed at the SIO_LOG object; otherwise, packets are directed at the SIO_RAID object itself.
On read completion, we reinsert the associated node at the tail of the active node list and update
the node’s LastAccessTime as long as the node wasn’t either:
• Destaged
• Destaging
• Being written
This means that more recently read data will stay in the cache longer. Note that sequential scans
are a weakness of LRU, though the likelihood of a sequential scan is slightly mitigated by the
SC_ENV::LogRedirectThreshold, which prevents unmapped writes over a given block size from
even hitting the cache.
If we completed a read that for a node that was just destaged, we also retry the read from the original
SIO_RAID object to make sure we read the right data.
4.6.5 Writes
15
if (_Table && _Table->_Type == SlTableTypeCache) {
//
// Writing to the same row/node
// is considered overlapping as
// otherwise metadata records
// may hold incomplete bitmaps
// of mapped ranges
//
Overlap = TRUE;
}
}
Writing to a row must also be synchronized with destage, as we must prevent writing to a row while
it’s being destaged.
//
// Id associated with the record
// being worked upon
//
SL_LSN _Lsn;
//
// Offset, in bytes, relative to
// the start of this record
//
ULONGLONG _RelativeOffset;
//
// Offset, in bytes, of the data
// cache area of the log device
//
ULONGLONG _CacheOffset;
...
//
// Size in bytes of the bitmaps
// associated with this record
//
16
ULONG _BitMapLength;
...
//
// Avoid writing metadata record
// for writes where the metadata
// wouldn't be modified
//
BOOLEAN _OptimizeCacheWrite;
if (!NT_SUCCESS(Status)) {
//
// If there isn't a free cache
// line available, then just let
// the packet go to disk
//
ChildPacket->_State = ScStateNormal;
DataVa = ChildPacket->_Buffer.VirtualAddress();
RtlMoveMemory(DataVa,
(PVOID)((ULONG_PTR)DataVa + HeaderLength),
Length);
Status = STATUS_SUCCESS;
goto Cleanup;
}
For nodes already mapped to a row, the associated cache line offset is found in Node->Cache.Offset.
17
If a node does exist, however, we then need to check whether the row is being destaged; if it is, then
we need to requeue the write to be serviced after the destage operation is complete.
If the node has already been destaged, then we simply clear the destaged flag, as the node will be
dirty due to this write.
By this point, we set a flag indicating that this row is being written, and therefore cannot be
destaged. As an optimization to avoid unnecessary metadata writes, we also check whether the
range being written is already marked as dirty. If the range is already dirty, then we don’t have
to write a metadata record for this write. Avoiding this basic write amplification improves random
write performance significantly as long as we are writing to data already mapped in the cache. As
a potential further optimization, we could avoid writing metadata at all for writes and then flush
the metadata all at once in response to a filesystem flush, as long as we could halt all writes while
writing out the checkpoint.
Once the cache line offset is found, an SIO_LOG_PACKET referencing the cache line is mapped to the
write, and issued to the SIO_LOG layer. Then, the byte offset relative to the cache line for the new
data can be found using SIO_RAID::ColumnToCache.
Note that cache line bitmaps have a sector size granularity; that is, each bit in the bitmap represents
one logical sector on the cache device. It is assumed that all packets will have a byte length evenly
divisible by the sector size of the cache’s child space.
Once a write completes back to the SIO_RAID object from which it originated, the associated node is
reinserted at the tail of the _TableActiveListHead to preserve LRU semantics and its access time
updated with the current time. Finally, the region is released so the next packet waiting on that
region may proceed.
4.6.6 Destage
Destage from the log was previously done by LSN; nodes with active elements were ordered by lowest
LSN and then candidates for destage were sorted in sequential order by offset. In order to optimize
for data writes to ranges previously mapped to a cache line (i.e. ranges that have been written to at
least once without being destaged), destage policy now uses a timestamp as a temporal threshold to
indicate which rows are eligible for destage.
When selecting candidates for destage (SIO_RAID::DestageGetCandidates), each RAID object will
iterate through its active list, which is ordered from least to to most recently accessed, and select a
row/node to destage if eligible. Eligibility for a destage candidate is determined by the following
criteria:
• Row is not already destaged
18
• Row is not currently destaging
• Row is not being written
• Row access time is below the destage timestamp
Once a candidate is succesfully selected, it’s inserted into a drive’s destage queue (SIO_DRIVE_CONTEXT::_Candidates)
by calling SIO_DRIVE_CONTEXT::InsertCandidateCache. The insertion maintains a queue ordering
by row to optimize for sequential write-back.
Policy for destining in the new write-back cache is now least recently used (LRU). This policy
is enforced by reordering the _TableActiveListHead on every access to a given cache line/node.
Candidate nodes for destage are then picked from the tail of _TableActiveListHead and then ranges
within the candidate nodes are ordered by column and offset so to issue the destage write sequentially
to the backing columns.
Default destage thresholds are still untouched, but it’s likely that both the start and stop thresholds
(below) will need to be increased for optimal performance (TBD based off empirical observation).
#define SIO_DESTAGE_START_THRESHOLD 25 // in %
#define SIO_DESTAGE_STOP_THRESHOLD 10 // in %
//
// Read each dirty range in the
// cache to destage
//
RunLength = RtlFindNextForwardRunClear(&Node->Cache.ActiveBitMap,
RunIndex,
&RunNextIndex);
if (RunLength == 0) {
RunNextIndex = Node->Cache.ActiveBitMap.SizeOfBitMap;
}
19
if (RunNextIndex == RunIndex) {
continue;
}
//
// If a run goes past the column
// then truncate at the stripe
// boundary
//
//
// Calculate the offset relative
// to the column
//
20
...
}
4.6.7 Replay
Replay simply required a slight modification to be able to read data from a cache line by providing
the node associated with the replay packet in order to read data from a cache line. This meant
adding a SL_TABLE_NODE pointer to SIO_LOG_PARITY_INFO.
5 Testing
As with any changes to Spaces libraries or driver code, functionality must be verified via the Spaces
test pass using WTT. Additionally, performance must be benchmarked against both the old write
log implementation and third party competitors.
The functional tests for Storage Spaces are in the 1Windows Datastore in WTT. Instructions to provi-
sion a WTT enabled VM for testing can be found at https://osgwiki.com/wiki/Creating_a_WTT-
enabled_VM. To prevent unnecessary test failures, add ntlab@ntdev.corp.microsoft.com as a
domain joined administrator (Workflow 1Windows: 407 seems to work well in my experience for
installation) and run 1Windows jobs 171857 and 220265 to prep the VM for running Spaces tests.
1Windows jobs 220134 through 220153 inclusive test Storage Spaces write cache functionality
specifically.
The easiest way to generate workloads and gather performance data for disks on Windows
is diskspd.exe. Documentation and source code for diskspd can be found on GitHub at
https://github.com/Microsoft/diskspd.
21