US20160048427A1 - Virtual subdirectory management - Google Patents
Virtual subdirectory management Download PDFInfo
- Publication number
- US20160048427A1 US20160048427A1 US14/828,942 US201514828942A US2016048427A1 US 20160048427 A1 US20160048427 A1 US 20160048427A1 US 201514828942 A US201514828942 A US 201514828942A US 2016048427 A1 US2016048427 A1 US 2016048427A1
- Authority
- US
- United States
- Prior art keywords
- data
- file system
- primary
- node
- restore
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 claims description 73
- 238000012545 processing Methods 0.000 claims description 22
- 230000009471 action Effects 0.000 claims description 20
- 230000006835 compression Effects 0.000 claims description 5
- 238000007906 compression Methods 0.000 claims description 5
- 230000005012 migration Effects 0.000 abstract description 4
- 238000013508 migration Methods 0.000 abstract description 4
- 238000010367 cloning Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 52
- 230000008859 change Effects 0.000 description 46
- 238000004458 analytical method Methods 0.000 description 10
- 238000011084 recovery Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000004048 modification Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 238000012517 data analytics Methods 0.000 description 5
- 230000000717 retained effect Effects 0.000 description 5
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011010 flushing procedure Methods 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000012508 change request Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1451—Management of the data involved in backup or backup restore by selection of backup contents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1471—Saving, restoring, recovering or retrying involving logging of persistent data for recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1464—Management of the backup or restore process for networked environments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2056—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
- G06F11/2058—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring using more than 2 mirrored copies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2056—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
- G06F11/2082—Data synchronisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/184—Distributed file systems implemented as replicated file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/188—Virtual file systems
-
- G06F17/30212—
-
- G06F17/30233—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45591—Monitoring or debugging support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/85—Active fault masking without idle spares
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0619—Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/065—Replication mechanisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Definitions
- HA High Availability
- the analytics may enable other functions such as data intelligence.
- primary data is read from and written to a primary storage pool. As the data is written to the primary pool it is automatically mirrored and also tracked for data protection to a recovery pool.
- the mirror can also be used for intelligence including analytics stored as discovery points.
- the techniques disclosed herein relate to a system that merges primary data storage, data protection, and intelligence into a single unified system.
- the unified system provides primary and restore data, analytics, and analytics-based data protection without requiring separate solutions for each aspect.
- Intelligence is provided through inline data analytics, with additional data intelligence and analytics gathered on protected data and prior analytics, and stored in discovery points, all without impacting performance of primary storage.
- that multi-threaded log writes are implemented at a protection and analytics (PART) node.
- the PART node receives access requests from multiple concurrently executing threads, and assigns a transaction identifier (ID) to the access requests.
- the PART then collects the access requests in a random access, multithreaded log before sending them to both a primary and a restore storage system. Subsequently, the PART forwards the access requests from the PART node to the primary node and restore node.
- ID transaction identifier
- the PART may further optionally determine when a number of access requests in the random access, multithreaded log reaches a predetermined number. At that time, the PART issues a synchronization command to the primary and restore nodes which causes data to be flushed from respective temporary caches to a persistent file system in each of the primary and restore. Once data is confirmed as having been flushed in both the primary and restore nodes, the PART may then release entries in the random access, multithreaded log.
- the PART maintains a set of file system level objects, one for each subdirectory in a directory tree created by an application, such as a hypervisor.
- the PART intercepts a make directory request from the application to store a new a file system level object for each subdirectory in the tree.
- the file system level object contains access information for the corresponding subdirectory, such that multiple make directory requests result in storing a corresponding multiple number of file system level objects as a virtual file system.
- a file system object located with a subdirectory are then serviced by the primary and restore nodes using only the virtual file system level object information and not the subdirectory directly. This ensures that the virtual file system objects remain transparent to the application.
- a property may be associated with two or more virtual file system objects to indicate that an access request applies to two or more subdirectories as a consistency group.
- the data-intelligent storage system intercepts a request to clone a data object.
- a clone object is first thin provisioned and opened for access. Data is copied to the clone object only upon the first to occur of either (a) a subsequent access request for the clone object, or (b) as part of a background restore process.
- Thin provisioning may involve creating a bitmap data object containing a bit for each one of a plurality data chunks in the data object.
- Bits may be set in the bitmap corresponding to data chunks referred to in the subsequent access request for the clone. In such an instance, the bitmap is updated as data chunks are copied to the clone object.
- a separate process for handling temporary clone objects uses the bitmaps to determine when to access the original object, the clone, or a snapshot.
- FIG. 1 is a diagram showing interaction between a Primary Node, Intelligence Node, and Remote Intelligence Node, and connected storage pools.
- FIG. 2 is a view of an appliance device containing Primary and Intelligence Nodes.
- FIG. 3 is a diagram showing the components of a Primary Node.
- FIG. 4 is a diagram showing the components of an Intelligence Node.
- FIG. 5 is a diagram showing the analytics flow process.
- FIG. 6A is a diagram showing the structure of a change catalog.
- FIG. 6B shows a discovery point
- FIG. 7 is a diagram illustrating a multi-threaded log.
- FIG. 8 shows a process flow for handling an access request at the PART.
- FIG. 9 shows a process flow for synchronizing requests to release entries in the multi-threaded log.
- FIG. 10 shows write access gathering at the PART.
- FIG. 11 shows virtual file system objects representing virtual machine subdirectories created by a hypervisor application can be submitted to a snapshot process.
- FIG. 12 shows a more general case where an action is applied in a restore node across subdirectory trees and file objects stored within those directories.
- FIG. 13 is a high level diagram of a system that provides clone on demand with background migration of data and metadata.
- FIG. 14 illustrates a file object and corresponding clone bitmap.
- FIG. 15 is a process flow for creating a clone of a file object.
- FIG. 16 is a process flow for access a cloned file object.
- FIG. 17 shows a directory tree object and its corresponding clone.
- FIG. 18 is a process flow for creating a directory object.
- FIG. 19 is a process flow for accessing a temporary clone directory object.
- Primary Storage networked storage accessible to multiple computers/workstations.
- the storage can be accessed via any networked device, either as files or blocks.
- “primary storage” refers to both blocks and files.
- Intelligence Storage secondary storage containing gathered intelligence, discovery points, and a redundant real-time copy of files and block data contained in Primary Storage.
- Primary Node includes access protocols to communicate with an Intelligence Node, Remote Sites, and Expansion Nodes; access protocols layer (for example, NFS, SMB, iSCSI); protection and analytics in real-time (“PART”) layer; file and block storage layer (file system, block volume); and connection to storage devices (RAID, DISK, etc.).
- a Primary Node appears to system users as Primary Storage, and provides an interface and controls to act as the access to Intelligence Storage.
- Intelligence Node includes access protocols to communicate with a Primary Node, Remote Sites, and Expansion Nodes; data intelligence storage layer (intelligent data services & rules processing); file and block storage layer (file system, block volume); and connection to storage devices (RAID, long-term storage).
- intelligence node data is accessed by users through a Primary Node, but in alternate embodiments Intelligence Nodes may be directly accessed by users.
- a discovery point created from a mirrored (high availability) copy of primary data, contains data analytics for accessed and changed primary data since a prior discovery point.
- a discovery point may contain the changed data, providing for a virtually full but physically sparse copy of the primary data captured at a user-specified point in time or dynamically based on change rate or other analytics.
- analytics metadata stored in a discovery point can be expanded as deeper levels of user data analysis are performed and more analytics are gathered. Tracked primary data changes can be retained for the life of the discovery point or can be removed at scheduled or dynamic intervals, such as after deep data analysis is complete and desired analytics metadata is obtained. Removing primary data allows for more efficient space utilization, while retaining primary data enables point-in-time recovery of that version of data.
- Change Catalog an ordered set of real-time access and change information related to a data object, tracked at a discovery point granularity.
- a change catalog tracks who, how, when, and where aspects of a data object being accessed and/or modified. There is one change catalog for every discovery point.
- Remote Site one or more off-site nodes in communication with local site primary or intelligence nodes.
- Pool the collection of data storage connected to a node.
- Object a file, directory, share, volume, region within a volume, or an embedded object.
- Objects can be complex, containing other embedded objects.
- a file can be a container containing other files, or a volume can have a file system on top of it which in turn contains files.
- the system is capable of recognizing complex objects and tracking changes at finer embedded object granularity.
- Selective Restore an automatic (policy based) or manual (customer initiated) restore at an object level.
- Site Restore a manually initiated process to recreate primary or intelligence pool content using a previously protected version of the data being restored.
- Container objects which may have other embedded objects, such as a file, directory, file system, or volume.
- Expansion Nodes appliance having a processor, memory (RAM), network connectivity, and storage devices, and connected to one or more primary or intelligence nodes scaling the processing power and/or storage for connected nodes.
- the disclosed high availability (HA) storage system provides primary storage, analytics, and live restore functions.
- Live restore is a technique used to optimize data restoration. It can be used to recover user data in case of a failure or to recover previous versions of the user data.
- the system provides primary storage access as block and/or file level storage while avoiding single points of failure.
- the system collects analytics in real-time while also protecting data in real-time on separate physical media, and includes options for off-site data protection.
- the system implements deep analytics enabling restore, storage, and data intelligence, and protects both customer data and associated analytics.
- the system provides traditional file based and custom API methods for extracting analytics metadata.
- the system employs Live Restore techniques at a file and at a block level to recover in case of a failure or to recover a previous version of user data.
- a file or block level Live Restore uses previously gathered analytics to prioritize data to be restored, while allowing user I/O access to the data during restoration.
- Primary Node 100 of the system connects within a network to provide block and/or file level storage access to connected computing devices (not shown), real-time data protection, and real-time analytics of primary data.
- Primary data is read from and written to primary storage pool 110 .
- the data can be written or read as files or blocks depending on the access protocol being used.
- As the data is written it is automatically mirrored and tracked for data protection as part of a HA process for the primary node.
- the mirrored cache of the data is created for Intelligence Node 120 .
- the Intelligence Node enables data protection, analytics, and recovery.
- the Intelligence Node stores a real-time copy of primary data, analytics and discovery points within intelligence pool 130 . Discovery points are automatically or manually created at any point by the Intelligence Node, and based on fine grained change data enabling action to be taken immediately with no need to copy the underlying primary data or do any post processing to determine what has changed since any prior discovery point.
- each Node is capable as acting as either a Primary Node, an Intelligence Node, or both. For reliability and performance reasons, separate Primary and Intelligence Nodes are desirable. In case of failure of either node, the other may take over operation of both. Implementation without dual-capability (that is, operating solely a Primary Node and solely an Intelligence Node) is possible but loss of service (to either primary or intelligence storage) would occur on failure of such a node.
- each one of the Nodes has a processor and local memory for storing and executing Node software, a connection to physical storage media, and one or more network connections including at least a dedicated high bandwidth and low latency communication path to other Nodes.
- the Primary Node and Intelligence Node are physically housed within a single device, creating a user impression of a single appliance.
- FIG. 2 shows one such example, with Primary Node 100 and Intelligence Node 120 housed together to appear as a single physical appliance.
- Implementation may be with any number of disks, for example such as a four rack units (4U) housing containing up to twenty-four hard drives, with separate physical storage devices connected to the system.
- 4U rack units
- each node is completely separated from the other with the exception of a backplane, with each node having a dedicated (not shared) power supply, processor, memory, network connection, operating media and optionally non-volatile memory. Separation enables continued operation, for example the Intelligence Node may continue operating should the Primary Node fail, and vice versa, but shared resource implementation is also possible.
- a node actively operating as Primary Node 100 operates storage protocol server software 300 , for example Common Internet File System (CIFS), Network File System (NFS), Server Message Block (SMB), or Internet Small Computer System Interface (iSCSI), so the Primary Node will appear as primary storage to network-connected computer devices.
- the storage protocol server software also communicates with a protection and analytics in real-time process (PART) 310 which intercepts and takes action on every data access.
- CIFS Common Internet File System
- NFS Network File System
- SMB Server Message Block
- iSCSI Internet Small Computer System Interface
- the PART 310 performs three main roles after intercepting any data access request: mirroring primary data for HA, gathering in-line data analytics on primary data, and storing primary data.
- the examples explained herein are directed to a file access perspective, but the PART can similarly process block level accesses.
- the PART can identify embedded objects and perform the same analysis that is applied to file-level accesses.
- Intercepted access requests include read, modify (write data or alter attributes, such as renaming, moving, or changing permissions), create, and delete.
- the PART tracks and mirrors the request (and data) to the Intelligence Node. Communication with the Intelligence Node is through synchronous or asynchronous inter-process communication (IPC) 340 depending on configuration.
- IPC inter-process communication
- IPC may including any suitable protocols or connections, such as Remote Procedure Call (RPC) or a Board-to-Board (B2B) high performance, low latency communication path that may be hardware specific.
- RPC Remote Procedure Call
- B2B Board-to-Board
- Any data included with a data access request, such as included in write operations, is also mirrored to the Intelligence Node as part of HA system operation. This mirroring establishes data protection through real-time redundancy of primary storage.
- the PART executes in-line analysis of primary data, gathering real-time analytics.
- the PART sends gathered real-time analytics to the Intelligence Node, where the analytics are added to a change catalog maintained by the Intelligence Node.
- the PART directs the request to an actual file system, for example Fourth Extended File System (EXT4) or Z File System (ZFS), or block volume for file or block storage access 330 to physical storage devices.
- EXT4 Fourth Extended File System
- ZFS Z File System
- the storage access function 330 (be it file system level or block level) performs the access request on storage media, and returns the result to the PART for return to the requesting system.
- the storage media includes disks attached to the system, but other storage media solutions are possible.
- the Primary Node also includes the software necessary to operate as an Intelligence Node in case of Intelligence Node failure.
- the Primary Node also operates management software.
- the management software provides system administrators access to configure and manage system users and access discovery points for the restore process.
- a node actively operating as Intelligence Node 120 operates Inter Process Communication (IPC) communication software 400 capable of communicating with the Primary Node.
- the communication software includes an API to receive real time analytics (change catalog entries) from the Primary Node, data change and access requests (read, modify, create, delete) from the Primary Node, data protection and intelligence control commands, and data restore commands.
- Data protection and intelligence control commands include commands for creating discovery points, setting up management rules for managing discovery points (including deletion), and searching and restoring content that has been backed up.
- Data restore commands include commands for accessing previously backed up data.
- the Intelligence Node maintains a change catalog 600 containing real-time analytics gathered from accessed and changed data since the last discovery point 650 .
- a discovery point is also created by associating and storing a change catalog together with reference to the mirrored copy of changed primary data since the last discovery point as maintained in the intelligence pool.
- the Intelligence Node implements file or block-level access 430 to its own pool 130 of physical storage. This intelligence storage pool retains the real-time copy of primary data and discovery points.
- the stored intelligence data within discovery points includes in-line analytics (change catalog) as received from the Primary Node and additional analytics 410 executed by the Intelligence Node.
- the real-time copy of primary data also enables distributed response processing between the Primary and Intelligence Nodes. For example, load balancing between the Primary and Intelligence Nodes may enable greater scalability. As both have real-time copies of primary data, read requests may be balanced between the nodes, or alternatively directed to both nodes with the fastest-to-respond used for the response.
- the Primary Node may act as a controller for such distributed processing, or a separate controller may be used.
- Primary 110 and Intelligence Data 130 reside on the same appliance, they can be distributed to multiple discrete appliances deploying all the same techniques with the exception that the communication method is performed over a network transport instead of using the HA mechanisms within an array.
- Intelligence is at the core of the system. There are four types of intelligence functions in the system: Data, Operational, Storage, and Recovery. All four use the same processing engine and common analytics metadata to provide analysis both at fixed points and as gathered over time.
- Data Intelligence 452 allows for intelligent user content management.
- Operational Intelligence 456 analyzes the behavior of the system and application logs stored on the system to provide insight into applications and security of the system.
- Storage Intelligence 454 allows for intelligent storage system resource management, including automatic storage allocation and reallocation including dynamically growing and shrinking storage pools.
- Recovery Intelligence 450 allows for intelligent data protection and data restore. All types of intelligence may be used for, or enable operation in conjunction with, different types of analytics, such as, but not limited to, collaboration, trending, e-discovery, audits, scoring, and similarity.
- Analytics begin at the Primary Node, which tracks data access and data modifications, system behavior, change rates, and other real-time analytics. It provides this real-time analytics information to the Intelligence Node. Intelligence gathering determines time and owner relationships with the data for collaboration and contextual information about the data. The gathered intelligence is used for later search and reporting, and is tracked in change catalogs associated with the data.
- change catalogs 600 are created as part of in-line real-time analytics 500 performed by the Primary Node 100 , but change catalogs 600 are then also further expanded by the Intelligence Node 120 performing further data processing, and create the foundation for later search.
- the change catalog data is initially created in real-time at the Primary Node (such as via PART 310 ) and includes extended information about the specific data access, for example, allowing complete tracking of who/how/when/where accessed, created, modified, or deleted a file or other data object.
- Traditional file metadata includes only an owner, group, path, access rights, file size, and last modified timestamp. This provides some, but not complete, information about a file.
- the PART operated by the Primary Node, intercepts every file access event.
- the Primary Node has the ability to track extended metadata about a file—including identification of every modification and every access, even those which do not modify the file, by timestamp, user, and type of access.
- this extended metadata is stored as a change catalog entry 610 that identifies the object being accessed, the actor (user performing an operation), and the operation being performed. Additional information which may be in a change catalog entry includes, but is not limited to, object name, owner, access control lists, and time of operation.
- the change catalog 600 contains this extended metadata information, and serves as the foundation of further analytics, such as performed later by the Intelligence Node.
- the change catalog entry may also include security information, such as permission rights for access, associated with the object.
- An administrator may configure the degree of tracking, or even enable/disable tracking on a file location, user, group-specific, or other basis, and the Primary Node is capable of incorporating all details of every file access into the change catalog entries.
- the change catalog metadata tracks incremental changes which are also linked to a discovery point 650 . Every time a new discovery point is created the current change catalog is closed off and stored within the discovery point. When data is retained in the discovery point, the system may be configured to retain a copy of the discovery point analytics metadata at the Intelligence Node even if that discovery point is migrated off the Intelligence Node, enabling more efficient query processing.
- a discovery point 650 is created by associating and storing a change catalog together with the mirrored copy of changed primary data since the last discovery point in the intelligence pool. After a discovery point creation, a new change catalog 600 is created allowing gathering of new real-time analytics on primary data. Change catalogs and discovery points are preferably maintained per volume or file system in primary storage, but may also span multiple volumes or file systems. Discovery points allow deeper analytics on a point in time version of primary data, and can also be used to recover a prior version of primary data.
- a discovery point contains data analytics for accessed and changed data since a prior discovery point. When created, a discovery point also contains a virtually full but physically sparse copy of primary data at the time of creation of that discovery point.
- the system uses data visible within discovery points to perform deeper data processing, creating more analytics metadata.
- the analysis is done on accessed and changed data since a previous discovery point, using the real-time analytics reflected in the change catalog. These newly gathered deeper analytics are also stored within the discovery point.
- Primary data may be retained for the life of the discovery point, or may be removed earlier, such as after the deep data analysis is complete and desired analytics metadata obtained. Removing the primary data allows for more efficient space utilization, while retaining the primary data enables recovery of primary data at the point in time of the creation of the discovery point. From one discovery point until the creation of a next discovery point, file changes, deletions, renames, creations and such are tracked as cumulative modifications to from the prior discovery point, so that only incremental changes are maintained.
- Discovery points can be deleted manually through a delete discovery point command, or automatically based on time or analysis in order to save storage space or for off-site migration. Deletion of discovery points is complicated by management of analytics metadata.
- the analytics metadata stored within a discovery point contains information about data changed within a period of time. If the stored analytics are deleted they can be lost. To prevent this, the time period for analytics associated with one or more other discovery points can be adjusted, and relevant portions of analytics metadata from a discovery point being deleted extracted and merged with other analytics already stored within the other discovery points.
- an adaptive parallel processing engine operates on the change catalog 600 to derive these more complex analytics, including tracking changes and use over time.
- the Rule Engine applies rules 510 to analyze content on the underlying primary data, enabling deeper analytics on stored data.
- a second level dictionary can provide sentiment attributes to an already indexed document. Regular expression processing may be applied to see if a document contains information such as social security or credit card numbers.
- Each rule may have a filter 530 to match content, and an action 540 to take based on results. Rules can be nested, and used to answer user-specific questions.
- Rules are configurable by administrators or system users, allowing dynamic rule creation and combination based on different applicable policies 520 . Rules can be combined in multiple ways to discover more complex information. Rules may also be configured for actions based on results. For example, notifications may be set to trigger based on detected access or content, and different retention policies may be applied based on content or access patterns or other tracked metadata. Other actions may include, but are not limited to, data retention, quarantine, data extraction, deletion, and data distribution. Results of applied rules may be indexed or tracked for future analysis.
- rules 510 identify results, such results may be indexed or tracked for other analytical use. This additional metadata may be added to the change catalogs for the relevant files or objects. The metadata may also be tracked as custom tags added to objects. Tags may be stored as extended attributes of files, or metadata tracked in a separate analytics index such as data in a directory or volume hidden from normal end user view, or in other data stores for analytics. Rules, and therefore analytics, may be applied both to data tracked and to the metadata generated by analytics. This enables analytics of both content and gathered intelligence, allowing point-in-time and over-time analysis. The rules results and actions may serve as feedback from one or more rules to one or more other rules (or even self-feedback to the same rule), enabling multi-stage analysis and workflow processing.
- Recovery Intelligence is the set of analytics implemented by Intelligence Node 120 around data protection. The purpose is to protect data and associated analytics. When data reaches the Intelligence Node a mirrored copy is stored in the intelligence pool, creating redundancy with primary storage, and these changes are tracked for use in discovery point creation. Primary data, discovery points, and intelligence data are preferably separated on actual physical media at the spindle or disk pool level, such that a failure of a single individual physical device is always recoverable. As discovery points are created based on change catalogs tracked at the Intelligence Node, they can be created at any time without any impact on the performance of primary storage. This eliminates a need to schedule time-windows for discovery point creation.
- Each discovery point includes incremental changes from the prior discovery point, including data object changes and the analytics gathered and associated with the data during such changes.
- Intelligent rules can be applied to automate discovery point creation, such that, in addition to manual or time-based creation, discovery point creation may be triggered by content changes.
- Such changes may be percentage based, specific to percentage change of certain identifiable subsets of the entire data pool, based on detected deviations from usage patterns such as increase in frequency of specific accesses, or based on real-time analysis of data content.
- the change catalog accumulating real-time changes is closed.
- the change catalog is then stored within the created discovery point, and a new change catalog created for changes to be associated with a next created discovery point.
- the analytics and data stored within discovery points enable efficient restores, allowing search over multiple discovery points for specific object changes without requiring restoration of the data objects from each discovery point.
- Such search can be based on any analytics performed, such as data tracked in the extended metadata and content-based analysis performed by application of the Rule Engine.
- the tracking further enables indexing and partial restores—for example specific objects, or embedded objects within complex objects, can be restored from a discovery point without a complete restore of all data from that discovery point.
- Data Intelligence is a set of analytics at the Intelligence Node analyzing content.
- Data Intelligence operates through the Rule Engine, and can be applied to unstructured data, for example file metadata such as document properties of Microsoft Office documents or the actual content of such documents, semi-structured data such as log files or specific applications such as Mail programs, structured data such as databases or other formats for which schema may be known or discovered by the system, and recursive containers such as virtual machines, file systems on file systems, file systems on volumes, or archives.
- File systems use internal data structures, called metadata, to manage files, directories and data in files.
- a typical file system uses logging to guarantee crash consistency.
- One of the popular techniques to guarantee crash consistency is a write-ahead log.
- the file system logs the intent of modifications to the log, and then performs the metadata modifications on disk. In case of a panic, power failure, or crash, the log is then replayed to bring the file system back to a consistent state.
- the PART 310 intercepts data access requests, forwards them to a primary node, mirrors them to a high availability restore node, and performs analytics to create intelligence data.
- each of the primary node 100 and restore node 140 operate with their own independent file system 102 , 142 (FS).
- FS independent file system
- the file systems 102 , 142 may be a ZFS-compatible file system or some other file system.
- FS file systems
- a logged transaction includes all the metadata modifications that will be done as part of an I/O. For example, if an I/O operation allocates on indirect block, the log entry in 101 or 141 consists of the new allocated indirect block, the parent indirect block where the new block will be inserted, an offset in the parent indirect block, the inode where associated with the indirect block, and so on.
- the PART 310 maintains its own log 311 independent of the logs 101 , 141 , if any, as maintained by the file systems 102 , 142 in primary 100 and restore 140 nodes.
- This PART-level, “virtual file system” log 311 is implemented in a durable storage medium that can be written to in random order, such as nonvolatile memory. To achieve crash consistency, access requests can be replayed at the primary 100 and restore 140 nodes consistent with the original order in which they were received at the PART 310 .
- the PART log 310 may obviate the need for logs 101 , 141 , which then may be disabled or bypassed if the file systems 102 , 142 allow this.
- any metadata in the PART log 311 is stored with a corresponding transaction ID.
- the transaction IDs are a unique number maintained by the PART 310 and incremented upon each access request received. Writes to the PART log 311 may therefore be multithreaded such that they can be written any time and in any order, with the order information retained in the transaction ID associated with each request.
- FIG. 7 shows a typical PART log entry including a transaction ID, an operation type, a file handle, offset, length and data.
- the entries in the PART log 311 are arranged in a number of chunks 301 typically with each chunk being of equal size to other chunks.
- access requests received by the PART 310 may be multithreaded.
- the various chunks 301 in the PART log 311 enable log entries to be written in any order and also concurrently.
- writes to the random access, high speed PART log 310 do not have to observe any ordering dependencies, yet the ordering can be regenerated when the PART log 310 is replayed to the primary and restore nodes.
- the PART 310 is executing five (5) concurrent threads labeled A 1 , A 2 , A 3 and B.
- some of the threads are issuing access requests for a data tree structure that is to be populated with engineering data concerning the configuration of a manufactured component.
- Other threads executing in the PART 310 are concerned with processing customer orders for the component.
- a first thread A 1 may be responsible for creating the tree while threads A 2 and A 3 are responsible for writing data to the tree.
- thread B is handling an entirely different operation such as supporting database accesses concerning the customer orders for the component.
- the accesses can be written to in any order in the PART log 311 . This is because, as previously described, the transaction ID numbers are assigned to each access request in the order which they are received. This then enables the transactions to be executed in the correct order in the local file systems 102 , 142 , even though they may have been originally stored in random order by the multiple threads executing at the PART level 310 .
- each access request is written to the PART log 311 , it is forwarded in parallel to each of the primary 100 and restore 140 nodes.
- the primary 100 and restore 140 nodes then copy the request data to a respective local cache 104 , 144 , but do not yet actually issue the request to their underlying file systems 102 , 142 to access permanent storage such as respective disks 103 , 143 .
- Such behavior by primary 100 and restore 140 nodes would not provide crash consistency locally within file systems 102 , 142 , and also does not provide consistency between file systems 102 , 142 .
- the PART 310 issues a synchronization (“sync”) request to the primary 100 and restore 140 .
- sync synchronization
- the primary 100 and restore 140 flushes their respective cached data to disks 103 , 143 .
- the primary 100 and restore 140 then acknowledge the sync back to the PART 310 .
- the PART 310 can now free the corresponding chunks 301 in PART log 311 . In other words, it is not until the sync command is complete that data related to the requests is known to be correctly persisted to respective disks in the primary and restore nodes.
- FIG. 8 shows a typical process flow among the PART 310 and primary node 100 . It should be understood that the corresponding operation between the PART 310 and the restore node 140 is similar.
- the PART 310 receives an access request from a host.
- the PART assigns a next available transaction ID to the request.
- the access request is then written to any available chunk in the PART log 311 .
- the request is then sent to both the primary 100 and restore 140 nodes.
- step 841 the primary 100 receives the request from the PART 310 .
- step 842 if the primary 100 and restore nodes maintain a local log 101 , 141 , they determine a place for the transaction in their respective log order from the transaction ID (which is known to have been assigned in the same order in which multithreaded PART receive it).
- data associated with the request is stored in the primary's local cache memory 104 , 144 .
- state 844 the primary can send an access complete acknowledgment back to the PART 310 .
- state 850 the PART 310 can then report that fact that the access is logically complete even though the data has not yet been flushed to disk at the primary 100 .
- FIG. 9 illustrates the process flow between the PART 310 and the primary 100 and restore 140 when the multithreaded log 311 is full or nearly full.
- state 910 the PART 311 log is recognized as no longer being able (or soon to become unable) to store additional requests.
- step 911 a sync command is sent from the PART 310 to both the primary 100 and restore nodes 140 .
- state 920 the primary 100 (or restore node 140 ) receive the sync command and in state 922 they flush their local cache to permanent file system (FS) storage such as one or more disk(s). Once the flush operation is complete in state 923 , an acknowledgment can then be returned to the PART 310 .
- FS permanent file system
- the PART receives the acknowledgment from the primary 100 , and at some point (either prior to, at the same time, or subsequent to state 930 ) the PART 310 also receives an acknowledgment from the restore node 140 .
- the PART 310 can finally release the associated chunks 301 in PART log 311 .
- the durable storage used for the PART log 311 is a fast access storage device, such as a solid state device, so that the log file can be sorted in transaction ID order as quickly as possible when it needs to be read back, such as when a fault occurs before data is flushed to disk by both the primary 100 and restore nodes 140 .
- a storage system It is typical for a storage system to aggregate write operations in a cache before being flushed to main storage.
- a data intelligent storage system is implemented with a primary node 100 and high availability/intelligence data stored at restore node 140 .
- a write access request may come into the PART 310 , and recorded in a PART log 311 before being forwarded to primary 100 file system and restore 140 file system.
- the primary and restore file systems may maintain their own logs 101 , 141 as previously described.
- those file system level logs 101 , 141 are also copied to a remote disk such that it another remote copy 151 is made of the primary log 101 and another remote copy 105 is made of the restore log 141 .
- each single I/O transaction may result in many different write operations to different primary data stores and logs.
- a PART level cache which we refer to as a write gathering cache 333 , is implemented to store data associated with write requests.
- write gathering cache 333 is implemented to store data associated with write requests.
- the associated data is immediately copied to the write gathering cache 333 , and the I/O request is also acknowledged.
- Certain other operations that involve metadata such as a make directory (mkdir) operation, are first logged in the PART log 310 and then issued to the primary 100 and restore 140 .
- FIG. 11 shows the data intelligence storage system being accessed by one or more applications 1010 .
- the application 1010 is a hypervisor environment such as an ESX or ESXi server (ESX and ESXi are trademarks of VMware, Inc. of Palo Alto, Calif.).
- the application 1010 creates and maintains various virtual machine (VM) files in such an environment on a subdirectory basis.
- VM virtual machine
- the application 1010 expects a first virtual machine (VM 0 ) to be disposed within a first subdirectory (/vm 0 ), and includes associated virtual machine files a 0 .vmdk, a 1 .vmdk, etc.
- the files associated with a second virtual machine (b 0 .vmdk) are to be disposed within second directory (/vm 1 ), and the files associated with an (n ⁇ 1)′th virtual machine in subdirectory (/vmn).
- the files (k 0 .vmdk) associated with yet another virtual machine are to be stored in a directory (/vm 11 ) that is subordinate to directory/vm 0 .
- the ESX server application 1010 may therefore be hosting a number of virtual machines; the data associated with each virtual machine including its operating system image files, application files and associated data are stored in one or more files arranged in a directory tree 1011 within a single file system 1015 tree.
- application 1010 issues access requests to the PART 310 .
- PART 310 not only sends the access request to one or more file systems on primary node 100 , but also sends the access request to the file system(s) on restore node 140 .
- discovery points 1020 may include snapshots of the state of the virtual machine files and their associated data, metadata, other intelligence data, and change catalog.
- snapshots become discovery points includes one or more snapshots of each VM.
- Such existing snapshot technologies are directed to instead storing a snapshot of an entire file system. However it may be desirable in certain circumstances to enable the use of such snapshot technologies on a single VM.
- the basic idea is for PART 310 to identify particular applications such as ESX server 1010 that create subdirectories, such as those containing virtual machine files, and manage them in a distinct way.
- the PART 310 therefore can more efficiently enable certain actions by intelligence 145 .
- the PART 310 maintains an entire set of filesystems 1050 for each sub-directory on the primary 100 and an entire set of filesystems 1070 on the restore 140 .
- ESX server 1010 What appears to the user application (ESX server 1010 ) to be an ordinary filesystem containing ordinary subdirectories is actually a virtual filesystem 1040 wherein any given subdirectory may actually be a link to a separate, associated file system that actually contains the .vmdk files for a given VM.
- the PART 310 When these subdirectories are accessed in the virtual file system 1015 , the PART 310 thus transparently redirects those accesses to the associated file system(s) 1050 , 1070 on the primary and restore.
- a make directory (mkdir) command to create VM subdirectory/vm 1 is intercepted by the PART 310 , which then creates file system v.vm 1 ( 1050 - 1 ) on the primary 100 and its mirror v.vm 1 ( 1070 - 1 ) on the restore node 140 .
- the PART 310 then creates the new file system directory/vm 1 in the primary filesystem 1040 , which is a virtual “mount point” linking the subdirectory/vm 1 in virtual file system 1015 with its associated actual file system v.vm 1 ( 1050 - 1 , 1070 - 1 ). This link is denoted by pointer 1042 .
- pointer 1042 a write access directed to file/vm 0 /a 1 .vmdk is intercepted by the PART 310 , which, following link 1041 , redirects that write access to the filesystem v.vm 0 ( 1050 - 0 ) on primary 100 which actually contain the file a 1 .vmdk.
- the PART 310 also mirrors write accesses to the restore node 140 ; in this case, the mirrored write access is directed to the filesystem v.vm 0 ( 1070 - 0 ) on the restore node 140 which actually contains the mirror of a 1 .vmdk.
- the PART 310 maintains the illusion of a subdirectory tree 1011 but actually creates a number of file systems 1050 - 0 , 1050 - 1 , 1050 - 2 , . . . , 1050 - 11 , . . . , 1050 - n on primary 100 and a number of file systems 1070 - 0 , 1070 - 1 , 1070 - 2 , . . . , 1070 - 11 , . . . , 1070 - n on restore 140 .
- the snapshot processes running as part of intelligence 145 can be executed using the standard file system oriented snapshot process but using the virtual mount point information to locate the underlying filesystems 1070 associated with a given subdirectory.
- the virtual filesystem (VFS) 1090 hides the existence of multiple independent, “container file systems” from user application 1010 .
- Subdirectories in the virtual file system (VFS) 1090 are accessible as subdirectories, but at the same time the underlying container file systems 1070 are accessible to the snapshot processes.
- Associated file system snapshot technology in the restore node 140 can now be relied upon to obtain snapshots of a given VM independently of snapshots of other VMs, and the restore node 140 can thus treat them as it treats any file system.
- the details of the virtual file system 1090 framework is hidden from the ESX server 1010 .
- directory-associated access requests made to the system such as a make directory (mkdir) or remove directory (rmdir) or change directory (chdir) command, are intercepted by the PART 310 .
- the PART 310 layer Upon receiving a mkdir command, the PART 310 layer then issues requests to primary 100 and restore 140 to create the file system constructs v.vm 0 ( 1050 - 0 , 1070 - 0 ), v.vm 1 ( 1050 - 1 , 1070 - 1 ), . . . , v.vm 11 ( 1050 - k , 1070 - k ), . .
- VFS 1090 v.vmn ( 1050 - n , 1070 - n ) and associates them with links ( 1041 , 1042 , 1043 ) to the actual virtual machine files and subdirectories as expected by the ESX server 1010 .
- the corresponding v.vm structure(s) can be removed from or edited within the VFS 1090 data structure.
- This shadow virtual file system 1090 is in effect, created and maintained via processes internal to the PART 310 .
- the virtual mount points in VFS 1090 are thus accessed by the snapshot processes 1020 - 1 , 1020 - 2 , but the subdirectory structures remain in place for other I/O requests as received from the ESX 1010 .
- FIG. 12 illustrates a more general case where restore node 140 applies a storage related action 1120 (such as snapshot, de-duplicate, compression, or some other storage related action) to portions of a file system (such as a subdirectory 1016 ) rather than an entire file system 1015 created by application 1010 .
- a virtual file system (VFS) layer 1090 in the PART 310 again used to provide links between file systems and the subdirectories 1011 , 1012 associated with these file systems.
- VFS virtual file system
- the application 1010 may wish to apply a de-duplicate action only to a particular sub-sub-directory 1016 .
- a “no de-duplicate” property can be maintained in the VFS 1090 by PART 310 for that subdirectory, because it is a virtual mount point linking to an actual underlying file system 1070 - k upon which the property can be applied.
- the PART 310 may apply properties to the virtual machine subdirectories in consistency groups. So, for example, the VFS 1090 maintained by PART 310 may further indicate that the virtual subdirectories ( 1070 - 0 , 1070 - 1 ) for two of the VMs (such as /vm 0 and /vm 1 ) are to always be treated together and subjected to the same snapshot policies.
- the VFS 1090 may be exposed to analytics running in the restore node 140 . The results of those analytics can then be used to determine how to further refine the directory structure(s) 1050 and/or 1070 .
- the intelligence process 145 in the restore node may detect the application 1010 accesses files in a particular way which indicates some need to apply a certain scope to a given action.
- the intelligence 145 may determine that a certain virtual machine contains a sub-sub-directory/vm 5 that is type of database (a.005.db) having contents that are known to change often and thus will not benefit from compression. Thus intelligence can maintain a compression property when accessing the virtual directory structures in VFS 1090 to exclude sub-sub-directory/vm 5 from any compression action.
- a user may wish to create a clone of a prior snapshot that contains prior obtained intelligence data, to perform still further analytics.
- the clone may be needed to recover a failed file system.
- the user may simply wish to copy an object from one file system to another without destroying the original object.
- a primary node 100 stores primary data and restore node 140 stores associated intelligence data and other information.
- the object 1215 to be cloned may reside within snapshot (“snap”) 1210 ; this object 1215 may be a file, a directory or even the entire file system within the snapshot 1210 .
- a clone request 1205 when a clone request 1205 is made to the PART 310 , it does not simply forward the request to the file systems 102 , 142 on primary 110 and restore 140 nodes. Instead, a new file system clone object 1220 is immediately thin provisioned on the primary node 100 and exposed by the PART 310 to the user (host), and a new file system clone object 1230 is also thin provisioned on the restore node 140 (which is consistent with the system's usual process of mirroring requests to the restore node 140 , although that step is not critical to handling the clone request as described herein).
- the PART 310 therefore does not have to first actually populate all of the metadata and data to the new file system clone objects 1220 , 1230 . Even without all of the metadata and data actually being populated, users can start to issue requests to access the cloned file system objects 1220 , 1230 .
- the PART level 310 thus coordinates execution of a clone process rather than pushing the clone process down to an FS 102 , 142 or other file system layer in the primary 100 and restore 140 nodes.
- the file object 1300 to be cloned consists of a metadata portion 1310 and a data portion 1320 .
- the particular structure of metadata 1310 depends on the type of underlying file system, and will be different for a Linux file system, than say, a Windows or MAC OSX file system as is known in the art.
- the data portion 1320 can be considered to be a collection of data bytes of a certain size, such as chunks 1308 each of 512 kilobytes (kB).
- a clone file process executed in PART 310 maintains a clone bitmap 1350 for each such cloned file object 1300 .
- the clone bitmap 1350 includes at least a single bit 1355 for each chunk 1308 in the file object 1300 .
- the clone bitmap 1350 is used as a tool to coordinate the status of actually copying portions of the data 1320 from the original file to the cloned object.
- FIG. 15 is an example flow for a clone process 1400 where the cloned object is a single file.
- a request is received at the PART 310 to create the cloned object.
- a new clone object 1220 is created on the primary node 100 and a new clone object 1230 is created on the restore node 140 , but only thin provisioned, without actually copying any data yet.
- the thin provisioned file objects at this point may contain some metadata in state 1406 depending upon whether or not the underlying file system maintains metadata within the file object itself (certain operating systems such as Windows and MAC OSX do this; other operating systems such as Linux maintain file metadata as part of a separate inode).
- the bitmap 1355 is created for the file with all bits therein set to logical “false” value indicating that the corresponding data for chunk has not yet been populated to the clones.
- state 1410 the new clone file objects 1220 , 1230 are made accessible for subsequent input/output (I/O) requests by the user I/O even though no data portion 1320 has yet to be copied from the source snap 1210 .
- I/O input/output
- a background live restore thread 1420 is started.
- the live restore thread 1420 may typically be a background thread executing only when the PART 310 is otherwise not executing other tasks. In other instances, the live restore thread 1420 may be a thread with low priority (keeping in mind that the PART 310 is a multi-threaded processor as described above).
- the purpose of the live restore thread 1420 is to perform the task of copying data from the source snap 1210 to the clones 1220 , 1230 .
- a next chunk is located.
- the next chunk is copied from the source snap 1210 to the clones 1220 , 1230 .
- the bit in the bitmap associated with that chunk is then set to a logical “true” value.
- the live restore process then continues as a background/low priority process as long as and until all chunks of the file have been copied to the clone.
- FIG. 16 shows a typical process 1500 performed when the PART 310 receives an access request for a previously thin-provisioned clone object.
- state 1502 the clone access request is received.
- the clone bitmap 1350 associated with the clone object is consulted. If the bit or bits associated with the chunks accessed in the request are all set to “true”, then that is an indication that the data has already been copied to clones 1220 , 1230 . Processing may proceed to step 1505 where the access request can be handled as per state 1510 .
- the access request refers to one or more chunks 1308 for which the bitmap 1350 indicate a “false” value and thus have not been previously processed
- the bitmap is updated to set those bits to “true”.
- data and possible metadata affecting chunks within the scope of the request are then populated to clones 1220 , 1230 .
- the access request is then further handled.
- the access request to the clone may be issued to both the primary 100 and restore 140 nodes.
- the access request may also typically be issued to the both the primary 100 and restore 140 nodes by the PART 310 using the multithreaded log process described above. This then results in duplicate copies of the clone 1220 , 1230 being eventually instantiated on the primary 100 and restore nodes 140 once data is flushed from the caches.
- a different process is executed when the object 1210 to be cloned is an object that includes more than one file, such as a directory or even an entire file system.
- Such an object as shown in FIG. 17 , may be represented as a tree or graph structure consisting of nodes and various levels with edges connecting the nodes.
- the node consist of values, such as metadata defining the content of an associated directory together with a list of references to child nodes and parent nodes that contain metadata for sub-directories and parent directories.
- the snap to be cloned is a root directory with several subdirectories and files stored within those subdirectories.
- the data structure 1600 thus consists of a root node 1610 , and a first level 1620 consisting of four nodes representing four subdirectories 1621 , 1622 , 1623 and 1624 .
- Each subdirectory contains pointers or other metadata concerning the files contained within, as is known in the art.
- the structure also includes the subdirectory at second level 1630 with further subdirectories 1631 and 1632 and files. Still further subdirectories are located at third level 1630 with respective files.
- the initial task when asked to clone such an object is to create a copy of the directory tree structure in the clone 1650 in a particular way. That process 1700 is shown in more detail in FIG. 18 and begins when a “clone directory” request is initially received in state 1702 . In a step 1704 the PART 310 thin provisions the clone directory such as by only creating a copy 1660 of the root node 1610 . In the next step 1706 metadata associated with the root node 1610 would also be copied as may be required by the particular type of file system (in the case of a Linux-compatible file system, that may include copying the inode for the directory). In state 1708 the clone object is then made available for user I/O.
- a background and/or low priority live restore 1720 thread is kicked off for the directory object.
- processing may continue with the clone appearing to be available for access by the user but without any data and without even the entire tree structure having actually been propagated to the clone yet.
- the live restore process 1720 for a directory object begins in state 1722 .
- the directory tree for the original snap 1650 is walked in a depth first search.
- the Depth First Search (DFS) from node 1610 would first located node 1621 for processing, then node 1631 and then node 1641 before returning to level two and node 1632 and so forth.
- step 1723 the node that has been located in a depth first search is then added to the clone tree.
- state 1724 another background thread is also started concurrently with thread 1720 .
- the background live restore process 1720 continues to determine if the depth first search locates any additional nodes, and if so, processing loops back to step 1723 to process the new node. If no new nodes are found in step 1725 , then in state 1726 a background data restore thread (such as that described in connection with thread 1420 in FIG. 14 ) can then be triggered to restore data for the files referenced in the now cloned directory tree.
- BFS Breadth First Search
- the background live restore thread need not be executed for the temporary clone; similarly, the temporary clone need not necessarily recreate all of the data, metadata and directory/subdirectory trees for which the user only requests read access. Thus it is only when a user wishes to perform a read-modify-write, the corresponding chunk(s) need to be fetched from the original snap, modified, and then only those chunk(s) written to the clone.
- FIG. 19 illustrates a typical temporary clone access process 1800 in more detail. It is understood that before this process 1800 is executed, a temporary clone structure has been created such as per the process 1500 in FIG. 16 .
- a temporary clone structure has been created such as per the process 1500 in FIG. 16 .
- an access request to the temporary clone is received.
- a determination is made as to whether or not the access request is a read or a write.
- the access request is a read, and if the corresponding bits in the bitmap are set to a logic false (indicating that there has been no prior write access to those chunks of the temporary clone), then the access request can be serviced in state 1806 from the original snap data structure 1210 .
- the access request is a read, and if the corresponding bits in the bitmap are set to a logic true (bits set) (indicating that there has been a prior write access to those chunks, then the access request can be services in state 1807 from the clone structure 1230 .
- process 1800 proceeds to state 1808 with bits now being set in the bitmap.
- state 1810 data (and metadata if needed) are populated within the scope of the request to the clone 1230 .
- state 1812 the PART 310 finishes the write request. As before, this may be performed via the multithreaded log process in the PART 310 .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Human Computer Interaction (AREA)
- Computer Security & Cryptography (AREA)
Abstract
Description
- This patent application claims priority to U.S. Provisional Patent Application Ser. No. 62/038,498 filed Aug. 18, 2014. This patent application also generally relates to co-pending U.S. utility patent application Ser. No. 14/017,754 filed Sep. 4, 2013 titled “SYSTEM AND METHOD OF DATA INTELLIGENT STORAGE”, U.S. utility patent application Ser. No. 14/157,974 filed Jan. 14, 2014 entitled “LIVE RESTORE FOR DATA INTELLIGENT STORAGE SYSTEM” and U.S. utility patent application Ser. No. 14/203,871 filed Mar. 11, 2014 entitled “CONSOLIDATING ANALYTICS METADATA”. The entire contents of each of the above-referenced co-pending patent applications are hereby incorporated by reference.
- Discussed herein are techniques applicable for a High Availability (HA) storage system that collects analytics while also protecting data on separate physical media. The analytics may enable other functions such as data intelligence. In such a system as described in the referenced patent applications, primary data is read from and written to a primary storage pool. As the data is written to the primary pool it is automatically mirrored and also tracked for data protection to a recovery pool. The mirror can also be used for intelligence including analytics stored as discovery points.
- More particularly, the techniques disclosed herein relate to a system that merges primary data storage, data protection, and intelligence into a single unified system. The unified system provides primary and restore data, analytics, and analytics-based data protection without requiring separate solutions for each aspect. Intelligence is provided through inline data analytics, with additional data intelligence and analytics gathered on protected data and prior analytics, and stored in discovery points, all without impacting performance of primary storage.
- More particularly, the disclosed system implements:
-
- multi-threaded log writes across primary and restore nodes;
- nested virtual machine directories, where subdirectories are associated with a virtual structure that corresponds to a file system for snapshot purposes;
- file system clone available on demand with background metadata and data migration; and/or
- write gathering across file systems/nodes.
- In one embodiment, that multi-threaded log writes are implemented at a protection and analytics (PART) node. The PART node receives access requests from multiple concurrently executing threads, and assigns a transaction identifier (ID) to the access requests. The PART then collects the access requests in a random access, multithreaded log before sending them to both a primary and a restore storage system. Subsequently, the PART forwards the access requests from the PART node to the primary node and restore node.
- The PART may further optionally determine when a number of access requests in the random access, multithreaded log reaches a predetermined number. At that time, the PART issues a synchronization command to the primary and restore nodes which causes data to be flushed from respective temporary caches to a persistent file system in each of the primary and restore. Once data is confirmed as having been flushed in both the primary and restore nodes, the PART may then release entries in the random access, multithreaded log.
- Another aspect is particularly useful where the system is to support snapshot and other actions applied to virtual machine definition files arranged in subdirectories. Here the PART maintains a set of file system level objects, one for each subdirectory in a directory tree created by an application, such as a hypervisor. The PART intercepts a make directory request from the application to store a new a file system level object for each subdirectory in the tree. The file system level object contains access information for the corresponding subdirectory, such that multiple make directory requests result in storing a corresponding multiple number of file system level objects as a virtual file system.
- Subsequently received access requests for applying a file system action a file system object located with a subdirectory are then serviced by the primary and restore nodes using only the virtual file system level object information and not the subdirectory directly. This ensures that the virtual file system objects remain transparent to the application.
- In this arrangement, a property may be associated with two or more virtual file system objects to indicate that an access request applies to two or more subdirectories as a consistency group.
- In another embodiment, the data-intelligent storage system intercepts a request to clone a data object. A clone object is first thin provisioned and opened for access. Data is copied to the clone object only upon the first to occur of either (a) a subsequent access request for the clone object, or (b) as part of a background restore process. Thin provisioning may involve creating a bitmap data object containing a bit for each one of a plurality data chunks in the data object.
- Bits may be set in the bitmap corresponding to data chunks referred to in the subsequent access request for the clone. In such an instance, the bitmap is updated as data chunks are copied to the clone object.
- A separate process for handling temporary clone objects uses the bitmaps to determine when to access the original object, the clone, or a snapshot.
- In the drawings, closely related figures and items have the same number but different alphabetic suffixes. Processes, states, statuses, and databases are named for their respective functions.
-
FIG. 1 is a diagram showing interaction between a Primary Node, Intelligence Node, and Remote Intelligence Node, and connected storage pools. -
FIG. 2 is a view of an appliance device containing Primary and Intelligence Nodes. -
FIG. 3 is a diagram showing the components of a Primary Node. -
FIG. 4 is a diagram showing the components of an Intelligence Node. -
FIG. 5 is a diagram showing the analytics flow process. -
FIG. 6A is a diagram showing the structure of a change catalog. -
FIG. 6B shows a discovery point. -
FIG. 7 is a diagram illustrating a multi-threaded log. -
FIG. 8 shows a process flow for handling an access request at the PART. -
FIG. 9 shows a process flow for synchronizing requests to release entries in the multi-threaded log. -
FIG. 10 shows write access gathering at the PART. -
FIG. 11 shows virtual file system objects representing virtual machine subdirectories created by a hypervisor application can be submitted to a snapshot process. -
FIG. 12 shows a more general case where an action is applied in a restore node across subdirectory trees and file objects stored within those directories. -
FIG. 13 is a high level diagram of a system that provides clone on demand with background migration of data and metadata. -
FIG. 14 illustrates a file object and corresponding clone bitmap. -
FIG. 15 is a process flow for creating a clone of a file object. -
FIG. 16 is a process flow for access a cloned file object. -
FIG. 17 shows a directory tree object and its corresponding clone. -
FIG. 18 is a process flow for creating a directory object. -
FIG. 19 is a process flow for accessing a temporary clone directory object. - The terminology and definitions of the prior art are not necessarily consistent with the terminology and definitions used herein. Where there is a conflict, the following definitions apply.
- Primary Storage: networked storage accessible to multiple computers/workstations. The storage can be accessed via any networked device, either as files or blocks. Unless explicitly stated, “primary storage” refers to both blocks and files.
- Intelligence Storage: secondary storage containing gathered intelligence, discovery points, and a redundant real-time copy of files and block data contained in Primary Storage.
- Primary Node: includes access protocols to communicate with an Intelligence Node, Remote Sites, and Expansion Nodes; access protocols layer (for example, NFS, SMB, iSCSI); protection and analytics in real-time (“PART”) layer; file and block storage layer (file system, block volume); and connection to storage devices (RAID, DISK, etc.). A Primary Node appears to system users as Primary Storage, and provides an interface and controls to act as the access to Intelligence Storage.
- Intelligence Node: includes access protocols to communicate with a Primary Node, Remote Sites, and Expansion Nodes; data intelligence storage layer (intelligent data services & rules processing); file and block storage layer (file system, block volume); and connection to storage devices (RAID, long-term storage). In the preferred embodiment, intelligence node data is accessed by users through a Primary Node, but in alternate embodiments Intelligence Nodes may be directly accessed by users.
- Discovery Point: A discovery point, created from a mirrored (high availability) copy of primary data, contains data analytics for accessed and changed primary data since a prior discovery point. A discovery point may contain the changed data, providing for a virtually full but physically sparse copy of the primary data captured at a user-specified point in time or dynamically based on change rate or other analytics. While primary data does not change within a discovery point after the discovery point was created, analytics metadata stored in a discovery point can be expanded as deeper levels of user data analysis are performed and more analytics are gathered. Tracked primary data changes can be retained for the life of the discovery point or can be removed at scheduled or dynamic intervals, such as after deep data analysis is complete and desired analytics metadata is obtained. Removing primary data allows for more efficient space utilization, while retaining primary data enables point-in-time recovery of that version of data.
- Change Catalog: an ordered set of real-time access and change information related to a data object, tracked at a discovery point granularity. A change catalog tracks who, how, when, and where aspects of a data object being accessed and/or modified. There is one change catalog for every discovery point.
- Remote Site: one or more off-site nodes in communication with local site primary or intelligence nodes.
- Pool: the collection of data storage connected to a node.
- Object: a file, directory, share, volume, region within a volume, or an embedded object. Objects can be complex, containing other embedded objects. For example, a file can be a container containing other files, or a volume can have a file system on top of it which in turn contains files. The system is capable of recognizing complex objects and tracking changes at finer embedded object granularity.
- Selective Restore: an automatic (policy based) or manual (customer initiated) restore at an object level.
- Site Restore: a manually initiated process to recreate primary or intelligence pool content using a previously protected version of the data being restored.
- Container: objects which may have other embedded objects, such as a file, directory, file system, or volume.
- Expansion Nodes: appliance having a processor, memory (RAM), network connectivity, and storage devices, and connected to one or more primary or intelligence nodes scaling the processing power and/or storage for connected nodes.
- In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be used, and structural changes may be made without departing from the scope of the present invention as defined by the claims.
- The disclosed high availability (HA) storage system provides primary storage, analytics, and live restore functions. Live restore is a technique used to optimize data restoration. It can be used to recover user data in case of a failure or to recover previous versions of the user data. The system provides primary storage access as block and/or file level storage while avoiding single points of failure. The system collects analytics in real-time while also protecting data in real-time on separate physical media, and includes options for off-site data protection. The system implements deep analytics enabling restore, storage, and data intelligence, and protects both customer data and associated analytics. The system provides traditional file based and custom API methods for extracting analytics metadata. The system employs Live Restore techniques at a file and at a block level to recover in case of a failure or to recover a previous version of user data. This provides for near-instantaneous restore at the object level, and significantly reduces wait-before-access time in case of primary or intelligence node complete failure (e.g., a full site restore). A file or block level Live Restore uses previously gathered analytics to prioritize data to be restored, while allowing user I/O access to the data during restoration.
- Referring to
FIG. 1 ,Primary Node 100 of the system connects within a network to provide block and/or file level storage access to connected computing devices (not shown), real-time data protection, and real-time analytics of primary data. Primary data is read from and written toprimary storage pool 110. The data can be written or read as files or blocks depending on the access protocol being used. As the data is written it is automatically mirrored and tracked for data protection as part of a HA process for the primary node. The mirrored cache of the data is created forIntelligence Node 120. The Intelligence Node enables data protection, analytics, and recovery. The Intelligence Node stores a real-time copy of primary data, analytics and discovery points withinintelligence pool 130. Discovery points are automatically or manually created at any point by the Intelligence Node, and based on fine grained change data enabling action to be taken immediately with no need to copy the underlying primary data or do any post processing to determine what has changed since any prior discovery point. - In a preferred embodiment, each Node is capable as acting as either a Primary Node, an Intelligence Node, or both. For reliability and performance reasons, separate Primary and Intelligence Nodes are desirable. In case of failure of either node, the other may take over operation of both. Implementation without dual-capability (that is, operating solely a Primary Node and solely an Intelligence Node) is possible but loss of service (to either primary or intelligence storage) would occur on failure of such a node. In a preferred embodiment, each one of the Nodes has a processor and local memory for storing and executing Node software, a connection to physical storage media, and one or more network connections including at least a dedicated high bandwidth and low latency communication path to other Nodes.
- In a preferred embodiment, the Primary Node and Intelligence Node are physically housed within a single device, creating a user impression of a single appliance.
FIG. 2 shows one such example, withPrimary Node 100 andIntelligence Node 120 housed together to appear as a single physical appliance. Implementation may be with any number of disks, for example such as a four rack units (4U) housing containing up to twenty-four hard drives, with separate physical storage devices connected to the system. Internally each node is completely separated from the other with the exception of a backplane, with each node having a dedicated (not shared) power supply, processor, memory, network connection, operating media and optionally non-volatile memory. Separation enables continued operation, for example the Intelligence Node may continue operating should the Primary Node fail, and vice versa, but shared resource implementation is also possible. - Also referring to
FIG. 3 , a node actively operating asPrimary Node 100 operates storageprotocol server software 300, for example Common Internet File System (CIFS), Network File System (NFS), Server Message Block (SMB), or Internet Small Computer System Interface (iSCSI), so the Primary Node will appear as primary storage to network-connected computer devices. The storage protocol server software also communicates with a protection and analytics in real-time process (PART) 310 which intercepts and takes action on every data access. - The
PART 310 performs three main roles after intercepting any data access request: mirroring primary data for HA, gathering in-line data analytics on primary data, and storing primary data. The examples explained herein are directed to a file access perspective, but the PART can similarly process block level accesses. When performing block access to a volume, the PART can identify embedded objects and perform the same analysis that is applied to file-level accesses. Intercepted access requests include read, modify (write data or alter attributes, such as renaming, moving, or changing permissions), create, and delete. The PART tracks and mirrors the request (and data) to the Intelligence Node. Communication with the Intelligence Node is through synchronous or asynchronous inter-process communication (IPC) 340 depending on configuration. IPC may including any suitable protocols or connections, such as Remote Procedure Call (RPC) or a Board-to-Board (B2B) high performance, low latency communication path that may be hardware specific. Any data included with a data access request, such as included in write operations, is also mirrored to the Intelligence Node as part of HA system operation. This mirroring establishes data protection through real-time redundancy of primary storage. Additionally, the PART executes in-line analysis of primary data, gathering real-time analytics. The PART sends gathered real-time analytics to the Intelligence Node, where the analytics are added to a change catalog maintained by the Intelligence Node. In addition to analytics, the PART directs the request to an actual file system, for example Fourth Extended File System (EXT4) or Z File System (ZFS), or block volume for file orblock storage access 330 to physical storage devices. - The storage access function 330 (be it file system level or block level) performs the access request on storage media, and returns the result to the PART for return to the requesting system. In a preferred embodiment, the storage media includes disks attached to the system, but other storage media solutions are possible.
- In a preferred embodiment, the Primary Node also includes the software necessary to operate as an Intelligence Node in case of Intelligence Node failure.
- In a preferred embodiment, the Primary Node also operates management software. Preferably accessed through a browser interface (although any user interface provision method may be used), the management software provides system administrators access to configure and manage system users and access discovery points for the restore process.
- Referring also to
FIG. 4 , a node actively operating asIntelligence Node 120 operates Inter Process Communication (IPC)communication software 400 capable of communicating with the Primary Node. The communication software includes an API to receive real time analytics (change catalog entries) from the Primary Node, data change and access requests (read, modify, create, delete) from the Primary Node, data protection and intelligence control commands, and data restore commands. Data protection and intelligence control commands include commands for creating discovery points, setting up management rules for managing discovery points (including deletion), and searching and restoring content that has been backed up. Data restore commands include commands for accessing previously backed up data. - Data change requests that are received at the Intelligence Node are applied to that node's copy of current data, thereby maintaining a real-time mirror of primary storage. This implements real-time data protection for the current data.
- For data analytics and data recovery purposes, the Intelligence Node maintains a
change catalog 600 containing real-time analytics gathered from accessed and changed data since thelast discovery point 650. A discovery point is also created by associating and storing a change catalog together with reference to the mirrored copy of changed primary data since the last discovery point as maintained in the intelligence pool. A more detailed discussion of the change catalogs and discovery points is provided below. - The Intelligence Node implements file or block-
level access 430 to itsown pool 130 of physical storage. This intelligence storage pool retains the real-time copy of primary data and discovery points. The stored intelligence data within discovery points includes in-line analytics (change catalog) as received from the Primary Node andadditional analytics 410 executed by the Intelligence Node. - The real-time copy of primary data also enables distributed response processing between the Primary and Intelligence Nodes. For example, load balancing between the Primary and Intelligence Nodes may enable greater scalability. As both have real-time copies of primary data, read requests may be balanced between the nodes, or alternatively directed to both nodes with the fastest-to-respond used for the response. The Primary Node may act as a controller for such distributed processing, or a separate controller may be used.
- There is no requirement that the
Primary 110 andIntelligence Data 130 reside on the same appliance, they can be distributed to multiple discrete appliances deploying all the same techniques with the exception that the communication method is performed over a network transport instead of using the HA mechanisms within an array. - Intelligence is at the core of the system. There are four types of intelligence functions in the system: Data, Operational, Storage, and Recovery. All four use the same processing engine and common analytics metadata to provide analysis both at fixed points and as gathered over time.
Data Intelligence 452 allows for intelligent user content management.Operational Intelligence 456 analyzes the behavior of the system and application logs stored on the system to provide insight into applications and security of the system.Storage Intelligence 454 allows for intelligent storage system resource management, including automatic storage allocation and reallocation including dynamically growing and shrinking storage pools.Recovery Intelligence 450 allows for intelligent data protection and data restore. All types of intelligence may be used for, or enable operation in conjunction with, different types of analytics, such as, but not limited to, collaboration, trending, e-discovery, audits, scoring, and similarity. - Analytics begin at the Primary Node, which tracks data access and data modifications, system behavior, change rates, and other real-time analytics. It provides this real-time analytics information to the Intelligence Node. Intelligence gathering determines time and owner relationships with the data for collaboration and contextual information about the data. The gathered intelligence is used for later search and reporting, and is tracked in change catalogs associated with the data.
- Referring now to
FIG. 5 and toFIG. 6A , change catalogs 600 are created as part of in-line real-time analytics 500 performed by thePrimary Node 100, but change catalogs 600 are then also further expanded by theIntelligence Node 120 performing further data processing, and create the foundation for later search. The change catalog data is initially created in real-time at the Primary Node (such as via PART 310) and includes extended information about the specific data access, for example, allowing complete tracking of who/how/when/where accessed, created, modified, or deleted a file or other data object. Traditional file metadata includes only an owner, group, path, access rights, file size, and last modified timestamp. This provides some, but not complete, information about a file. For example, it does not identify who modified the file, how many modifications have occurred, or any information about file accesses (such as viewing or reading a file) which do not modify the file. The PART, operated by the Primary Node, intercepts every file access event. Thus the Primary Node has the ability to track extended metadata about a file—including identification of every modification and every access, even those which do not modify the file, by timestamp, user, and type of access. - Referring also to
FIG. 6A , this extended metadata is stored as achange catalog entry 610 that identifies the object being accessed, the actor (user performing an operation), and the operation being performed. Additional information which may be in a change catalog entry includes, but is not limited to, object name, owner, access control lists, and time of operation. Thechange catalog 600 contains this extended metadata information, and serves as the foundation of further analytics, such as performed later by the Intelligence Node. The change catalog entry may also include security information, such as permission rights for access, associated with the object. An administrator may configure the degree of tracking, or even enable/disable tracking on a file location, user, group-specific, or other basis, and the Primary Node is capable of incorporating all details of every file access into the change catalog entries. These change catalog entries of enhanced metadata are gathered by the Primary Node and transmitted to the Intelligence Node for storing and expanding with further analytics. - With reference now also to
FIG. 6B , the change catalog metadata tracks incremental changes which are also linked to adiscovery point 650. Every time a new discovery point is created the current change catalog is closed off and stored within the discovery point. When data is retained in the discovery point, the system may be configured to retain a copy of the discovery point analytics metadata at the Intelligence Node even if that discovery point is migrated off the Intelligence Node, enabling more efficient query processing. - A
discovery point 650 is created by associating and storing a change catalog together with the mirrored copy of changed primary data since the last discovery point in the intelligence pool. After a discovery point creation, anew change catalog 600 is created allowing gathering of new real-time analytics on primary data. Change catalogs and discovery points are preferably maintained per volume or file system in primary storage, but may also span multiple volumes or file systems. Discovery points allow deeper analytics on a point in time version of primary data, and can also be used to recover a prior version of primary data. A discovery point contains data analytics for accessed and changed data since a prior discovery point. When created, a discovery point also contains a virtually full but physically sparse copy of primary data at the time of creation of that discovery point. The system uses data visible within discovery points to perform deeper data processing, creating more analytics metadata. The analysis is done on accessed and changed data since a previous discovery point, using the real-time analytics reflected in the change catalog. These newly gathered deeper analytics are also stored within the discovery point. Primary data may be retained for the life of the discovery point, or may be removed earlier, such as after the deep data analysis is complete and desired analytics metadata obtained. Removing the primary data allows for more efficient space utilization, while retaining the primary data enables recovery of primary data at the point in time of the creation of the discovery point. From one discovery point until the creation of a next discovery point, file changes, deletions, renames, creations and such are tracked as cumulative modifications to from the prior discovery point, so that only incremental changes are maintained. This creates a version of the data at each discovery point. While the data is retained in a discovery point, the system is able to restore data at the discovery point granularity. As change catalogs are stored with each discovery point, information about change history between discovery points may be available through analysis of the change catalog. To restore a data object at a particular point in time, a discovery point is used. For long-term storage, discovery points may be moved to long-term media such as tape or off-site storage as configured through the management software. - Discovery points can be deleted manually through a delete discovery point command, or automatically based on time or analysis in order to save storage space or for off-site migration. Deletion of discovery points is complicated by management of analytics metadata. The analytics metadata stored within a discovery point contains information about data changed within a period of time. If the stored analytics are deleted they can be lost. To prevent this, the time period for analytics associated with one or more other discovery points can be adjusted, and relevant portions of analytics metadata from a discovery point being deleted extracted and merged with other analytics already stored within the other discovery points.
- Returning attention now to
FIG. 5 , at the Intelligence Node, an adaptive parallel processing engine, orRule Engine 420, operates on thechange catalog 600 to derive these more complex analytics, including tracking changes and use over time. The Rule Engine appliesrules 510 to analyze content on the underlying primary data, enabling deeper analytics on stored data. As an example, a second level dictionary can provide sentiment attributes to an already indexed document. Regular expression processing may be applied to see if a document contains information such as social security or credit card numbers. Each rule may have afilter 530 to match content, and anaction 540 to take based on results. Rules can be nested, and used to answer user-specific questions. Another example may be to apply locations where keywords appear, for example to search objects for a set of keywords such as “mold” or “water damage,” and in all matches to search the objects for address or zip code information. Rules are configurable by administrators or system users, allowing dynamic rule creation and combination based on differentapplicable policies 520. Rules can be combined in multiple ways to discover more complex information. Rules may also be configured for actions based on results. For example, notifications may be set to trigger based on detected access or content, and different retention policies may be applied based on content or access patterns or other tracked metadata. Other actions may include, but are not limited to, data retention, quarantine, data extraction, deletion, and data distribution. Results of applied rules may be indexed or tracked for future analysis. - As applied
rules 510 identify results, such results may be indexed or tracked for other analytical use. This additional metadata may be added to the change catalogs for the relevant files or objects. The metadata may also be tracked as custom tags added to objects. Tags may be stored as extended attributes of files, or metadata tracked in a separate analytics index such as data in a directory or volume hidden from normal end user view, or in other data stores for analytics. Rules, and therefore analytics, may be applied both to data tracked and to the metadata generated by analytics. This enables analytics of both content and gathered intelligence, allowing point-in-time and over-time analysis. The rules results and actions may serve as feedback from one or more rules to one or more other rules (or even self-feedback to the same rule), enabling multi-stage analysis and workflow processing. - Recovery Intelligence is the set of analytics implemented by
Intelligence Node 120 around data protection. The purpose is to protect data and associated analytics. When data reaches the Intelligence Node a mirrored copy is stored in the intelligence pool, creating redundancy with primary storage, and these changes are tracked for use in discovery point creation. Primary data, discovery points, and intelligence data are preferably separated on actual physical media at the spindle or disk pool level, such that a failure of a single individual physical device is always recoverable. As discovery points are created based on change catalogs tracked at the Intelligence Node, they can be created at any time without any impact on the performance of primary storage. This eliminates a need to schedule time-windows for discovery point creation. Each discovery point includes incremental changes from the prior discovery point, including data object changes and the analytics gathered and associated with the data during such changes. Intelligent rules can be applied to automate discovery point creation, such that, in addition to manual or time-based creation, discovery point creation may be triggered by content changes. Such changes may be percentage based, specific to percentage change of certain identifiable subsets of the entire data pool, based on detected deviations from usage patterns such as increase in frequency of specific accesses, or based on real-time analysis of data content. - At the creation of a discovery point, the change catalog accumulating real-time changes is closed. The change catalog is then stored within the created discovery point, and a new change catalog created for changes to be associated with a next created discovery point. The analytics and data stored within discovery points enable efficient restores, allowing search over multiple discovery points for specific object changes without requiring restoration of the data objects from each discovery point. Such search can be based on any analytics performed, such as data tracked in the extended metadata and content-based analysis performed by application of the Rule Engine. The tracking further enables indexing and partial restores—for example specific objects, or embedded objects within complex objects, can be restored from a discovery point without a complete restore of all data from that discovery point.
- Data Intelligence is a set of analytics at the Intelligence Node analyzing content. Data Intelligence operates through the Rule Engine, and can be applied to unstructured data, for example file metadata such as document properties of Microsoft Office documents or the actual content of such documents, semi-structured data such as log files or specific applications such as Mail programs, structured data such as databases or other formats for which schema may be known or discovered by the system, and recursive containers such as virtual machines, file systems on file systems, file systems on volumes, or archives.
- File systems use internal data structures, called metadata, to manage files, directories and data in files. A typical file system uses logging to guarantee crash consistency. One of the popular techniques to guarantee crash consistency is a write-ahead log. Before modifying metadata, the file system logs the intent of modifications to the log, and then performs the metadata modifications on disk. In case of a panic, power failure, or crash, the log is then replayed to bring the file system back to a consistent state.
- Consider a high availability, data intelligence environment as shown in
FIG. 7 . As explained above, thePART 310 intercepts data access requests, forwards them to a primary node, mirrors them to a high availability restore node, and performs analytics to create intelligence data. During these operations, each of theprimary node 100 and restorenode 140 operate with their ownindependent file system 102, 142 (FS). It should be understood that thefile systems - Certain types of file systems (FS) on each of the primary 100 and restore 140 may maintain their own local log (101, 141) of transactions; however other file systems may not maintain such
local logs - Since some transactions take longer than others to process, the single threaded, sequential log process delays any subsequent log transactions from being entered when the log is busy with a prior task.
- In a preferred implementation, the
PART 310 maintains itsown log 311 independent of thelogs file systems log 311 is implemented in a durable storage medium that can be written to in random order, such as nonvolatile memory. To achieve crash consistency, access requests can be replayed at the primary 100 and restore 140 nodes consistent with the original order in which they were received at thePART 310. In some implementations, thePART log 310 may obviate the need forlogs file systems - More specifically, any metadata in the
PART log 311 is stored with a corresponding transaction ID. The transaction IDs are a unique number maintained by thePART 310 and incremented upon each access request received. Writes to thePART log 311 may therefore be multithreaded such that they can be written any time and in any order, with the order information retained in the transaction ID associated with each request. -
FIG. 7 shows a typical PART log entry including a transaction ID, an operation type, a file handle, offset, length and data. The entries in the PART log 311 are arranged in a number ofchunks 301 typically with each chunk being of equal size to other chunks. - As mentioned previously, access requests received by the
PART 310 may be multithreaded. Thevarious chunks 301 in the PART log 311 enable log entries to be written in any order and also concurrently. As a result, writes to the random access, highspeed PART log 310 do not have to observe any ordering dependencies, yet the ordering can be regenerated when thePART log 310 is replayed to the primary and restore nodes. - In one example shown in
FIG. 7 , thePART 310 is executing five (5) concurrent threads labeled A1, A2, A3 and B. In this example, some of the threads are issuing access requests for a data tree structure that is to be populated with engineering data concerning the configuration of a manufactured component. Other threads executing in thePART 310 are concerned with processing customer orders for the component. For example, a first thread A1 may be responsible for creating the tree while threads A2 and A3 are responsible for writing data to the tree. In this example, thread B is handling an entirely different operation such as supporting database accesses concerning the customer orders for the component. Thus it should be understood that some of the accesses (those initiated by threads A1, A2, A3) will ultimately have to be executed in a certain order at the primary 100 and restore 140 nodes, but other accesses (thread B) can be handled in any order at the primary 100 and restore 140 level. - However, regardless of the order in which the accesses must ultimately be executed at the primary 100 and secondary 140, the accesses can be written to in any order in the
PART log 311. This is because, as previously described, the transaction ID numbers are assigned to each access request in the order which they are received. This then enables the transactions to be executed in the correct order in thelocal file systems PART level 310. - After each access request is written to the
PART log 311, it is forwarded in parallel to each of the primary 100 and restore 140 nodes. The primary 100 and restore 140 nodes then copy the request data to a respectivelocal cache 104, 144, but do not yet actually issue the request to theirunderlying file systems respective disks PART log 311, such behavior by primary 100 and restore 140 nodes would not provide crash consistency locally withinfile systems file systems - At some time when the number of entries in the PART log reaches a certain number (such as when the
PART log 311 is nearing a full condition), thePART 310 issues a synchronization (“sync”) request to the primary 100 and restore 140. Upon receipt of the sync request, the primary 100 and restore 140 flushes their respective cached data todisks PART 310. With the data now confirmed as having been being persisted on disk by both the primary and restore nodes, thePART 310 can now free thecorresponding chunks 301 inPART log 311. In other words, it is not until the sync command is complete that data related to the requests is known to be correctly persisted to respective disks in the primary and restore nodes. -
FIG. 8 shows a typical process flow among thePART 310 andprimary node 100. It should be understood that the corresponding operation between thePART 310 and the restorenode 140 is similar. In afirst step 801 thePART 310 receives an access request from a host. In anext step 802 the PART assigns a next available transaction ID to the request. Atstep 803, the access request is then written to any available chunk in thePART log 311. Innext step 804, the request is then sent to both the primary 100 and restore 140 nodes. - In
step 841 the primary 100 receives the request from thePART 310. In anext step 842, if the primary 100 and restore nodes maintain alocal log state 843 data associated with the request is stored in the primary'slocal cache memory 104,144. Although data is not yet stored on disk, instate 844 the primary can send an access complete acknowledgment back to thePART 310. Instate 850 thePART 310 can then report that fact that the access is logically complete even though the data has not yet been flushed to disk at the primary 100. This permits the client application which is accessing thePART 310 to continue its logical flow even though the data has not yet been physically flushed to disk. It should be understood from the foregoing that multiple instances of this process can occur in parallel, owing to the multi-threaded nature of the PART log 311 which supplants the single-threadedlogs -
FIG. 9 illustrates the process flow between thePART 310 and the primary 100 and restore 140 when themultithreaded log 311 is full or nearly full. Instate 910 thePART 311 log is recognized as no longer being able (or soon to become unable) to store additional requests. Instep 911, a sync command is sent from thePART 310 to both the primary 100 and restorenodes 140. Instate 920 the primary 100 (or restore node 140) receive the sync command and instate 922 they flush their local cache to permanent file system (FS) storage such as one or more disk(s). Once the flush operation is complete instate 923, an acknowledgment can then be returned to thePART 310. - In
state 930 the PART receives the acknowledgment from the primary 100, and at some point (either prior to, at the same time, or subsequent to state 930) thePART 310 also receives an acknowledgment from the restorenode 140. Instate 933 having received flush acknowledgments from both the primary 100 and the restore 140, thePART 310 can finally release the associatedchunks 301 inPART log 311. - As a result, even when log entries are not recorded in
PART log 311 in the same exact order in which they are issued to the primary 100 and restore 140 nodes, the transaction IDs can be used to replay the log in the same order as the original writes occurred. The durable storage used for thePART log 311 is a fast access storage device, such as a solid state device, so that the log file can be sorted in transaction ID order as quickly as possible when it needs to be read back, such as when a fault occurs before data is flushed to disk by both the primary 100 and restorenodes 140. - This results in guaranteeing file system consistency at the higher system level, and without relying on the standard log operations within the file systems implemented in both the
primary node 100 and restorenode 140. This also guarantees data synchronization and metadata consistency between theprimary node 100 and restorenode 140, even in the event of an error occurring prior to cache flushing. Furthermore, in a case where primary and restore maintain theirown logs PART log 311 in effect becomes a virtual file system (VFS) log that supplants the operation oflocal logs - Write Gathering at Virtual File System Layer
- It is typical for a storage system to aggregate write operations in a cache before being flushed to main storage. Consider the environment shown in
FIG. 10 . As with the systems described above, a data intelligent storage system is implemented with aprimary node 100 and high availability/intelligence data stored at restorenode 140. In a typical I/O operation, a write access request may come into thePART 310, and recorded in aPART log 311 before being forwarded to primary 100 file system and restore 140 file system. In an optional arrangement the primary and restore file systems may maintain theirown logs remote copy 151 is made of theprimary log 101 and anotherremote copy 105 is made of the restorelog 141. As a result, each single I/O transaction may result in many different write operations to different primary data stores and logs. - We have realized that efficiency can be obtained by also gathering write accesses at the
PART 310 layer above the FS layers (102, 142) distributed to multiple nodes. A PART level cache, which we refer to as awrite gathering cache 333, is implemented to store data associated with write requests. Thus when a write transaction comes into thePART 310, the associated data is immediately copied to thewrite gathering cache 333, and the I/O request is also acknowledged. Certain other operations that involve metadata, such as a make directory (mkdir) operation, are first logged in thePART log 310 and then issued to the primary 100 and restore 140. - Writes are then aggregated in
cache 333 until such time as thecache 333 needs to be flushed to the restorenode 140. At this point, for example, a sequence of transactions has resulted in multiple writes to the same block, the cache location associated with that block will be overwritten multiple times. Flushing of thegathering cache 333 will only then require a single write of that block, thereby reducing the total number of write operations to the restorefile system 140. As part of the cache flushing, additional copies can be sent to still other nodes, such as to provide remote replication. -
FIG. 11 shows the data intelligence storage system being accessed by one ormore applications 1010. In the particular example illustrated, theapplication 1010 is a hypervisor environment such as an ESX or ESXi server (ESX and ESXi are trademarks of VMware, Inc. of Palo Alto, Calif.). Theapplication 1010 creates and maintains various virtual machine (VM) files in such an environment on a subdirectory basis. For example, theapplication 1010 expects a first virtual machine (VM0) to be disposed within a first subdirectory (/vm0), and includes associated virtual machine files a0.vmdk, a1.vmdk, etc. The files associated with a second virtual machine (b0.vmdk) are to be disposed within second directory (/vm1), and the files associated with an (n−1)′th virtual machine in subdirectory (/vmn). The files (k0.vmdk) associated with yet another virtual machine are to be stored in a directory (/vm11) that is subordinate to directory/vm0. As can be seen, theESX server application 1010 may therefore be hosting a number of virtual machines; the data associated with each virtual machine including its operating system image files, application files and associated data are stored in one or more files arranged in adirectory tree 1011 within asingle file system 1015 tree. - As with the other data intelligence environments discussed herein,
application 1010 issues access requests to thePART 310. In turn,PART 310 not only sends the access request to one or more file systems onprimary node 100, but also sends the access request to the file system(s) on restorenode 140. - As explained above it also becomes desirable to use
intelligence 145 in the restorenode 140 to perform certain tasks. One such task creates intelligence data in the form of a change catalog entry with associated discovery points (1020-1, 1020-2). In the scenario shown inFIG. 11 , these discovery points 1020 may include snapshots of the state of the virtual machine files and their associated data, metadata, other intelligence data, and change catalog. As also explained above, snapshots become discovery points includes one or more snapshots of each VM. - While certain applications such as the
ESX server 1010 store their associated files in atree structure 1011 containing different subdirectories, the file systems implemented with primary 100 and/or restorenode 140 may not easily support taking a snapshot of just a single subdirectory and therefore of just a single VM. Such existing snapshot technologies are directed to instead storing a snapshot of an entire file system. However it may be desirable in certain circumstances to enable the use of such snapshot technologies on a single VM. - The basic idea is for
PART 310 to identify particular applications such asESX server 1010 that create subdirectories, such as those containing virtual machine files, and manage them in a distinct way. ThePART 310 therefore can more efficiently enable certain actions byintelligence 145. As shown inFIG. 11 , as it handles access requests, thePART 310 maintains an entire set offilesystems 1050 for each sub-directory on the primary 100 and an entire set offilesystems 1070 on therestore 140. What appears to the user application (ESX server 1010) to be an ordinary filesystem containing ordinary subdirectories is actually a virtual filesystem 1040 wherein any given subdirectory may actually be a link to a separate, associated file system that actually contains the .vmdk files for a given VM. - When these subdirectories are accessed in the
virtual file system 1015, thePART 310 thus transparently redirects those accesses to the associated file system(s) 1050, 1070 on the primary and restore. In one example, a make directory (mkdir) command to create VM subdirectory/vm1 is intercepted by thePART 310, which then creates file system v.vm1 (1050-1) on the primary 100 and its mirror v.vm1 (1070-1) on the restorenode 140. ThePART 310 then creates the new file system directory/vm1 in the primary filesystem 1040, which is a virtual “mount point” linking the subdirectory/vm1 invirtual file system 1015 with its associated actual file system v.vm1 (1050-1, 1070-1). This link is denoted bypointer 1042. In another example, a write access directed to file/vm0/a1.vmdk is intercepted by thePART 310, which, followinglink 1041, redirects that write access to the filesystem v.vm0 (1050-0) on primary 100 which actually contain the file a1.vmdk. As described in the other patents incorporated by reference above, thePART 310 also mirrors write accesses to the restorenode 140; in this case, the mirrored write access is directed to the filesystem v.vm0 (1070-0) on the restorenode 140 which actually contains the mirror of a1.vmdk. - In effect, the
PART 310 maintains the illusion of asubdirectory tree 1011 but actually creates a number of file systems 1050-0, 1050-1, 1050-2, . . . , 1050-11, . . . , 1050-n on primary 100 and a number of file systems 1070-0, 1070-1, 1070-2, . . . , 1070-11, . . . , 1070-n on restore 140. - It is possible that not every subdirectory will be given this treatment, depending on the desirability of having separate access for
PART 310 to implement snapshots of certain subdirectories. - When the need arises to take a snapshot, the snapshot processes running as part of
intelligence 145 can be executed using the standard file system oriented snapshot process but using the virtual mount point information to locate theunderlying filesystems 1070 associated with a given subdirectory. In effect, the virtual filesystem (VFS) 1090 hides the existence of multiple independent, “container file systems” fromuser application 1010. Subdirectories in the virtual file system (VFS) 1090 are accessible as subdirectories, but at the same time the underlyingcontainer file systems 1070 are accessible to the snapshot processes. Associated file system snapshot technology in the restorenode 140 can now be relied upon to obtain snapshots of a given VM independently of snapshots of other VMs, and the restorenode 140 can thus treat them as it treats any file system. - In addition, the details of the
virtual file system 1090 framework is hidden from theESX server 1010. In particular, directory-associated access requests made to the system, such as a make directory (mkdir) or remove directory (rmdir) or change directory (chdir) command, are intercepted by thePART 310. Upon receiving a mkdir command, thePART 310 layer then issues requests to primary 100 and restore 140 to create the file system constructs v.vm0 (1050-0, 1070-0), v.vm1 (1050-1, 1070-1), . . . , v.vm11 (1050-k, 1070-k), . . . , v.vmn (1050-n, 1070-n) and associates them with links (1041, 1042, 1043) to the actual virtual machine files and subdirectories as expected by theESX server 1010. Upon receipt of a rmdir command, the corresponding v.vm structure(s) can be removed from or edited within theVFS 1090 data structure. This shadowvirtual file system 1090 is in effect, created and maintained via processes internal to thePART 310. The virtual mount points inVFS 1090 are thus accessed by the snapshot processes 1020-1, 1020-2, but the subdirectory structures remain in place for other I/O requests as received from theESX 1010. -
FIG. 12 illustrates a more general case where restorenode 140 applies a storage related action 1120 (such as snapshot, de-duplicate, compression, or some other storage related action) to portions of a file system (such as a subdirectory 1016) rather than anentire file system 1015 created byapplication 1010. Here, a virtual file system (VFS)layer 1090 in thePART 310 again used to provide links between file systems and thesubdirectories 1011, 1012 associated with these file systems. - In one such example, the
application 1010 may wish to apply a de-duplicate action only to a particular sub-sub-directory 1016. Thus, even when the underlying file systems provided by primary 100 and restore 140 does not permit such access granularity, a “no de-duplicate” property can be maintained in theVFS 1090 byPART 310 for that subdirectory, because it is a virtual mount point linking to an actual underlying file system 1070-k upon which the property can be applied. - Thus the techniques described herein can be applied wherever it is desirable to apply a property only to portions (or at some granularity such as a subdirectory) of an underlying file system, even when the file system itself limits access to such portions or at such granularity.
- In another example, the
PART 310 may apply properties to the virtual machine subdirectories in consistency groups. So, for example, theVFS 1090 maintained byPART 310 may further indicate that the virtual subdirectories (1070-0, 1070-1) for two of the VMs (such as /vm0 and /vm1) are to always be treated together and subjected to the same snapshot policies. - In other scenarios, the
VFS 1090 may be exposed to analytics running in the restorenode 140. The results of those analytics can then be used to determine how to further refine the directory structure(s) 1050 and/or 1070. As one example theintelligence process 145 in the restore node may detect theapplication 1010 accesses files in a particular way which indicates some need to apply a certain scope to a given action. In another example, theintelligence 145 may determine that a certain virtual machine contains a sub-sub-directory/vm5 that is type of database (a.005.db) having contents that are known to change often and thus will not benefit from compression. Thus intelligence can maintain a compression property when accessing the virtual directory structures inVFS 1090 to exclude sub-sub-directory/vm5 from any compression action. - Cloning with Thin Provisioning and Background Live Restore
- It can be desirable from time to time for a user to request that complete duplicate or clone of an existing file system object be created. In one scenario, a user may wish to create a clone of a prior snapshot that contains prior obtained intelligence data, to perform still further analytics. In another instance, the clone may be needed to recover a failed file system. In still another instance, the user may simply wish to copy an object from one file system to another without destroying the original object.
- As with the systems described above, in a typical scenario such as shown in
FIG. 13 , aprimary node 100 stores primary data and restorenode 140 stores associated intelligence data and other information. In one example, theobject 1215 to be cloned may reside within snapshot (“snap”) 1210; thisobject 1215 may be a file, a directory or even the entire file system within thesnapshot 1210. - The basic idea is that when a
clone request 1205 is made to thePART 310, it does not simply forward the request to thefile systems system clone object 1220 is immediately thin provisioned on theprimary node 100 and exposed by thePART 310 to the user (host), and a new filesystem clone object 1230 is also thin provisioned on the restore node 140 (which is consistent with the system's usual process of mirroring requests to the restorenode 140, although that step is not critical to handling the clone request as described herein). ThePART 310 therefore does not have to first actually populate all of the metadata and data to the new file system clone objects 1220, 1230. Even without all of the metadata and data actually being populated, users can start to issue requests to access the cloned file system objects 1220, 1230. - As file system requests are made to the clone file system objects 1220, 1230, the metadata and data associated with the clone file system objects 1220, 1230 are restored as needed. Background processes are also employed to copy metadata and data to the clone file system objects 1220, 1230.
- The
PART level 310 thus coordinates execution of a clone process rather than pushing the clone process down to anFS - A situation where the object to be cloned is a single file will be first discussed in connection with
FIG. 14 . In this example thefile object 1300 to be cloned consists of ametadata portion 1310 and adata portion 1320. The particular structure ofmetadata 1310 depends on the type of underlying file system, and will be different for a Linux file system, than say, a Windows or MAC OSX file system as is known in the art. Thedata portion 1320 can be considered to be a collection of data bytes of a certain size, such aschunks 1308 each of 512 kilobytes (kB). A clone file process executed inPART 310 maintains aclone bitmap 1350 for each such clonedfile object 1300. Theclone bitmap 1350 includes at least asingle bit 1355 for eachchunk 1308 in thefile object 1300. Theclone bitmap 1350 is used as a tool to coordinate the status of actually copying portions of thedata 1320 from the original file to the cloned object. -
FIG. 15 is an example flow for aclone process 1400 where the cloned object is a single file. In a step 1402 a request is received at thePART 310 to create the cloned object. In the next step 1404 anew clone object 1220 is created on theprimary node 100 and anew clone object 1230 is created on the restorenode 140, but only thin provisioned, without actually copying any data yet. The thin provisioned file objects at this point may contain some metadata instate 1406 depending upon whether or not the underlying file system maintains metadata within the file object itself (certain operating systems such as Windows and MAC OSX do this; other operating systems such as Linux maintain file metadata as part of a separate inode). In either event, in thenext state 1408 thebitmap 1355 is created for the file with all bits therein set to logical “false” value indicating that the corresponding data for chunk has not yet been populated to the clones. - In
state 1410 the newclone file objects data portion 1320 has yet to be copied from thesource snap 1210. - In a next state 1412 a background live restore
thread 1420 is started. The live restorethread 1420 may typically be a background thread executing only when thePART 310 is otherwise not executing other tasks. In other instances, the live restorethread 1420 may be a thread with low priority (keeping in mind that thePART 310 is a multi-threaded processor as described above). - The purpose of the live restore
thread 1420 is to perform the task of copying data from thesource snap 1210 to theclones state 1424 the next chunk is copied from thesource snap 1210 to theclones -
FIG. 16 shows atypical process 1500 performed when thePART 310 receives an access request for a previously thin-provisioned clone object. Instate 1502 the clone access request is received. In anext step 1504, theclone bitmap 1350 associated with the clone object is consulted. If the bit or bits associated with the chunks accessed in the request are all set to “true”, then that is an indication that the data has already been copied toclones state 1510. - However if the access request refers to one or
more chunks 1308 for which thebitmap 1350 indicate a “false” value and thus have not been previously processed, then instate 1506 the bitmap is updated to set those bits to “true”. Instate 1508, data and possible metadata affecting chunks within the scope of the request are then populated toclones - Regardless of whether
state 1510 is reached fromstate 1508 orstate 1505, the access request is then further handled. As explained above, the access request to the clone may be issued to both the primary 100 and restore 140 nodes. The access request may also typically be issued to the both the primary 100 and restore 140 nodes by thePART 310 using the multithreaded log process described above. This then results in duplicate copies of theclone nodes 140 once data is flushed from the caches. - A different process is executed when the
object 1210 to be cloned is an object that includes more than one file, such as a directory or even an entire file system. Such an object, as shown inFIG. 17 , may be represented as a tree or graph structure consisting of nodes and various levels with edges connecting the nodes. The node consist of values, such as metadata defining the content of an associated directory together with a list of references to child nodes and parent nodes that contain metadata for sub-directories and parent directories. - In the example shown in
FIG. 17 the snap to be cloned is a root directory with several subdirectories and files stored within those subdirectories. Thedata structure 1600 thus consists of aroot node 1610, and afirst level 1620 consisting of four nodes representing foursubdirectories second level 1630 withfurther subdirectories third level 1630 with respective files. - The initial task when asked to clone such an object is to create a copy of the directory tree structure in the
clone 1650 in a particular way. Thatprocess 1700 is shown in more detail inFIG. 18 and begins when a “clone directory” request is initially received instate 1702. In astep 1704 thePART 310 thin provisions the clone directory such as by only creating acopy 1660 of theroot node 1610. In thenext step 1706 metadata associated with theroot node 1610 would also be copied as may be required by the particular type of file system (in the case of a Linux-compatible file system, that may include copying the inode for the directory). Instate 1708 the clone object is then made available for user I/O. - In state 1710 a background and/or low priority live restore 1720 thread is kicked off for the directory object. In
state 1711 processing may continue with the clone appearing to be available for access by the user but without any data and without even the entire tree structure having actually been propagated to the clone yet. - The live restore
process 1720 for a directory object begins in state 1722. Here the directory tree for theoriginal snap 1650 is walked in a depth first search. In the example ofFIG. 16 , the Depth First Search (DFS) fromnode 1610 would first locatednode 1621 for processing, thennode 1631 and thennode 1641 before returning to level two andnode 1632 and so forth. Instep 1723 the node that has been located in a depth first search is then added to the clone tree. - In
state 1724 another background thread is also started concurrently withthread 1720. (It is understood that are explained above thePART 310 is a multithreaded processor and is capable of executing multiple concurrent threads at the same time). Fromstate 1725, the background live restoreprocess 1720 continues to determine if the depth first search locates any additional nodes, and if so, processing loops back to step 1723 to process the new node. If no new nodes are found instep 1725, then in state 1726 a background data restore thread (such as that described in connection withthread 1420 inFIG. 14 ) can then be triggered to restore data for the files referenced in the now cloned directory tree. - Returning to the background process and Breadth First Search (BFS)
thread 1730, here thetree 1600 is then walked in a breadth first fashion at the current level. In the example ofFIG. 17 , whennode 1621 is encountered atlevel 1 1620, the breadth first search will next locatenode 1621. In state 1733 a correspondingnew node 1671 is added to the clone tree. Processing continues with the test instate 1734 and looping back tostate 1732 until the search of the current level in the tree is complete. Once this is done then theBFS background thread 1730 can terminate instate 1735. - Thus as the tree is populated in the clone using both a DFS-oriented live restore
thread 1720 which initiates concurrent BFS-oriented live restore thread(s) 1730. It is also important to note that neither of the live restorethreads process 1500 inFIG. 15 ) or (b) when background restoreprocess 1420 is kicked off after the tree structures are created. - It should be understood that there is typically some limit on the number of
concurrent BFS threads 1730 at any one particular time, depending on the available processing power of thePART 310. - The above-described processes, with some adaptation, can also efficiently support “temporary”
clones 1230. Creation and population of atemporary clone 1230 may use the same general mechanisms but with an observation that data in the temporary clone is not meant to be persisted for long. Thus when thetemporary clone 1230 is opened for user I/O, the actual access might only store modified data in the temporary clone and continue to access theoriginal snap 1210 for read accesses. This eliminates the need to restore all of the data in thesnap 1210 to theclone 1230 but a synchronization mechanism can be observed through use of thebitmap 1350 for each file. - For example the background live restore thread need not be executed for the temporary clone; similarly, the temporary clone need not necessarily recreate all of the data, metadata and directory/subdirectory trees for which the user only requests read access. Thus it is only when a user wishes to perform a read-modify-write, the corresponding chunk(s) need to be fetched from the original snap, modified, and then only those chunk(s) written to the clone.
-
FIG. 19 illustrates a typical temporaryclone access process 1800 in more detail. It is understood that before thisprocess 1800 is executed, a temporary clone structure has been created such as per theprocess 1500 inFIG. 16 . Instate 1802 an access request to the temporary clone is received. In state 1804 a determination is made as to whether or not the access request is a read or a write. - If the access request is a read, and if the corresponding bits in the bitmap are set to a logic false (indicating that there has been no prior write access to those chunks of the temporary clone), then the access request can be serviced in
state 1806 from the originalsnap data structure 1210. - If the access request is a read, and if the corresponding bits in the bitmap are set to a logic true (bits set) (indicating that there has been a prior write access to those chunks, then the access request can be services in
state 1807 from theclone structure 1230. - If however the access request is a write, then process 1800 proceeds to
state 1808 with bits now being set in the bitmap. Instate 1810 data (and metadata if needed) are populated within the scope of the request to theclone 1230. Instate 1812 thePART 310 finishes the write request. As before, this may be performed via the multithreaded log process in thePART 310. - It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. As but one example, the algorithms specify general steps, or one specific way of implementing a function or feature. Those of skill in the art will recognize that other approaches are possible. It should also be understood that the algorithms described are directed to the primary logic needed to carry out the stated functions. They do not describe all possible variations in implementation; nor do they specify all possible ancillary functions needed for a practical system such as invalid user-supplied inputs or invalid operational states. For example, error states can be handled in any convenient way.
- The scope of the invention should, therefore, be determined only with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/828,942 US20160048427A1 (en) | 2013-09-04 | 2015-08-18 | Virtual subdirectory management |
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/017,754 US8849764B1 (en) | 2013-06-13 | 2013-09-04 | System and method of data intelligent storage |
US14/157,974 US9213706B2 (en) | 2013-06-13 | 2014-01-17 | Live restore for a data intelligent storage system |
US14/203,871 US9262281B2 (en) | 2013-06-13 | 2014-03-11 | Consolidating analytics metadata |
US201462038498P | 2014-08-18 | 2014-08-18 | |
US14/828,942 US20160048427A1 (en) | 2013-09-04 | 2015-08-18 | Virtual subdirectory management |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160048427A1 true US20160048427A1 (en) | 2016-02-18 |
Family
ID=55304789
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/828,905 Active 2035-10-13 US9785518B2 (en) | 2013-09-04 | 2015-08-18 | Multi-threaded transaction log for primary and restore/intelligence |
US14/828,942 Abandoned US20160048427A1 (en) | 2013-09-04 | 2015-08-18 | Virtual subdirectory management |
US14/829,046 Abandoned US20160048428A1 (en) | 2013-09-04 | 2015-08-18 | Thin provisioned clone |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/828,905 Active 2035-10-13 US9785518B2 (en) | 2013-09-04 | 2015-08-18 | Multi-threaded transaction log for primary and restore/intelligence |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/829,046 Abandoned US20160048428A1 (en) | 2013-09-04 | 2015-08-18 | Thin provisioned clone |
Country Status (1)
Country | Link |
---|---|
US (3) | US9785518B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341114A (en) * | 2016-04-29 | 2017-11-10 | 华为技术有限公司 | A kind of method of directory management, Node Controller and system |
US10671412B1 (en) * | 2019-03-21 | 2020-06-02 | Adobe Inc. | Fast cloning for background processes in scripting environments |
US10929424B1 (en) * | 2016-08-31 | 2021-02-23 | Veritas Technologies Llc | Cloud replication based on adaptive quality of service |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9785518B2 (en) * | 2013-09-04 | 2017-10-10 | Hytrust, Inc. | Multi-threaded transaction log for primary and restore/intelligence |
US9563376B2 (en) | 2015-05-01 | 2017-02-07 | International Business Machines Corporation | Low power storage array with metadata access |
US10057350B2 (en) * | 2015-12-28 | 2018-08-21 | Netapp, Inc. | Methods for transferring data based on actual size of a data operation and devices thereof |
US9983817B2 (en) | 2016-02-24 | 2018-05-29 | Netapp, Inc. | Adaptive, self learning consistency point triggers |
US10366061B2 (en) | 2016-09-23 | 2019-07-30 | International Business Machines Corporation | Interactive visualization |
US10423593B2 (en) | 2016-09-23 | 2019-09-24 | International Business Machines Corporation | Interactive visualization |
US10430436B2 (en) | 2016-09-23 | 2019-10-01 | International Business Machines Corporation | Interactive visualization |
US10331636B2 (en) | 2016-09-23 | 2019-06-25 | International Business Machines Corporation | Interactive visualization |
US10545926B1 (en) * | 2016-12-31 | 2020-01-28 | EMC IP Holding Company LLC | Computer data file system with consistency groups as basic file system objects |
US10545913B1 (en) | 2017-04-30 | 2020-01-28 | EMC IP Holding Company LLC | Data storage system with on-demand recovery of file import metadata during file system migration |
US10809938B2 (en) | 2018-03-06 | 2020-10-20 | International Business Machines Corporation | Synchronized safe data commit scans in multiple data storage systems |
CN108509327A (en) * | 2018-04-20 | 2018-09-07 | 深圳市文鼎创数据科技有限公司 | A kind of log-output method, device, terminal device and storage medium |
US10884796B2 (en) * | 2018-05-03 | 2021-01-05 | Sap Se | Job execution using system critical threads |
US20200019476A1 (en) * | 2018-07-11 | 2020-01-16 | EMC IP Holding Company LLC | Accelerating Write Performance for Microservices Utilizing a Write-Ahead Log |
US10795817B2 (en) | 2018-11-16 | 2020-10-06 | Western Digital Technologies, Inc. | Cache coherence for file system interfaces |
US11347647B2 (en) | 2018-12-18 | 2022-05-31 | Western Digital Technologies, Inc. | Adaptive cache commit delay for write aggregation |
US10860483B2 (en) * | 2019-04-30 | 2020-12-08 | EMC IP Holding Company LLC | Handling metadata corruption to avoid data unavailability |
US10885450B1 (en) | 2019-08-14 | 2021-01-05 | Capital One Services, Llc | Automatically detecting invalid events in a distributed computing environment |
CN110990191B (en) * | 2019-11-01 | 2022-06-17 | 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) | Data recovery method and system based on mirror image storage |
US20220114267A1 (en) * | 2020-10-13 | 2022-04-14 | ASG Technologies Group, Inc. dba ASG Technologies | Secure Sharing of Documents Created via Content Management Repository |
US11593309B2 (en) * | 2020-11-05 | 2023-02-28 | International Business Machines Corporation | Reliable delivery of event notifications from a distributed file system |
US11789829B2 (en) * | 2021-04-27 | 2023-10-17 | Capital One Services, Llc | Interprocess communication for asynchronous tasks |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090182778A1 (en) * | 2006-03-20 | 2009-07-16 | Swsoft Holdings, Ltd. | Managing computer file system using file system trees |
US20120144391A1 (en) * | 2010-12-02 | 2012-06-07 | International Business Machines Corporation | Provisioning a virtual machine |
US20150134826A1 (en) * | 2013-11-11 | 2015-05-14 | Amazon Technologeis, Inc. | Managed Directory Service Connection |
Family Cites Families (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6606694B2 (en) | 2000-12-22 | 2003-08-12 | Bull Hn Information Systems Inc. | Write logging in mirrored disk subsystems |
US7024427B2 (en) | 2001-12-19 | 2006-04-04 | Emc Corporation | Virtual file system |
US7010553B2 (en) | 2002-03-19 | 2006-03-07 | Network Appliance, Inc. | System and method for redirecting access to a remote mirrored snapshot |
US7337351B2 (en) * | 2002-09-18 | 2008-02-26 | Netezza Corporation | Disk mirror architecture for database appliance with locally balanced regeneration |
US8037264B2 (en) | 2003-01-21 | 2011-10-11 | Dell Products, L.P. | Distributed snapshot process |
US7567991B2 (en) | 2003-06-25 | 2009-07-28 | Emc Corporation | Replication of snapshot using a file system copy differential |
US7251708B1 (en) * | 2003-08-07 | 2007-07-31 | Crossroads Systems, Inc. | System and method for maintaining and reporting a log of multi-threaded backups |
US20050273486A1 (en) | 2004-06-03 | 2005-12-08 | Keith Robert O Jr | Virtual distributed file system |
US7389298B2 (en) | 2004-11-18 | 2008-06-17 | International Business Machines Corporation | Seamless remote traversal of multiple NFSv4 exported file systems |
US7353311B2 (en) * | 2005-06-01 | 2008-04-01 | Freescale Semiconductor, Inc. | Method of accessing information and system therefor |
WO2007002397A2 (en) * | 2005-06-24 | 2007-01-04 | Syncsort Incorporated | System and method for high performance enterprise data protection |
US7685385B1 (en) | 2005-06-30 | 2010-03-23 | Symantec Operating Corporation | System and method for satisfying I/O requests before a replica has been fully synchronized |
US7676514B2 (en) * | 2006-05-08 | 2010-03-09 | Emc Corporation | Distributed maintenance of snapshot copies by a primary processor managing metadata and a secondary processor providing read-write access to a production dataset |
US8046548B1 (en) | 2007-01-30 | 2011-10-25 | American Megatrends, Inc. | Maintaining data consistency in mirrored cluster storage systems using bitmap write-intent logging |
US7913046B2 (en) | 2007-08-06 | 2011-03-22 | Dell Global B.V. - Singapore Branch | Method for performing a snapshot in a distributed shared file system |
US20110040812A1 (en) | 2007-12-20 | 2011-02-17 | Virtual Computer, Inc. | Layered Virtual File System |
US8286030B1 (en) | 2009-02-09 | 2012-10-09 | American Megatrends, Inc. | Information lifecycle management assisted asynchronous replication |
US8572338B1 (en) * | 2010-02-22 | 2013-10-29 | Symantec Corporation | Systems and methods for creating space-saving snapshots |
JP5666710B2 (en) * | 2011-04-05 | 2015-02-12 | 株式会社日立製作所 | Storage apparatus and volume management method |
WO2013001332A1 (en) | 2011-06-27 | 2013-01-03 | Nokia Corporation | System, method and apparatus for facilitating resource security |
US8713170B2 (en) | 2011-11-18 | 2014-04-29 | The Boeing Company | Server-side web analytics system and method |
US8996827B1 (en) * | 2011-12-27 | 2015-03-31 | Emc Corporation | Creating and maintaining clones in continuous data protection |
US8935502B2 (en) * | 2012-12-21 | 2015-01-13 | Red Hat, Inc. | Synchronous management of disk flush requests |
US9274798B2 (en) | 2013-01-18 | 2016-03-01 | Morgan Stanley | Multi-threaded logging |
US9213706B2 (en) | 2013-06-13 | 2015-12-15 | DataGravity, Inc. | Live restore for a data intelligent storage system |
US8849764B1 (en) * | 2013-06-13 | 2014-09-30 | DataGravity, Inc. | System and method of data intelligent storage |
US9785518B2 (en) * | 2013-09-04 | 2017-10-10 | Hytrust, Inc. | Multi-threaded transaction log for primary and restore/intelligence |
US9361187B2 (en) * | 2013-11-04 | 2016-06-07 | Quantum Corporation | File system metadata capture and restore |
-
2015
- 2015-08-18 US US14/828,905 patent/US9785518B2/en active Active
- 2015-08-18 US US14/828,942 patent/US20160048427A1/en not_active Abandoned
- 2015-08-18 US US14/829,046 patent/US20160048428A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090182778A1 (en) * | 2006-03-20 | 2009-07-16 | Swsoft Holdings, Ltd. | Managing computer file system using file system trees |
US20120144391A1 (en) * | 2010-12-02 | 2012-06-07 | International Business Machines Corporation | Provisioning a virtual machine |
US20150134826A1 (en) * | 2013-11-11 | 2015-05-14 | Amazon Technologeis, Inc. | Managed Directory Service Connection |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341114A (en) * | 2016-04-29 | 2017-11-10 | 华为技术有限公司 | A kind of method of directory management, Node Controller and system |
US10929424B1 (en) * | 2016-08-31 | 2021-02-23 | Veritas Technologies Llc | Cloud replication based on adaptive quality of service |
US12210544B1 (en) * | 2016-08-31 | 2025-01-28 | Veritas Technologies Llc | Cloud replication based on adaptive quality of service |
US10671412B1 (en) * | 2019-03-21 | 2020-06-02 | Adobe Inc. | Fast cloning for background processes in scripting environments |
Also Published As
Publication number | Publication date |
---|---|
US20160048428A1 (en) | 2016-02-18 |
US9785518B2 (en) | 2017-10-10 |
US20160048351A1 (en) | 2016-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9785518B2 (en) | Multi-threaded transaction log for primary and restore/intelligence | |
US10089192B2 (en) | Live restore for a data intelligent storage system | |
US9213706B2 (en) | Live restore for a data intelligent storage system | |
US10061658B2 (en) | System and method of data intelligent storage | |
US10102079B2 (en) | Triggering discovery points based on change | |
US9501546B2 (en) | System and method for quick-linking user interface jobs across services based on system implementation information | |
US7596713B2 (en) | Fast backup storage and fast recovery of data (FBSRD) | |
CN103415842B (en) | For the virtualized system and method for data management | |
JP2013545162A (en) | System and method for integrating query results in a fault tolerant database management system | |
JP2013545162A5 (en) | ||
US10289495B1 (en) | Method and system for performing an item level restore from a backup | |
US20230259529A1 (en) | Timestamp consistency for synchronous replication | |
WO2016028757A2 (en) | Multi-threaded transaction log for primary and restore/intelligence | |
WO2017112737A1 (en) | Triggering discovery points based on change | |
Wang et al. | Fast off-site backup and recovery system for HDFS | |
CN116601611A (en) | Method and system for continuous data protection | |
Han et al. | A Novel File-Level Continuous Data Protection System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DATAGRAVITY, INC., NEW HAMPSHIRE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SASI, KANNAN;KANTETI, KUMAR;LONG, PAULA;SIGNING DATES FROM 20151013 TO 20151019;REEL/FRAME:037024/0917 |
|
AS | Assignment |
Owner name: HYTRUST, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DATAGRAVITY, INC.;REEL/FRAME:042930/0758 Effective date: 20170701 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: ENTRUST CORPORATION, MINNESOTA Free format text: MERGER;ASSIGNOR:HYTRUST, INC.;REEL/FRAME:066806/0262 Effective date: 20230330 |
|
AS | Assignment |
Owner name: BMO BANK N.A., AS COLLATERAL AGENT, ILLINOIS Free format text: SECURITY INTEREST;ASSIGNOR:ENTRUST CORPORATION;REEL/FRAME:066917/0024 Effective date: 20240326 |