[go: up one dir, main page]

Academia.eduAcademia.edu
A Distributed Provenance Aware Storage System Mihai Bădoiu ∗ Kiran-Kumar Muniswamy-Reddy † Mythili Vutukuru § Abstract The provenance of a file represents the origin and history of the file data. A Distributed Provenance Aware Storage System (DPASS) tracks the provenance of files in a distributed file system. The provenance information can be used to identify potential dependencies between files in a filesystem. Some applications of provenance tracking include (i) tracking the transformations applied to process raw data in scientific communities and (ii) intrusion detection and forensic analysis of computer systems. In this report we present the design and implementation of a provenance aware storage system, which efficiently stores and retrieves provenance information for files in a distributed file system, while incurring minimal space and time overheads. 1 Introduction Provenance, from the French word for “source or origin”, refers to a complete history or lineage of a document. In computer terms, it consists of information about the objects that a particular object is based on, the process of creation/modification of an object, etc. For example, consider a process P that reads from files A and B, performs some computation, and writes to a file C. Then the provenance of C consists of, the input files A and B, the application P that modified the file, the command line arguments and environment of process P, the processor type on which P is running, etc. Provenance is particularly useful for scientific communities like Physics, Chemistry and Astronomy. Raw data, generated by scientific experiments is further processed and transformed multiple times, before it is published. Before using the published data in their experiments, scientists need to know whether they can trust its source. At this end, they need to know where the data came from, and the transformations it went through. Also, if it turns out that there was a flaw during the data ∗ E-mail: mihai@mit.edu kiran@eecs.harvard.edu ‡ E-mail: tasos@mit.edu § E-mail: mythili@mit.edu † E-mail: Anastasios Sidiropoulos ‡ generation and transformation process, the originators of a flawed data-set need to inform all users of the data of the flaw. Moreover, it is often desirable to keep track of enough meta-data, so the exact same experiment can be recreated. The provenance of a file can be useful in all of these scenarios. Provenance can also be used for security purposes, to conduct forensic analysis after a break-in. Intruders gain access into systems by installing malicious worm backdoors, that can then corrupt files in the system. Upon detecting a suspicious file, we can examine its provenance, backtrack through the file system and locate the worm backdoor. We could also locate all the files in the system which depend on the worm backdoor and thus identify other possible corrupted files. BackTracker [5] is based on a similar mechanism of intrusion detection. A provenance aware storage system (PASS) maintains the provenance information of a file in the file system, along with the other meta-data of the file. Complete provenance includes information about the applications that modified the data, the input data, and the environment under which the application was executed. For this project, we limit ourselves to capturing only the application that modified a file, the host on which the file was modified, together with the set of files that the process read before the modification. The files A1 , . . . , An that a process read before modifying a file B are the ancestors of B, and B is a descendant of each Ai . One of our main goals is to capture the ancestor-descendant dependencies between files. It is common for users to have their data on a centralized file system, so that they can access it by logging into any machine. In order for the user to access the provenance from any machine, the provenance has to be stored along with the data in a centralized file server. A Distributed PASS (DPASS) is a distributed storage system that stores the provenance of a file along with the data, enabling the user to access the provenance remotely. Note that since the provenance depends on the processes that are running in the user’s machine, the recording of the provenance must involve both the machine that the user is logged in, and the machine which stores the file system. 1.1 Challenges in Building a DPASS • Automatic Provenance Generation: In a primitive provenance tracking system, users who generate or modify a file, can be responsible for tracking its provenance. This solution however is unacceptable, since users might neglect entering provenance, might enter it incorrectly, or might find it cumbersome to enter the provenance manually. A PASS should automatically record provenance of files without human intervention, and without changing the existing applications, and programming interfaces. • Transporting Provenance: DPASS requires that provenance be transmitted to the file server. It is desirable to transport provenance without inventing a new protocol or changing an existing protocol like NFS. • Storing minimal required information: A naive approach to provenance recording, is to record every read and write by a process. This approach results in redundant dependencies, and incurs unacceptable storage and processing time overhead. Thus, it is critical that a PASS should store only the minimum required information that is sufficient to reconstruct all relevant dependencies between the files. • Querying Provenance Efficiently: The provenance should be efficiently retrievable by applications. While a simple log containing all the writes and reads, is sufficient to capture any possible file dependencies, it cannot be queried efficiently. The rest of the report is organized as follows. Section 2 describes the provenance tracking algorithm and the design of the database. Section 3 discusses the implementation details. Section 4 evaluates our system. Section 5 describes the related work. We conclude in Section 6, and discuss future directions. 2 Provenance Tracking Algorithms and Database Design Our system captures all dependencies of the form A → B that exist between any two files A and B, denoting that the contents of B might have been derived from the contents of A. More precisely, A → B means that B was modified by a process that read A before modifying B. In this section, we first describe a naive algorithm for tracking dependencies, followed by an improved algorithm that we use in our system. We also describe the format of the database used to store the provenance, and explain how we construct provenance trees. 2.1 A Naive Algorithm The naive algorithm to capture provenance is as follows: • Each time a process P reads a file Ai , record this event by appending a record to a buffer. • Each time P writes to a file B, the data written to B could potentially depend on each file Ai that P has read. On every write to B, for every file Ai in the buffer of P , record the dependency Ai → B. Althought the above algorithm seems like a reasonable way to record provenance, we will next explain the main problems that render it inappropriate for our system. • The naive approach results in a lot of redundant storage. For example, if the same file is read a second time by a process, this information should be captured only if the file has changed since the last read. Thus, optimizations are needed to avoid recording every single dependency on write. • Since there are many redundant dependency entries, the time required to build dependency trees is increased. • The naive algorithm could result in cycles while building the provenance tree of a file. For example, if process P reads file A and writes to file B (resulting in A → B) and another process Q reads B and writes to A (resulting in B → A), a cycle is formed. This can send a provenance tree building algorithm into a loop. It is desirable that dependencies of the form A → B → A should be avoided. This dependency can be eliminated by noting that there are in fact two different versions of A involved here, and the file A that Q has written to is no longer the same as the file A that P read from. Hence, our system needs to store additional timestamps to recognize the different versions of a file. We will show that by carefully recording the dependencies, the above problems can be avoided. In the next section, we present a formal improvised algorithm that keeps track of a few timestamps with every read and write, in order to avoid capturing redundant dependencies and avoid cycles in the provenance tree of a file. 2.2 2.2.1 An Efficient Tracking Algorithm Active file dependencies For each file Ai that a process P has read or written, we store a tuple (inodei , first-readi , mtimei , lp-writei ), such that: • i-nodei is the i-node number of file Ai . • first-readi is the time the first read system call was issued by P on file Ai . • Rule 1 If lp-writei = null, then this is the first write of P to Ai . No file dependencies for Ai have been recorded so far, and Ai depends on all the active file dependencies of P . All active dependencies of P are recorded in the database. For every file Aj , with j 6= i, that P has read from, we record the dependency Aj (tj ) → Ai (ti ), where tj is the version number of Aj . • Rule 2 If lp-writei 6= null, the provenance of Ai has been recorded before, and thus only some of the active file dependencies need to be recorded. For every file Aj , with j 6= i in the active file dependencies of P , the dependence of Ai (ti ) on Aj (tj ) is recorded only if one of the following rules are satisfied: • mtimei is the modification time of file Ai . This is updated every time Ai is read by P . The mtimei of Ai changes between two reads if and only if some other process has modified Ai between the two reads. Different mtimei of a file denote different version numbers of a file. mtimei is used to identify if a process is reading a different version of the same file. We denote a file A with mtime = t by A(t). – Rule 2.1 first-readj > lp-writei , which means that Aj is a new file that P has read for the first time after the last write to Ai . Since Aj had not been read before the previous writes to Ai , Aj (tj ) → Ai (ti ) should be recorded. – Rule 2.2 mtimej > lp-writei , which means that the file Aj has been modified by some other process and process P has read the modified version. Since Aj has been read after it was modified, the write to Ai implies that Ai now depends on the new version of Aj . This dependency on the newer version of Aj should to be captured. • lp-writei is the time when the provenance of Ai was last recorded in the database. This corresponds to the last write system call to Ai before which new files have been read by the process P . We refer to the set of these tuples as the active file dependencies of the process. The active file dependencies of each process are stored in a separate buffer in memory. Note that, when a process reads a file Ai , only the first 3 elements of the tuple are populated and when a process writes to file Ai , only the lp-writei field is updated. As we next explain, these timestamps are used to eliminate redundant dependencies, and to avoid cycles during the construction of provenance trees. 2.2.2 Provenance recording rules Recall that a dependency of the form Aj (tj ) → Ai (ti ) means that the version of the file Ai at time ti depends on the version of file Aj at time tj . When a process P performs a write system call on a file Ai , the system scans the active file dependencies of P , extracts any new dependencies on which Ai depends on and records it to the database. The exact rules for recording provenance when a process P writes to Ai at time ti are as follows: If any of the rules result in Ai ’s provenance being updated in the database, the lp-writei field of Ai is updated to ti , indicating that the provenance of Ai was last updated at ti . 2.3 The Database The recorded dependencies are converted into key-value pairs and stored persistently in a centralized provenance database. Storing provenance in a database allows us to build provenance trees efficiently as we have don’t have to scan the whole database. Since inodes are recycled, each file is assigned a unique p-node number when it is created. A p-node number is never recycled. The details of how p-node numbers are assigned and maintained will be explained in Section 3. Let pnr and pnw denote the p-node numbers of Ar and Aw respectively. The dependency Ar (tr ) → Aw (tw ) generated by process P on host H is stored in the provenance database as a tuple (pnw , tw , pnr , tr , H, P ), where pnw is used as the primary key. Using timestamps also allows us to avoid cycles since at any point, only dependencies recorded before a particular time are retrieved. As the depth of recursion increases the time also decreases, ensuring that the recursion ends. Additionally, two secondary indices are maintained to speed up certain kinds of queries. The process database is a secondary index on the process name to enable efficient retrieval of files that have been modified by a particular application. When a tuple of the form (pnw , tw , pnr , tr , H, P ) is stored in the provenance database, a tuple of the form (P , pnw ) with P as the key is stored in the process database. The descendant database is another secondary index maintained to efficiently retrieve the descendants of a particular file. For a tuple (pnw , tw , pnr , tr , H, P ) in the primary, the tuple (pnr , tr , pnw , tw ) with pnr as the key is stored in the descendant database. Descendant tree building algorithm To build the descendant tree of file A with version t, the query application retrieves all all tuples of the form A(t1 ) → B(t2 ) where t1 > t, using the p-node number of A as the key. The application recursively query the database for descendants of B recorded after t2 . The recursion ends when no more descendent’s can be found. 2.4 2.5 Retrieving the Provenance Information DPASS supports two primary queries on the provenance stored in the database. • Retrieving the provenance tree of a file: This query returns all the files in the system that a particular file X depends on, by tracing the ancestors of a file to its foremost ancestors. Intuitively, this amounts to backtracking the origins of a file. • Retrieving the descendant tree of a file: This query returns all the files in the file system that have a file X as their ancestor. For instance, if a file X is corrupted at time t, the descendant tree is useful to determine all the files that have been derived using this corrupted data. Provenance tree building algorithm To build the provenance tree of file A with version time t, the query application starts by retrieving all the immediate ancestors of A before time t, i.e. tuples of the form B(t1 ) → A(t2 ), where t2 ≤ t. For each chosen ancestor B, the application recursively retrieves all immediate ancestors of B recorded before t2 . The recursion ends when no more provenance records can be found. Clearly, this recursive algorithm retrieves all the ancestors of a file. Observe that since the mtime of B at time t2 was t1 , it seems sufficient to query for the provenance of B up to time t1 , and not up to time t2 . We found however, that this was not the case and that the mtime of a file does not always indicate the time of the last write to a file. For example, tar while untar-ing a file, sets the mtime of a file to something much earlier than even the file’s creation time using utime. Hence we need to use the time when the dependency was recorded rather than the mtime of the ancestor for building provenance trees. Example Consider a process P on host H that reads from a file A, and writes to a file B, within a loop. Formally, P reads from A at times t1 , t3 , and t7 , and writes to B at times t2 , t4 , and t8 . Moreover, assume that a process P reads a file C at time t5 , and another process P ′ on host H ′ reads a file D and writes to A at time t6 . Finally P ′ reads from B at time t9 and writes to A at time t10 where t1 < t2 < . . . < t10 . The following diagram summarizes the above scenario: P (H) read(A) write(B) read(A) write(B) read(C) P ′ (H ′ ) read(D) write(A) read(A) write(B) read(B) write(A) time t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 The provenance capturing algorithm proceeds as follows: After time t1 , the active file dependencies of P on host H will contain a tuple (i-node(A), tA , t1 , null), for some tA < t1 , tA being the version of A at time t1 . When the write to B happens at t2 , the dependency A(tA ) → B(t2 ) is recorded following Rule 1. Also, the tuple (i-node(B), null, null, t2 ) is added to the active file dependencies of P to indicate that the provenance of B was recorded at t2 . Observe that after P reads from A for the second time, the tuple in the active file dependencies of P that corresponds to A remains unchanged, since A has not been modified since the previous read. This implies that in A(t6 ) A(t10 ) B(t8 ) D(tD ) C(tC ) A(tA ) Figure 1: An example provenance tree the second write to B, the provenance of B will not be updated. When P reads file C at time t5 , a new active file dependency (i-node(C), tC , t5 , null) corresponding to file C is added to P ’s active file dependencies, where tC is the version number (mtime) of C. When P writes to B at t8 , the first read of C is greater than lpwrite of B (t5 > t4 ). Thus the dependency C(tC ) → B(t8 ) is recorded following Rule 2.1, and the dependency D(tD ) → A(t6 ) is recorded following Rule 1, where tD is the mtime of D. Next, when P reads from A at time t7 , the mtime of A has been changed, and thus the active file dependencies of P is updated to contain the tuple (i-node(A), t6 , t1 , null). When P writes to B at time t8 , the mtime of A is greater than the time of the lpwrite to B (i.e. t6 > t4 ), and thus the provenance of B is updated by adding the dependency A(t6 ) → B(t8 ) following Rule 2.2. Finally, when P ′ reads B and writes to A, the H ′ records the dependency B(t8 ) → A(t10 ). Assuming that the p-node numbers of files A, B, C and D are pA , pB , pC and pD , the provenance for the dependencies generated above are stored in the database server as shown in the table below: Dependency A(tA ) → B(t2 ) C(tC ) → B(t8 ) D(tD ) → A(t6 ) A(t6 ) → B(t8 ) B(t8 ) → A(t10 ) Pipes The pipe data structure is extended to store a pointer to provenance information. On a write to a pipe, a pointer to a copy of the active file dependencies of the process is stored in the pipe data structure. On a read from a pipe, the active file dependencies recorded during the write to the pipe are removed and appended to the process reading from the pipe. Mmaped files Tracking provenance in mmaped files is hard because writes translate to pages being marked dirty and by the time pages are synchronized to disk, the process could be long dead. So we treat an mmap system call as a read/write system to the file and record provenance as we do for normal reads/writes. Note that, unlike the BackTracker [5], our system does not track dependencies between processes explicitly but uses processes only to implicitly capture dependencies between files. We claim that our system can still reconstruct all file-to-file dependencies that the BackTracker can capture, in spite of storing a smaller subset of the information that BackTracker stores. For example, suppose process P reads file A and later writes to process Q through a pipe. When process Q writes to file B, we record a dependency A → B, since Q has indirectly read data from A (through process P ). This dependency is captured as follows: • When P reads A, the system adds the file A into its active file dependencies. Tuple (pnB , t2 , pnA , tA , P, H) (pnB , t8 , pnC , tC , P, H) (pnA , t6 , pnD , tD , P ′ , H ′ ) (pnB , t8 , pnA , t6 , P, H) (pnA , t10 , pnB , t8 , P ′ , H ′ ) The provenance tree of A at time t10 is as shown in Figure 1. 2.6 Forks When a process P calls fork, the active file dependencies of P are copied to the child process. Extensions to the provenance tracking mechanism The above high-level description of the provenance tracking mechanism describes only the actions performed during read and write system calls. We now briefly outline the similar actions for other system calls and inter-process communication mechanisms. • When P writes to Q through a pipe, the active file dependencies of P are copied into the active file dependencies of Q. • When Q writes B, the dependency A → B is recorded in the provenance of B. The A → B dependency is recovered without explicitly tracking the P → Q or any other dependency between the processes in the system that BackTracker captures. 3 Implementation In this section, we discuss the architecture of the DPASS system, the databases, the query application, and how we have overcome the challenges in building a DPASS. 3.1 Architecture The overall Architecture of the system is shown in Figure 2. The system is composed of the following components: • DPASS client • BDB RPC Server 3.1.1 DPASS Client The DPASS client is the component that is present in every host in a distributed file system. It consists of two components: • DPASS Stacking File System • A user level daemon called provd DPASS Stacking File System A stackable file system is a file system layer placed between the VFS and a lower level native file system. It intercepts VFS operations enabling us to track the data/meta-data before passing it to the lower level file system. In our case, the lower level file system is NFS. Wrapfs, a wrapper stacking file system generated from FiST [10], was used as the starting point for building a Stacking file system for our needs. The DPASS stacking file system intercepts file system operations, runs the provenance tracking algorithm and updates the active file dependencies of the process. If the provenance tracking algorithm decides that a dependence should be recorded to the database, the DPASS stacking file system sends this information to provd via a netlink socket. provd provd is a user level daemon that collects provenance sent out by the DPASS stacking file system and stores it in the database server. On receiving a record from the DPASS stacking file system, provd lookups the pnode number corresponding to the inode number of the record, and stores the record to the database. 3.1.2 BDB RPC server The Berkeley DB (BDB) server is an embedded database [1], that provides an RPC interface to the Berkeley DB API. The provd daemon running on each client persistently stores provenance by executing the appropriate BDB API calls. The BDB API calls made by provd are converted to appropriate RPC calls by the BDB library, thus transmitting the data to the BDB server. 3.2 The Databases The three primary databases are listed below. Note that all the clients operate on the same databases and share the databases. Updates made by provd on one client is accessible by provd on another client. Provenance generated by one provd can be used by another while recording new provenance. • i-node → p-node map • provenance database • p-node → name map Note that all the clients operate on the same databases and share the databases. Updates made by provd on one client is accessible by provd on another client. Provenance generated by one provd can be used by another while recording new provenance. Every file is assigned a unique p-node number when it is created. The p-node number of a file, as described in Section 2.3, is used as the key to store and retrieve the provenance of a file from the provenance database. The mapping from i-node numbers to p-node numbers is required to lookup the p-node number of the file since the provenance records that provd on receives from the DPASS stacking file system contain only i-node numbers. provd looks up the p-node number from the database and uses it as key for storing the record. The mapping from p-node numbers to filenames is useful to display provenance information in a more readable format. The p-node number of a file is always unique. When a file is deleted, the associated provenance data is not deleted and the p-node number is not recycled, unlike the i-node number. To see why this property is necessary for tracking the provenance of a file, consider the following example: A process P1 reads from a file A1 , and writes to a file A2 . Then, a process P2 reads from A2 , and writes to a file A3 . Clearly, even if A2 is deleted, the provenance records of A2 need to be kept to be able to recover the dependence of A3 on A1 . Since the p-node number of A2 is used as a key for all these records, the same p-node number cannot be assigned to a new file. When a file is created, the provd on the client that created the file allocates a new p-node number for the file and updates the (i-node number → p-node number), and (p-node number → filename) mappings. When a file is unlinked, the provd on the client that unlinked the file removes the (i-node number → p-node number) record. When a file is renamed, the (p-node number → filename) and (filename → p-node number) mappings are updated. Figure 2: Distributed PASS Architecture. • Storing minimal required information: The provenance tracking algorithm described in Section 2.2 ensures that DPASS stores only a minimal required set of dependencies. Apart from the primary databases, provd also maintains the secondary indices. 3.3 Querying the Provenance • Querying Provenance Efficiently: The use of BDB databases together with the timestamps generated by the provenance tracking algorithm (Section 2.2), enables us to easily determine the relevant subset of the provenance records needed to build a provenance tree. A simple log on the other hand requires a sequential scan of the entire log starting from the last record. We built a query application (depicted as part of the client in in Figure 2), that implements the tree building algorithms described in Section 2.4. It interacts directly with the BDB RPC server using the BDB API, and uses the (inode → p-node) mapping and the provenance database to construct the provenance tree of a given file. 3.4 Overcoming the Challenges We summarize how we overcame the challenges in building our DPASS below: • Automatic Provenance Generation: The DPASS stacking file system intercepts file system operations and runs the provenance tracking algorithm to generate provenance records, that are eventually stored in the database. Moreover, it does not require designing a new file system, or modifying an existing one. • Transporting Provenance: DPASS stacking file system sends the provenance to provd via a netlink socket, and provd sends it to the BDB RPC server using BDB API calls. Provenance is thus transported over the network without designing a new protocol, or modifying an existing one. 4 Evaluation We evaluated the performance of our system on 2 machines, one was configured to be an NFS server with file system operations being synced and the other machine was configured run an the NFS client with the DPASS stacking file system. The Berkeley DB RPC server was configured to run on the same machine as the NFS server. The server is a 3GhZ Pentium 4 Machine with 512MB of RAM and a MAXTOR 6Y080M0 80GB Serial ATA 7200PRM HDD. The server runs Fedora Core 3, with a Linux 2.6.11-1.14 FC3 kernel. The client is a 500Mhz Pentium 3 Machine with 756MB of RAM running running RedHat 7.3, with a Linux 2.4.29 kernel. The Linux kernel on the client has a single line patch to store a pointer to the active dependencies of the files of the process. We set the receive buffer size for sockets to be 450 400 700 Wait System User 600 500 300 250 400 200 300 150 200 Time (Seconds) - Am-Utils 350 Time (Seconds) - CGR 16MB at the client. In all our evaluations, to ensure a cold cache, we unmounted the file systems on which the experiments took place between each run of a test. We recorded elapsed, system, and user times, and the amount of disk space utilized for recording provenance. We also recorded the wait times for all tests; Wait time is mostly I/O time, but other factors like scheduling time can also affect it. Wait time is computed as the difference between the elapsed time and system+user times. We ran each experiment at least 4 times. For each of our results, the standard deviation was less than 5%. We do not discuss the user time in the results as DPASS stacking fs is in the kernel and hence the user time remains unaffected. 100 100 50 0 0 NFS DPASS NFS DPASS Figure 3: Overhead for CGR workload and Am-Utils Com- 4.1 Workloads We ran two benchmarks on our system: a real workload from the Bauer Center for Genomics Research (CGR), Harvard University and a CPU-intensive benchmark. The first workload, from CGR, takes 2 files and produces 1 result file at the end. Each of the 2 input files contains protein sequences from different species of bacteria. The output file contains a list of proteins in the two species that may be related to each other evolutionarily. The workload consists of series of commands that produce output files that are used as input to the next command. Starting from 2 files and 1 configuration file, 15 more files are produced, with the 1 result file. The scientists at CGR would find DPASS useful to easily “recollect” the input files from which the output was derived, two months after the fact. The second workload was a build of Am-Utils [6]. We used Am-Utils 6.0.9: it contains over 60,000 lines of C code in 430 files. The build process begins by running several hundred small configuration tests to detect system features. It then builds a shared library, ten binaries, four scripts, and documentation: a total of 152 new files and 19 new directories. Though the Am-Utils compile is CPU intensive, it contains a fair mix of file system operations. This workload demonstrates the performance impact a user sees when using DPASS under a normal workload. For each workload, we evaluate the performance overhead due to DPASS, the space overhead required to store provenance and the reduction in dependencies recorded due to the improvised provenance tracking algorithm. 4.1.1 Configurations We used the following configurations on the client machine for evaluation: pile. The first half of the graph is the CGR workload result and uses the left scale. The second half of the graph is the Am-Utils Compile result and uses the right scale. • NFS : Client Machine running NFS client without the DPASS stacking file system or provd. • DPASS : Client Machine with provenance tracking enabled, i.e., NFS client with DPASS Stacking file system and provd. 4.2 4.2.1 Performance Overhead CGR Workload The left half of Figure 3 compares the overhead of DPASS with NFS for the CGR workload. The overhead is negligible (less than 1%). The system and wait time in DPASS increase, but are within the standard deviation and hence the increase can be attributed to noise. At any point, there are at best 3 files open, hence the provenance tracking algorithm will not have any effect on the system time. The amount of provenance generated is also (see Section 4.3) very small, hence the wait time is also unchanged. Figure 4 shows the provenance tree for the CGR workload. Mpne.faa and Hinf.faa are the two files containing the protein sequence, .ncbirc is a configuration file and RHRB.out is the output file. 4.2.2 Am-Utils Compile The right half of Figure 3 compares the overhead of DPASS with NFS for the Am-utils compile benchmark. Overall, there is a 6.2% decrease in the elapsed time for DPASS compared to NFS. The decrease in overhead can be attributed to the 19.2% decrease in the wait time for .ncbirc B A.best BA Hinf.faa.phr Hinf.faa .ncbirc formatdb.log Mpne.faa .ncbirc Hinf.faa.psq quired for building provenance trees. .ncbirc 4.4 Hinf.faa formatdb.log .ncbirc .ncbirc Hinf.faa.pin Hinf.faa formatdb.log RBHB.out A B A.recip .ncbirc A B.best AB Mpne.faa.phr .ncbirc formatdb.log Hinf.faa .ncbirc Mpne.faa.psq Mpne.faa formatdb.log .ncbirc .ncbirc .ncbirc Mpne.faa.pin Mpne.faa formatdb.log .ncbirc Figure 4: Provenance tree for CGR workload. DPASS. We believe that this decrease in the wait time is mainly due to caching at the DPASS stacking file system layer . The system time for DPASS increases by 7.9%, due to the provenance tracking algorithm running in the DPASS stacking file system. The increase in system time is offset by the decrease in the wait time. 4.3 Table 2 shows the space overhead due to provenance. The space overhead for the CGR workload is 0.4% and for Am-utils compile is 3.3%. Clearly, the amount of space occupied is within admissable limits. .ncbirc ABA Mpne.faa Space Overhead Reduction in dependencies due to Provenance Tracking Algorithm Table 1 shows the number of read system calls, number of write system calls and the number of mmap system calls for each of the workload. Table 1 also shows the number of dependencies captured by DPASS. The number of dependencies captured by DPASS is drastically less than systems like backtracker [5] and lineage file system [9] which log every read and write and later build the dependencies from the log. The last column in Table 1 shows the amount of reduction in the number of dependencies due to the provenance tracking algorithm. The previous section has already shown that the cost of running provenance tracking algorithm, that reduces redundant dependencies, is very low. In scientific experiments, we expect there to be a small number of large files implying that a large number of read/write calls are needed to process them. While logging each call will prove to inefficient, using the provenance tracking algorithm should reduce the storage space required to store provenance and as a result, the time re- In summary, our performance evaluation demonstrates that DPASS has a low performance and space overhead, while also demonstrating that our provenance tracking algorithm is effective in reducing dependencies. 5 Related Work The Lineage File System [9] is a system that logs each read/write syscall into a SQL database. The user then directly runs SQL queries to retrieve provenance. The disadvantage with this system is that it does not eliminate redundant data. The Semantic File System (SFS) [4] is another system which uses provenance. The system allows users to access files based on their content. File type specific transducers automatically extract attributes (fieldvalue pairs) from files and insert them to an index on file modification. These attributes are used for query based file retrieval. Queries are in the form of virtual directories. For example, to list all files that export the procedure lookup fault, the user can run ls /sfs/exports:/lookup fault. This lists the files that export lookup fault. Although SFS is similar to DPASS in that it creates indices and provides a queriable interface, it is different from DPASS as SFS creates indices and allows for queries on the content of the files rather than the provenance. Many Grid and workflow management systems like the Metadata Catalog Service (MCS) [8], the replica location service (RLS) [2], Chimera [3], and the provenance aware service oriented architecture (PASOA) [7] provide provenance tracking mechanisms for various applications. However these systems are very domain specific and cannot be used elsewhere. There has been earlier work on tracking the flow of information in a filesystem to detect intrusions. For example, the BackTracker [5] is a system that logs every read and write and beginning with suspect log record for a file the BackTracker is able to track back and identify the files and processes that affected that file, and also to display chains of events in a dependency graph. Note that the BackTracker is limited to a non-distributed system, Benchmark CGR Workload Am-Utils compile Number of reads Number of writes Number of mmap reads 251 27,230 8,522 70,607 18,688 1,040 Dependencies generated by DPASS 245 6,062 % Savings 99.1% 93.9% Table 1: Reduction in dependencies due to Provenance Tracking Algorithm. Benchmark Data Size CGR Workload Am-Utils compile 5.7MB 34.4MB Number of files 18 564 Size of Provenance 24KB 1.1MB % Overhead 0.4% 3.3% Table 2: Space overhead due to provenance. whereas our system works in a distributed environment. We also take care to avoid redundant dependencies. [5] S. T. King and P. M. Chen. Backtracking Intrusions. In SOSP, Bolton Landing, New York, Oct. 2003. [6] J. S. Pendry, N. Williams, and E. Zadok. Am-utils User Manual, 6.1b3 edition, July 2003. www.am-utils.org. 6 Conclusions In this project, we have designed and implemented a Distributed Provenance Aware Storage System that automatically captures and efficiently retrieves the provenance of files in a distributed file system. We have proposed a provenance tracking algorithm that reduces redundant dependencies significantly. Our system also incurs minimal space and processor overheads. 6.1 Future Work The Berkeley RPC server does not support concurrent operations as it is currently unithreaded. Although this project does not focus on performance, making the Berkeley DB RPC server support concurrent operations will be useful. Our current implementation does not capture provenance of input files that exist outside the mount point. Designing a system that is capable of capturing provenance from multiple mount points will be useful. References [1] Sleepycat Software. http://www.sleepycat.com. [2] A. L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, and R. Schwartzkopf. Performance and Scalability of a Replica Location Service. In Proceedings of the International Symposium on High Performance Distributed Computing Conference (HPDC13), Honolulu, HI, June 2004. [3] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In CIDR, Asilomar, CA, Jan. 2003. [4] D. Gifford, P. Jouvelot, M. Sheldon, and J. J. O’Toole. Semantic file systems. In Thirteenth ACM Symposium on Operating Systems Principles, Pacific Grove, CA, Oct. 1991. [7] Provenance aware service oriented architecture. http: //twiki.pasoa.ecs.soton.ac.uk/bin/view/ PASOA/WebHome. [8] G. Singh, S. Bharathi, A. Chervenak, E. Deelman, C. Kesselman, M. Manohar, S. Patil, and L. Pearlman. A Metadata Catalog Service for Data Intensive Applications. In Proceedings of SC2003 Conference, November 2003. [9] The lineage file system. http://crypto.stanford.edu/ ˜cao/lineage.html. [10] E. Zadok and J. Nieh. FiST: A Language for Stackable File Systems. In Proceedings of the Annual USENIX Technical Conference, pages 55–70, San Diego, CA, June 2000.