In this paper we describe the support for data ingestion in AsterixDB, an open-source Big Data Ma... more In this paper we describe the support for data ingestion in AsterixDB, an open-source Big Data Management System (BDMS) that provides a platform for storage and analysis of large volumes of semi-structured data. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users' lives and system performance.
We show how to build the data feed component, architecturally, and how an enhanced user model can enable sharing of ingested data. We describe how to make this component fault-tolerant so the system manages input in the presence of failures. We also show how to make this component elastic so that variances in incoming data rates can be handled gracefully without data loss if/when desired. Results from initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported. We include an evaluation of built-in ingestion policies and study their effect as well on throughput and latency. An evaluation and comparison with a "glued" together system formed from popular engines -Storm (for streaming) and MongoDB (for persistence) - is also included.
A growing wealth of digital information is being generated on a daily basis in social networks, b... more A growing wealth of digital information is being generated on a daily basis in social networks, blogs, online communities, etc. Organizations and researchers in a wide variety of domains recognize that there is tremendous value and insight to be gained by warehousing this emerging data and making it available for querying, analysis, and other purposes. This new breed of "Big
Data analysts today want to grab every bit of data and extract useful information from it. The co... more Data analysts today want to grab every bit of data and extract useful information from it. The collected data may scale tera or even petabytes. Sampling has been established as an effective tool in avoiding the subsequent processing cost. A fixed size random sample may not suffice as the sampled data is often required to satisfy additional predicates in order for the collected sample to be useful. We refer this kind of sampling as "Predicate-Based" sampling and is a widely occurring pattern at Facebook. We desire to be able to produce such samples from large scale data in a manner such that the response time is independent of the size of the input dataset. This allows to produce desired samples from increasingly large sizes of input data. Predicate-based sampling can be expressed as a Map-Reduce task. Hadoop as a Map-Reduce implementation provides inefficient execution as it assumes that all input must be processed for a job to produce the required result. Predicate-Based sampling belongs to a class of jobs that can potentially produce the required result by processing partial input. We present an extension of Map-Reduce execution model (as implemented in Hadoop) that allows incremental processing wherein input is added dynamical to a running job in accordance with the need and the load on the cluster. The extended model allows us to produce predicate-based samples from increasingly large quantities of data with response time being independent of the size of the input.
In this paper we describe the support for data ingestion in AsterixDB, an open-source Big Data Ma... more In this paper we describe the support for data ingestion in AsterixDB, an open-source Big Data Management System (BDMS) that provides a platform for storage and analysis of large volumes of semi-structured data. Data feeds are a new mechanism for having continuous data arrive into a BDMS from external sources and incrementally populate a persisted dataset and associated indexes. We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users' lives and system performance.
We show how to build the data feed component, architecturally, and how an enhanced user model can enable sharing of ingested data. We describe how to make this component fault-tolerant so the system manages input in the presence of failures. We also show how to make this component elastic so that variances in incoming data rates can be handled gracefully without data loss if/when desired. Results from initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported. We include an evaluation of built-in ingestion policies and study their effect as well on throughput and latency. An evaluation and comparison with a "glued" together system formed from popular engines -Storm (for streaming) and MongoDB (for persistence) - is also included.
A growing wealth of digital information is being generated on a daily basis in social networks, b... more A growing wealth of digital information is being generated on a daily basis in social networks, blogs, online communities, etc. Organizations and researchers in a wide variety of domains recognize that there is tremendous value and insight to be gained by warehousing this emerging data and making it available for querying, analysis, and other purposes. This new breed of "Big
Data analysts today want to grab every bit of data and extract useful information from it. The co... more Data analysts today want to grab every bit of data and extract useful information from it. The collected data may scale tera or even petabytes. Sampling has been established as an effective tool in avoiding the subsequent processing cost. A fixed size random sample may not suffice as the sampled data is often required to satisfy additional predicates in order for the collected sample to be useful. We refer this kind of sampling as "Predicate-Based" sampling and is a widely occurring pattern at Facebook. We desire to be able to produce such samples from large scale data in a manner such that the response time is independent of the size of the input dataset. This allows to produce desired samples from increasingly large sizes of input data. Predicate-based sampling can be expressed as a Map-Reduce task. Hadoop as a Map-Reduce implementation provides inefficient execution as it assumes that all input must be processed for a job to produce the required result. Predicate-Based sampling belongs to a class of jobs that can potentially produce the required result by processing partial input. We present an extension of Map-Reduce execution model (as implemented in Hadoop) that allows incremental processing wherein input is added dynamical to a running job in accordance with the need and the load on the cluster. The extended model allows us to produce predicate-based samples from increasingly large quantities of data with response time being independent of the size of the input.
Uploads
Papers
We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users' lives and system performance.
We show how to build the data feed component, architecturally, and how an enhanced user model can enable sharing of ingested data. We describe how to make this component fault-tolerant so the system manages input in the presence of failures. We also show how to make this component elastic so that variances in incoming data rates can be handled gracefully without data loss if/when desired.
Results from initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported. We include an evaluation of built-in ingestion policies and study their effect as well on throughput and latency. An evaluation and comparison with a "glued" together system formed from popular engines -Storm (for streaming) and MongoDB (for persistence) - is also included.
Talks
is often required to satisfy additional predicates in order for the collected sample to be useful. We refer this kind of sampling as "Predicate-Based" sampling and is a widely occurring pattern at Facebook. We desire to be able to produce such samples from large scale data in a manner such that the response
time is independent of the size of the input dataset. This allows to produce desired samples from increasingly large sizes of input data. Predicate-based sampling can be expressed as a Map-Reduce task. Hadoop as a Map-Reduce
implementation provides inefficient execution as it assumes that all input must be processed for a job to produce the required result. Predicate-Based sampling belongs to a class of jobs that can potentially produce the required result by
processing partial input. We present an extension of Map-Reduce execution model (as implemented in Hadoop) that allows incremental processing wherein input is
added dynamical to a running job in accordance with the need and the load on the cluster. The extended model allows us to produce predicate-based samples from
increasingly large quantities of data with response time being independent of the size of the input.
We add a new BDMS architectural component, called a data feed, that makes a Big Data system the caretaker for functionality that used to live outside, and we show how it improves users' lives and system performance.
We show how to build the data feed component, architecturally, and how an enhanced user model can enable sharing of ingested data. We describe how to make this component fault-tolerant so the system manages input in the presence of failures. We also show how to make this component elastic so that variances in incoming data rates can be handled gracefully without data loss if/when desired.
Results from initial experiments that evaluate scalability and fault-tolerance of AsterixDB data feeds facility are reported. We include an evaluation of built-in ingestion policies and study their effect as well on throughput and latency. An evaluation and comparison with a "glued" together system formed from popular engines -Storm (for streaming) and MongoDB (for persistence) - is also included.
is often required to satisfy additional predicates in order for the collected sample to be useful. We refer this kind of sampling as "Predicate-Based" sampling and is a widely occurring pattern at Facebook. We desire to be able to produce such samples from large scale data in a manner such that the response
time is independent of the size of the input dataset. This allows to produce desired samples from increasingly large sizes of input data. Predicate-based sampling can be expressed as a Map-Reduce task. Hadoop as a Map-Reduce
implementation provides inefficient execution as it assumes that all input must be processed for a job to produce the required result. Predicate-Based sampling belongs to a class of jobs that can potentially produce the required result by
processing partial input. We present an extension of Map-Reduce execution model (as implemented in Hadoop) that allows incremental processing wherein input is
added dynamical to a running job in accordance with the need and the load on the cluster. The extended model allows us to produce predicate-based samples from
increasingly large quantities of data with response time being independent of the size of the input.