Skip to main content

    Meichun Hsu

    ABSTRACT The current generation of stream processing systems is in general built separately from the query engine thus lacks the expressive power of SQL and causes significant overhead in data access and movement. This situation has... more
    ABSTRACT The current generation of stream processing systems is in general built separately from the query engine thus lacks the expressive power of SQL and causes significant overhead in data access and movement. This situation has motivated us to leverage the query engine for stream processing. Stream-join is a window operation where the key issue is how to punctuate and pair two or more correlated streams. In this work we tackle this issue in the specific context of query engine supported stream processing. We focus on the following problems: a SQL query is definable on bounded relation data but stream data are unbounded, and join multiple streams is a stateful (thus history-sensitive) operation but a SQL query only cares about the current state; further, relation join typically requires relation re-scan in a nested-loop but by nature a stream cannot be re-captured as reading a stream always gets newly incoming data. To leverage query processing for analyzing unbounded stream, we defined the Epoch-based Continuous Query (ECQ) model which allows a SQL query to be executed epoch by epoch for processing the stream data chunk by chunk. However, unlike multiple one-time queries, an ECQ is a single, continuous query instance across execution epochs for keeping the continuity of the application state as required by the history-sensitive operations such as sliding-window join. To joining multiple streams, we further developed the techniques to cache one or more consecutive data chunks falling in a sliding window across query execution epochs in the ECQ instance, to allow them to be re-delivered from the cache. In this way join multiple streams and self-join a single stream in the data chunk based window or sliding window, with various pairing schemes, are made possible. We extended the PostgreSQL engine to support the proposed approach. Our experience has demonstrated its value.
    ... Machine, SVM, model for classifying video frames to concepts) for multilevel, multidimensional feature ... and analysis, our OpBI video platform • supports in-DB, multi-level, multi-dimensional ... contexts, models and features... more
    ... Machine, SVM, model for classifying video frames to concepts) for multilevel, multidimensional feature ... and analysis, our OpBI video platform • supports in-DB, multi-level, multi-dimensional ... contexts, models and features (Fig.6), yielding model-based classification expressed by ...
    ABSTRACT Scaling-out data-intensive analytics is generally made by means of parallel computation for gaining CPU bandwidth, and incremental computation for balancing workload. Combining these two mechanisms is the key to support large... more
    ABSTRACT Scaling-out data-intensive analytics is generally made by means of parallel computation for gaining CPU bandwidth, and incremental computation for balancing workload. Combining these two mechanisms is the key to support large scale stream analytics. Map-Reduce (M-R) is a programming model for supporting parallel computation over vast amounts of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of data intensive applications. In-DB M-R allows these functions to be embedded within standard queries to exploit the SQL expressive power, and allows them to be executed by the query engine with fast data access and reduced data move. However, when the data form infinite streams, the semantics and scale-out capability of M-R are challenged. To solve this problem, we propose to integrate M-R with the continuous query model characterized by Cut-Rewind (C-R), i.e. cut a query execution based on some granule of the stream data and then rewind the state of the query without shutting it down, for processing the next chunk of stream data. This approach allows an M-R query with full SQL expressive power to be applied to dynamic stream data chunk by chunk for continuous, window-based stream analytics. Our experience shows that integrating M-R and C-R can provide a powerful combination for parallelized and granulized stream processing. This combination enables us to scale-out stream analytics “horizontally” based on the M-R model, and “vertically” based on the C-R model. The proposed approach has been prototyped on a commercial and proprietary parallel database engine. Our preliminary experiments reveal the merit of using query engine for near-real-time parallel and incremental stream analytics.
    ... T3 T4 T5 T6 T7 T8 T9 Used for current hour h h -1 h -2 h -8 h -7 loading archiving T1 T2 T3 T4 T5 T6 T7 T8 T9 loading archiving at hour h Table indices Query generator Retrieve request at hour h+1 Page 15. Data Stream Analytics as... more
    ... T3 T4 T5 T6 T7 T8 T9 Used for current hour h h -1 h -2 h -8 h -7 loading archiving T1 T2 T3 T4 T5 T6 T7 T8 T9 loading archiving at hour h Table indices Query generator Retrieve request at hour h+1 Page 15. Data Stream Analytics as Cloud Service for Mobile Applications 723 ...
    ABSTRACT When cloud services become popular, how to consume a cloud service efficiently by an enterprise application, as the client of the cloud service either on a device or on the application tier of the enterprise software stack, is an... more
    ABSTRACT When cloud services become popular, how to consume a cloud service efficiently by an enterprise application, as the client of the cloud service either on a device or on the application tier of the enterprise software stack, is an important issue. Focusing on the consumption of the real-time events service, in this work we extend the Data Access Object (DAO) pattern of enterprise applications for on-demand access and analysis of real-time events. We introduce the notion of Operational Event Pipe for caching the most recent events delivered by an event service, and the on-demand data analysis pattern based on this notion. We implemented the operational event pipe as a special kind of continuous query referred to as Event Pipe Query (EPQ). An EPQ is a long-standing SQL query with User Defined Functions (UDFs) that provides a pipe for the stream data to be buffered and to flow continuously in the boundary of a sliding window; when not requested, the EPQ just maintains and updates the buffer but returns noting, once requested, it returns the query processing results on the selected part of the sliding window buffer, under the request-and-rewind mechanism. Integrating event buffering and analysis in a single continuous query leverages the SQL expressive power and the query engine's data processing capability, and reduces the data movement overhead. By extending the PostgreSQL query engine, we implement this operation pattern as the Continuous Data Access Object (CDAO) - an extension to the J2EE DAO. While DAO provides static data access interfaces, CDAO adds dynamic event processing interfaces with one or more EPQs.
    Most conventional video processing platforms treat database merely as a storage engine rather than a computation engine, which causes inefficient data access and massive amount of data movement. Motivated by providing a convergent... more
    Most conventional video processing platforms treat database merely as a storage engine rather than a computation engine, which causes inefficient data access and massive amount of data movement. Motivated by providing a convergent platform, we push down video processing to the database engine using User Defined Functions (UDFs). However, the existing UDF technology suffers from two major limitations. First, a
    ABSTRACT
    ABSTRACT Many enterprise applications are based on continuous analytics of data streams. Integrating data-intensive stream processing with query processing allows us to take advantage of SQL's expressive power and DBMS's... more
    ABSTRACT Many enterprise applications are based on continuous analytics of data streams. Integrating data-intensive stream processing with query processing allows us to take advantage of SQL's expressive power and DBMS's data management capability. However, it also raises serious challenges in dealing with complex dataflow, applying queries to unbounded stream data, and providing highly scalable, dynamically configurable, elastic infrastructure. In this project we tackle these problems in three dimensions. First, we model the general graph-structured, continuous dataflow analytics as a SQL Streaming Process with multiple connected and stationed continuous queries. Next, we extend the query engine to support cycle-based query execution for processing unbounded stream data in bounded chunks with sound semantics. Finally, we develop the Query Engine Grid (QE-Grid) over the Distributed Caching Platforms (DCP) as a dynamically configurable elastic infrastructure for parallel and distributed execution of SQL Streaming Processes. The proposed infrastructure is preliminarily implemented using PostgreSQL engines. Our experience shows its merit in leveraging SQL and query engines to analyze real-time, graph-structured and unbounded streams. Integrating it with a commercial and proprietary MPP based database cluster is being investigated.
    ABSTRACT
    Research Interests:
    Research Interests:
    SFL (pronounced as Sea-Flow) is an analytics system that supports a declarative language that extends SQL for specifying the dataflow of data-intensive analytics. The extended SQL language is motivated by providing a top-level... more
    SFL (pronounced as Sea-Flow) is an analytics system that supports a declarative language that extends SQL for specifying the dataflow of data-intensive analytics. The extended SQL language is motivated by providing a top-level representation of the converged platform for analytics and data management. Due to fast data access and reduced data transfer, such convergence has become the key to speed
    Research Interests:
    The massively growing data volume and the pressing need for low latency are pushing the traditional store-first-query-later data warehousing technologies beyond their limits. Many enterprise applications are now based on continuous... more
    The massively growing data volume and the pressing need for low latency are pushing the traditional store-first-query-later data warehousing technologies beyond their limits. Many enterprise applications are now based on continuous analytics of data streams. While integrating stream processing with query processing takes advantage of SQL's expressive power and DBMS's data management capability, it raises serious challenges in dealing with complex dataflow, applying queries to unbounded stream data, and providing highly scalable, dynamically configurable, elastic infrastructure. To solve these problems, we model the general graph-structured, continuous dataflow analytics as a SQL Streaming Process with multiple connected and stationed continuous queries; then we extend the query engine to support cycle-based query execution for processing unbounded stream data chunk-wise with sound semantics; and finally, we develop the Query Engine Net (QE-Net) over the Distributed Caching P...
    With the booming of microblogs on the Web, people have begun to express their opinions on a wide variety of topics on Twitter and other similar services. Sentiment analysis on entities (e.g., products, organizations, people, etc.) in... more
    With the booming of microblogs on the Web, people have begun to express their opinions on a wide variety of topics on Twitter and other similar services. Sentiment analysis on entities (e.g., products, organizations, people, etc.) in tweets (posts on Twitter) thus becomes a rapid and effective way of gauging public opinion for business marketing or social studies. However, Twitter's unique characteristics give rise to new problems for current sentiment analysis methods, which originally focused on large opinionated corpora such as product reviews. In this paper, we propose a new entity-level sentiment analysis method for Twitter. The method first adopts a lexiconbased approach to perform entity-level sentiment analysis. This method can give high precision, but low recall. To improve recall, additional tweets that are likely to be opinionated are identified automatically by exploiting the information in the result of the lexicon-based method. A classifier is then trained to assig...
    Without Abstract
    ABSTRACT
    ABSTRACT
    A technical trend in supporting large scale scientific applications is converging data intensive computation and data management for fast data access and reduced data flow. In a combined cluster platform, co-locating computation and data... more
    A technical trend in supporting large scale scientific applications is converging data intensive computation and data management for fast data access and reduced data flow. In a combined cluster platform, co-locating computation and data is the key to efficiency and scalability; and to make it happen, data must be partitioned in a way consistent with the computation model. However, with
    The performance of intra-node parallel dataflow programs in the context of streaming systems depends mainly on two parameters: the degree of parallelism for each node of the dataflow program as well as the batching size for each node. In... more
    The performance of intra-node parallel dataflow programs in the context of streaming systems depends mainly on two parameters: the degree of parallelism for each node of the dataflow program as well as the batching size for each node. In the state-of-the-art systems the user has to specify those values manually. Manual tuning of both parameters is necessary in order to get good performance. However, this process is difficult and time consuming-even for experts. In this paper we introduce and optimization algorithm that optimizes both parameters automatically. We define a novel cost model for intra-node parallel dataflow programs with user-defined functions. Furthermore, we introduce different batching schemes to reduce the number of output buffers, i. e., main memory consumption. We implemented our approach on top of the open source system Storm and ran experiments with different workloads. Our results show a throughput improvement of more than one order of magnitude while the optim...
    Research Interests:
    ABSTRACT Since stream analytics is treated as a kind of cloud service, there exists a pressing need for its reliability and fault-tolerance. In a streaming process, the parallel and distributed tasks are chained in a graph-structure with... more
    ABSTRACT Since stream analytics is treated as a kind of cloud service, there exists a pressing need for its reliability and fault-tolerance. In a streaming process, the parallel and distributed tasks are chained in a graph-structure with each task transforming a stream to a new stream, the transaction property guarantees the streaming data, called tuples, to be processed in the order of their generation in every dataflow path, with each tuple processed once and only once. The failure recovery of a task allows the previously produced results to be corrected for eventual consistency, which is different from the instant consistency of global state enforced by the failure recovery of general distributed systems, and therefore presents new technical challenges. Transactional stream processing typically requires every task to checkpoint its execution state, and when it is restored from a failure, to have the last state recovered from the checkpoint and missing tuple re-acquired and processed. Currently there exist two kind approaches: one treats the whole process as a single transaction, and therefore suffers from the loss of intermediate results during failures, the other relies on the receipt of acknowledgement (ACK) to decide whether moving forward to emit the next resulting tuple or resending the current one after timeout, on the per-tuple basis, thus incurs extremely high latency penalty. In contradistinction to the above, we propose the backtrack mechanism for failure recovery, which allows a task to process tuples continuously without waiting for ACKs and without resending tuples in the failure-free case, but to request (ASK) the source tasks to resend the missing tuples only when it is restored from a failure which is a rare case thus has limited impact on the overall performance. We have implemented the proposed mechanisms on Fontainebleau, the distributed stream analytics infrastructure we developed on top of Storm. As a principle, we ensure all the transactional proper- ies to be system supported and transparent to users. Our experience shows that the ASK-based recovery mechanism significantly outperforms the ACK-based one.
    ABSTRACT We study a novel information retrieval problem, where the query is a time series for a given time period, and the retrieval task is to find relevant documents in a text collection of the same time period, which contain topics... more
    ABSTRACT We study a novel information retrieval problem, where the query is a time series for a given time period, and the retrieval task is to find relevant documents in a text collection of the same time period, which contain topics that are correlated with the query time series. This retrieval problem arises in many text mining applications where there is a need to analyze text data in order to discover potentially causal topics. To solve this problem, we propose and study multiple retrieval algorithms that use the general idea of ranking text documents based on how well their terms are correlated with the query time series. Experiment results show that the proposed retrieval algorithm can effectively help users find documents that are relevant to the time series queries, which can help users analyze the variation patterns of the time series.
    ABSTRACT Many applications require analyzing textual topics in conjunction with external time series variables such as stock prices. We develop a novel general text mining framework for discovering such causal topics from text. Our... more
    ABSTRACT Many applications require analyzing textual topics in conjunction with external time series variables such as stock prices. We develop a novel general text mining framework for discovering such causal topics from text. Our framework naturally combines any given probabilistic topic model with time-series causal analysis to discover topics that are both coherent semantically and correlated with time series data. We iteratively refine topics, increasing the correlation of discovered topics with the time series. Time series data provides feedback at each iteration by imposing prior distributions on parameters. Experimental results show that the proposed framework is effective.
    To effectively handle the scale of processing required in information extraction and analytical tasks in an era of information explosion, partitioning the data streams and applying computation to each partition in parallel is the key.... more
    To effectively handle the scale of processing required in information extraction and analytical tasks in an era of information explosion, partitioning the data streams and applying computation to each partition in parallel is the key. Even though the concept of MapReduce has been around for some time and is well known in the functional programming literatures, it is Google which demonstrated that this very high-level abstraction is especially suitable for data-intensive computation and potentially has very high performance implementation as well. If we observe the behavior of a query plan on a modern shared-nothing parallel database system such as Teradata and HP NeoView, one notices that it also offers large-scale parallel processing while maintaining the high level abstraction of a declarative query language. The correspondence between the MapReduce parallel processing paradigm and the paradigm for parallel query processing has been observed. In addition to integrated schema management and declarative query language, the strengths of parallel SQL engines also include workload management and richer expressive power and parallel processing patterns. Compared to the MapReduce parallel processing paradigm, however, the parallel query processing paradigm has focused on native, built-in, algebraic query operators that are supported in the SQL language. Parallel query processing engines lack the ability to efficiently handle dynamically-defined procedures. While the “user-defined function” in SQL can be used to inject dynamically defined procedures, the ability of standard SQL to support flexibility of their invocation, and efficient implementation of these user-defined functions, especially in a highly scaled-out architecture, are not adequate. This paper discusses some issues and approaches in integrating large scale information extraction and analytical tasks with parallel data management.
    ABSTRACT In this work we focus on using Distributed Caching Platform (DCP) to scale out database applications and to support relational data communication of multiple individual query-engines in a general graph-structured SQL dataflow... more
    ABSTRACT In this work we focus on using Distributed Caching Platform (DCP) to scale out database applications and to support relational data communication of multiple individual query-engines in a general graph-structured SQL dataflow process. While the use of DCP has gained popularity lately, transferring query results from one query-engine to another tuple-by-tuple through DCP is often inefficient; this is because the granularity of cache access is too small, and the overhead of data conversion and interpretation is too large. To deal with these issues, we leverage DCP's binary protocol and query engine's buffer management to deliver query results at the storage level directly. We extend the database buffer pool over multiple memory nodes to enable low-latency access to large volumes of data, and introduce the novel page-feed mechanism to allow the query results of the collaborative query engines to be communicated as data pages (blocks), namely, the producer query puts its result relation as pages in the DCP to be got by the consumer query. In this way, data are transferred as pages directly under DCP's binary protocol, where the contained tuples are exactly in the format required by the relational operators, and the use of pages, as mini-batches of tuples, provides the balanced efficiency of query processing and DCP access. Pushing relation data communication down to the storage (buffer pool) level from the application level offers significant performance gain, and is naturally consistent with the SQL semantics. We have implemented these specific mechanisms on a cluster of PostgreSQL engines. Our experiment results are documented in this paper.
    ... and one-to-one security policy agreement in a many-to-many collaboration environment. ... In multiple-hop environment, where to perform security actions, eg building credentials for ... The ICD approach allows instant establishment of... more
    ... and one-to-one security policy agreement in a many-to-many collaboration environment. ... In multiple-hop environment, where to perform security actions, eg building credentials for ... The ICD approach allows instant establishment of security policies for each business transaction ...
    ABSTRACT In the era of information explosion, huge amount of data are generated from various sensing devices continuously, which are often too low level for analytics purpose, and too massive to load to data-warehouses for filtering and... more
    ABSTRACT In the era of information explosion, huge amount of data are generated from various sensing devices continuously, which are often too low level for analytics purpose, and too massive to load to data-warehouses for filtering and summarizing with the reasonable latency. Distributed stream analytics for multilevel abstraction is the key to solve this problem. We advocate a distributed infrastructure for CDR (Call Detail Record) stream analytics in the telecommunication network where the stream processing is integrated into the database engine, and carried out in terms of continuous querying; the computation model is based on network-distributed (rather than clustered) Map-Reduce scheme. We propose the window based cooperation mechanism for having multiple engines synchronized and cooperating on the data falling in a common window boundary, based on time, cardinality, etc. This mechanism allows the engines to cooperate window by window without centralized coordination. We further propose the quantization mechanism for integrating the discretization and abstraction of continuous-valued data, for efficient and incremental data reduction, and in turn, network data movement reduction. These mechanisms provide the key roles in scaling out CDR stream analytics. The proposed approach has been integrated into the PostgreSQL engine. Our preliminary experiments reveal its merit for large-scale distributed stream processing.
    Motivated by automating enterprise information derivation processes, we propose a new kind of business process - Data-Continuous SQL Process (DCSP), which is data-stream driven and continuously running. The basic operators of a DCSP are... more
    Motivated by automating enterprise information derivation processes, we propose a new kind of business process - Data-Continuous SQL Process (DCSP), which is data-stream driven and continuously running. The basic operators of a DCSP are database User Defined Functions (UDFs). However, we introduce a special kind of UDFs - Relation Valued Functions (RVFs) with both input and return values specified as
    ... To cqy otherwe, or to repubbsh, requwes a fee and/or speak p-msswn @ 1990 ... A transaction 1s a unit of work, performed on shared data, which preserves atomlclty (1 e , all ... defined and extensively studied m the database and... more
    ... To cqy otherwe, or to repubbsh, requwes a fee and/or speak p-msswn @ 1990 ... A transaction 1s a unit of work, performed on shared data, which preserves atomlclty (1 e , all ... defined and extensively studied m the database and transaction processing (TP) literature [Bernstein et al ...
    ABSTRACT
    Running analytics computation inside a database engine through the use of UDFs (User Defined Functions) has been investigated, but not yet become a scalable approach due to several technical limitations. One limitation lies in the lack of... more
    Running analytics computation inside a database engine through the use of UDFs (User Defined Functions) has been investigated, but not yet become a scalable approach due to several technical limitations. One limitation lies in the lack of generality for UDFs to express complex applications and to compose them with relational operators in SQL queries. Another limitation lies in the lack
    ABSTRACT
    To achieve scalable data intensive analytics, we investigate methods to integrate general purpose analytic computation into a query pipeline using User Defined Functions (UDFs). However, an existing UDF cannot act as a block operator with... more
    To achieve scalable data intensive analytics, we investigate methods to integrate general purpose analytic computation into a query pipeline using User Defined Functions (UDFs). However, an existing UDF cannot act as a block operator with chunk-wise input along the tuple-wise query processing pipeline, therefore unable to deal with the application semantics definable on the set of incoming tuples representing a single object or falling in a time window, and unable to leverage external computation engines for efficient batch processing. To enable the data intensive computation pipeline, we introduce a new kind of UDFs called Set-In Set-Out (SISO) UDFs. A SISO UDF is a block operator for processing the input tuples and returning the resulting tuples chunk by chunk. Operated in the query processing pipeline, a SISO UDF pools a chunk of input tuples, dispatches them to GPUs or an analytic engine in batch, materializes and then streams out the results. This behavior differentiates SISO UDF from all the existing ones, and makes efficient integration of analytic computation and data management feasible. We have implemented the SISO UDF framework by extending the PostgreSQL query engine, and further demonstrated the use of SISO UDF with GPU-enabled analytical query evaluation. Our experiments show that the proposed approach is scalable and efficient.
    Running analytics computation inside database engines through the use of UDFs (User Defined Functions) has been extensively investigated, but not yet become a scalable approach due to two major limitations. One limitation lies in that the... more
    Running analytics computation inside database engines through the use of UDFs (User Defined Functions) has been extensively investigated, but not yet become a scalable approach due to two major limitations. One limitation lies in that the existent UDFs are not relation-in, relation-out and schema-aware, unable to model complex applications, and cannot be composed with relational operators in a SQL query.
    Page 1. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS Qiming Chen*, Andy Therber, Meichun Hsu*, Hans Zeller, Bin Zhang*, Ren Wu* HP TSG SW NED, Cupertino, California, USA *HP Labs, Palo Alto, California, USA ...
    ABSTRACT
    Research Interests:
    Research Interests:
    ABSTRACT Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming... more
    ABSTRACT Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or to demote some target products. In recent years, fake review detection has attracted significant attention from both the business and research communities. However, due to the difficulty of human labeling needed for supervised learning and evaluation, the problem remains to be highly challenging. This work proposes a novel angle to the problem by modeling spamicity as latent. An unsupervised model, called Author Spamicity Model (ASM), is proposed. It works in the Bayesian setting, which facilitates modeling spamicity of authors as latent and allows us to exploit various observed behavioral footprints of reviewers. The intuition is that opinion spammers have different behavioral distributions than non-spammers. This creates a distributional divergence between the latent population distributions of two clusters: spammers and non-spammers. Model inference results in learning the population distributions of the two clusters. Several extensions of ASM are also considered leveraging from different priors. Experiments on a real-life Amazon review dataset demonstrate the effectiveness of the proposed models which significantly outperform the state-of-the-art competitors.
    ABSTRACT

    And 61 more