Skip to main content
Xiufeng Liu
  • Building 426, room 125, 2800 Kgs. Lyngby, Denmark
  • I am a researcher in the Department of Management Engineering at Technical University of Denmark (DTU). I am a member... moreedit
Recently, there has been a shifting focus of organizations and governments towards digiti-zation of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this... more
Recently, there has been a shifting focus of organizations and governments towards digiti-zation of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.
Research Interests:
Smart city data come from heterogeneous sources including various types of the Internet of Things such as traffic, weather, pollution, noise, and portable devices. They are characterized with diverse quality issues and with different... more
Smart city data come from heterogeneous sources including various types of the Internet of Things such as traffic, weather, pollution, noise, and portable devices. They are characterized with diverse quality issues and with different types of sensitive information. This makes data processing and publishing challenging. In this paper, we propose a framework to streamline smart city data management, including data collection, cleansing, anonymiza-tion, and publishing. The paper classifies smart city data in sensitive, quasi-sensitive, and open/public levels and then suggests different strategies to process and publish the data within these categories. The paper evaluates the framework using a real-world smart city data set, and the results verify its effectiveness and efficiency. The framework can be a generic solution to manage smart city data.
Research Interests:
—With the prevalence of cloud computing and In-ternet of Things (IoT), smart meters have become one of the main components of smart city strategies. Smart meters generate large amounts of fine-grained data that is used to provide useful... more
—With the prevalence of cloud computing and In-ternet of Things (IoT), smart meters have become one of the main components of smart city strategies. Smart meters generate large amounts of fine-grained data that is used to provide useful information to consumers and utility companies for decision-making. Now-a-days, smart meter analytics systems consist of analytical algorithms that process massive amounts of data. These analytics algorithms require ample amounts of realistic data for testing and verification purposes. However, it is usually difficult to obtain adequate amounts of realistic data, mainly due to privacy issues. This paper proposes a smart meter data generator that can generate realistic energy consumption data by making use of a small real-world data set as seed. The generator generates data using a prediction-based method that depends on historical energy consumption patterns along with Gaussian white noise. In this paper, we comprehensively evaluate the efficiency and effectiveness of the proposed method based on a real-world energy data set.
Research Interests:
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide... more
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. ABSTRACT The eGovMon Data Warehouse (eGovMon DW) is built as a data repository for eGovernment services benchmarking results. We propose a DW architecture with open source business intelligence technologies for eGovernment. This DW architecture uses PostgreSQL as the DBMS, eGovernment operational system as the data source, and a right-time ETL tool to populate the data. Through this proposal, we give the potential research interests and issues for our future work.
Research Interests:
An increasing number of (semantic) web applications store a very large number of (subject, predicate, object) triples in specialized storage engines called triple-stores. Often, triple-stores are used mainly as plain data stores, i.e.,... more
An increasing number of (semantic) web applications store a very large number of (subject, predicate, object) triples in specialized storage engines called triple-stores. Often, triple-stores are used mainly as plain data stores, i.e., for inserting and retrieving large amounts of triples, but not using more advanced features such as logical inference, etc. However, current triple-stores are not optimized for such bulk operations and/or do not support OWL Lite. Further, triple-stores can be inflexible when the data has to be integrated with other kinds of data in non-triple form, e.g., standard relational data. This paper presents 3XL, a triple-store that efficiently supports operations on very large amounts of OWL Lite triples. 3XL also provides the user with high flexibility as it stores data in an object-relational database in a schema that is easy to use and understand. It is, thus, easy to integrate 3XL data with data from other sources. The distinguishing features of 3XL include (a) flexibility as the data is stored in a database, allowing easy integration with other data, and can be queried by means of both triple queries and SQL, (b) using a specialized data-dependent schema (with intelligent partitioning) which is intuitive and efficient to use, (c) using object-relational DBMS features such as inheritance, (d) efficient loading through extensive use of bulk loading and caching, and (e) efficient triple query operations, especially in the important case when the subject and/or predicate is known. Extensive experiments with a PostgreSQL-based implementation show that 3XL performs very well for such operations and that the performance is comparable to state-of-the-art triple-stores.
Research Interests:
—This paper demonstrates the use of 3XL, a DBMS-based triple-store for OWL Lite data. 3XL is characterized by its use of a database schema specialized for the data to represent. The specialized database schema uses object-relational... more
—This paper demonstrates the use of 3XL, a DBMS-based triple-store for OWL Lite data. 3XL is characterized by its use of a database schema specialized for the data to represent. The specialized database schema uses object-relational features – particularly inheritance – and partitions the data such that it is fast to locate the needed data when it is queried. Further, the generated database schema is very intuitive and it is thus easy to integrate the OWL data with other kinds of data. 3XL offers performance comparable to the leading file-based triple-stores. We will demonstrate 1) how a specialized database schema is generated by 3XL based on an OWL ontology; 2) how triples are loaded, including how they pass through the 3XL system and how 3XL can be configured to fine-tune performance; and 3) how (simple and complex) queries can be expressed and how they are executed by 3XL.
Research Interests:
Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in... more
Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however , challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.
Research Interests:
Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications... more
Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection
of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics
algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has
been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart
meter analytics from a software performance perspective. First, we design a performance benchmark that
includes common smart meter analytics tasks. These include offline feature extraction and model building
as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart
meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets
from a small seed of real data. Third, we implement the proposed benchmark using five representative
platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine
learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed
data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of
application development effort and performance on a multicore machine as well as a cluster of 16 commodity
servers
Research Interests:
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide... more
General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Research Interests:
Research Interests: