Skip to main content
  • Porto, Porto, Portugal

Joao Gama

Universidade do Porto, Liaad, Faculty Member
Machine learning techniques have been successfully applied to several real world problems in areas as diverse as image analysis, Semantic Web, bioinformatics, text processing, natural language processing,telecommunications, finance,... more
Machine learning techniques have been successfully applied to several real world problems in areas as diverse as image analysis, Semantic Web, bioinformatics, text processing, natural language processing,telecommunications, finance, medical diagnosis, and so forth. A particular application where machine learning plays a key role is data mining, where machine learning techniques have been extensively used for the extraction of association, clustering, prediction, diagnosis, and regression models. This text presents our personal view of the main aspects, major tasks, frequently used algorithms, current research, and future directions of machine learning research. For such, it is organized as follows: Background information concerning machine learning is presented in the second section. The third section discusses different definitions for Machine Learning. Common tasks faced by Machine Learning Systems are described in the fourth section. Popular Machine Learning algorithms and the importance of the loss function are commented on in the fifth section. The sixth and seventh sections present the current trends and future research directions, respectively.
A machine learning approach that is capable of treating data streams presents new challenges and enables the analysis of a variety of real problems in which concepts change over time. In this scenario, the ability to identify novel... more
A machine learning approach that is capable of treating data streams presents new challenges and enables the analysis of a variety of real problems in which concepts change over time. In this scenario, the ability to identify novel concepts as well as to deal with concept drift are two important attributes. This paper presents a technique based on the k-means
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal... more
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised
Machine Learning and Data Mining research strongly depend on the quality and quantity of the real world datasets for the evaluation stages of the developing methods. In the context of the emerging Online Multi-Target Regression and... more
Machine Learning and Data Mining research strongly depend on the quality and quantity of the real world datasets for the evaluation stages of the developing methods. In the context of the emerging Online Multi-Target Regression and Multi-Label Classification methodologies, datasets present new characteristics that require specific testing and represent new challenges. The first difficulty found in evaluation is the reduced amount of examples caused by data damage, privacy preservation or high cost of acquirement. Secondly, few data events of interest such as data changes are difficult to find in the datasets of specific domains, since these events naturally scarce. For those reasons, this work suggests a method of producing synthetic datasets with desired properties(number of examples, data changes events, ... ) for the evaluation of Multi-Target Regression and Multi-Label Classification methods. These datasets are produced using First Principle Models which give more realistic and ...
Plasmids are common in the prokaryotic world, both in bacteria and archaea. Most of these extrachromosomal DNA molecules do not code for essential genes. One may expect that the replication of plasmids and the expression of plasmidic... more
Plasmids are common in the prokaryotic world, both in bacteria and archaea. Most of these extrachromosomal DNA molecules do not code for essential genes. One may expect that the replication of plasmids and the expression of plasmidic genes impose a fitness cost to their host. Given this cost, and given that plasmid-free cells often arise, it is striking that so many non-transferable plasmids are able to maintain themselves inside prokaryotic cells without being counter-selected in favor of plasmid-free cells. A solution to this paradox would be the evolution of controlling mechanisms to regulate rivalry between plasmids for the stability of these symbiotic relationships. In this chapter, we discuss the evolutionary selective conditions for such mechanisms to evolve.
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal... more
In this paper, a cluster-based novelty detection technique capable of dealing with a large amount of data is presented and evaluated in the context of intrusion detection. Starting with examples of a single class that describe the normal profile, the proposed technique detects novel concepts initially as cohesive clusters of examples and later as sets of clusters in an unsupervised
Data mining and machine learning algorithms can be employed to perform a variety of tasks. However, since most of these problems may depend on environments that change over time, performing classification tasks in dynamic environments has... more
Data mining and machine learning algorithms can be employed to perform a variety of tasks. However, since most of these problems may depend on environments that change over time, performing classification tasks in dynamic environments has been a challenge in data mining research domain in the last decades. Currently, in the literature, the most common strategies used to detect changes are based on accuracy monitoring, which relies on previous knowledge of the data in order to identify whether or not correct classifications are provided. However, such a feedback can be infeasible in practical problems. In this work, we present a comprehensive overview of current machine learning/data mining approaches proposed to deal with dynamic environments problems. The objective is to highlight the main drawbacks and open issues, as well as future directions and problems worthy of investigation. In addition, we provide the definitions of the main terms used to represent this problem in the liter...
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works... more
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current stateof- the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.
Decision rules are one of the most expressive and interpretable models for machine learning. In this article, we present Adaptive Model Rules (AMRules), the first stream rule learning algorithm for regression problems. In AMRules, the... more
Decision rules are one of the most expressive and interpretable models for machine learning. In this article, we present Adaptive Model Rules (AMRules), the first stream rule learning algorithm for regression problems. In AMRules, the antecedent of a rule is a conjunction of conditions on the attribute values, and the consequent is a linear combination of the attributes. In order to maintain a regression model compatible with the most recent state of the process generating data, each rule uses a Page-Hinkley test to detect changes in this process and react to changes by pruning the rule set. Online learning might be strongly affected by outliers. AMRules is also equipped with outliers detection mechanisms to avoid model adaption using anomalous examples. In the experimental section, we report the results of AMRules on benchmark regression problems, and compare the performance of our system with other streaming regression algorithms.
Research Interests:
Page 1. Artif Intell Rev (2008) 30:19–37 DOI 10.1007/s10462-009-9114-9 A review on the combination of binary classifiers in multiclass problems Ana Carolina Lorena · André CPLF de Carvalho · João MP Gama Published ...
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we... more
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform... more
Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that de...
Abstract Many data stream clustering algorithms operate in two well-defined steps:(i) online statistical data collection stage; and (ii) offline macro-clustering stage. The well-known k-means algorithm is often employed for performing the... more
Abstract Many data stream clustering algorithms operate in two well-defined steps:(i) online statistical data collection stage; and (ii) offline macro-clustering stage. The well-known k-means algorithm is often employed for performing the offline macro-clustering step. The conventional k-means algorithm assumes that the number of clusters (k) is defined a priori by the user. Given the difficulty of defining the value of ka priori in real-world problems, we describe a new approach that allows estimating k dynamically from streams with variable ...
Online Prediction of Clustered Streams Pedro Pereira Rodrigues and Joao Gama {prodrigues, jgama}@ liacc. up. pt LIACC-NIAAD-University of Porto Rua de Ceuta, 118-6 andar 4050-190 Porto, Portugal Abstract. This paper presents a real-time... more
Online Prediction of Clustered Streams Pedro Pereira Rodrigues and Joao Gama {prodrigues, jgama}@ liacc. up. pt LIACC-NIAAD-University of Porto Rua de Ceuta, 118-6 andar 4050-190 Porto, Portugal Abstract. This paper presents a real-time system for online ...
Research Interests:
Research Interests:
This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an... more
This paper presents and evaluates an approach to novelty detection that addresses it as the problem of identifying novel concepts in a continuous learning scenario, as an extension to a single-class classification problem. OLINDDA, an OnLIne Novelty and Drift Detection ...
Addressing the issues challenging the sensor community, this book presents innovative solutions in offline data mining and real-time analysis of sensor or geographically distributed data. Illustrated with case studies, it discusses the... more
Addressing the issues challenging the sensor community, this book presents innovative solutions in offline data mining and real-time analysis of sensor or geographically distributed data. Illustrated with case studies, it discusses the challenges and requirements for sensor data-based knowledge discovery solutions in high-priority application. The book then explores the fusion between heterogeneous data streams from multiple sensor types and applications in science, engineering, and security. Bringing together researchers from ...

And 63 more