Skip to main content
    Searching and Evaluating . . . . . . . . . . . . . . . . . . . . . . . . . 10 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Performing Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11... more
    Searching and Evaluating . . . . . . . . . . . . . . . . . . . . . . . . . 10 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Performing Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 ... The scatter plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Selectin an individual 2D scatter plot . . . . . . . . . . . . . . . . . . ...
    Despite its simplicity, the naive Bayes learning scheme performs wellon most classification tasks, and is often significantly more accurate thanmore sophisticated methods. Although the probability estimates that itproduces can be... more
    Despite its simplicity, the naive Bayes learning scheme performs wellon most classification tasks, and is often significantly more accurate thanmore sophisticated methods. Although the probability estimates that itproduces can be inaccurate, it often assigns maximum probability to thecorrect class. This suggests that its good performance might be restrictedto situations where the output is categorical. It is therefore interesting tosee how
    Association rule mining is a data mining technique that reveals interesting relationships in a database. Existing approaches employ different parameters to search for interesting rules. This fact and the large number of rules make it... more
    Association rule mining is a data mining technique that reveals interesting relationships in a database. Existing approaches employ different parameters to search for interesting rules. This fact and the large number of rules make it difficult to compare the output of confidence ...
    A system of nested dichotomies is a hierarchical decomposition of a multi-class problem with c classes into c–1 two-class problems and can be represented as a tree structure. Ensembles of randomly-generated nested dichotomies have proven... more
    A system of nested dichotomies is a hierarchical decomposition of a multi-class problem with c classes into c–1 two-class problems and can be represented as a tree structure. Ensembles of randomly-generated nested dichotomies have proven to be an effective ...
    Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not)... more
    Empirical research in learning algorithms for classification tasks generally requires the use of significance tests. The quality of a test is typically judged on Type I error (how often the test indicates a difference when it should not) and Type II error (how often it indicates no ...
    ABSTRACT Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced.... more
    ABSTRACT Inducing classifiers that make accurate predictions on future data is a driving force for research in inductive learning. However, also of importance to the users is how to gain information from the models produced. Unfortunately, some of the most powerful inductive learning algorithms generate “black boxes”—that is, the representation of the model makes it virtually impossible to gain any insight into what has been learned. This paper presents a technique that can help the user understand why a classifier makes the predictions that it does by providing a two-dimensional visualization of its class probability estimates. It requires the classifier to generate class probabilities but most practical algorithms are able to do so (or can be modified to this end).
    Multinomial naive Bayes (MNB) is a popular method for document classification due to its computational efficiency and relatively good predictive performance. It has recently been established that predictive performance can be improved... more
    Multinomial naive Bayes (MNB) is a popular method for document classification due to its computational efficiency and relatively good predictive performance. It has recently been established that predictive performance can be improved further by appropriate data ...
    Page 1. Revisiting Multiple-Instance Learning Via Embedded Instance Selection James Foulds and Eibe Frank Department of Computer Science, University of Waikato, New Zealand {jf47,eibe}@cs.waikato.ac.nz Abstract. ... The MI framework was... more
    Page 1. Revisiting Multiple-Instance Learning Via Embedded Instance Selection James Foulds and Eibe Frank Department of Computer Science, University of Waikato, New Zealand {jf47,eibe}@cs.waikato.ac.nz Abstract. ... The MI framework was introduced by Dietterich et al. ...
    ABSTRACT The much-publicized Netflix competition has put the spotlight on the application domain of collaborative filtering and has sparked interest in machine learning algorithms that can be applied to this sort of problem. The demanding... more
    ABSTRACT The much-publicized Netflix competition has put the spotlight on the application domain of collaborative filtering and has sparked interest in machine learning algorithms that can be applied to this sort of problem. The demanding nature of the Netflix data has lead to some interesting and ingenious modifications to standard learning methods in the name of efficiency and speed. There are three basic methods that have been applied in most approaches to the Netflix problem so far: stand-alone neighborhood-based methods, latent factor models based on singular-value decomposition, and ensembles consisting of variations of these techniques. In this paper we investigate the application of forward stage-wise additive modeling to the Netflix problem, using two regression schemes as base learners: ensembles of weighted simple linear regressors and k-means clustering—the latter being interpreted as a tool for multi-variate regression in this context. Experimental results show that our methods produce competitive results.
    Nested dichotomies are a standard statisti-cal technique for tackling certain polytomous classification problems with logistic regres-sion. They can be represented as binary trees that recursively split a multi-class classifica-tion task... more
    Nested dichotomies are a standard statisti-cal technique for tackling certain polytomous classification problems with logistic regres-sion. They can be represented as binary trees that recursively split a multi-class classifica-tion task into a system of dichotomies and provide a ...
    Abstract. This paper presents empirical results for several versions of the multinomial naive Bayes classifier on four text categorization prob-lems, and a way of improving it using locally weighted learning. More specifically, it... more
    Abstract. This paper presents empirical results for several versions of the multinomial naive Bayes classifier on four text categorization prob-lems, and a way of improving it using locally weighted learning. More specifically, it compares standard multinomial naive Bayes to the recently ...
    The hypothesis was that sensors currently available on farm that monitor behavioral and physiological characteristics have potential for the detection of lameness in dairy cows. This was tested by applying additive logistic regression to... more
    The hypothesis was that sensors currently available on farm that monitor behavioral and physiological characteristics have potential for the detection of lameness in dairy cows. This was tested by applying additive logistic regression to variables derived from sensor data. Data were collected between November 2010 and June 2012 on 5 commercial pasture-based dairy farms. Sensor data from weigh scales (liveweight), pedometers (activity), and milk meters (milking order, unadjusted and adjusted milk yield in the first 2 min of milking, total milk yield, and milking duration) were collected at every milking from 4,904 cows. Lameness events were recorded by farmers who were trained in detecting lameness before the study commenced. A total of 318 lameness events affecting 292 cows were available for statistical analyses. For each lameness event, the lame cow's sensor data for a time period of 14 d before observation date were randomly matched by farm and date to 10 healthy cows (i.e., ...
    ABSTRACT We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of... more
    ABSTRACT We address the problem of estimating a discrete joint density online, that is, the algorithm is only provided the current example and its current estimate. The proposed online estimator of discrete densities, EDDO (Estimation of Discrete Densities Online), uses classifier chains to model dependencies among features. Each classifier in the chain estimates the probability of one particular feature. Because a single chain may not provide a reliable estimate, we also consider ensembles of classifier chains and ensembles of weighted classifier chains. For all density estimators, we provide consistency proofs and propose algorithms to perform certain inference tasks. The empirical evaluation of the estimators is conducted in several experiments and on data sets of up to several million instances: We compare them to density estimates computed from Bayesian structure learners, evaluate them under the influence of noise, measure their ability to deal with concept drift, and measure the run-time performance. Our experiments demonstrate that, even though designed to work online, EDDO delivers estimators of competitive accuracy compared to batch Bayesian structure learners and batch variants of EDDO.
    Research Interests:
    Keyphrases are an important means of documentsummarization, clustering, and topicsearch. Only a small minority of documentshave author-assigned keyphrases, and manuallyassigning keyphrases to existing documents isvery laborious. Therefore... more
    Keyphrases are an important means of documentsummarization, clustering, and topicsearch. Only a small minority of documentshave author-assigned keyphrases, and manuallyassigning keyphrases to existing documents isvery laborious. Therefore it is highly desirableto automate the keyphrase extraction process.This paper shows that a simple procedure forkeyphrase extraction based on the naiveBayeslearning scheme performs comparably to thestate of the art. It goes on to
    Inflammatory bowel diseases (IBD) are emerging globally, indicating that environmental factors may be important in their pathogenesis. Colonic mucosal epigenetic changes, such as DNA methylation, can occur in response to the environment... more
    Inflammatory bowel diseases (IBD) are emerging globally, indicating that environmental factors may be important in their pathogenesis. Colonic mucosal epigenetic changes, such as DNA methylation, can occur in response to the environment and have been implicated in IBD pathology. However, mucosal DNA methylation has not been examined in treatment-naïve patients. We studied DNA methylation in untreated, left sided colonic biopsy specimens using the Infinium HumanMethylation450 BeadChip array. We analyzed 22 control (C) patients, 15 untreated Crohn's disease (CD) patients, and 9 untreated ulcerative colitis (UC) patients from two cohorts. Samples obtained at the time of clinical remission from two of the treatment-naïve UC patients were also included into the analysis. UC-specific gene expression was interrogated in a subset of adjacent samples (5 C and 5 UC) using the Affymetrix GeneChip PrimeView Human Gene Expression Arrays. Only treatment-naïve UC separated from control. One-hundred-and-twenty genes with significant expression change in UC (> 2-fold, P<0.05) were associated with differentially methylated regions (DMRs). Epigenetically associated gene expression changes (including gene expression changes in the IFITM1, ITGB2, S100A9, SLPI, SAA1, and STAT3 genes) were linked to colonic mucosal immune and defense responses. These findings underscore the relationship between epigenetic changes and inflammation in pediatric treatment-naïve UC and may have potential etiologic, diagnostic, and therapeutic relevance for IBD.
    More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been re- written entirely from scratch, evolved substantially and now accompanies a text on data mining (35). These days, WEKA... more
    More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been re- written entirely from scratch, evolved substantially and now accompanies a text on data mining (35). These days, WEKA enjoys widespread acceptance in both academia and busi- ness, has an active community, and has been downloaded more than 1.4 million
    Logistic Model Trees have been shown to be very accurate and compact classifiers [8]. Their greatest disadvantage is the computational complexity of inducing the logistic regression models in the tree. We address this issue by using the... more
    Logistic Model Trees have been shown to be very accurate and compact classifiers [8]. Their greatest disadvantage is the computational complexity of inducing the logistic regression models in the tree. We address this issue by using the AIC criterion [1] instead of cross-...
    Research Interests:
    This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their... more
    This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques. This is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.
    Research Interests:
    We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for bio-chemical databases, where the examples are two-dimensional descriptions of... more
    We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for bio-chemical databases, where the examples are two-dimensional descriptions of chemical compounds. The fragment generator finds all fragments (i.e. linearly connected atoms) that satisfy user-specified constraints regarding their frequency and generality. In this paper, we are querying for fragments within a minimum and a maximum frequency in the dataset. After fragment generation, we propose to apply SVMs to the problem of inducing SARs from these fragments. We conjecture that the SVMs are particularly useful in this context, as they can deal with a large number of features. Experiments in the domains of carcinogenicity and mutagenicity prediction show that the minimum and the maximum frequency queries for fragments can be answered within a reasonable time, and that the predictive accuracy obtained using these fragments is satisfactory. However, further experiments will have to confirm that this is a viable approach to inducing SARs.
    The Weka workbench is an organized collection of state-of-the-art machine lear-ning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient... more
    The Weka workbench is an organized collection of state-of-the-art machine lear-ning algorithms and data preprocessing tools. The basic way of interacting with these methods is by invoking them from the command line. However, convenient interactive graphical user ...

    And 16 more