David Nettleton
Pompeu Fabra University, Communication and Information Technologies, Department Member
- David F. Nettleton is Senior Data Mining Analyst at IRIS Technology Solutions. He also collaborates with the Web Scie... moreDavid F. Nettleton is Senior Data Mining Analyst at IRIS Technology Solutions. He also collaborates with the Web Science and Social Computing Research Group of the DTIC at the Pompeu Fabra University in Catalunya, Spain. From 1985 until 2004 he worked for a diversity of companies in different sectors, such as Systems Designers, Plc. (UK), IBM Global Services, Carburos Metalicos, Laboratorios Menarini and Coritel. He has also been involved in business startups, such as TAD Sistemas (acquired by Bertelsmann AG in 2000). Since 2004 he has taught and conducted research at the Pompeu Fabra University (Web Research Group, http://grupoweb.upf.es), the IIIA-CSIC (Ares Team for Advanced Research on Information Security and Privacy, http://www.iiia.csic.es/en/project/ares), the Ramon Llull University with the GRSI (Intelligent Systems Research Group, http://www.salleurl.edu/GRSI/) and IRIS (http://www.iristechnologygroup.com/).
His research interests include industrial data analysis and modeling, machine learning, artificial intelligence and online social network analysis.edit
Research Interests:
Research Interests:
Research Interests:
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several... more
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user; (ii) enhancement of the information available about these users by unifying information from different sources; (iii) consolidation of accounts by on-line social network providers; (iv) identification of potentially malicious users and/or bots. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users’ expectations on such matters to avoid backlash. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. This formalization enables us to reason on two issues. First, how to identify that two or mo...
Research Interests:
Research Interests:
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials... more
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials feasible for packaging and textile applications, routes of upscaling, (industrial) applications, evaluation of sustainability, and end-of-life options. The outlined bio-based materials include polylactic acid, polyethylene furanoate, polybutylene succinate, and non-synthetically produced polymers such as polyhydrodyalkanoate, cellulose, starch, proteins, lipids, and waxes. Further emphasis is placed on modification techniques (coating and surface modification), biocomposites, multilayers, and additives used to adjust properties especially for barriers to gas and moisture and to tune their biodegradability. Overall, this review provides a holistic view of bio-based packaging material including processing, and an evaluation of the sustainability of an...
Research Interests:
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing... more
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing near-infrared spectroscopy) generally focus on identifying mono-materials of a few selected types which currently have a market-interest as secondary materials. Current progress in photonic sciences together with advanced data analysis, such as artificial intelligence, enable bridging practical challenges previously not feasible, for example in terms of classifying more complex materials. In the present paper, the different techniques are initially reviewed based on their main characteristics. Then, based on academic literature, their suitability for monitoring the composition of multi-materials, such as different types of multi-layered packaging and fibre-reinforced polymer composites as well as black plastics used in the motor vehicle industry, is discussed...
Research Interests:
Research Interests: Computer Science, Artificial Intelligence, Natural Language Processing, Recommender Systems, Statistical Analysis, and 15 moreSocial Media, Tagging Technologies, Data and Knowledge Modeling, Tagging, Social Tagging, Library automation and networking, Meta data, Image Content, Tagging Behavior, Eye tracking Study, Pre iconographic, Iconographic, Iconologic, Areas of Interest, and Tag Order
Background In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules... more
Background In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules and Recurrent Neural Networks (RNN)), the former inducing a rule-based model and the latter building a neural network. In situ telemetry is important to obtain quality in-field data for predictive models and this was a key aspect of the RICE-GUARD project on which this study is based. According to the authors, this is the first time process-based and machine learning modelling approaches for supporting plant disease management are compared. Results Results clearly showed that the models succeeded in providing a warning of rice blast onset and presence, thus representing suitable solutions for preventive remedial actions targeting the mitigation of yield losses and the reduction of fungicide use. All methods gave significant “signals” during the “early...
Research Interests: Computer Science, Artificial Intelligence, Machine Learning, Forecasting, Neural Networks, and 12 moreForecasting and Prediction Tools, Biological Sciences, Artificial Neural Networks, Mathematical Sciences, BMC Bioinformatics, Predictive models, Crop Diseases, Disease Management of Crop Plant, Plant Disease, Rule Induction, Breeding for Rice Blast Resistance, and Rice Blast Disease
Research Interests:
Research Interests:
Research Interests:
Research Interests:
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on. In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
Research Interests:
This is the first of four chapters that deal with the analysis of data on the Internet and in an online environment. This chapter gives an introduction to website analysis and Internet search using two contrasting case studies: first, the... more
This is the first of four chapters that deal with the analysis of data on the Internet and in an online environment. This chapter gives an introduction to website analysis and Internet search using two contrasting case studies: first, the chapter discusses how to analyze the transactional data from customer visits to a business’s website, and second, it explores how Internet search can be used as a market research tool. The examples serve to illustrate how the Internet can be used as a tool for individual marketing, mass marketing, and marketing sentiment surveys. The examples also illustrate the following two business objectives: (i) analyzing activity on a website to adapt the website’s commercial offering at both the general and individual levels, and (ii) gathering commercial information on the Internet from a diversity of sources in order to analyze and understand the marketplace. From a data mining perspective (and recalling the data sources in Chapter 3 ), throughout this chapter the Internet could be considered as a meta data source born from a company's Internet presence. Following each case study, details are given of which technical techniques are relevant and which software applications could be used for the examples.
Research Interests:
The area of CRM (Customer Relationship Management) has attracted a lot of attention, and many businesses who are end users of IT solutions have spent considerable amounts of money on implementing CRM systems integrated to a greater or... more
The area of CRM (Customer Relationship Management) has attracted a lot of attention, and many businesses who are end users of IT solutions have spent considerable amounts of money on implementing CRM systems integrated to a greater or lesser extent with their operational and business processes. However, what should be kept in mind is that CRM is a basic, common sense idea that can be put into practice with nothing more than a spreadsheet and a modest database. This chapter introduces the reader to CRM in terms of recency, frequency, and latency of customer activity, and in terms of the client life cycle: capturing new clients, potentiating and retaining existing clients, and winning back ex-clients. The chapter then discusses the relation of data analysis to each of the CRM phases and considers customer satisfaction and integrated CRM systems. Next, it briefly describes the characteristics of commercial CRM software products, and finally, the chapter examines example screens and functionality from a simple CRM application.
Research Interests:
This chapter discusses data quality, which is a preliminary consideration for any commercial data analysis project; the definition of quality includes the availability or accessibility of data. The chapter examines typical problems that... more
This chapter discusses data quality, which is a preliminary consideration for any commercial data analysis project; the definition of quality includes the availability or accessibility of data. The chapter examines typical problems that can occur with data, including errors in the data content (textual and numerical data) and the relevance and reliability of the data, as well as how to quantitatively evaluate data quality. Finally, some typical errors due to data extraction and how to avoid them are discussed by examining a practical case study.
Research Interests:
Research Interests: Computer Science, Social Networks, Data Mining, Graph Theory, Privacy, and 12 moreOnline social networks, Information Security and Privacy, Social Networking Security and Privacy, Privacy and data protection, Data Privacy, Information Hiding, Information Loss, Graphs and Networks, Computer Applications, Information and Knowledge, Privacy-Preserving Data Publishing, and Anonymization
When evaluating variable data for a given business intelligence objective, we may observe that the relevant variables are not reliable or that the reliable ones are not relevant. Here's how to address this situation. Available at:... more
When evaluating variable data for a given business intelligence objective, we may observe that the relevant variables are not reliable or that the reliable ones are not relevant. Here's how to address this situation. Available at: http://tdwi.org/articles/2014/05/13/data-quality-relevance-vs-reliability.aspx
Research Interests:
ABSTRACT This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary... more
ABSTRACT This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary queries on the data. These two aspects condition the way in which we anonymize the graph, and from which we derive measures for information loss, risk and privacy levels.
ABSTRACT This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a... more
ABSTRACT This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
ABSTRACT A key aspect of data mining and its success in extracting useful knowledge is the way in which the data is represented. In this paper we propose representing the relations inherent in an e-commerce bookstore search log as a... more
ABSTRACT A key aspect of data mining and its success in extracting useful knowledge is the way in which the data is represented. In this paper we propose representing the relations inherent in an e-commerce bookstore search log as a graph, which allows us to apply and customize graph metrics and algorithms in order to identify structures and key elements. This approach complements traditional transactional mining by facilitating the identification of underlying structural interrelations.
Research Interests:
ABSTRACT
Research Interests:
Research Interests: Computer Science, Information Retrieval, Human Computer Interaction, Eye tracking, Data Mining, and 15 moreUsability and user experience, Information Visualisation, Digital Identity, Search Engines, Ambient Intelligence, Information Fusion, Interactive Systems, User web search behaviour, Eye, Internet user behaviour, Attentive Displays, Eye and Gaze Tracking, Google Search Engine, Search Engine Optimiztion, and Internet
Research Interests:
Research Interests:
Research Interests:
ABSTRACT
Research Interests:
Research Interests:
In this paper we propose a classification for different observable trends over time for user web queries. The focus is on the identification of general collective trends, based on search query keywords, of the user community in Internet... more
In this paper we propose a classification for different observable trends over time for user web queries. The focus is on the identification of general collective trends, based on search query keywords, of the user community in Internet and how they behave over a given time period. We give some representative examples of real search queries and their tendencies. From these examples we define a set of descriptive features which can be used as inputs for data modelling. Then we use a selection of non supervised (clustering) and supervised modelling techniques to classify the trends. The results show that it is relatively easy to classify the basic hypothetical trends we have defined, and we identify which of the chosen learning techniques are best able to model the data. However, the presence of more complex, noisy or mixed trends make the classification more difficult.
Research Interests: Bioinformatics, Computer Science, Information Retrieval, Artificial Intelligence, Natural Language Processing, and 12 moreMachine Learning, Data Mining, Signal Processing, Network Security, Web Mining, Text Mining, Web search, Computer Security, Mobile Computing, The Internet, World Wide Web, and Syntactic and Semantic Knowledge
This article presents the results of applying artificial intelligence (AI), such as machine learning algorithms, to identifying and predicting anomalies for corrective maintenance in a water for injection (WFI) processing plant. The aim... more
This article presents the results of applying artificial intelligence (AI), such as machine learning algorithms, to identifying and predicting anomalies for corrective maintenance in a water for injection (WFI) processing plant. The aim is to avoid the yearly stoppage of the WFI plant for preventive maintenance activities, common in the industry, and use a more scientific approach for the time between stoppages, expected to be longer after the study and thus saving money and increasing productivity.
Research Interests:
The motivation of the work in this paper is due to the need in research and applied fields for synthetic social network data due to (i) difficulties to obtain real data and (ii) data privacy issues of the real data. The issues to address... more
The motivation of the work in this paper is due to the need in research and applied fields for synthetic social network data due to (i) difficulties to obtain real data and (ii) data privacy issues of the real data. The issues to address are first to obtain a graph with a social network type structure, label it with communities. The main focus is the generation of realistic data, its assignment to and propagation within the graph. The main aim in this work is to implement an easy to use standalone end-user application which addresses the aforementioned issues. The methods used are the R-MAT and Louvain algorithms, with some modifications, for graph generation and community labeling respectively, and the development of a Java based system for the data generation using an original seed assignment algorithm followed by a second algorithm for weighted and probabilistic data propagation to neighbors and other nodes. The results show that a close fit can be achieved between the initial user specification and the generated data, and that the algorithms have potential for scale up. The system is made publicly available in a Github Java project.
Research Interests:
***BEST PAPER AWARD, SIMULTECH 2021 *** The exceptionally high virulence of COVID-19 and the patients' precondition seem to constitute primary factors in how pro-inflammatory cytokines production evolves during the course of an... more
***BEST PAPER AWARD, SIMULTECH 2021 ***
The exceptionally high virulence of COVID-19 and the patients' precondition seem to constitute primary factors in how pro-inflammatory cytokines production evolves during the course of an infection. We present a System Dynamics Model approach for simulating the patient reaction using two key control parameters (i) virulence, which can be "moderate" or "high" and (ii) patient precondition, which can be "healthy", "not so healthy" or "serious preconditions". In particular, we study the behaviour of Inflammatory (M1) Alveolar Macrophages, IL6 and Active Adaptive Immune system as indicators of the immune system response, together with the COVID viral load over time. The results show that it is possible to build an initial model of the system to explore the behaviour of the key attributes involved in the patient condition, virulence and response. The model suggests aspects that need further study so that it can then assist in choosing the correct immunomodulatory treatment, for instance the regime of application of an Interleukin 6 (IL-6) inhibitor (tocilizumab) that corresponds to the projected immune status of the patients. We introduce machine learning techniques to corroborate aspects of the model and propose that a dynamic model and machine learning techniques could provide a decision support tool to ICU physicians.
The exceptionally high virulence of COVID-19 and the patients' precondition seem to constitute primary factors in how pro-inflammatory cytokines production evolves during the course of an infection. We present a System Dynamics Model approach for simulating the patient reaction using two key control parameters (i) virulence, which can be "moderate" or "high" and (ii) patient precondition, which can be "healthy", "not so healthy" or "serious preconditions". In particular, we study the behaviour of Inflammatory (M1) Alveolar Macrophages, IL6 and Active Adaptive Immune system as indicators of the immune system response, together with the COVID viral load over time. The results show that it is possible to build an initial model of the system to explore the behaviour of the key attributes involved in the patient condition, virulence and response. The model suggests aspects that need further study so that it can then assist in choosing the correct immunomodulatory treatment, for instance the regime of application of an Interleukin 6 (IL-6) inhibitor (tocilizumab) that corresponds to the projected immune status of the patients. We introduce machine learning techniques to corroborate aspects of the model and propose that a dynamic model and machine learning techniques could provide a decision support tool to ICU physicians.
Research Interests:
This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the... more
This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the field. It builds on previous work by providing an integrated system, and a user friendly screen interface. It can be run with default values to produce graph data and statistics, which can then be used for further processing. The system is made publicly available in a Github Java project. The annex provides a user manual with a screen by screen guide.
Research Interests:
There is exciting news in recent developments suggesting the potential to treat some human cancers by stimulating the patients own immune system. However, there is still much to understand; therefore, modelling the battle between those... more
There is exciting news in recent developments suggesting the potential to treat some human cancers by stimulating the patients own immune system. However, there is still much to understand; therefore, modelling the battle between those cells that are constituents of the human immune system against tumorous cells can significantly provide insights as mathematical modelling has done regarding the immune system behaviour against virus infections. In this paper we innovate in two directions. First, we move the modelling of immune struggles from the sphere of ordinary-differential equation models to the modelling by multi-agent simulations. We highlight the advantages of the multi-agent simulation, for example the consideration of elaborate spatial proximity interactions. Secondly, we move away from the realm of infectious diseases to the complex modelling of the stimulation of T-Cells and their participation in fighting cancerous cell tumours.
Research Interests:
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several... more
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user; (ii) enhancement of the information available about these users by unifying information from different sources; (iii) consolidation of accounts by on-line social network providers; (iv) identification of potentially malicious users and/or bots. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users' expectations on such matters to avoid backlash. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. This formalization enables us to reason on two issues. First, how to identify that two or more user-accounts belong to the same user. Second, what gains in predictability are obtained after re-identification. For the first issue, we show that a set-difference approach is remarkably effective. For the second issue we explore the impact of re-identification on the predictability by two different machine learning algorithms: C4.5 (decision tree induction) and SVM-SMO (Support Vector Machine with SMO kernel). Our results show that as predictability improves, in some cases different SAN metrics emerge as predictors.
Research Interests:
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated... more
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated data. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, correlations and so on. In the following work, we present and validate an approach for populating a graph topology with synthetic data which approximates an online social network. The empirical tests confirm that our approach generates a dataset which is both diverse and with a good fit to the target requirements, with a realistic modeling of noise and fitting to communities. A good match is obtained between the generated data and the target profiles and distributions, which is competitive with other state of the art methods. The data generator is also highly configurable, with a sophisticated control parameter set for different “similarity/diversity” levels.
Research Interests:
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as... more
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as tree structures, transforming the structures into strings and then calculating the edit distance between those strings. However many coding schemes are complex and are computationally expensive. In this paper, we present a novel coding scheme for unlabeled graphs and perform some empirical experiments to evaluate its precision and cost for the matching of neighborhood subgraphs in online social networks. We call our method OSG-L (Ordered String Graph-Levenshtein). Some key advantages of the pre-processing phase are its simplicity, compactness and lower execution time. Furthermore, our method is able to match both non-isomorphisms (near matches) and isomorphisms (exact matches), also taking into account the degrees of the neighbors, which is adequate for social network graphs.
Research Interests:
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and... more
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and commercial reasons. On the other hand, there are many risks for user privacy, as information a user may wish to remain private becomes evident upon analysis. However, when data is anonymized to make it safe for publication in the public domain, information is inevitably lost with respect to the original version, a significant aspect of social networks being the local neighborhood of a user and its associated data. Current anonymization techniques are good as identifying risks and minimizing them, but not so good at maintaining local contextual data which relate users in a social network. Thus, improving this aspect will have a high impact on the data utility of anonymized social networks. Also, there is a lack of systems which facilitate the work of a data analyst in anonymizing this type of data structures and performing empirical experiments in a controlled manner on different datasets. Hence, in the present work we address these issues by designing and implementing a sophisticated synthetic data generator together with an anonymization processor with strict privacy guarantees and which takes into account the local neighborhood when anonymizing. All this is done for a complex dataset which can be fitted to a real dataset in terms of data profiles and distributions. In the empirical section we perform experiments to demonstrate the scalability of the method and the improvement in terms of reduction of information loss with respect to approaches which do not consider the local neighborhood context when anonymizing.
Research Interests:
Approximate sub-graph matching is important in many graph data mining fields. At present, current solutions can be difficult to implement, have an expensive pre-processing phase, or only work for given types of graph. In this paper a... more
Approximate sub-graph matching is important in many graph data mining fields. At present, current solutions can be difficult to implement, have an expensive pre-processing phase, or only work for given types of graph. In this paper a novel generic approach is presented which addresses these issues. An approximate sub-graph matcher (A-SGM) calculates the distance between the topological characteristics (footprint) of the sub-graphs to be matched, applying a weighting to the different sub-graph characteristics and those of neighbor nodes. The weights are calibrated for each dataset with a simulated annealing process using sample sets of graph nodes to reduce computational cost, and an exact isomorphism matcher as a fitness function which takes into account how well the match maintains the neighboring node degree distributions. Benchmarking is performed on several state of the art methods and real and synthetic graph datasets to evaluate the precision, recall and computational cost. The results show that the A-SGM is competitive with state of the art methods in terms of precision, recall and execution time.
Research Interests:
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked.... more
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked. Furthermore, even if a user masks all its data, it may still be inferable from the unmasked data of its friends, due to affinities in likes and personal attributes. In contrast to the conventional data mining approach, in which a model is built for all users, we build a rule set which is individualized for each user. In this paper we propose a novel rule induction approach (that incorporates predictive metrics) which enable a user to evaluate the potential risk incurred by unmasked attributes, friends’ attributes and also the risk of befriending new users. We find that all of these risks are quantifiable and a risk ranking of attributes and friends/potential friends can be individualized for each user. We give examples and use cases and confirm the effectiveness of the approach, using a sophisticated synthetic OSN-data to define risk attribute and user combinations which coincide with the risk ranking produced by our algorithm.
Research Interests:
Key Features - Illustrates cost-benefit evaluation of potential projects - Includes vendor-agnostic advice on what to look for in off-the-shelf solutions as well as tips on building your own data mining tools -... more
Key Features
- Illustrates cost-benefit evaluation of potential projects
- Includes vendor-agnostic advice on what to look for in off-the-shelf solutions as well as tips on building your own data mining tools
- Approachable reference can be read from cover to cover by readers of all experience levels
- Includes practical examples and case studies as well as actionable business insights from author's own experience
Description
Whether you are brand new to data mining or working on your tenth predictive analytics project, Commercial Data Mining will be there for you as an accessible reference outlining the entire process and related themes. In this book, you'll learn that your organization does not need a huge volume of data or a Fortune 500 budget to generate business using existing information assets. Expert author David Nettleton guides you through the process from beginning to end and covers everything from business objectives to data sources, and selection to analysis and predictive modeling.
Commercial Data Mining includes case studies and practical examples from Nettleton's more than 20 years of commercial experience. Real-world cases covering customer loyalty, cross-selling, and audience prediction in industries including insurance, banking, and media illustrate the concepts and techniques explained throughout the book.
Readership
Data mining professionals in business & IT.
- Illustrates cost-benefit evaluation of potential projects
- Includes vendor-agnostic advice on what to look for in off-the-shelf solutions as well as tips on building your own data mining tools
- Approachable reference can be read from cover to cover by readers of all experience levels
- Includes practical examples and case studies as well as actionable business insights from author's own experience
Description
Whether you are brand new to data mining or working on your tenth predictive analytics project, Commercial Data Mining will be there for you as an accessible reference outlining the entire process and related themes. In this book, you'll learn that your organization does not need a huge volume of data or a Fortune 500 budget to generate business using existing information assets. Expert author David Nettleton guides you through the process from beginning to end and covers everything from business objectives to data sources, and selection to analysis and predictive modeling.
Commercial Data Mining includes case studies and practical examples from Nettleton's more than 20 years of commercial experience. Real-world cases covering customer loyalty, cross-selling, and audience prediction in industries including insurance, banking, and media illustrate the concepts and techniques explained throughout the book.
Readership
Data mining professionals in business & IT.
Research Interests: Business, Information Systems (Business Informatics), Data Mining, Web Mining, Business Intelligence, and 5 moreBusiness Information Systems, Predictive Analytics, Web Usage Mining, Business Intelligence, Predictive Modeling, Decision-Support, and Statistical Modeling and Machine Learning Algorithms for Data Mining, Inference, Prediction and Classification Problems
El libro esta dirigido a las personas que por razones profesionales o académicas tienen la necesidad de analizar datos de pacientes, con el motivo de realizar un diagnóstico o un pronóstico. Se explican en detalle las diversas técnicas... more
El libro esta dirigido a las personas que por razones profesionales o académicas tienen la necesidad de analizar datos de pacientes, con el motivo de realizar un diagnóstico o un pronóstico. Se explican en detalle las diversas técnicas estadísticas y de aprendizaje automatizado para su aplicación al análisis de datos clínicos. Además, el libro describe de forma estructurada, una serie de técnicas adaptadas y enfoques originales, basándose en la experiencia y colaboraciones del autor en este campo.
INDICE RESUMIDO: Introducción. Conceptos y técnicas. La perspectiva difusa. El diagnóstico y el pronóstico clínico. El diagnóstico del síndrome de apnea del sueña. La representación, comparación y proceso de datos de diferentes tipos. Técnicas. Resumen de los aspectos claves en la adaptación e implementación de las técnicas. Aplicación de las técnicas a casos reales. Pronóstico de pacientes de la UCI-Hospital Parc Tauli de Sabadell, etc.,
INDICE RESUMIDO: Introducción. Conceptos y técnicas. La perspectiva difusa. El diagnóstico y el pronóstico clínico. El diagnóstico del síndrome de apnea del sueña. La representación, comparación y proceso de datos de diferentes tipos. Técnicas. Resumen de los aspectos claves en la adaptación e implementación de las técnicas. Aplicación de las técnicas a casos reales. Pronóstico de pacientes de la UCI-Hospital Parc Tauli de Sabadell, etc.,
Research Interests:
Este libro está dirigido tanto a las personas sin formación en el análisis de datos comerciales como a las que ya se dedican a ello en mayor o menor grado, y buscan una referencia sencilla de todo el proceso y los temas vinculados. El... more
Este libro está dirigido tanto a las personas sin formación en el análisis de datos comerciales como a las que ya se dedican a ello en mayor o menor grado, y buscan una referencia sencilla de todo el proceso y los temas vinculados. El autor incorpora materia tanto de sus mas de 20 años de experiencia empresarial como de sus diversos proyectos de investigación para enriquecer el contenido, el cual ofrece un enfoque original sobre la problemática del tema. En los apéndices, casos prácticos derivados de proyectos reales, sirven para ilustrar los conceptos y técnicas explicadas a lo largo del libro.
Prácticamente todos los métodos, técnicas e ideas que se presentan, por ejemplo 'calidad de datos', 'data mart', 'CRM - gestión de la relación con los clientes', 'diferentes fuentes de datos' y 'búsqueda en Internet', pueden ser aprovechados tanto por el empresario de una micro-empresa o un profesional autónomo, como por una empresa mediana o grande. No es imprescindible disponer de un gran volumen de datos, y hay herramientas de análisis disponibles a un precio accesible a todos.
Prácticamente todos los métodos, técnicas e ideas que se presentan, por ejemplo 'calidad de datos', 'data mart', 'CRM - gestión de la relación con los clientes', 'diferentes fuentes de datos' y 'búsqueda en Internet', pueden ser aprovechados tanto por el empresario de una micro-empresa o un profesional autónomo, como por una empresa mediana o grande. No es imprescindible disponer de un gran volumen de datos, y hay herramientas de análisis disponibles a un precio accesible a todos.
Research Interests:
In this presentation two themes are considered:
(i) A personalized privacy tool for online social network users
and (ii) a generator for synthetic online social network graph data.
(i) A personalized privacy tool for online social network users
and (ii) a generator for synthetic online social network graph data.
Research Interests:
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked.... more
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked. Furthermore, even if a user masks all its data, it may still be inferable from the unmasked data of its friends, due to affinities in likes and personal attributes. In contrast to the conventional data mining approach, in which a model is built for all users, we build a rule set which is individualized for each user. In this paper we propose a novel rule induction approach (that incorporates predictive metrics) which enable a user to evaluate the potential risk incurred by unmasked attributes, friends' attributes and also the risk of befriending new users. We find that all of these risks are quantifiable and a risk ranking of attributes and friends/potential friends can be individualized for each user. We give examples and use cases and confirm the effectiveness of the approach, using a sophisticated synthetic OSN-data to define risk attribute and user combinations which coincide with the risk ranking produced by our algorithm.
Research Interests:
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on.
In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
Research Interests:
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on. In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
Research Interests: Social Networks, Social Networking, Social Network Analysis (SNA), Online social networks, Social Network Analysis (Social Sciences), and 6 moreGraph Data Mining, Graph Mining, Web Mining, Social Network Analysis, Data Mining, WEB MINING,Graph Mining, Hypergraphs, Graphs, Data Mining, and Mining Social Graph, Social Influence Metrics
In this brief presentation on free text document sanitization, we perform a multi-step semi-automatic sanitization process and evaluate the information loss using information retrieval metrics. The Wikileaks document corpus is used for... more
In this brief presentation on free text document sanitization, we perform a multi-step semi-automatic sanitization process and evaluate the information loss using information retrieval metrics. The Wikileaks document corpus is used for testing.
Research Interests:
In this brief presentation on graph anonymization, we look at some graph modifier operators and different types of adversary information queries.
Research Interests:
In this brief presentation we give an overview of some of the issues and work related to data privacy of on-line social network data represented as graphs. Among the issues considered are adversaries, protection methods (link addition and... more
In this brief presentation we give an overview of some of the issues and work related to data privacy of on-line social network data represented as graphs. Among the issues considered are adversaries, protection methods (link addition and clustering) and data processing.
Research Interests:
This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define... more
This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary queries on the data. These two aspects condition the way in which we anonymize the graph, and from which we derive measures for information loss, risk and privacy levels.
Research Interests:
In this brief talk we describe an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define... more
In this brief talk we describe an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary queries on the data. These two aspects condition the way in which we anonymize the graph, and from which we derive measures for information loss, risk and privacy levels.
Research Interests:
This poster gives an overview of some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community... more
This poster gives an overview of some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
Research Interests:
This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community... more
This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
Research Interests:
The present invention proposes a new approximate sub-graph matching method with the advantage of a relatively simple to implement matching method, requiring a worst case runtime computational cost of O(N2). The present invention refers to... more
The present invention proposes a new approximate sub-graph matching method with the advantage of a relatively simple to implement matching method, requiring a worst case runtime computational cost of O(N2). The present invention refers to a similarity metric which approximates a modified isomorphism matcher for local neighbourhood sub-graphs, the matcher consisting in a distance metric with weighted characteristics in terms of sub-graph statistics and statistics of neighbour node degrees. The weights of the metric are calibrated using a simulated annealing process which uses as a fitness function a modified isomorphism matcher which takes into account how well the match maintains the neighbouring node degree distributions. The learned weights provide additional information useful to interpret the relative importance of each characteristic.
Research Interests:
This unclassified report consists of three testing and performance studies of the IBM 3081 mainframe which provided computer services to the AERE (Atomic Energy Research Establishment) Harwell site. (i) Job test stream for the batch... more
This unclassified report consists of three testing and performance studies of the IBM 3081 mainframe which provided computer services to the AERE (Atomic Energy Research Establishment) Harwell site. (i) Job test stream for the batch system. (ii) Performance comparison of an indexed VTOC vs OS VTOC. (iii) System response time analysis using two different performance monitoring systems.
Research Interests:
In this document we review the state of the art on graph privacy with special emphasis on applications to online social networks, and we review how six different operators modify local topologies, when activity data is included. We... more
In this document we review the state of the art on graph privacy with special emphasis on applications to online social networks, and we review how six different operators modify local topologies, when activity data is included. We consider an aspect which has not been greatly covered in the specialized literature on graph privacy: adding, deleting and disaggregation of nodes. We also cover the following key considerations: (i) choice of six different operators to modify the graph; (ii) simulated annealing to find the optimum graph using a fitness function based on information loss and disclosure risk; (iii) Use of heuristics to choose graph elements (nodes, edges) to be modified, as a probability weighted by the distribution of an elements statistical characteristics (degree, clustering coefficient and path length) in the original graph; (iv) re-linking of nodes: heuristic which finds the topology whose statistical characteristics are closest to those of the original neighborhood; (v) in the case of the aggregation of two nodes, we choose adjacent nodes rather than isomorphic topologies, in order to maintain the overall structure of the graph; (vi) incorporation of network activity as a weight on the topology characteristics; (vii) a statistically knowledgeable attacker who is able to search for regions of the graph based on statistical characteristics and map those onto a given node and its immediate neighborhood.
Research Interests:
This document describes the first version (V1.0) of the graph privacy software suite. It consists of some initial assumptions, together with a textual description of the main routine (simulated annealing) and the six graph modifier... more
This document describes the first version (V1.0) of the graph privacy software suite. It consists of some initial assumptions, together with a textual description of the main routine (simulated annealing) and the six graph modifier operators. This is followed by a structure diagram of the whole system and the pseudo code of each of the main functions, organized in a modular design. A companion document [TR-IIIA-2010-04] details the theoretical background to the work.
Research Interests:
Brief description Two datasets are included which represent a graph which contain 11580 user records (nodes) and 87322 link records (edges), respectively. We have used as a (empty) topology the Amazon product co-purchasing network and... more
Brief description Two datasets are included which represent a graph which contain 11580 user records (nodes) and 87322 link records (edges), respectively. We have used as a (empty) topology the Amazon product co-purchasing network and ground-truth communities dataset which was collected by crawling the Amazon website by (Yang and Leskovec 2012) and is available from the SNAP online repository (https://snap.stanford.edu/data/). We used the version which has the top 5000 communities. The graph structure has then been populated with data by choosing seeds in each community and propagating from them. This follows a method outlined in [1]. The method has also been used to create a synthetic dataset for use in a data privacy study [2].
Research Interests:
Research Interests:
Research Interests:
50K link records (edges) - corresponds to 1K records (nodes) file in this same section.
Research Interests:
1K user records (nodes) - corresponds to the edges file in this same section.
Research Interests:
Two datasets are included which represent a graph which contain approx. 1K user records (nodes) and 50K link records (edges), respectively. We have followed a two step process: (1) generate a topology using R-Mat; apply Louvain to... more
Two datasets are included which represent a graph which contain approx. 1K user records (nodes) and 50K link records (edges), respectively. We have followed a two step process: (1) generate a topology using R-Mat; apply Louvain to identify some communities; then apply Louvain recursively to selected communities to obtain some smaller ones, giving a total of 10 communities; (2) Populate the graph structure with data by choosing seeds in each community and propagating from them. This follows a method outlined in [1]. A new more sophisticated version of this method will be made available soon (datasets and code).
Please reference the paper [1] when using this data and publishing results in your work. Please give me your feedback on your analysis/use of this data and suggestions for improvement.
[1] Nettleton, DF (2015) Generating synthetic online social network graph data and topologies, 3rd Workshop on Graph-based Technologies and Applications (Graph-TA), UPC, Barcelona, Spain, March 18th 2015.
Please reference the paper [1] when using this data and publishing results in your work. Please give me your feedback on your analysis/use of this data and suggestions for improvement.
[1] Nettleton, DF (2015) Generating synthetic online social network graph data and topologies, 3rd Workshop on Graph-based Technologies and Applications (Graph-TA), UPC, Barcelona, Spain, March 18th 2015.
Research Interests:
Github Java source code of MEDICI: A simple to use synthetic social network data generator https://github.com/dnettlet/MEDICI The main project folder includes the corresponding paper (please reference if you include Medici in you... more
Github Java source code of MEDICI: A simple to use synthetic social network data generator
https://github.com/dnettlet/MEDICI
The main project folder includes the corresponding paper (please reference if you include Medici in you research) and user manual.
The paper preprint reference is: https://arxiv.org/abs/2101.01956
Overview:
The Java and JavaFx source code corresponds to the Medici application, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the field. It builds on previous work by providing an integrated system, and a user friendly screen interface. It can be run with default values to produce graph data and statistics, which can then be used for further processing. The system is made publicly available in a Github Java project. The annex provides a user manual with a screen by screen guide.
https://github.com/dnettlet/MEDICI
The main project folder includes the corresponding paper (please reference if you include Medici in you research) and user manual.
The paper preprint reference is: https://arxiv.org/abs/2101.01956
Overview:
The Java and JavaFx source code corresponds to the Medici application, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the field. It builds on previous work by providing an integrated system, and a user friendly screen interface. It can be run with default values to produce graph data and statistics, which can then be used for further processing. The system is made publicly available in a Github Java project. The annex provides a user manual with a screen by screen guide.
Research Interests:
Repast (ReLogo) source code of paper "Multi-Agent Modeling Simulation of In-Vitro T-Cells for Immunologic Alternatives to Cancer Treatment" Language: Repast (ReLogo) Repository: https://github.com/dnettlet/AgentSim1 License: GNU GENERAL... more
Repast (ReLogo) source code of paper "Multi-Agent Modeling Simulation of In-Vitro T-Cells for Immunologic Alternatives to Cancer Treatment"
Language: Repast (ReLogo)
Repository: https://github.com/dnettlet/AgentSim1
License: GNU GENERAL PUBLIC LICENSE Version 3
Language: Repast (ReLogo)
Repository: https://github.com/dnettlet/AgentSim1
License: GNU GENERAL PUBLIC LICENSE Version 3
Research Interests:
Python Source code of project to extract memes (compact semantic network structures) representing key knowledge which is circulating in online discussion forums. Languages: Python Repository: https://github.com/dnettlet/memes License:... more
Python Source code of project to extract memes (compact semantic network structures) representing key knowledge which is circulating in online discussion forums.
Languages: Python
Repository: https://github.com/dnettlet/memes
License: GNU GENERAL PUBLIC LICENSE Version 3
Languages: Python
Repository: https://github.com/dnettlet/memes
License: GNU GENERAL PUBLIC LICENSE Version 3
Research Interests:
This program takes an empty graph (just nodes and links) and a community labelling (e.g. generated by Gephi Louvain) and fills it will data, one record per node. Neighbors tend to be similar, users tend to form communities, node degree... more
This program takes an empty graph (just nodes and links) and a community labelling (e.g. generated by Gephi Louvain) and fills it will data, one record per node. Neighbors tend to be similar, users tend to form communities, node degree has a long tail distribution, clustering coefficient distributions, and so on.... Please reference the associated paper
"A synthetic data generator for online social network graphs",
Social Network Analysis and Mining, Dec. 2016, 6:44
and the github code ref when you use/adapt/improve it !
https://github.com/dnettlet/SynthOSNdataGenerator
This version with no overlapping communities :)
"A synthetic data generator for online social network graphs",
Social Network Analysis and Mining, Dec. 2016, 6:44
and the github code ref when you use/adapt/improve it !
https://github.com/dnettlet/SynthOSNdataGenerator
This version with no overlapping communities :)
Research Interests:
This Master's Thesis dissertation describes my final project work for the M.Sc. in Computer Software and System Design, a 1 year intensive course at The Computing Laboratory of the University of Newcastle Upon Tyne, during 1984-1985. The... more
This Master's Thesis dissertation describes my final project work for the M.Sc. in Computer Software and System Design, a 1 year intensive course at The Computing Laboratory of the University of Newcastle Upon Tyne, during 1984-1985. The work was motivated by the need at the time for higher level programming languages to allow the programmer to define and control computer operating system functions, rather than directly writing in (sequential) low level machine and assembly code. Also, it allowed an abstraction for addressing key issues such as concurrency, parallelism, reliability, security, IO disk interface, streams and queuing procedures, among others, and implementing at different levels (from user interface level down to the disc interface level, for example). Unix was used as the underlying system, running on a PDP 11/34 mainframe computer. The main areas of work were the setting up of the standalone Concurrent Euclid (CE) software on the PDP 11/34 hardware, the development of a disk interface written in CE, the development of different operating systems functions, some rewritten from an existing SOLO operating system (Brinch Hansen) written in Sequential Pascal, and a comparative study of the CE language with Concurrent Pascal, Modula 2 and Edison-11.
Research Interests:
In this paper a brief description is given of the implementation of a 'Pepper's Ghost' apparatus for creating an optical illusion. The result is a purely non digital effect, using only light reflection, appropriate lighting arrangement... more
In this paper a brief description is given of the implementation of a 'Pepper's Ghost' apparatus for creating an optical illusion. The result is a purely non digital effect, using only light reflection, appropriate lighting arrangement and background. A second chamber is added which makes it possible to project a secondary independent image superimposed with the primary one. As part of the testing of the apparatus, different objects (cup, bag) are made to appear and disappear, and by varying the incident light intensity, spurious visual artefacts are minimized.