Skip to main content
  • David F. Nettleton is Senior Data Mining Analyst at IRIS Technology Solutions. He also collaborates with the Web Scie... moreedit
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several... more
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user; (ii) enhancement of the information available about these users by unifying information from different sources; (iii) consolidation of accounts by on-line social network providers; (iv) identification of potentially malicious users and/or bots. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users’ expectations on such matters to avoid backlash. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. This formalization enables us to reason on two issues. First, how to identify that two or mo...
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials... more
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials feasible for packaging and textile applications, routes of upscaling, (industrial) applications, evaluation of sustainability, and end-of-life options. The outlined bio-based materials include polylactic acid, polyethylene furanoate, polybutylene succinate, and non-synthetically produced polymers such as polyhydrodyalkanoate, cellulose, starch, proteins, lipids, and waxes. Further emphasis is placed on modification techniques (coating and surface modification), biocomposites, multilayers, and additives used to adjust properties especially for barriers to gas and moisture and to tune their biodegradability. Overall, this review provides a holistic view of bio-based packaging material including processing, and an evaluation of the sustainability of an...
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing... more
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing near-infrared spectroscopy) generally focus on identifying mono-materials of a few selected types which currently have a market-interest as secondary materials. Current progress in photonic sciences together with advanced data analysis, such as artificial intelligence, enable bridging practical challenges previously not feasible, for example in terms of classifying more complex materials. In the present paper, the different techniques are initially reviewed based on their main characteristics. Then, based on academic literature, their suitability for monitoring the composition of multi-materials, such as different types of multi-layered packaging and fibre-reinforced polymer composites as well as black plastics used in the motor vehicle industry, is discussed...
Background In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules... more
Background In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules and Recurrent Neural Networks (RNN)), the former inducing a rule-based model and the latter building a neural network. In situ telemetry is important to obtain quality in-field data for predictive models and this was a key aspect of the RICE-GUARD project on which this study is based. According to the authors, this is the first time process-based and machine learning modelling approaches for supporting plant disease management are compared. Results Results clearly showed that the models succeeded in providing a warning of rice blast onset and presence, thus representing suitable solutions for preventive remedial actions targeting the mitigation of yield losses and the reduction of fungicide use. All methods gave significant “signals” during the “early...
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on. In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
This is the first of four chapters that deal with the analysis of data on the Internet and in an online environment. This chapter gives an introduction to website analysis and Internet search using two contrasting case studies: first, the... more
This is the first of four chapters that deal with the analysis of data on the Internet and in an online environment. This chapter gives an introduction to website analysis and Internet search using two contrasting case studies: first, the chapter discusses how to analyze the transactional data from customer visits to a business’s website, and second, it explores how Internet search can be used as a market research tool. The examples serve to illustrate how the Internet can be used as a tool for individual marketing, mass marketing, and marketing sentiment surveys. The examples also illustrate the following two business objectives: (i) analyzing activity on a website to adapt the website’s commercial offering at both the general and individual levels, and (ii) gathering commercial information on the Internet from a diversity of sources in order to analyze and understand the marketplace. From a data mining perspective (and recalling the data sources in Chapter 3 ), throughout this chapter the Internet could be considered as a meta data source born from a company's Internet presence. Following each case study, details are given of which technical techniques are relevant and which software applications could be used for the examples.
The area of CRM (Customer Relationship Management) has attracted a lot of attention, and many businesses who are end users of IT solutions have spent considerable amounts of money on implementing CRM systems integrated to a greater or... more
The area of CRM (Customer Relationship Management) has attracted a lot of attention, and many businesses who are end users of IT solutions have spent considerable amounts of money on implementing CRM systems integrated to a greater or lesser extent with their operational and business processes. However, what should be kept in mind is that CRM is a basic, common sense idea that can be put into practice with nothing more than a spreadsheet and a modest database. This chapter introduces the reader to CRM in terms of recency, frequency, and latency of customer activity, and in terms of the client life cycle: capturing new clients, potentiating and retaining existing clients, and winning back ex-clients. The chapter then discusses the relation of data analysis to each of the CRM phases and considers customer satisfaction and integrated CRM systems. Next, it briefly describes the characteristics of commercial CRM software products, and finally, the chapter examines example screens and functionality from a simple CRM application.
This chapter discusses data quality, which is a preliminary consideration for any commercial data analysis project; the definition of quality includes the availability or accessibility of data. The chapter examines typical problems that... more
This chapter discusses data quality, which is a preliminary consideration for any commercial data analysis project; the definition of quality includes the availability or accessibility of data. The chapter examines typical problems that can occur with data, including errors in the data content (textual and numerical data) and the relevance and reliability of the data, as well as how to quantitatively evaluate data quality. Finally, some typical errors due to data extraction and how to avoid them are discussed by examining a practical case study.
When evaluating variable data for a given business intelligence objective, we may observe that the relevant variables are not reliable or that the reliable ones are not relevant. Here's how to address this situation. Available at:... more
When evaluating variable data for a given business intelligence objective, we may observe that the relevant variables are not reliable or that the reliable ones are not relevant. Here's how to address this situation. Available at: http://tdwi.org/articles/2014/05/13/data-quality-relevance-vs-reliability.aspx
Research Interests:
ABSTRACT This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary... more
ABSTRACT This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define potential adversary queries on the data. These two aspects condition the way in which we anonymize the graph, and from which we derive measures for information loss, risk and privacy levels.
ABSTRACT This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a... more
ABSTRACT This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
ABSTRACT A key aspect of data mining and its success in extracting useful knowledge is the way in which the data is represented. In this paper we propose representing the relations inherent in an e-commerce bookstore search log as a... more
ABSTRACT A key aspect of data mining and its success in extracting useful knowledge is the way in which the data is represented. In this paper we propose representing the relations inherent in an e-commerce bookstore search log as a graph, which allows us to apply and customize graph metrics and algorithms in order to identify structures and key elements. This approach complements traditional transactional mining by facilitating the identification of underlying structural interrelations.
Research Interests:
ABSTRACT
Research Interests:
ABSTRACT
In this paper we propose a classification for different observable trends over time for user web queries. The focus is on the identification of general collective trends, based on search query keywords, of the user community in Internet... more
In this paper we propose a classification for different observable trends over time for user web queries. The focus is on the identification of general collective trends, based on search query keywords, of the user community in Internet and how they behave over a given time period. We give some representative examples of real search queries and their tendencies. From these examples we define a set of descriptive features which can be used as inputs for data modelling. Then we use a selection of non supervised (clustering) and supervised modelling techniques to classify the trends. The results show that it is relatively easy to classify the basic hypothetical trends we have defined, and we identify which of the chosen learning techniques are best able to model the data. However, the presence of more complex, noisy or mixed trends make the classification more difficult.
In this paper we describe the functionality of a decision support modelling approach to select appropriate biomaterial blends depending on their mechanical/chemical properties on the one hand, and their biodegradation behaviour, on the... more
In this paper we describe the functionality of a decision support modelling approach to select appropriate biomaterial blends depending on their mechanical/chemical properties on the one hand, and their biodegradation behaviour, on the other. Firstly, a Case Based Reasoning (CBR) approach is applied to predict expected biodegradation behaviour over time, based on historical examples and using a weighted distance metric on the material properties in order to calculate the trend curve of the new case. Secondly, a Multi-Agent System (MAS) is applied to dynamically simulate the biodegradation curve, in which the two main agents, bacteria and plastic, interact to reproduce the biodegradation kinetics over time. The results of the interpolation are very promising with a good approximation to the real curve time series and % biodegradation, and the Multi-Agent System successfully simulates the different trend curves over time. The system has been confirmed as useful by materials expert end-users, who participated in the project, in order to evaluate a priori new blends "in silico", and identify and select the most promising, before conducting the long duration biodegradation experiments in the real environment.
This article presents the results of applying artificial intelligence (AI), such as machine learning algorithms, to identifying and predicting anomalies for corrective maintenance in a water for injection (WFI) processing plant. The aim... more
This article presents the results of applying artificial intelligence (AI), such as machine learning algorithms, to identifying and predicting anomalies for corrective maintenance in a water for injection (WFI) processing plant. The aim is to avoid the yearly stoppage of the WFI plant for preventive maintenance activities, common in the industry, and use a more scientific approach for the time between stoppages, expected to be longer after the study and thus saving money and increasing productivity.
Among digital technologies, Artificial Intelligence (AI) and Big Data (BD) have proven capability to support different processes, mainly in discrete manufacturing. Despite the fact that a number of AI and BD literature reviews exist, no... more
Among digital technologies, Artificial Intelligence (AI) and Big Data (BD) have proven capability to support different processes, mainly in discrete manufacturing. Despite the fact that a number of AI and BD literature reviews exist, no comprehensive review is available for the Process Industry (i.e. cements, chemical, steel, and mining). This paper aims to provide a comprehensive review of AI and BD literature to gain insights into their evolution supporting operational phases of the Process Industry. Results allow to define the areas where AI/BD are proven to have greater impact and areas with gaps like for example the process control (predictive models) area, machine learning and cyber-physical systems technologies. The sectors lagging behind are Ceramics, Cement and non-ferrous metals. Areas to be studied in the future include the interaction between intelligent systems. humans and the external environment, the implementation of AI for the monitoring and optimization of parameters of different operations, ethical and social impact.
The motivation of the work in this paper is due to the need in research and applied fields for synthetic social network data due to (i) difficulties to obtain real data and (ii) data privacy issues of the real data. The issues to address... more
The motivation of the work in this paper is due to the need in research and applied fields for synthetic social network data due to (i) difficulties to obtain real data and (ii) data privacy issues of the real data. The issues to address are first to obtain a graph with a social network type structure, label it with communities. The main focus is the generation of realistic data, its assignment to and propagation within the graph. The main aim in this work is to implement an easy to use standalone end-user application which addresses the aforementioned issues. The methods used are the R-MAT and Louvain algorithms, with some modifications, for graph generation and community labeling respectively, and the development of a Java based system for the data generation using an original seed assignment algorithm followed by a second algorithm for weighted and probabilistic data propagation to neighbors and other nodes. The results show that a close fit can be achieved between the initial user specification and the generated data, and that the algorithms have potential for scale up. The system is made publicly available in a Github Java project.
***BEST PAPER AWARD, SIMULTECH 2021 *** The exceptionally high virulence of COVID-19 and the patients' precondition seem to constitute primary factors in how pro-inflammatory cytokines production evolves during the course of an... more
***BEST PAPER AWARD, SIMULTECH 2021 ***
The exceptionally high virulence of COVID-19 and the patients' precondition seem to constitute primary factors in how pro-inflammatory cytokines production evolves during the course of an infection. We present a System Dynamics Model approach for simulating the patient reaction using two key control parameters (i) virulence, which can be "moderate" or "high" and (ii) patient precondition, which can be "healthy", "not so healthy" or "serious preconditions". In particular, we study the behaviour of Inflammatory (M1) Alveolar Macrophages, IL6 and Active Adaptive Immune system as indicators of the immune system response, together with the COVID viral load over time. The results show that it is possible to build an initial model of the system to explore the behaviour of the key attributes involved in the patient condition, virulence and response. The model suggests aspects that need further study so that it can then assist in choosing the correct immunomodulatory treatment, for instance the regime of application of an Interleukin 6 (IL-6) inhibitor (tocilizumab) that corresponds to the projected immune status of the patients. We introduce machine learning techniques to corroborate aspects of the model and propose that a dynamic model and machine learning techniques could provide a decision support tool to ICU physicians.
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing... more
In the increasingly pressing context of improving recycling, optical technologies present a broad potential to support the adequate sorting of plastics. Nevertheless, the commercially available solutions (for example, employing near-infrared spectroscopy) generally focus on identifying mono-materials of a few selected types which currently have a market-interest as secondary materials. Current progress in photonic sciences together with advanced data analysis, such as artificial intelligence, enable bridging practical challenges previously not feasible, for example in terms of classifying more complex materials. In the present paper, the different techniques are initially reviewed based on their main characteristics. Then, based on academic literature, their suitability for monitoring the composition of multi-materials, such as different types of multi-layered packaging and fibre-reinforced polymer composites as well as black plastics used in the motor vehicle industry, is discussed. Finally, some commercial systems with applications in those sectors are also presented. This review mainly focuses on the materials identification step (taking place after waste collection and before sorting and reprocessing) but in outlook, further insights on sorting are given as well as future prospects which can contribute to increasing the circularity of the plastic composites’ value chains.
This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the... more
This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the field. It builds on previous work by providing an integrated system, and a user friendly screen interface. It can be run with default values to produce graph data and statistics, which can then be used for further processing. The system is made publicly available in a Github Java project. The annex provides a user manual with a screen by screen guide.
Research Interests:
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials... more
Environmental impacts and consumer concerns have necessitated the study of bio-based materials as alternatives to petrochemicals for packaging applications. The purpose of this review is to summarize synthetic and non-synthetic materials feasible for packaging and textile applications, routes of upscaling, (industrial) applications, evaluation of sustainability, and end-of-life options. The outlined bio-based materials include polylactic acid, polyethylene furanoate, polybutylene succinate, and non-synthetically produced polymers such as polyhydrodyalkanoate, cellulose, starch, proteins, lipids, and waxes. Further emphasis is placed on modification techniques (coating and surface modification), biocomposites, multilayers, and additives used to adjust properties especially for barriers to gas and moisture and to tune their biodegradability. Overall, this review provides a holistic view of bio-based packaging material including processing, and an evaluation of the sustainability of and options for recycling. Thus, this review contributes to increasing the knowledge of available sustainable bio-based packaging material and enhancing the transfer of scientific results into applications.
There is exciting news in recent developments suggesting the potential to treat some human cancers by stimulating the patients own immune system. However, there is still much to understand; therefore, modelling the battle between those... more
There is exciting news in recent developments suggesting the potential to treat some human cancers by stimulating the patients own immune system. However, there is still much to understand; therefore, modelling the battle between those cells that are constituents of the human immune system against tumorous cells can significantly provide insights as mathematical modelling has done regarding the immune system behaviour against virus infections. In this paper we innovate in two directions. First, we move the modelling of immune struggles from the sphere of ordinary-differential equation models to the modelling by multi-agent simulations. We highlight the advantages of the multi-agent simulation, for example the consideration of elaborate spatial proximity interactions. Secondly, we move away from the realm of infectious diseases to the complex modelling of the stimulation of T-Cells and their participation in fighting cancerous cell tumours.
Background: In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules... more
Background: In this study, we compared four models for predicting rice blast disease, two operational process-based models (Yoshino and Water Accounting Rice Model (WARM)) and two approaches based on machine learning algorithms (M5Rules and Recurrent Neural Networks (RNN)), the former inducing a rule-based model and the latter building a neural network. In situ telemetry is important to obtain quality in-field data for predictive models and this was a key aspect of the RICE-GUARD project on which this study is based. According to the authors, this is the first time process-based and machine learning modelling approaches for supporting plant disease management are compared. Results: Results clearly showed that the models succeeded in providing a warning of rice blast onset and presence, thus representing suitable solutions for preventive remedial actions targeting the mitigation of yield losses and the reduction of fungicide use. All methods gave significant "signals" during the "early warning" period, with a similar level of performance. M5Rules and WARM gave the maximum average normalized scores of 0.80 and 0.77, respectively, whereas Yoshino gave the best score for one site (Kalochori 2015). The best average values of r and r2 and %MAE (Mean Absolute Error) for the machine learning models were 0.70, 0.50 and 0.75, respectively and for the process-based models the corresponding values were 0.59, 0.40 and 0.82. Thus it has been found that the ML models are competitive with the process-based models. This result has relevant implications for the operational use of the models, since most of the available studies are limited to the analysis of the relationship between the model outputs and the incidence of rice blast. Results also showed that machine learning methods approximated the performances of two process-based models used for years in operational contexts. Conclusions: Process-based and data-driven models can be used to provide early warnings to anticipate rice blast and detect its presence, thus supporting fungicide applications. Data-driven models derived from machine learning methods are a viable alternative to process-based approaches and - in cases when training datasets are available - offer a potentially greater adaptability to new contexts.
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several... more
We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user; (ii) enhancement of the information available about these users by unifying information from different sources; (iii) consolidation of accounts by on-line social network providers; (iv) identification of potentially malicious users and/or bots. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users' expectations on such matters to avoid backlash. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. This formalization enables us to reason on two issues. First, how to identify that two or more user-accounts belong to the same user. Second, what gains in predictability are obtained after re-identification. For the first issue, we show that a set-difference approach is remarkably effective. For the second issue we explore the impact of re-identification on the predictability by two different machine learning algorithms: C4.5 (decision tree induction) and SVM-SMO (Support Vector Machine with SMO kernel). Our results show that as predictability improves, in some cases different SAN metrics emerge as predictors.
From their origins in the sociological field, memes have recently become of interest in the context of `viral' transmission of basic information units (memes) in online social networks. However, much work still needs to be done in terms... more
From their origins in the sociological field, memes have recently become of interest in the context of `viral' transmission of basic information units (memes) in online social networks. However, much work still needs to be done in terms of metrics and practical data processing issues. In this paper we define a theoretical basis and processing system for extracting and matching memes from free format text. The system facilitates the work of a text analyst in extracting this type of data structures from online text corpuses and n performing empirical experiments in a controlled manner. The general aspects related to the solution are the automatic processing of unstructured text without need for preprocessing (such as labelling and tagging), identification of co-occurences of concepts and corresponding relations, construction of semantic networks and selecting the top memes. The system integrates these processes which are generally separate in other state of the art systems. The proposed system is important because unstructured online text content is growing at a greater rate than other content (e.g. semi-structured, structured) and integrated and automated systems for knowledge extraction from this content will be increasingly important in the future. To illustrate the method and metrics we process several real online discussion forums, extracting the principal concepts and relations, building the memes and then identifying the key memes for each document corpus using a sophisticated matching process. The results show that our method can automatically extract coherent key knowledge from free text, which is corroborated by benchmarking with a set of other text analysis approaches, as well as a user study evaluation.
Research Interests:
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated... more
Two of the difficulties for data analysts of online social networks are (1) the public availability of data and (2) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated data. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, correlations and so on. In the following work, we present and validate an approach for populating a graph topology with synthetic data which approximates an online social network. The empirical tests confirm that our approach generates a dataset which is both diverse and with a good fit to the target requirements, with a realistic modeling of noise and fitting to communities. A good match is obtained between the generated data and the target profiles and distributions, which is competitive with other state of the art methods. The data generator is also highly configurable, with a sophisticated control parameter set for different “similarity/diversity” levels.
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as... more
Given that exact pair-wise graph matching has a high computational cost, different representational schemes and matching methods have been devised in order to make matching more efficient. Such methods include representing the graphs as tree structures, transforming the structures into strings and then calculating the edit distance between those strings. However many coding schemes are complex and are computationally expensive. In this paper, we present a novel coding scheme for unlabeled graphs and perform some empirical experiments to evaluate its precision and cost for the matching of neighborhood subgraphs in online social networks. We call our method OSG-L (Ordered String Graph-Levenshtein). Some key advantages of the pre-processing phase are its simplicity, compactness and lower execution time. Furthermore, our method is able to match both non-isomorphisms (near matches) and isomorphisms (exact matches), also taking into account the degrees of the neighbors, which is adequate for social network graphs.
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and... more
In recent years, online social networks have become a part of everyday life for millions of individuals. Also, data analysts have found a fertile field for analyzing user behavior at individual and collective levels, for academic and commercial reasons. On the other hand, there are many risks for user privacy, as information a user may wish to remain private becomes evident upon analysis. However, when data is anonymized to make it safe for publication in the public domain, information is inevitably lost with respect to the original version, a significant aspect of social networks being the local neighborhood of a user and its associated data. Current anonymization techniques are good as identifying risks and minimizing them, but not so good at maintaining local contextual data which relate users in a social network. Thus, improving this aspect will have a high impact on the data utility of anonymized social networks. Also, there is a lack of systems which facilitate the work of a data analyst in anonymizing this type of data structures and performing empirical experiments in a controlled manner on different datasets. Hence, in the present work we address these issues by designing and implementing a sophisticated synthetic data generator together with an anonymization processor with strict privacy guarantees and which takes into account the local neighborhood when anonymizing. All this is done for a complex dataset which can be fitted to a real dataset in terms of data profiles and distributions. In the empirical section we perform experiments to demonstrate the scalability of the method and the improvement in terms of reduction of information loss with respect to approaches which do not consider the local neighborhood context when anonymizing.
Approximate sub-graph matching is important in many graph data mining fields. At present, current solutions can be difficult to implement, have an expensive pre-processing phase, or only work for given types of graph. In this paper a... more
Approximate sub-graph matching is important in many graph data mining fields. At present, current solutions can be difficult to implement, have an expensive pre-processing phase, or only work for given types of graph. In this paper a novel generic approach is presented which addresses these issues. An approximate sub-graph matcher (A-SGM) calculates the distance between the topological characteristics (footprint) of the sub-graphs to be matched, applying a weighting to the different sub-graph characteristics and those of neighbor nodes. The weights are calibrated for each dataset with a simulated annealing process using sample sets of graph nodes to reduce computational cost, and an exact isomorphism matcher as a fitness function which takes into account how well the match maintains the neighboring node degree distributions. Benchmarking is performed on several state of the art methods and real and synthetic graph datasets to evaluate the precision, recall and computational cost. The results show that the A-SGM is competitive with state of the art methods in terms of precision, recall and execution time.
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked.... more
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked. Furthermore, even if a user masks all its data, it may still be inferable from the unmasked data of its friends, due to affinities in likes and personal attributes. In contrast to the conventional data mining approach, in which a model is built for all users, we build a rule set which is individualized for each user. In this paper we propose a novel rule induction approach (that incorporates predictive metrics) which enable a user to evaluate the potential risk incurred by unmasked attributes, friends’ attributes and also the risk of befriending new users. We find that all of these risks are quantifiable and a risk ranking of attributes and friends/potential friends can be individualized for each user. We give examples and use cases and confirm the effectiveness of the approach, using a sophisticated synthetic OSN-data to define risk attribute and user combinations which coincide with the risk ranking produced by our algorithm.

And 50 more

Key Features - Illustrates cost-benefit evaluation of potential projects - Includes vendor-agnostic advice on what to look for in off-the-shelf solutions as well as tips on building your own data mining tools -... more
Key Features

  - Illustrates cost-benefit evaluation of potential projects

    - Includes vendor-agnostic advice on what to look for in off-the-shelf solutions as well as tips on building your own data mining tools

    - Approachable reference can be read from cover to cover by readers of all experience levels

    - Includes practical examples and case studies as well as actionable business insights from author's own experience

Description

Whether you are brand new to data mining or working on your tenth predictive analytics project, Commercial Data Mining will be there for you as an accessible reference outlining the entire process and related themes. In this book, you'll learn that your organization does not need a huge volume of data or a Fortune 500 budget to generate business using existing information assets. Expert author David Nettleton guides you through the process from beginning to end and covers everything from business objectives to data sources, and selection to analysis and predictive modeling.

Commercial Data Mining includes case studies and practical examples from Nettleton's more than 20 years of commercial experience. Real-world cases covering customer loyalty, cross-selling, and audience prediction in industries including insurance, banking, and media illustrate the concepts and techniques explained throughout the book.

Readership

Data mining professionals in business & IT.
El libro esta dirigido a las personas que por razones profesionales o académicas tienen la necesidad de analizar datos de pacientes, con el motivo de realizar un diagnóstico o un pronóstico. Se explican en detalle las diversas técnicas... more
El libro esta dirigido a las personas que por razones profesionales o académicas tienen la necesidad de analizar datos de pacientes, con el motivo de realizar un diagnóstico o un pronóstico. Se explican en detalle las diversas técnicas estadísticas y de aprendizaje automatizado para su aplicación al análisis de datos clínicos. Además, el libro describe de forma estructurada, una serie de técnicas adaptadas y enfoques originales, basándose en la experiencia y colaboraciones del autor en este campo.

  INDICE RESUMIDO: Introducción. Conceptos y técnicas. La perspectiva difusa. El diagnóstico y el pronóstico clínico. El diagnóstico del síndrome de apnea del sueña. La representación, comparación y proceso de datos de diferentes tipos. Técnicas. Resumen de los aspectos claves en la adaptación e implementación de las técnicas. Aplicación de las técnicas a casos reales. Pronóstico de pacientes de la UCI-Hospital Parc Tauli de Sabadell, etc.,
Este libro está dirigido tanto a las personas sin formación en el análisis de datos comerciales como a las que ya se dedican a ello en mayor o menor grado, y buscan una referencia sencilla de todo el proceso y los temas vinculados. El... more
Este libro está dirigido tanto a las personas sin formación en el análisis de datos comerciales como a las que ya se dedican a ello en mayor o menor grado, y buscan una referencia sencilla de todo el proceso y los temas vinculados. El autor incorpora materia tanto de sus mas de 20 años de experiencia empresarial como de sus diversos proyectos de investigación para enriquecer el contenido, el cual ofrece un enfoque original sobre la problemática del tema. En los apéndices, casos prácticos derivados de proyectos reales, sirven para ilustrar los conceptos y técnicas explicadas a lo largo del libro.

Prácticamente todos los métodos, técnicas e ideas que se presentan, por ejemplo 'calidad de datos', 'data mart', 'CRM - gestión de la relación con los clientes', 'diferentes fuentes de datos' y 'búsqueda en Internet', pueden ser aprovechados tanto por el empresario de una micro-empresa o un profesional autónomo, como por una empresa mediana o grande. No es imprescindible disponer de un gran volumen de datos, y hay herramientas de análisis disponibles a un precio accesible a todos.
A malevolent data miner can use data mining techniques in order to learn confidential information of social networking site users that the users did not disclose ; and thereby the data miner can breach individual privacy of a social... more
A malevolent data miner can use data mining techniques in order to learn confidential information of social networking site users that the users did not disclose ; and thereby the data miner can breach individual privacy of a social networking site user . However, the information items in a social network are not only the attributes of users but also the relationships. The attributes of the neighbours, and the characteristics of the connections can also determine a user profile, even with very little or no information has been shared . Thus, it is a challenge to empower users by alerting them of unmasked attributes disclosed by a particular user or his neighbour connection . This approach gathers information for SNS users and applies proposed Cum_Sensitivity and Total_Count algorithms to find out sensitive rules, and their corresponding unmasked attributes (i . e . which are used to conjunct rule) . Then, it suggests the user suppress those high risk attributes or some values of it . In addition, potential risk incurred by friends’ attributes are also quantifiable and a risk ranking of attributes and friends can be individualized for each user .
In this presentation two themes are considered:

(i) A personalized privacy tool for online social network users
and (ii) a generator for synthetic online social network graph data.
Research Interests:
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked.... more
Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. However, some of this masked data may be inferable from other data the user has not masked. Furthermore, even if a user masks all its data, it may still be inferable from the unmasked data of its friends, due to affinities in likes and personal attributes. In contrast to the conventional data mining approach, in which a model is built for all users, we build a rule set which is individualized for each user. In this paper we propose a novel rule induction approach (that incorporates predictive metrics) which enable a user to evaluate the potential risk incurred by unmasked attributes, friends' attributes and also the risk of befriending new users. We find that all of these risks are quantifiable and a risk ranking of attributes and friends/potential friends can be individualized for each user. We give examples and use cases and confirm the effectiveness of the approach, using a sophisticated synthetic OSN-data to define risk attribute and user combinations which coincide with the risk ranking produced by our algorithm.
Research Interests:
It is widely accepted that the field of Data Analytics has entered into the era of Big Data. In particular, it has to deal with so-called Big Graph Data, which is the focus of this paper. Graph Data is present in many fields, such as... more
It is widely accepted that the field of Data Analytics has entered into the era of Big Data. In particular, it has to deal with so-called Big Graph Data, which is the focus of this paper. Graph Data is present in many fields, such as Social Networks, Biological Networks, Computer Networks, and so on. It is recognized that data analysts benefit from interactive real time data exploration techniques such as clustering and zoom capabilities on the clusters. However, although clustering is one of the key aspects of graph data analysis, there is a lack of scalable graph clustering algorithms which would support interactive techniques. This paper presents an approach based on combining graph clustering and graph coordinate system embedding, and which shows promising results through initial experiments. Our approach also incorporates both structural and attribute information, which can lead to a more meaningful clustering.
Research Interests:
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on.
In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series... more
One of the difficulties for data analysts of online social networks is the public availability of data, while respecting the privacy of the users. One alternative is to use synthetically generated data[1]. However, this presents a series of challenges related to generating a realistic dataset in terms of topologies, attribute values, communities, data distributions, and so on. In the following we present an approach for generating a graph topology and populating it with synthetic data for an online social network.
In this brief presentation on free text document sanitization, we perform a multi-step semi-automatic sanitization process and evaluate the information loss using information retrieval metrics. The Wikileaks document corpus is used for... more
In this brief presentation on free text document sanitization, we perform a multi-step semi-automatic sanitization process and evaluate the information loss using information retrieval metrics. The Wikileaks document corpus is used for testing.
In this brief presentation on graph anonymization, we look at some graph modifier operators and different types of adversary information queries.
In this brief presentation we give an overview of some of the issues and work related to data privacy of on-line social network data represented as graphs. Among the issues considered are adversaries, protection methods (link addition and... more
In this brief presentation we give an overview of some of the issues and work related to data privacy of on-line social network data represented as graphs. Among the issues considered are adversaries, protection methods (link addition and clustering) and data processing.
This poster gives an overview of an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define... more
This poster gives an overview of  an  approach  for anonymizing  online  social  networks  represented  as graphs: (i) The  end  user  of  the  data  is  able  to  specify  the utility requirements; (ii) We  are  able  to  define  potential  adversary queries on the data. These  two  aspects  condition  the  way  in  which  we anonymize  the  graph,  and  from  which  we  derive measures for information loss, risk and privacy levels.
In this brief talk we describe an approach for anonymizing online social networks represented as graphs: (i) The end user of the data is able to specify the utility requirements; (ii) We are able to define... more
In  this  brief  talk  we  describe  an  approach  for anonymizing  online  social  networks  represented  as graphs: (i) The  end  user  of  the  data  is  able  to  specify  the utility requirements; (ii) We  are  able  to  define  potential  adversary queries on the data. These  two  aspects  condition  the  way  in  which  we anonymize  the  graph,  and  from  which  we  derive measures for information loss, risk and privacy levels.
This poster gives an overview of some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community... more
This poster gives an overview of some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community... more
This brief talk will consider some of the issues which graph data miners may encounter when analyzing Online Social Networks represented as graphs. Such issues include representing an OSN as a graph, the elicitation of a community structure, finding similar subgraphs and computational cost issues.
The present invention proposes a new approximate sub-graph matching method with the advantage of a relatively simple to implement matching method, requiring a worst case runtime computational cost of O(N2). The present invention refers to... more
The present invention proposes a new approximate sub-graph matching method with the advantage of a relatively simple to implement matching method, requiring a worst case runtime computational cost of O(N2). The present invention refers to a similarity metric which approximates a modified isomorphism matcher for local neighbourhood sub-graphs, the matcher consisting in a distance metric with weighted characteristics in terms of sub-graph statistics and statistics of neighbour node degrees. The weights of the metric are calibrated using a simulated annealing process which uses as a fitness function a modified isomorphism matcher which takes into account how well the match maintains the neighbouring node degree distributions. The learned weights provide additional information useful to interpret the relative importance of each characteristic.
This unclassified report consists of three testing and performance studies of the IBM 3081 mainframe which provided computer services to the AERE (Atomic Energy Research Establishment) Harwell site. (i) Job test stream for the batch... more
This unclassified report consists of three testing and performance studies of the IBM 3081 mainframe which provided computer services to the AERE (Atomic Energy Research Establishment) Harwell site. (i) Job test stream for the batch system.  (ii) Performance comparison of an indexed VTOC vs OS VTOC. (iii) System response time analysis using two different performance monitoring systems.
In this document we review the state of the art on graph privacy with special emphasis on applications to online social networks, and we review how six different operators modify local topologies, when activity data is included. We... more
In this document we review the state of the art on graph privacy with special emphasis on applications to online social networks, and we review how six different operators modify local topologies, when activity data is included. We consider an aspect which has not been greatly covered in the specialized literature on graph privacy: adding, deleting and disaggregation of nodes. We also cover the following key considerations: (i) choice of six different operators to modify the graph; (ii) simulated annealing to find the optimum graph using a fitness function based on information loss and disclosure risk; (iii) Use of heuristics to choose graph elements (nodes, edges) to be modified, as a probability weighted by the distribution of an elements statistical characteristics (degree, clustering coefficient and path length) in the original graph; (iv) re-linking of nodes: heuristic which finds the topology whose statistical characteristics are closest to those of the original neighborhood; (v) in the case of the aggregation of  two nodes, we choose adjacent nodes rather than isomorphic topologies, in order to maintain the overall structure of the graph; (vi) incorporation of network activity as a weight on the topology characteristics; (vii) a statistically knowledgeable attacker who is able to search for regions of the graph based on statistical characteristics and map those onto a given node and its immediate neighborhood.
This document describes the first version (V1.0) of the graph privacy software suite. It consists of some initial assumptions, together with a textual description of the main routine (simulated annealing) and the six graph modifier... more
This document describes the first version (V1.0) of the graph privacy software suite. It consists of some initial assumptions, together with a textual description of the main routine (simulated annealing) and the six graph modifier operators. This is followed by a structure diagram of the whole system and the pseudo code of each of the main functions, organized in a modular design. A companion document [TR-IIIA-2010-04] details the theoretical background to the work.
Brief description Two datasets are included which represent a graph which contain 11580 user records (nodes) and 87322 link records (edges), respectively. We have used as a (empty) topology the Amazon product co-purchasing network and... more
Brief description Two datasets are included which represent a graph which contain 11580 user records (nodes) and 87322 link records (edges), respectively. We have used as a (empty) topology the Amazon product co-purchasing network and ground-truth communities dataset which was collected by crawling the Amazon website by (Yang and Leskovec 2012) and is available from the SNAP online repository (https://snap.stanford.edu/data/). We used the version which has the top 5000 communities. The graph structure has then been populated with data by choosing seeds in each community and propagating from them. This follows a method outlined in [1]. The method has also been used to create a synthetic dataset for use in a data privacy study [2].
Research Interests:
Research Interests:
Research Interests:
50K link records (edges) - corresponds to 1K records (nodes) file in this same section.
Research Interests:
1K user records (nodes) - corresponds to the edges file in this same section.
Research Interests:
Two datasets are included which represent a graph which contain approx. 1K user records (nodes) and 50K link records (edges), respectively. We have followed a two step process: (1) generate a topology using R-Mat; apply Louvain to... more
Two datasets are included which represent a graph which contain approx. 1K user records (nodes) and 50K link records (edges), respectively. We have followed a two step process: (1) generate a topology using R-Mat; apply Louvain to identify some communities; then apply Louvain recursively to selected communities to obtain some smaller ones, giving a total of 10 communities; (2) Populate the graph structure with data by choosing seeds in each community and propagating from them. This follows a method outlined in [1]. A new more sophisticated version of this method will be made available soon (datasets and code).
Please reference  the paper [1] when using this data and publishing results in your work. Please give me your feedback on your analysis/use of this data and suggestions for improvement.
[1] Nettleton, DF (2015) Generating synthetic online social network graph data and topologies, 3rd Workshop on Graph-based Technologies and Applications (Graph-TA), UPC, Barcelona, Spain, March 18th 2015.
Research Interests:
In this presentation, preliminary results are given for the modeling and calibration of two different industrial winding MIMO (Multiple Input Multiple Output) processes using machine learning techniques. In contrast to previous approaches... more
In this presentation, preliminary results are given for the modeling and calibration of two different industrial winding MIMO (Multiple Input Multiple Output) processes using machine learning techniques. In contrast to previous approaches which have typically used “black-box” linear statistical methods together with a definition of the mechanical behavior of the process, the present work builds a model using non-linear
machine learning algorithms together with a “white-box” rule induction technique to create a supervised model of the fitting error between the expected and real force measures. The final objective is to build a precise model of the winding process
in order to control de tension of the material being wound in the first case, and the friction of the material passing through the die, in the second case.
Github Java source code of MEDICI: A simple to use synthetic social network data generator https://github.com/dnettlet/MEDICI The main project folder includes the corresponding paper (please reference if you include Medici in you... more
Github Java source code of MEDICI: A simple to use synthetic social network data generator

https://github.com/dnettlet/MEDICI
The main project folder includes the corresponding paper (please reference if you include Medici in you research) and user manual.

The paper preprint reference is: https://arxiv.org/abs/2101.01956

Overview:
The Java and JavaFx source code corresponds to the Medici application, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the field. It builds on previous work by providing an integrated system, and a user friendly screen interface. It can be run with default values to produce graph data and statistics, which can then be used for further processing. The system is made publicly available in a Github Java project. The annex provides a user manual with a screen by screen guide.
Repast (ReLogo) source code of paper "Multi-Agent Modeling Simulation of In-Vitro T-Cells for Immunologic Alternatives to Cancer Treatment" Language: Repast (ReLogo) Repository: https://github.com/dnettlet/AgentSim1 License: GNU GENERAL... more
Repast (ReLogo) source code of paper "Multi-Agent Modeling Simulation of In-Vitro T-Cells for Immunologic Alternatives to Cancer Treatment"
Language: Repast (ReLogo)
Repository: https://github.com/dnettlet/AgentSim1
License: GNU GENERAL PUBLIC LICENSE Version 3
Research Interests:
Python Source code of project to extract memes (compact semantic network structures) representing key knowledge which is circulating in online discussion forums. Languages: Python Repository: https://github.com/dnettlet/memes License:... more
Python Source code of project to extract memes (compact semantic network structures) representing key knowledge which is circulating in online discussion forums.
Languages: Python
Repository: https://github.com/dnettlet/memes
License: GNU GENERAL PUBLIC LICENSE Version 3
Research Interests:
This program takes an empty graph (just nodes and links) and a community labelling (e.g. generated by Gephi Louvain) and fills it will data, one record per node. Neighbors tend to be similar, users tend to form communities, node degree... more
This program takes an empty graph (just nodes and links) and a community labelling (e.g. generated by Gephi Louvain) and fills it will data, one record per node. Neighbors tend to be similar, users tend to form communities, node degree has a long tail distribution, clustering coefficient distributions, and so on.... Please reference the associated paper
"A synthetic data generator for online social network graphs",
Social Network Analysis and Mining, Dec. 2016, 6:44

and the github code ref when you use/adapt/improve it !

https://github.com/dnettlet/SynthOSNdataGenerator

This version with no overlapping communities :)
This Master's Thesis dissertation describes my final project work for the M.Sc. in Computer Software and System Design, a 1 year intensive course at The Computing Laboratory of the University of Newcastle Upon Tyne, during 1984-1985. The... more
This Master's Thesis dissertation describes my final project work for the M.Sc. in Computer Software and System Design, a 1 year intensive course at The Computing Laboratory of the University of Newcastle Upon Tyne, during 1984-1985. The work was motivated by the need at the time for higher level programming languages to allow the programmer to define and control computer operating system functions, rather than directly writing in (sequential) low level machine and assembly code. Also, it allowed an abstraction for addressing key issues such as concurrency, parallelism, reliability, security, IO disk interface, streams and queuing procedures, among others, and implementing at different levels (from user interface level down to the disc interface level, for example). Unix was used as the underlying system, running on a PDP 11/34 mainframe computer. The main areas of work were the setting up of the standalone Concurrent Euclid (CE) software on the PDP 11/34 hardware, the development of a disk interface written in CE, the development of different operating systems functions, some rewritten from an existing SOLO operating system (Brinch Hansen) written in Sequential Pascal, and a comparative study of the CE language with Concurrent Pascal, Modula 2 and Edison-11.
In this paper a brief description is given of the implementation of a 'Pepper's Ghost' apparatus for creating an optical illusion. The result is a purely non digital effect, using only light reflection, appropriate lighting arrangement... more
In this paper a brief description is given of the implementation of a 'Pepper's Ghost' apparatus for creating an optical illusion. The result is a purely non digital effect, using only light reflection, appropriate lighting arrangement and background. A second chamber is added which makes it possible to project a secondary independent image superimposed with the primary one. As part of the testing of the apparatus, different objects (cup, bag) are made to appear and disappear, and by varying the incident light intensity, spurious visual artefacts are minimized.