This thesis was scanned from the print manuscript for digital preservation and is copyright the a... more This thesis was scanned from the print manuscript for digital preservation and is copyright the author. Researchers can access this thesis by asking their local university, institution or public library to make a request on their behalf. Monash staff and postgraduate students can use the link in the References field.
Text mining in clinical domain is usually more difficult than general domains (e.g. newswire repo... more Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.
This thesis was scanned from the print manuscript for digital preservation and is copyright the a... more This thesis was scanned from the print manuscript for digital preservation and is copyright the author. Researchers can access this thesis by asking their local university, institution or public library to make a request on their behalf. Monash staff and postgraduate students can use the link in the References field.
Text mining in clinical domain is usually more difficult than general domains (e.g. newswire repo... more Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.
Over the past 30 years Health information tech- nology (HIT) has been positioned as a battle betw... more Over the past 30 years Health information tech- nology (HIT) has been positioned as a battle between two classes of technology solutions, that is Clinical Enterprise Resource Planning (CERP aka EMR) versus best-of-breed systems. The CERP systems are provided by the largest vendors as whole of hospital or whole of organization solutions intended to satisfy all users in the organization. Experience shows that they fail to fulfill that promise. Best-of-breed solutions are tailored to suit a particular community of users to perform specialized tasks such as surgical scheduling, tracking, and clinical details. These systems get higher rankings from users for usability and efficiency but create problems for IT departments by requiring individual maintenance tasks for each installed system, and silo data which is needed for back office administration and analytics. In the last 10 years, the best-of-breed solution has been in retreat with the onslaught of CERP vendors holding sway over the decision makers with a promise of increased revenue for more detailed billing and common access to all data . At the same time, the clinicians at the coalface of care are complaining bitterly about CERP systems, which have unsuitable interfaces, add more work, and fail to respond to change requests.
Uploads
Surgical Patient Care: Improving Safety, Quality and Value, Springer, 2017, http://www.springer.com/us/book/9783319440088