WO2017151759A1

WO2017151759A1 - Category discovery and image auto-annotation via looped pseudo-task optimization

Info

Publication number: WO2017151759A1
Application number: PCT/US2017/020185
Authority: WO
Inventors: Le LU; Xiaosong Wang; Ronald M. SUMMERS
Original assignee: The United States Of America, As Represented By The Secretary, Department Of Health And Human Services
Priority date: 2016-03-01
Filing date: 2017-03-01
Publication date: 2017-09-08

Abstract

Methods and apparatus are disclosed for providing a looped deep pseudo-task automation approach for automatic category discovery as can be applied to a collection of images. In one example of the disclosed technology, a method of analyzing a collection of images includes extracting at least a portion of the activation values and weights associated with internal nodes of the neural network responsive to a respective input image of the collection of images being applied to the neural network, encoding the extracted activation values and weights, producing encoded vectors, clustering at least a portion of the collection of images based on similarities of the encoded vectors, to produce a plurality of clusters, and evaluating the clusters with a convergence criteria.

Description

CATEGORY DISCOVERY AND IMAGE AUTO-ANNOTATION VIA

LOOPED DEEP PSEUDO-TASK OPTIMIZATION

CROSS-REFERENCE TO RELATED APPLICATIONS

[001] This application claims the benefit of and priority to U.S. Provisional Application No. 62/302,096, filed March 1, 2016, which application is incorporated by reference in its entirety.

ACKNOWLEDGMENT OF GOVERNMENT SUPPORT

[002] This invention was made with government support under contract no.

HHSN263200900026I awarded by the National Institutes of Health. The government has certain rights in the invention.

BACKGROUND

[003] With the advent of digital medical imaging, large databases of images, such as x-rays, computed tomography (CT), magnetic resonance imaging (MRI), photographs, and other medical images are available, in large medical imaging databases. However, labels for such data in these databases are typically not readily available. Providing labeling for these images can be problematic, as conventional ways for collecting image labels do not apply in the medical imaging field. For example, Google searching, and crowd searching approaches cannot be used with specialized databases. Further, annotation of medical images requires professionals that have clinical training to properly identify features found in the images. Thus, there is ample opportunity for improvements in analyzing and labeling digital images contained in such image databases.

SUMMARY

[004] Methods and apparatus are disclosed for providing a looped deep pseudo-task automation approach for automatic category discovery as can be applied to a collection of images. For example, the disclosed technologies can be used to determine similar images, for example, those images that are visually coherent, and/or have similar clinically semantic features. Using the disclosed techniques, better labels can be generated that in turn lead to better trained neural network models used to feed more effective deep image features in the neural network to facilitate generation of improved clustering and labels. By basing analysis on internal or deep nodes of a neural network and fine-tuning the neural network to achieve convergence, improved training of the neural network and the resulting clustering can be achieved. [005] Obtaining ImageNet-level semantic labels on a large-scale radiology image database (in one experimental example, comprising 215,786 key images from 61,845 unique patients) is a bottleneck to train highly effective deep convolutional neural network (CNN) models for image recognition. Nevertheless, using a search engine and crowd-source approach to collecting image labels are not applicable in many specialized domains, for example, medical imaging, due to the formidable difficulties of medical annotation tasks for annotators that are not clinically trained.

[006] In some examples of the disclosed technology, methods, and apparatus are provided for unsupervised joint mining of deep image features and labels via LDPO based on the hypothesized "convergence" of better labels leading to better-trained CNN models, which in turn offer more effective deep image features to facilitate more meaningful clustering/labels. This looped property can be used with deep CNN classification-clustering models. Typically, other types of classifiers do not learn better image features simultaneously. In some examples, disclosed methods are applied to perform large-scale medical image auto- annotation. In some examples of the disclosed technology, an LDPO framework is also validated through a scene recognition task where ground- truth labels are available (for the validation purpose).

[007] In some examples of the disclosed technology, a looped deep pseudo-task optimization procedure for automatic category discovery of visually coherent and clinically semantic (concept) clusters is provided. In such examples, a system can be initialized by domain-specific (e.g. , a CNN trained on radiology images and text report derived labels) or generic (e.g. , ImageNet) CNN models. A sequence of pseudo-tasks are exploited by using looped deep image feature clustering (e.g. , to refine image labels) and deep CNN training/classification (e.g. , to obtain more task representative deep features using new labels). In certain examples, a method provides convergence of better labels leading to better-trained CNN models which consequently feed more effective deep image features to facilitate more meaningful clustering/labels. The convergence has been empirically validated and demonstrated promising quantitative and qualitative results. Thus, in certain examples of applying the disclosed methods to large image radiology databases, significantly higher quality of category labels can be discovered. This further allows investigation of hierarchical semantics for a given large-scale radiology image database.

[008] In some examples, an LDPO framework is used for joint mining of image features and labels, without a priori knowledge of the image categories. The true image category labels are assumed to be latent and not directly observable. Thus, disclosed methods can learn and train CNN models using pseudo-task labels (since human- annotated labels are unavailable) and iterate this process with the expectation that pseudo-task labels will gradually resemble the real image categories. In some examples, a looped optimization algorithm flow starts with deep CNN feature extraction and image encoding using domain-specific (e.g. , a CNN trained on radiology images and text report-derived labels) or generically-initialized CNN models. Afterwards, the CNN-encoded image feature vectors are clustered to compute and refine image labels, and the newly clustered labels are fed to fine-tune the current CNN models. Next, the obtained more task-specific and representative deep CNN will serve as the deep image encoder in the successive iteration. This looped process continues until a stopping criterion is met. For medical image annotation, LDPO- generated image clusters can be further interpreted by a natural language processing (NLP) based text mining system and/or a clinician.

[009] In one example of the disclosed technology, a computer-implemented method of analyzing a collection of images with a neural network includes producing a neural network that has a plurality of input nodes, output nodes, and internal or deep nodes. The nodes are interconnected with a plurality of links that carry messages, or values, from input nodes to the internal nodes and from the internal nodes, in turn, to the output nodes. The nodes are typically arranged in layers corresponding to a stage of neural processing. The nodes can have an associated plurality of activation values which is used to translate input signals to a particular node to its output signal. For example, the activation value can scale a transfer or activation function. Examples of suitable activation functions are step functions, sigmoid functions, piecewise linear functions, and Gaussian functions. Each of the connections to a node can also be associated with a weight which provides the connection strength, or multiplier for calculating the effect that an input received on a particular link will have on a given node. In some examples, a convolutional neural network (CNN) can be used. The neural network can be trained by applying the collection of images as inputs to the neural network. The method further includes extracting values from internal nodes of the internal network, for example, activation values for the internal nodes and/or weights associates with links connecting the internal nodes responsive to an input image of the collection of images being applied as input to the neural network. The extracted activation values and weights can be encoded to produce encoded vectors. The encoded vectors provide a more compact description of the internal nodes of the neural network and are used to cluster at least some of the images of the collection of images based on similarities between their respective encoded vectors. Thus, images having similar encoded vectors can be collected into clusters. The method further includes evaluation the clusters with convergence criteria to determine whether the clustering will satisfy a threshold. If the evaluating indicates that the convergence criteria do not satisfy the threshold, then the neural network is fine-tuned by adjusting activation, weights, and/or transfer functions of the internal nodes of the neural network, and then repeating the extracting encoding, clustering, and evaluating of the fine-tuned neural network. If the evaluating indicates that the convergence criteria satisfies the threshold, then the images can be labeled, for example, using computer generated or human generated labels for each of the clusters. Further, the clusters can be arranged in a hierarchy based on similarities between groups of clusters.

[010] In some examples, a system includes a computer readable database storing a collection of images, a neural network coupled to receive images from the database as input to the neural network, and encoding a clustering unit that is configured to generate clusters of images by extracting activation values, weights, or other parameters of internal nodes of the neural network, and/or their respective edges, and classify the images into clusters based at least in part on similarities between the activation values and/or weights. In some examples, the neural network is implemented with a general-purpose processor, while in other examples, the neural network can be implemented, at least in part using a graphics processing unit.

[Oil] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Any trademarks used herein remain the property of their respective owners. The foregoing and other objects, features, and advantages of the disclosed subject matter will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

[012] FIG. 1 is a block diagram outlining an example system in which certain apparatus and methods can be implemented.

[013] FIG. 2 is a diagram illustrating an example of a common neural network having deep internal nodes.

[014] FIG. 3 is a flow chart outlining an example method of analyzing images, as can be performed in certain examples of the disclosed technology.

[015] FIGS. 4 A and 4B are a number of charts illustrating convergence criteria over a number of fine-tuning iterations. [016] FIGS. 5A and 5B is a number of charts illustrating convergence criteria over a number of fine-tuning iterations performed according to certain examples of the disclosed technology.

[017] FIG. 6 is a chart illustrating statistics for a group of clusters produced using certain examples of the disclosed technology.

[018] FIG. 7 is a chart illustrating an example of statistics for image and text clusters that can be produced according to certain examples of the disclosed technology.

[019] FIG. 8 is a number of charts showing a number of medical images and their associated label frequencies that can be produced using certain examples of the disclosed technology.

[020] FIG. 9 is a chart illustrating an example label hierarchy for a number of cluster produced according to certain examples of the disclosed technology.

[021] FIG. 10 is a chart illustrating a portion of a label hierarchy that has been generated for a series of image clusters, as can be performed in certain examples of the disclosed technology.

[022] FIG. 11 is a diagram illustrating a suitable computing environment in which certain methods can be performed according to certain examples of the disclosed technology.

[023] FIGS. 12A through 12D are a number of charts illustrating convergence criteria over a number of fine-tuning iterations performed according to certain examples of the disclosed technology.

[024] FIG. 13 is a number of charts showing a number of medical images and their associated label frequencies that can be produced using certain examples of the disclosed technology.

[025] FIG. 14 is a chart showing clustering accuracy plots that can be produced according to certain examples of the disclosed technology.

DETAILED DESCRIPTION

I. General Considerations

[026] This disclosure is set forth in the context of representative embodiments that are not intended to be limiting in any way.

[027] As used in this application the singular forms "a," "an," and "the" include the plural forms unless the context clearly dictates otherwise. Additionally, the term "includes" means "comprises." Further, the term "coupled" encompasses mechanical, electrical, magnetic, optical, as well as other practical ways of coupling or linking items together, and does not exclude the presence of intermediate elements between the coupled items. Furthermore, as used herein, the term "and/or" means any one item or combination of items in the phrase.

[028] The systems, methods, and apparatus described herein should not be construed as being limiting in any way. Instead, this disclosure is directed toward all novel and non-obvious features and aspects of the various disclosed embodiments, alone and in various combinations and subcombinations with one another. The disclosed systems, methods, and apparatus are not limited to any specific aspect or feature or combinations thereof, nor do the disclosed things and methods require that any one or more specific advantages be present or problems be solved. Furthermore, any features or aspects of the disclosed embodiments can be used in various combinations and subcombinations with one another.

[029] Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed things and methods can be used in conjunction with other things and methods. Additionally, the description sometimes uses terms like "produce,"

"generate," "display," "receive," "train," "sample," "initialize," "embed," "execute," and "initiate" to describe the disclosed methods. These terms are high-level descriptions of the actual operations that are performed. The actual operations that correspond to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art having the benefit of the present disclosure.

[030] Theories of operation, scientific principles, or other theoretical descriptions presented herein in reference to the apparatus or methods of this disclosure have been provided for the purposes of better understanding and are not intended to be limiting in scope. The apparatus and methods in the appended claims are not limited to those apparatus and methods that function in the manner described by such theories of operation.

[031] Any of the disclosed methods can be implemented as computer-executable instructions stored on one or more computer-readable media (e.g. , computer-readable media, such as one or more optical media discs, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)) and executed on a computer (e.g. , any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques, as well as any data created and used during implementation of the disclosed embodiments, can be stored on one or more computer-readable media (e.g. , computer-readable storage media). The computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application). Such software can be executed, for example, on a single local computer (e.g. , with a general -purpose and/or graphics processors executing on any suitable commercially available computer) or in a network environment (e.g. , via the Internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers.

[032] For clarity, only certain selected aspects of the software-based implementations are described. Other details that are well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any specific computer language or program. For instance, the disclosed technology can be implemented by software written in C, C++, Java, or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well-known and need not be set forth in detail in this disclosure.

[033] Furthermore, any of the software-based embodiments (comprising, for example, computer- executable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.

II. Introduction to the Disclosed Technology

[034] Other approaches to employing and adapting deep convolutional neural network models (e.g. , ConvNets or CNNs) for computer vision tasks rely on availability of well-labeled or annotated ImageNet ILSVRC, MS COCO, and PASCAL VOC datasets. For image recognition tasks, deep CNNs tend to perform significantly better than shallow learning architectures, but typically more training data is generated and applied to the neural network models. In some examples of the disclosed technology, mining into data instances without explicit labels and across different dataset biases can be successfully performed. ImageNet pre-trained deep CNN models can be bootstrapped for externally- sourced data exploitation tasks. In certain specialized image processing domains (e.g. , medical domains), however, similar large-scale labeled image datasets are not typically available. Gigantic collections of radiology images and reports are stored in hospital Picture Archiving and Communication Systems (PACS); but obtaining ImageNet-level semantic labels on a large-scale radiology image database (e.g. , in one example, 215,786 key images extracted from 61,845 unique patients) is desirable. Unsupervised image categorization, lacking "ground-truth" labeling, involves images where annotations are difficult to obtain via mass image search or crowdsourcing techniques. Further, the good efficacy of CNNs often comes at the cost of large amounts of annotated training data. For example, ImageNet pre-trained deep CNN models when used in other approaches serve an indispensable role to be bootstrapped or fine-tuned for all externally sourced data exploitation tasks.

[035] In some examples of the disclosed technology, a Looped Deep Pseudo-task Optimization (LDPO) approach for automatic category discovery of visually coherent and clinically semantic (concept) clusters is provided. In some examples, true semantic category information is assumed to be latent and not directly observable. The, it is often desirable to learn and train CNN models using pseudo-task labels (e.g. , for cases where human-annotated labels are not available) and iterate this process with the expectation that pseudo-task labels will gradually converge, reaching agreement with latent true image categories. In some examples, an LDPO framework is used for joint mining of deep CNN features and image labels. The hypothesized "convergence" of better labels can lead to better-trained CNN models which in turn feed more discriminative image representations to facilitate more meaningful clusters/labels.

[036] In some examples of the disclosed technologies, an unlabeled image collection is initialized with labels from a pseudo-task (e.g. , text topic modeling generated labels) and the labels are updated through an iterative looped optimization of deep CNN feature clustering and CNN model training (towards better deep image features).

[037] Natural language processing (NLP) in radiology reports presents a number of challenges. Human radiologists often rule out or indicate pathology/disease terms, not existing in the according key images, but based on patient priors and other long-range contexts or abstractions. In some cases, only ~8% key images (18K out of 216K) can be tagged from NLP with moderate confidence levels. In some examples of the disclosed technologies, interactions are exploited from the text- derived image labels to the disclosed LDPO (mainly operating in the image modality) and final term extraction from image groups.

[038] In some examples of the disclosed technology, a CNN architecture is used for deep domain transfer to handle unlabeled and sparsely labeled target domain data. An image label auto- annotation approach is addressed via multiple instance learning, but the target domain is very restricted as a small subset of 25 ImageNet and 25 SUN classes. A method to identify a hierarchical set of unlabeled data clusters (spanning a spectrum of visual concept granularities) that can be efficiently labeled to produce high performing classifiers (thus less label noise at instance level) is disclosed. By learning visually coherent and balanced labels through LDPO, we expect that the studied large-scale radiology image database can markedly improve its feasibility in domain transfer to specific CAD problems where very limited training data are available per task.

[039] In some examples of the disclosed technology, improved large-scale medical image annotation is provided. Such annotation is typically a prohibitively expensive and easily-biased task even for well-trained radiologists. Significantly better image categorization results can be achieved via certain disclosed methods. In some examples of the disclosed technology, unsupervised scene recognition on representative and publicly available datasets is performed. The LDPO achieves excellent quantitative scene classification results.

[040] In the medical imaging domain, no suitable largescale labeled image dataset comparable to ImageNet exists. Modern hospitals store vast amounts of radiological images/reports in their Picture Archiving and Communication Systems (PACS). Other means of collecting image labels are often inadequate due to unavailability of a high quality or large capacity medical image search engine, and formidable difficulties of medical annotation tasks for annotators with no clinical training. Further, even for well-trained radiologists, this task of assigning labels to images is not aligned with their diagnostic routine work, and drastic inter-observer variations or inconsistency are expected. The protocols of defining image labels based on visible anatomic structures (often multiple), or pathological findings (possibly multiple) or both cues can exhibit intrinsically high ambiguities.

[041] In certain examples of the disclosed technology, a loop optimization method includes a number of aspects. For example, an LDPO framework-based method can be used to initialize an unlabeled image collection with randomly-assigned labels or labels obtained by a pseudo-task (e.g. , text topic modeling generated labels). In some examples, the LDPO framework has the flexibility of working with any clustering function. Particularly, it employs Regularized Information

Maximization (RIM) clustering to perform clustering the image (similar in some respects to k- means clustering) with model selection on finding the optimal number of clusters. The empirical convergence process of disclosed LDPO methods is observable and quantifiable.

[042] In some examples of the disclosed technology, mid-level image representation can be used to enhance performance of the disclosed technology. For example, use of a mid-level visual elements-based image representation can be used to improve scene recognition amongst other visual computing tasks. As will be readily discemable to one of ordinary skill in the art having the benefit of the present disclosure, a number of different types of mid- level visual elements can be extracted from an image, including without limitation: image patches, parts/segments, prototypes, and attributes through different learning and mining techniques (including, e.g. , iterative optimization, classification and co-segmentation, Multiple Instance Learning (MIL), random forests, ensemble projection, and association rule mining).

[043] In some examples of the disclosed technology, iterative optimization is used to jointly mine improved deep image representations and the labels for all images, providing iterative auto- annotation. Association rule mining techniques can be used to extract the frequent image parts (that are further used to encode image representation) into an LDPO pipeline.

III. Example System for Implementing Image Analysis

[044] FIG. 1 is a block diagram 100 outlining an example apparatus that can be used to implement certain examples of image analysis according to the disclosed technology. For example, the illustrated system can be implemented using a computer system that includes one or more general purpose, graphics processing units (GPU), or other specialized computing hardware for

implementing image analysis using neural networks. In some examples, the computing hardware includes integrated circuits, application specific integrated circuits, and/or field programmable gate arrays (FPGAs).

[045] As shown in FIG. 1, a series of images 110 is stored in an image database 115. The images can include x-ray images, CT scan images, MRI images, photographs, or other suitable images. In some examples, the images are medical images, for example, images of a human or other type of mammal. In some examples, text annotation can be associated with at least some of the images and can be used to help categorize the images. For example, the text can include descriptions of pathologies, diseases, or other disorders associated with the image, organs, severity of disorders, and/or location information, such as verbal descriptions of location, or numerical coordinates. The text 120 can be stored in an image annotation database 125. The image annotation database 125 can include information that associates the text with one or more of the images stored in the image database 115. In other examples of the disclosed technology, annotations are not used for initializing a neural network. The neural network can be used to classify images without the use of associated annotation text, and then the clusters themselves can be labeled at a later point in time, for example by a trained technician or radiologist.

[046] Images stored in the image database are input to an input layer of a neural network 130. In some examples, annotations are also input from the image annotation database 125. The neural network processes the input values according to activation values, activation/transfer functions, and edge weights of links that connect nodes of the neural network. The neural network 130 can have a large number of layers, for example, 5, 8, 12, 35, or some other number of layers. As the neural network is trained, or fine-tuned, activation values, activation functions, link weights, or other parameters associated with internal nodes of the neural network are updated. These internal node values (e.g. , internal node values 140 and 141) can be extracted from the neural network and provided to an encoding and clustering unit 150 as input. In some examples, output values 145 from the neural network 130 are also provided to the encoding and clustering unit 150. In other examples, only values from internal nodes of the neural network are provided to the encoding and clustering unit 150.

[047] The encoding and clustering unit 150 in turn, generates clusters based on similarities detected between images that have been applied to the neural network 130. For example, the internal values can be encoded as an encoding vector, and respective encoding vectors for each image can be grouped based on their similarity into clusters. The encoding and clustering unit 150 can also generate feedback that is sent to a fine-tuning unit 160. The fine-tuning unit 160 adjusts values of the activation values, link weights, activation functions, or other values associated with nodes of the neural network (e.g. , internal nodes of the neural network), in order to improve image classification results generated by the neural network. The encoding, clustering, and fine-tuning can be repeated a number of times, for example, until a convergence criteria indicating that the network has converged is satisfied. Once the network has been converged, clusters and/or labels for images 170 can be output by the encoding and clustering unit. These clusters and labels can be used, for example, by a trained technician or radiologist, in order to improve diagnosis of disorders exhibited in input images that have been applied to the neural network.

[048] In some examples of the disclosed technology, the encoding and clustering unit 150 further includes a scene recognition unit 155. The scene recognition unit 155 can be used to initialize patch mining and generate new image labels for subsequent iterations of unsupervised scene recognition. In some examples, the scene recognition unit 155 can be used to validate an image analysis system implemented according to the disclosed technology. Further details of scene recognition and patch mining are discussed below.

IV. Example Convolutional Neural Network for Image Analysis

[049] FIG. 2 is a block diagram 200 illustrating an example of a convolutional neural network as can be used in certain examples of the disclosed technology. The neural network can be implemented, for example, with a general-purpose processor CPU, or with a GPU and configured with use of any of the methods disclosed herein.

[050] As shown in FIG. 2, the neural network includes an input layer 210 to which an input image is applied as input. The neural network further includes a number of convolutional layers 220. Each of the convolutional layers applies the convolutional operation on a subset of the preceding layer. For example, an 11x11 pixel portion of the input image 210 is sampled, convolved using a selected convolution function, and provided as input to a node of a first internal layer of the neural network. A number of additional convolutions are performed by successive internal layers of the network. The size of the convolution window can vary, for example, 5x5 and 3x3 convolution samples are illustrated in the block diagram 200. Also shown are a number of fully connected layers 230 that include connections between all nodes of an input convolutional layer and the output of the next respective layer. Two fully connected layers are shown, and the output of the second fully connected layer is provided to a layer of output nodes 240.

[051] Activation values, link/edge weights, and/or activation functions can be adjusted by applying a training input to the neural network. Thus, by applying appropriate training input, the neural network performance can be improved by successive iterations.

V. Example Method of Performing Image Analysis

[052] FIG. 3 is a flow chart 300 outlining an example method of performing image analysis with deep nodes of a neural network, as can be performed in certain example of a disclosed technology. For example, the system described above regarding FIG. 1, including the neural network depicted in FIG. 2 can be used. It should be readily understood to one of ordinary skill in the relevant art having the benefit of the present disclosure, however, that the systems and neural networks can be adapted, and different components selected based on a particular implementation and its associated requirements.

[053] At process block 310, a neural network model is produced for use in image analysis. The initial neural network can be pre-trained in some examples, while in other examples, the initial neural network is not trained. In some examples, a CNN, such as AlexNet or GoogleNet are used.

[054] At process block 320, deep CNN features are extracted from the neural network. The deep features are based on activation values, weights, and other parameters that affect how individual nodes of the neural network respond. Internal nodes of the neural network, which are normally not visible outside of the neural network. The extracted deep features of the internal nodes can be encoded in a form of dense pooling using, for example, a Fisher Vector (FV), or a Vector Locally Aggregated Descriptor (VLAD). In some examples, principle component analysis is performed to reduce the dimensionality of the encoded vector in order to facilitate the comparison between images in order to allow for improved clustering.

[055] At process block 330, image clustering is performed. Images having similarities as determined using, for example, the encoding vector, can be grouped into the same vector. In some examples, the £-means technique is used to form the clusters. In other examples, over-segmented k-means is used followed by regularized information maximization. In some examples, k-means clustering followed by RIM is used.

[056] Additional details regarding the clustering algorithms employed are discussed further below.

[057] At process block 340, clusters are evaluated to determine whether convergence criteria for the clustering operation is satisfied. In some examples, the Purity convergence measure is used, while in other examples, the normalized mutual information (NMI) convergence measurement is used. If the convergence criteria are not satisfied, then the method proceeds to process block 350. On the other hand, if the convergence criteria are satisfied, then the method proceeds to process block 360. By evaluating the purity and mutual information between formed clusters in consecutive rounds, method either terminates the current iteration (and yields converged clustering outputs) or uses the newly refined image cluster labels to train or fine-tune the CNN model in the next iteration. [058] At process block 350, the neural network is fine-tuned using renewed cluster labels with randomly shuffled images. For example, image clustering is first applied on the entire collection so that each image receives a clustered label. In some examples, the cluster label is completely arbitrary (e.g. , a unique number assigned to an arbitrary cluster). After the images have been associated with a cluster label, the collection of images is randomly reshuffled into sub-groups for neural network fine-tuning via stochastic gradient descent. Thus, convergence can be achieved for the entire collection of images. In other examples, a Caffe implementation of neural networks is used for the fine-tuning. Once the neural network has been fine-tuned, the method proceeds to process block 320 to re-extract values of neural network nodes and encode a vector to be used for clustering.

[059] Once the clustering has converged, the method proceeds to process block 360, where text reports for each cluster can be analyzed (e.g. , using natural language processing, and/or human analysis). Based on the analysis at process block 360, image clusters are produced with semantic text labels at process block 370. Thus, a large number of similar images can be labeled using a highly automated process, even if no initial annotation labels are available. For example, a technician or radiologist can analyze a few images in one of the final clusters of images and apply a label indicating pathology, organs, or other appropriate terms, to all of the images of the cluster. For example, for medical image categorization, LDPO-generated image clusters can be further fed into text processing. The system can extract semantically meaningful text words for each formed cluster. A hierarchical category relationship can be built using the class confusion measures of the final converged CNN classification models.

[060] In some examples of the disclosed technology, joint mining of deep features and labels can be performed. In supervised and semi-supervised approaches, at least partial image labels are typically required as a prerequisite. For large data sets, such techniques require massive data annotation efforts. Further, for certain fields such as medical imaging, providing images to crowdsource or "mechanical turks" may be unsatisfactory due to privacy concerns. In joint mining approaches according to the present disclosure, unsupervised category discovery using empirical image cues can be used for grouping or clustering through an iterative process of deep image feature extraction and clustering and deep CNN model fine-tuning (using new labels from clustering) to update deep feature extraction in the next iterative round.

[061] Other examples of the method outlined in FIG. 3 can also be used in the scenario of scene recognition. Association rule mining can be performed inside the sets of either randomly grouped images (for the first iteration of the method) and/or on image clusters computed at process block 330. The top 50 mined patterns (which cover a maximum numbers of patches) per image cluster are further merged across the dataset to form a consolidated vocabulary of visual elements. Then, the clustering on patch-mining based features with £-means is exploited. By evaluating the purity and mutual information between formed clusters in consecutive rounds, the system either terminates the current iteration (which leads to converged clustering outputs at process block 370); or takes the newly refined image cluster labels to train or fine-tune the CNN model in the next iteration (at process block 350).

VI. Looped Deep Pseudo-Task Optimization (LDPO)

[062] Traditional detection and classification problems in medical imaging, e.g. Computer Aided Detection (CAD), demand precise labels of lesions or diseases as the training/testing ground-truth. This can requires a large amount of annotation from well-trained medical professionals (especially at the era of "deep learning"). Employing and converting the medical records stored in the PACS into labels or tags presents a number of challenges. Disclosed examples perform category discovery in an empirical manner and return accurate keyword category labels for most or all images in the database, through an iterative framework of deep feature extraction, clustering, and deep CNN model fine-tuning.

[063] Similar to the example method discussed above regarding FIG. 3, an iterative method (e.g. , implemented with the example system of FIG. 1) starts from deep CNN feature extraction based on either fine-tuned (with high-uncertainty radiological topic labels) or generic (from ImageNet labels) CNN models. Deep feature clustering, performed using £-means, or £-means followed by RIM, is employed. By evaluating purity and mutual information between discovered clusters, the example system can decide to either terminate the current iteration (which leads to an optimized clustering output) or take the refined cluster labels as the input to fine-tune the CNN model for another round. Once visually coherent image clusters are achieved, the example system will further extract semantic meaningful text words for each cluster. All corresponding patient reports per category cluster are finally adopted for the NLP stage. Furthermore, a hierarchical category relationship is built using the class confusion measures of the latest converged CNN classification models.

[064] The example LDPO frameworks disclosed herein are applicable to a variety of neural network models, including CNNs having layers of different depths, such as in AlexNet, VGGnet, and GoogLeNet. Pre-trained models with ImageNet ILSVRC data are obtained from the Caffe Model Zoo. In some examples, a Caffe CNN implementation can be used to perform fine-tuning on pre-trained CNNs using the key image database. Both CNN models with and without fine- tuning can be used to initialize the looped optimization. AlexNet is a CNN architecture with seven layers. In the disclosed experiments, feature activations of both the 5th convolutional layer (Conv5) and the 7th fully-connected (FC) layer FC7 are adopted. GoogLeNet is a much deeper CNN architecture compared to AlexNet, which comprises 9 inception modules and an average pooling layer. Each of the inception modules is truly a set of convolutional layers with multiple window sizes, e.g. , lxl, 3x3, 5x5. In another experiment, the deep image features from the last inception layer (InceptionSb) and final pooling layer (Pool 5) are adopted. In some examples, such as those performing scene recognition tasks, very deep VGGNet (VGG-VD) can be employed in addition to AlexNet. The extracted features from VGG-VD 's last fully-connected layers can be used for the patch-mining based image encoding. Table 1 illustrates the detailed model layers and their activation dimensions.

Table 1

[065] Table 1 details configurations of CNN output layers and encoding method. (The number output dimension for the cases was 4,096, except the last two row, which were 1,024).

[066] While features extracted from a fully-connected layer are able to capture the overall layout of objects inside the image, features computed at the last convolution layer preserve the local activations of images. In certain examples, the same setting is used to encode the convolutional layer outputs in a form of dense pooling via a Fisher Vector (FV) and/or a Vector Locally

Aggregated Descriptor (VLAD). Nevertheless, the dimensions of encoded features are much higher compared to FC features. Since there is redundant information by encoded features, Principal Component Analysis (PCA) is performed to reduce the dimensionality to 4096, equivalent to the FC features' dimension. [067] In some examples, mined mid-level visual elements based image encoding (PM) has been shown to provide a more discriminative representation in natural scene recognition. Visual elements are expected to be common amongst images with the same label but seldom occur in other categories. In some examples, the association rule mining technique is integrated into a looped optimization process to automatically discover mid-level image patches for encoding. Thus, discriminative patches can be discovered and gradually improved through LDPO iterations even if the initialization image labels are not accurate.

[068] Given a pre-trained (generic or domain-specific) CNN model (e.g. , Alexnet or GoogLeNet), an input image / is resized to fit the model definition and feed into the CNN model to extract features {f^Lij} (1≤i, j≤ s^L) from the L-th convolutional layer with dimensions s^Lxs^Lxd^L , e.g. , 13x13x256 of Conv5 in AlexNet and 7x7x1,024 of Pool5 in GoogLeNet. For a Fisher Vector implementation, the following settings are used: 64 Gaussian components are adopted to train the Gaussian mixture Model(GMM). The dimension of resulting FV features is significantly higher than FC7's, e.g. , 32,768 (2x64x256) vs. 4,096. After PCA, the Fisher Vector representation per image is reduced to a 4,096-component vector. A list of deep image features, the encoding methods, and output dimensions is described in Table 1. To be consistent with the settings of FV representation, the VLAD encoding of convolutional image features is initialized by £-means clustering with k = 64. Thus the dimensions of VLAD descriptors are 16,384 (64x256) of Conv5 in AlexNet and 65,536 (64x1024) of Inception5b in GoogLeNet. PCA additionally reduces both dimensions to 4,096.

[069] In some examples, a patch mining-based encoding is used to extract mid-level elements for image representation. In some examples, this method does not use a priori knowledge of the image categories. For example, for each image / in the dataset, a set of patches is extracted from multiple spatial scales and the CNN activation is computed for each patch. In some examples, patch mining is performed by selecting portions of an image and recognizing images within individual selected portions. For example, a 512x512 image may be divided into four 256x256 portions, and each portion provided for scene recognition to seed the CNN. In some examples, the image is divided into even portions, in other examples, changes in contrast, color, edge detection, or other suitable techniques are used to identify portions for applying scene recognition. In some examples, the number and origin of portions may be varied depending on the content of a particular image.

[070] Among all activations (e.g., 4096-D vectors on FC7), only indexes of top k maximal activations are recorded and used to form a transaction (e.g. , { 1024; 3; 24; 4096} , k = 4). Each image contains a set of transactions, which appears on the image. Instead of retrieving patches in a class-specific fashion association rule mining is used inside the sets of either randomly grouped images (for the first iteration) or image clusters computed by "clustering on CNN features." The top N (e.g. , 50) mined patterns (which cover the maximum numbers of patches) per image cluster are further merged across the entire dataset to form a consolidated vocabulary of visual elements. Further detail of exemplary global merging procedures for patch clusters that can be performed according to certain examples of the disclosed technology are elaborated in Pseudocode Listing 1. The global merging strategy effectively reduces redundancy and offers more discriminative image features for both clustering and classification tasks. Finally, "bag-of-elements" image

representations are computed.

Input: A set of mined patterns from each of K image

clusters, i.e. V = {u_j}, \V\ = 50 * K, set of patches

Pi E for each pattern and LDS detectors

d-i E T> trained on associated patch set p_t.

Output: A set of merged patterns (uj, p dj] = {v_n} and

associated patch LDA detectors V = {d_n}

for each set of (uj, p dj]

Compute S_l} = l/\p_j dj x (X_Pj is a set of

CNN activations of patches in p_j).

end

if both Si_j and S,-j > a predefined threshold then

Merge (Uj, p d_t} and v_j, p_j, d_j and train new LDA

detector d_n based on pi U p_j .

end

return V, >';

Pseudocode Listing 1

[071] The newly generated image clusters driven by looped pseudo-task optimization are improved in terms of: (1) images in each cluster are visually more coherent and discriminative from instances in other clusters; (2) the numbers of images in each cluster are approximately equivalent to achieve class balance; and (3) the number of clusters is self-adaptive according to the statistical properties of a large collection of image data. Two clustering methods are employed in this example: £-means alone, and an over-segmented £-means (where k is much larger than the first setting, e.g. , 1,000) followed by Regularized Information Maximization (RIM) for model selection and optimization.

[072] Using £-means is an efficient clustering technique, provided that the number of clusters is known. The use of k- means clustering can be beneficial for at least two reasons: (1) to set up the baseline performance of clustering on deep CNN image features by fixing the number of clusters k at each iteration; and (2) to initialize the RIM clustering. Clustering via £-means can also be used in scene recognition applications, for example, to initialize patch mining and generate new image labels for a subsequent iteration. However, the underlying cluster number is unknown in some applications, such as medical image categorization. Thus, £-means clustering can be used to initialize RIM clustering with a relative large k value, and then RIM is used to perform model selection to optimize k. Unlike £-means, RIM works with fewer assumptions on the data and categories, e.g. , the number of clusters. RIM can be used for discriminative clustering by maximizing the mutual information between data and the resulting categories via a complexity regularization term. The objective function is defined as

/ (W; F; λ) = Iw{c; f} - IR(W; λ),

where c £ { 1, . . . , K) is a category label. F is the set of image features fi = (fn, . . . )^τ E MP. Iw {c; f } is an estimation of the mutual information between the feature vector /and the label c under the conditional model p(clf , W). R(W; λ) is the complexity penalty and specified according to p(clf , W). The unsupervised multinomial logit regression cost can be used in this example. The conditional model and the regularization term are consequently defined as p(c = k \ f, W) oc exp(wj

where W = {wi, . . . , WK ,

. . . , b ] is the parameters and Wk E ° , Z¾ ε . Maximizing the objective function is now equivalent to a logistic regression problem. W. is the L2 regulator of weight {¼¾} and its power is controlled by λ. Large λ values will reduce the total number of categories considering that no penalty is given for unpopulated categories. This characteristic enables RIM to attain the optimal number of categories coherent to the data. In the experimental results discussed herein, λ is fixed to 1.

[073] Before exporting the newly generated cluster labels to fine-tune the CNN model of the next iteration, the LDPO framework will evaluate the quality of clustering to decide if convergence has been achieved. In this example, two convergence measurements have been adopted: Purity and Normalized Mutual Information (NMI). In other examples, other suitable convergence measurements can be used. These two criteria are used as forms of empirical similarity examination between two clustering results from adjacent iterations. If the similarity is above the threshold, then the optimal clustering-based categorization of the data is deemed to have been reached. In practice, the final numbers of category after RIM process in later LDPO iterations being stabilized is typically around a constant number. The convergence on classification is directly observable through the increasing top-1 and top-5 classification accuracy levels in the initial few LDPO rounds, which fluctuate slightly at a higher accuracy.

[074] Converged clustering can be accomplished by adopting the underlying classification capability stored in those deep CNN features through the looped optimization, which accents the visual coherence amongst images inside each cluster. Nevertheless, category discovery of medical images will still typically use clinically semantic labeling of the images. From the optimized clusters, the associated text reports are collected for each image and each cluster's text reports are assembled together as a unit. Then NLP is performed on each report unit to find highly-recurring words as keyword labels for each cluster by, for example, counting and ranking the frequency of each word. Common words to all clusters are removed from the list. In some examples, all the keywords and randomly sampled exemplary images are ultimately compiled to be reviewed by board-certified radiologists. This process shares some analogy to human- machine collaborated image database construction.

[075] In some examples of the disclosed technology, NLP parsing (especially term

negation/assertion) and clustering can be integrated into an LDPO framework. From the optimized clusters the associated text reports are collected and each image cluster's text reports are assembled together into a group. Next, NLP is performed on each unit of reports (e.g. , radiology reports) to find highly recurring words that may serve as informative key words per cluster by counting and ranking the frequency of each word. Common words to all clusters are first removed from the list. The resulting key words and randomly sampled exemplary images for each cluster or category are compiled for reviewing by board-certified radiologists.

[076] ImageNet neural networks are constructed according to WordNet ontology hierarchy. Recently, a formalism called Hierarchy and Exclusion (HEX) graphs can be used to perform object classification by exploiting the rich structure of real world labels. In this work, our converged CNN classification model can be further extended to explore the hierarchical class relationship in a tree representation. First, the pairwise class similarity or affinity score A, between a class is modeled via an adapted measurement from CNN classification confusion: 1

A _j = - (Prob (i \j) + Probl (j \ i))

where C C_j are the image sets for class i, j respectively, I · I is the cardinality function, CNN(/_m I j) is the CNN classification score of image 7_m from class C, at class j obtained directly by the N-way CNN flat-softmax. Here A_(J = is symmetric by averaging Prob(i \ j) and Probij I i).

[077] An Affinity Propagation algorithm (AP) is invoked to perform "tuning parameter-free" clustering on this pairwise affinity matrix { Ai,j } £ WL^KxK. This process can be executed recursively to generate a hierarchically merged category tree. Without loss of generality, we assume that at level L, classes i^L, j^L are formed by merging classes at level L-l through AP clustering. The new affinity score can be computed as follows.

where the L-th level class label j^L include all merged original classes (0-th level before AP is called) k E j^L so far. From the above, N-way CNN classification scores only need to be evaluated once. AiL, JL, at any level can be computed by summing over these original scores. The discovered category hierarchy can help alleviate the highly uneven visual separability between different object categories in image classification from which the category-embedded hierarchical deep CNN can be beneficial.

VII. Example Computing Environment

[078] FIG. 11 illustrates a generalized example of a suitable computing environment 1100 in which described embodiments, techniques, and technologies, including image analysis using an LDPO framework, can be implemented. For example, the computing environment 1100 can implement disclosed techniques for analyzing images by converging clusters of deep nodes in a neural network, as described herein.

[079] The computing environment 1100 is not intended to suggest any limitation as to scope of use or functionality of the technology, as the technology may be implemented in diverse general- purpose or special-purpose computing environments. For example, the disclosed technology may be implemented with other computer system configurations, including hand held devices, multiprocessor systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The disclosed technology may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

[080] With reference to FIG. 11, the computing environment 1100 includes at least one processing unit 1110 and memory 1120. In FIG. 11, this most basic configuration 1130 is included within a dashed line. The processing unit 1110 executes computer-executable instructions and may be a real or a virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power and as such, multiple processors can be running simultaneously. The memory 1120 may be volatile memory (e.g. , registers, cache, RAM), non-volatile memory (e.g. , ROM, EEPROM, flash memory, etc.), or some combination of the two. The memory 1120 stores software 1180, images, and video that can, for example, implement the technologies described herein. A computing environment may have additional features. For example, one or more co-processing units 1115 or accelerators, including graphics processing units (GPUs), can be used to accelerate certain functions, including implementation of CNNs and RNNs. The computing environment 1100 may also include storage 1140, one or more input device(s) 1150, one or more output device(s) 1160, and one or more communication connection(s) 1170. An interconnection mechanism (not shown) such as a bus, a controller, or a network, interconnects the components of the computing environment 1100. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 1100, and coordinates activities of the components of the computing environment 1100.

[081] The storage 1140 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and that can be accessed within the computing environment 1100. The storage 1140 stores instructions for the software 1180, image data, and annotation data, which can be used to implement technologies described herein.

[082] The input device(s) 1150 may be a touch input device, such as a keyboard, keypad, mouse, touch screen display, pen, or trackball, a voice input device, a scanning device, or another device, that provides input to the computing environment 1100. For audio, the input device(s) 1150 may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment 1100. The output device(s) 1160 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing environment 1100.

[083] The communication connection(s) 1170 enable communication over a communication medium (e.g. , a connecting network) to another computing entity. The communication medium conveys information such as computer-executable instructions, compressed graphics information, video, or other data in a modulated data signal. The communication connection(s) 1170 are not limited to wired connections (e.g. , megabit or gigabit Ethernet, Infiniband, Fibre Channel over electrical or fiber optic connections) but also include wireless technologies (e.g. , RF connections via Bluetooth, WiFi (IEEE 802.11a/b/n), WiMax, cellular, satellite, laser, infrared) and other suitable communication connections for providing a network connection for the disclosed methods. In a virtual host environment, the communication(s) connections can be a virtualized network connection provided by the virtual host.

[084] Some embodiments of the disclosed methods can be performed using computer-executable instructions implementing all or a portion of the disclosed technology in a computing cloud 1190. For example, disclosed compilers and/or processor servers are located in the computing environment, or the disclosed compilers can be executed on servers located in the computing cloud 1190. In some examples, the disclosed compilers execute on traditional central processing units (e.g. , RISC or CISC processors).

[085] Computer-readable media are any available media that can be accessed within a computing environment 1100. By way of example, and not limitation, with the computing environment 1100, computer-readable media include memory 1120 and/or storage 1140. As should be readily understood, the term computer-readable storage media includes the media for data storage such as memory 1120 and storage 1140, and not transmission media such as modulated data signals.

VIII. Example Image Analysis Experimental Results

[086] Example image analysis results are disclosed in the section, as can be performed in certain examples of the disclosed technology. As will be readily apparent to one of ordinary skill in the relevant art having the benefit of the present disclosure, the example systems, methods, and neural networks described above regarding FIGS. 1-3 and 11 can be adapted to provide the disclosed analysis results. Further, the technologies described above can be modified to suit particular datasets, computing environments, and performance requirements.

[087] In some examples of the disclosed technology, an image recognition system is initialized by domain-specific (e.g., trained on radiology images and text report derived labels) or generic (e.g. , ImageNet) CNN models. A looped sequence of deep pseudo-task optimization (LDPO) is exploited by iterative, unsupervised deep image feature clustering (to refine image labels) and supervised deep CNN training/classification (to obtain more task representative deep features, if new labels make more visually semantic senses). Using certain disclosed methods, hypothesized "convergence" better labels leading to better trained CNN models which consequently feed more effective deep image features to facilitate more meaningful clustering/labels. In the experimental results discussed below, a database was obtained to conduct experiments of certain disclosed methods. In particular, different pseudo-task initialization strategies, two CNN architectures of varying depths (in this example, AlexNet and GoogLeNet), different deep feature encoding schemes, and clustering via £-means only, or over-fragmented £-means followed by Regularized Information Maximization (RIM) as an effective model selection method), are extensively discussed and empirically evaluated. The example LDPO methodology shows congruent convergence under all aforementioned experimental settings and promising quantitative and qualitative results. Significantly higher quality of the resulting category labels is determined, especially in terms of visual coherence and class balance issues. Furthermore, LDPO image clustering and classification can feed back to text processing. The formed image groups can instruct the NLP based key "object" text words or informative tags extraction upon each cluster of patient reports (where their key images belong to one class), with higher statistical stability in word/term frequencies beyond individual reports.

[088] Examples tasks that can be performed using disclosed analysis techniques include (1) the investigation of the hierarchical semantic nature of (object/organ, pathology, scene, modality, etc.) categories, and (2) a finer level image mining of tag-constrained object instance discovery and detection from the given large scale radiology image database. Thus, un-supervised deep feature clustering is integrated with supervised deep label classification for self-annotating a large-scale radiology database where other ways of image annotation are not feasible or affordable.

[089] The experimental results discussed in this section use a dataset supplied by the National Institutes of Health Clinical Center. The image database contains 216,000 two-dimensional (2-D) key-images that are associated with ~62,000 unique patients' radiology reports. Key-images are directly extracted from a Dicom image file and resized as 256x256 bitmap images. Their intensity ranges are rescaled using the default window settings stored in the Dicom header files (this intensity rescaling factor was observed to improve CNN classification accuracies by ~2%). Linked radiology reports are also collected as separate text files, with patient- sensitive information removed for privacy reasons.

[090] Furthermore, a disclosed LDPO framework was evaluated on three widely-reported scene recognition benchmark datasets: (1) the 1-67 dataset including 67 indoor scene classes with 15,620 images; (2) the B-25 dataset including 25 architectural styles from 4,794 images; and (3) the S-15 dataset of 15 outdoor and indoor mixed scene classes with 4485 images. For scene recognition, the ground truth (GT) labels are used to validate the final quantitative LDPO clustering results (where cluster-purity becomes classification accuracy). The cluster number is assumed to be known to LDPO during clustering for a fair comparison. Thus the model selection RIM module is dropped.

[091] At each LDPO iteration, the image clustering is first applied on the entire image dataset so that each image will receive a cluster label. Then the whole dataset is randomly reshuffled into three subgroups for CNN fine-tuning via Stochastic Gradient Descent (SGD): for example, training (70%), validation (10%), and testing (20%). In this way, the convergence is not only achieved on a particular data-split configuration, but generalized to the entire dataset.

[092] The Caffe implementation of CNN models are used in the experiment. During the looped optimization process, the CNN is fine-tuned for each iteration once a new set of image labels is generated from the clustering stage. Only the last softmax classification layer of the models (e.g. , "FC8" in AlexNet and "loss3/classifier"in GoogLeNet) is significantly modulated by (1) setting a higher learning rate than all other layers and (2) updating the (varying but converging) number of category classes from the newly computed results of clustering.

[093] An analysis of how different settings of the disclosed LDPO frameworks will affect the convergence follows.

[094] The clustering methods used in the experiment was £-means based image clustering with k E 80, 100, 200, 300, 500, and 800. FIGS. 4A and 4B are a number of charts 410, 420, 430, and 440 that illustrate convergence of the clusters over ten iterations of neural network fine-tuning, as can be performed in certain examples of the disclosed technology. A first chart 410 shows the change in top- 1 classification accuracy for each iteration for a number of different £- values used for £-means clustering. A second chart 420 illustrates convergence of the clusters for the same convergence process as measured by a Purity convergence criteria. A third chart 430 illustrates the MMI value of the clusters as £-means clustering is performed for 10 iterations. As shown in FIGS. 4A and 4B, the classification accuracies quickly reach the plateau after 2 or 3 iterations. Naturally smaller k values triggers higher accuracies (> 86.0% for k = 80), considering that fewer categories make the classification task easier. Purity and NMI between clusters from two consecutive iterations both rise quickly and fluctuate close to 0.7, indicating the convergence of clustering labels (and CNN models). The minor fluctuations shown are caused by the randomly remix of the dataset in each iteration. RIM clustering takes an over-segmented £-means results as initialization, (e.g. , k = 1000) in our experiments.

[095] FIGS. 5 A and 5B are a series of charts 510, 520, and 530 illustrating convergence criteria for a number of iterations for a number of different neural networks. For example, the data includes the use of AlexNet and GoogleNet using the method depicted in FIG. 3 and based on the use of different internal nodes of the neural network to converge. As shown, in FIG. 5A, a first chart 510 illustrates the number of clusters formed for ten iterations of cluster convergence using different internal node values and neural networks. A second chart 520 illustrates Purity criteria for a group of clusters for the same ten iterations. A third chart 530 in FIG. 5B illustrates convergence of the MMI criteria for the group of clusters across ten iterations of cluster convergence. As shown in the first chart 510, RIM can estimate the category capacities or numbers consistently under different image representations (deep CNN feature and encoding approaches).

[096] Thus, in the illustrated examples, £-means clustering enables LDPO to approach the convergence quickly with high classification accuracies, whereas the added RIM based model selection delivers more balanced and semantically meaningful clustering results. This is due to two of RIM's characteristics: (1) RIM has less restricted geometric assumptions in the clustering feature space, and (2) RIM can reach the optimal number of clusters by maximizing the mutual information of input data and the induced clusters via a regularized term.

[097] Both ImageNet and domain-specific CNN models have been employed to initialize the LDPO framework. In FIGS. 5A and 5B, two CNNs of AlexNet-FC7-ImageNet and AlexNet-FC7- Topic demonstrate their LDPO performances. LDPO initialized by ImageNet CNN reach the steady state noticeably slower than its counterpart, considering AlexNet-FC7 -Topic already contains the domain information in this radiology image database. However similar clustering outputs are produced after the convergence. Letting LDPO reach ~10 iterations, two different initializations end up with very close clustering results (based on Cluster number, purity, and NMI) and similar classification accuracies (as shown below in Table 2, which displays experimental results for classification accuracy of converged CNN models).

Table 2

[098] Different image representations can vary the performance of the disclosed LDPO framework as shown in FIGS. 5 A and 5B. Deep CNN images features extracted from different layers of CNN models (AlexNet and GoogLeNet) contain level-specific visual information:

convolutional layer features retain the spatial activation layouts of images, while FC layer features do not. Different encoding approaches further lead to various outcomes of our LOP framework. The numbers of clusters are ranged from 270 (AlexNet FC7 features with no encoding) to 931 (convolutional feature of more sophisticated GoogLeNet under VLAD encoding). The numbers of clusters discovered by RIM imply the amount of information complexity stored in the database collection of image features.

[099] To generate the experimental results, each looped process is running on a node of Linux computer cluster with 16 CPU cores (x2650), 128 GB memory, and Nvidia K20 GPUs. The computational costs of different settings of an example LDPO framework is shown below in Table 3. As shown, the settings with more sophisticated and richer information will take longer to converge.

CNN setting Time per iter. (HH:MM)

AlexNet-FC7-Topic 14:35

AlexNet-FC7-ImageNet 14:40

AlexNet-Conv5-FV 17:40

AlexNet-Conv5-VLAD 15:44

GoogLeNet-Pool5 21: 12

GoogLeNet-5b-VLAD 23:35

Table 3

[0100] The example LDPO framework performs category discovery for more visually coherent and cluster-wise balanced results instead of only looking from a text report or NLP perspective.

[0101] In additional examples of unsupervised medical image categorization, category discovery clusters generates using disclosed LDPO methods are found to be more visually coherent and clusterwise balanced FIG. 6 shows the image numbers for each cluster from the AlexNet-FC7- Topic setting. The numbers are uniformly distributed with a mean of 778 and standard deviation of 52.

[0102] FIG. 7 is a chart that illustrates the relation of clustering results from purely images or text reports. The clusters from text only are highly uneven, where 3 clusters inhabit the majority of images. It is observed that there is no instance-balance-per-cluster constraints in the LDPO clustering. FIGS. 8 and 13 show sample images and top-10 associated key words from five randomly selected clusters. The LDPO clusters are found to be semantically or clinically related to the corresponding key words, containing the information of (likely appeared) anatomies, pathologies (e.g. , adenopathy, mass), their attributes (e.g. , bulky, frontal), and imaging protocols or properties.

[0103] The final trained CNN classification models allow to compute the pairwise category similarities or affinity scores using the CNN classification confusion values between any pair of classes. An affinity propagation algorithm is called recursively to form a hierarchical category tree. The resulted category tree has (270, 64, 15, 4, 1) different class labels from bottom (leaf) to top (root). The random color coded category tree is shown in FIG. 10. The high majority of images in the clusters of this branch are verified as CT Chest scans by radiologists. Enabling to construct a semantic and meaningful hierarchy of classes offers another indicator to validate the disclosed LDPO category discovery method and results. [0104] FIG. 8 is a graphic 800 that shows sample images and the top ten associated key words from three randomly- selected clusters. As shown in FIG. 8, the resulting cluster is both visually coherent and semantically related the corresponding text reports. It illustrates the feasibility of automatic labeling of a large-scale radiology image database. The final trained CNN classification models permit computing pairwise category similarities or affinity scores using the CNN classification confusion values between any pairs of classes.

[0105] In some examples, an affinity propagation algorithm is called recursively to form a hierarchical category tree. The resulting category tree has (270, 64, 15, 4, 1) different class labels from bottom (leaf) to top (root). An example of a random shade coded category tree 900 is shown in FIG. 9. A portion 1000 of the category tree is depicted in FIG. 10, which is a category tree that displays the number of images associated with each cluster or set of clusters in the hierarchy.

IX. Example Image Analysis Experimental Results

[0106] Unsupervised scene recognition performed on the Massachusetts Institute of Technology (MIT) indoor scene dataset, according to an example method was observed to attain a clustering accuracy of 75.3%, compared to the state-of-the-art supervised classification accuracy of 81.0% (when both are based on the VGG-VD model).

[0107] Examples methods disclosed herein integrate unsupervised deep feature clustering and supervised deep label classification for self- annotating a large-scale radiology image database where the conventional means of image annotation may not be quite feasible. In one examples, the converged model obtains the Top-1 classification accuracy of 0.8109 and Top-5 accuracy 0.9412 with 270 formed image categories. An example 67-class clustering achieves an accuracy of 75.3% on the MIT-67 indoor scene dataset that doubles the performance from other baseline methods (of using £-means or agglomerative clustering on the ImageNet-pretrained deep image features via AlexNet) and is strongly close to the fully-supervised deep classification result of 81.0%.

[0108] For the example results discussed in this section, in each LDPO round, image clustering is applied on the entire image dataset in order to assign a cluster label to each image. For CNN model fine-tuning, images are randomly reshuffled into three subsets of training (70%), validation (10%) and testing (20%) at each iteration. This is done to ensure that LDPO convergence will generalize to the entire image database. The CNN model is fine-tuned at each LDPO iteration once a new set of image labels is generated from the clustering stage. The Caffe implementation of CNN models was used. The softmax loss layer (e.g. , "FC8" in AlexNet and "loss3/classifier" in GoogLeNet) can be more significantly modulated by (1) setting a higher learning rate than all other CNN layers; and (2) updating the (varying but converging) number of category classes from the clustering results.

X. Unsupervised Medical Image Categorization

[0109] FIGS. 12-15 illustrate examples results obtained by performing unsupervised medical image categorization according to certain examples of the disclosed technology. An analysis of the convergence issue of LDPO methods under different system configurations is proved and then the CNN classification performance is reported for the discovered categories.

[0110] As shown in FIGS. 12A-12D, RIM can estimate unsupervised category numbers consistently well under different image representations (deep CNN feature configurations + encoding schemes). Standalone £-means clustering enables LDPO to converge quickly with high classification accuracies whereas RIM based model selection module produces more balanced and semantically meaningful clustering results. This can be attributed to two properties of RIM-based models: (1) less restricted geometric assumptions in the clustering feature space and (2) the capacity to attain the optimal number of clusters by maximizing the mutual information between input data and the induced clusters via a regularized term.

[0111] Both generic and domain- specific CNN models can be employed for LDPO initialization. FIGS. 12A-12D illustrate LDPO performance according to various combinations of encoding methods (FV and VLAD) and CNN's (AlexNet and GoogLeNet). FIG. 12A is a chart 1200 illustrating the number of clusters discovered according to the various combinations. FIG. 12B is a chart 1210 illustrating Top-1 accuracy of the trained CNNs according to the various combinations. FIG. 12C is a chart 1220 illustrating purity of clusters obtained according to the various combinations. FIG. 12D is a chart 1230 illustrating NMI measurements of clusters according to the various combinations.

[0112] In particular, the charts of FIGS. 12A-12D illustrate the performance of LDPO using two CNN variants— AlexNet-FC7-ImageNet and AlexNet-FC7-Topic. As shown, AlexNet-FC7- ImageNet yields noticeably slower LDPO convergence than its counterpart of AlexNet- FC7-Topic, as the latter has already been fine-tuned by the report-derived category information on the same radiology image database. Nevertheless, the final clustering outcomes are similar after convergence from AlexNet-FC7-ImageNet or AlexNet-FC7 -Topic. At about 10 iterations, two different initializations result in similar cluster numbers, purity/NMI scores, and even classification accuracies. See Table 2, above.

[0113] Different configurations of image representation can affect the performance of medical image categorization, as shown in FIGS. 12A-12D. Deep images features are extracted at different layers of depth from two CNN models (e.g. , AlexNet or GoogLeNet) and may present the depth- specific visual information. Different image feature encoding schemes (FV or VLAD) add further options or variations into this process. The numbers of clusters range from 270 (AlexNet-FC7- Topic with no explicit feature encoding scheme) to 931 (the more sophisticated GoogLeNet- Inc.5b- VLAD with VLAD encoding). The numbers of clusters discovered by RIM are expected to reflect the amount of knowledge or information complexity stored in the PACS database.

[0114] Disclosed category discovery clusters are generally visually coherent within the cluster and size-balanced across clusters. However, image clusters formed only based on text information (of -780K radiology reports) are highly unbalanced, with three clusters inhabiting the majority of images. Note that certain example methods impose no explicit constraint on the number of instances per cluster. FIGS. 8 and 13 are charts 800 and 1300 illustrating a number of sample images and their top- 10 associated key words from five randomly selected clusters (more results are provided below). The LDPO clusters are found to be clinically or semantically related to the corresponding key words, which describe presented anatomies, pathologies (e.g. , adenopathy, mass), their associated attributes (e.g. , bulky, frontal) and imaging protocols or properties.

[0115] The results validate the following hypothesis: a high quality unsupervised image categorization scheme will generate labels that can be more easily recognized by any supervised CNN model. From Table 2, AlexNet-FC7-Topic has the Top-1 classification accuracy of 0.8109 and Top-5 accuracy 0.9412 with 270 formed image categories while AlexNet-FC7-ImageNet achieves the accuracies of 0.8099 and 0.9547, from 275 discovered classes. The classification accuracies shown in Table 2 are computed using the final LDPO-converged CNN models and the testing dataset. Markedly better accuracies (especially on Top-1) on classifying higher numbers of classes (that are generally more challenging) also demonstrate certain advantages of the LDPO discovered image clusters or labels. Upon evaluation by two board-certified radiologists, AlexNet- FC7-Topic of 270 categories and AlexNet-FC7-ImageNet of 275 classes are considered the best of total six model-feature-encoding setups. Interestingly, both models have no external feature encoding schemes built-in and preserve gloss image layouts (without spatially unordered FV or VLAD encoding modules). XI. Example Results Using Scene Recognition

[0116] Three scene recognition datasets are used in another example to quantitatively evaluate the proposed LDPO-PM method (with patch mining) based on two metrics: (1) clustering based scene recognition accuracy and (2) supervised classification (e.g. , Li-bilinear) on image representations learned in an unsupervised fashion. The purity and NMI measurements are computed between the final LDPO clusters and GT scene classes where purity becomes the classification accuracy against GT. The LDPO cluster numbers are set to match the GT class numbers of (67, 25, 15), respectively. The LDPO scene recognition performance according to a disclosed method is compared to those of several popular clustering methods, such as KM: £-means; LSC; AC:

Agglomerative clustering; EP: Ensemble Projection + £-means; and MDPM: Midlevel

Discriminative Patch Mining + £-means. Both EP and MDPM use mid-level visual elements based image representations. Three variants of a disclosed method (LDPO-AFC7: FC7 feature on AlexNet, LDPO-A-PM: FC7 feature on AlexNet with patch mining, and LDPO-V-PM: FC7 feature on VGG-VD with patch mining) are examined.

[0117] Table 4 includes results data for Clustering performance of LOM and other methods on three scene recognition datasets. The last column presents fully- supervised scene Classification Accuracy (CA) for each dataset, produced by alternative methods, respectively. On all three datasets, the LDPO-A-PM and LDPO-V-PM achieve significantly higher purity and NMI values than the previous clustering methods (compare Table 5). Especially for the MIT-67 indoor scene dataset the disclosed model LDPOV-PM achieves an unsupervised scene recognition accuracy of 75.3%, which nearly doubles the performances of KM and AC on FC7 features of an ImageNet pretrained AlexNet. Note that the supervised classification accuracy on MIT-67 is 81.0% and that disclosed unsupervised methods are comparatively close to that. VGG-VD, a deeper CNN model, is empirically observed to boost the recognition performance from LDPO-A-PM of 63.2% to LDPO-V-PM at 75.3% on MIT-67. However this performance gain was not observed on two other smaller datasets.

Table 4 [0118] In this section, the supervised discriminative power of LDPO-PM learned image representation is evaluated. Its classification accuracy is measured using the MIT-67 dataset and its standard partition with 80 training and 20 testing images per class. The Liblinear classification toolbox is used on the LDPO-V-PM image representation (noted as LDPO-V-PM-LL), under 5- fold cross validation. The supervised and unsupervised scene recognition accuracy results from previous state-of-the-art work and variants of our method are listed in Table 4, above, which includes scene recognition accuracy data on the MIT-67 dataset. The one-versus-all Liblinear classification in LDPOV-PM-LL does not noticeably improve upon purely unsupervised LDPO-V- PM. This may indicate that the LDPOPM image representation is sufficient to adequately separate images from different scene classes. Another experiment is performed for the clustering convergence issue with two different initializations: random initialization or image labels obtained from k-means clustering on FC7 features of an ImageNet pretrained AlexNet. While the clustering accuracy of the LDPO-PM with random initialization increases rapidly during its first iterations, both schemes ultimately converge to similar performance levels. This suggests that the LDPO convergence is insensitive to the chosen initialization.

Table 5

XII. Example Results Using Scene Recognition

[0119] In this section, example results for a quantitative validation of the disclosed LDPO-PM methods (with patch mining) are analyzed for the following aspects: (1) supervised classification (e.g. , Liblinear ) on image representations learned in an unsupervised fashion and (2) the convergence analysis with different initialization strategies. The techniques disclosed in the section can be used to implement a scene recognition unit, such as the scene recognition unit 155 discussed above with respect to FIG. 1.

[0120] The supervised discriminative power of LDPO-PM learned image representation is evaluated. The MIT indoor scene dataset and its standard partition, including 80 training and 20 testing images per class, are adopted to examine the classification accuracy. A Liblinear classification toolbox is used on LDPO-A-PM and LDPO-V-PM image representation (noted as LDPO-A-PM-LL and LDPO-V-PMLL) under 5-fold cross validation. The supervised and unsupervised scene recognition accuracy results from previous state-of-the-art work and variants of our method are listed in Table 6. The one-versus-all Liblinear classification in LDPO-A-PM-LL, LDPO-VPM-LL is not observed to noticeably improve upon purely unsupervised LDPO-A-PM and LDPO-V-PM. This may indicate LDPO-PM image representation are already sufficient on separating images from different scene classes.

Table 6

[0121] The clustering convergence issue with two different initializations is performed: random initialization or image labels obtained from £-means clustering on FC7 features of an ImageNet pretrained AlexNet. The clustering accuracies for both settings are plotted across iterations. As illustrated in the chart 1400 of FIG. 14, the clustering accuracies of the random initialization setting boost significantly during the first several LDPO-PM iterations and finally the performances of both strategies converge to a similar level. Therefore it appears that the LDPO convergence is insensitive to different initialization settings.

[0122] Computational Cost of LDPO runs on a node of Linux computer cluster with 16 CPU cores (x2650), 128G memory and two Nvidia K20 GPUs is analyzed. The computational costs of different method configurations (ranging from 14:35 to 28:38 in hours: minutes) are shown in Table 7 per looped iteration. As shown, the more sophisticated and feature rich settings, e.g. , AlexNet- Conv5-FV, AlexNet-Conv5-VLAD, and VGG-VD-FC7-PM, require more time to converge. Computational Cost of LDPO

CNN setting Time per iter. (HH:MM)

Medical Image Categorization

AlexNet-FC7-Topic 14:35

AlexNet-FC7-Imagenet 14.40

AlexNet-Conv5-FV 17:40

AlexNet-Conv5-VLAD 15.44

GoogLeNet-Pool5 21:12

GoogLeNet-Inc.5b-VLAD 23.35

Scene Recognition

AlexNet-FC7-PM 18.11

VGG-VD-FC7-PM 28.38

Table 7

[0123] As will be readily understood to one of ordinary skill in the relevant art having the benefit of the present disclosure, the experimental results disclosed herein are provided to demonstrate efficacy of certain disclosed methods, but are not intended to be limited in any way.

[0124] In view of the many possible embodiments to which the principles of the disclosed technology may be applied, it should be recognized that the illustrated embodiments are only preferred examples of the disclosed technology and should not be taken as limiting the scope of the claims. Rather, the scope of the claimed subject matter is defined by the following claims. We therefore claim as our invention all that comes within the scope of these claims and their equivalents.

Claims

We claim:

1. A computer- implemented method of analyzing a collection of images with a neural network, the method comprising:

producing the neural network, the neural network including a plurality of input, output, and internal nodes and a plurality of links interconnecting the nodes, each of the nodes having an associated plurality of activation values, and each of the links having an associated plurality of weights;

extracting at least a portion of the activation values and weights associated with internal nodes of the neural network responsive to a respective input image of the collection of images being applied to the neural network;

encoding the extracted activation values and weights, producing encoded vectors;

clustering at least two of the collection of images based on similarities of the encoded vectors, to produce a plurality of clusters; and

evaluating the clusters with a convergence criteria.

2. The method of claim 1, further comprising:

if the evaluating indicates the convergence criteria does not satisfy a threshold, then fine- tuning the neural network; and

repeating at least one of the extracting, the encoding, the clustering, and the evaluating with the fine-tuned neural network.

3. The method of claim 2, wherein the fine-tuning comprises reshuffling the collection of images via stochastic gradient descent.

4. The method of claim 2, wherein the fine-tuning comprises adjusting a learning rate of one or more classification layers of the neural network.

5. The method of claim 2, wherein the fine-tuning comprises adjusting classes used to perform the clustering.

6. The method of claim 1 or claim 2, further comprising:

if the evaluating indicates the convergence criteria satisfies a threshold, then labeling images within a given cluster with one or more related labels.

7. The method of claim 6, wherein the labeling is based at least in part on analyzing text reports associated with each cluster using natural language processing.

8. The method of claim 1, wherein the neural network is AlexNet.

9. The method of claim 1, wherein the neural network is GoogLeNet.

10. The method of claim 1, wherein the neural network comprises:

one or more convolutional layers that filters a subset of input from nodes of a preceding layer; and

one or more fully-connected layers coupled to receive output from at least one of the convolutional layers.

11. The method of claim 1, wherein the neural network is initially a fine-tuned neural network.

12. The method of claim 1, wherein the neural network is initially a generic convolutional neural network.

13. The method of claim 1, wherein the encoding comprises generating a Fisher vector encoding.

14. The method of claim 1, wherein the encoding comprises generating a vector of locally aggregated descriptors (VLAD) encoding.

15. The method of claim 1, wherein the clustering is performed using an iterative refinement technique.

16. The method of claim 15, wherein the iterative refinement technique is a £-means method.

17. The method of claim 16, wherein the iterative refinement technique further includes performing regularized information maximization.

18. The method of claim 1, wherein the convergence criteria include a Purity function.

19. The method of claim 1, wherein the convergence criteria include a normalized mutual information function.

20. The method of claim 1, further comprising applying natural language processing to keyword labels associated with each of the images in a respective cluster.

21. The method of claim 1, further comprising applying natural language processing to keyword labels associated with each of the images in a respective cluster and applying a label to the images based on keyword frequencies.

22. The method of claim 1 , further comprising generating hierarchical category relationships for at least some of the clusters based on similarities of the encoded vectors for the clusters.

23. The method of claim 1, wherein the clustering comprises performing scene recognition for two or more of the collection of images, wherein the produced plurality of clusters are based at least in part on recognizing a scene in at least one of the collection of images.

24. The method of claim 1 , wherein the clustering comprises selecting one or more portions within an individual one of the collection of images and performing scene selection for the selected portion.

25. The method of claim 1, wherein the collection of images comprises medical images of mammals.

26. The method of claim 1, wherein the collection of images include at least one or more of the following: x-ray images, computed tomography (CT) images, magnetic resonance images, or positron emission tomography-computed tomography (PET/CT) images.

27. The method of claim 1, wherein at least one of the collection of images are associated with annotation data indicating at least one or more of the following: a disease, a severity level of a disease, an organ, or a location.

28. The method of claim 1, further comprising diagnosing a pathology for at least one image of the collection of images.

29. The method of claim 1 , wherein the neural network is implemented with a general- purpose microprocessor or a graphics processor (GPU).

30. One or more computer readable storage media storing computer readable instructions that when executed by a processor, cause the processor to perform the method of any one of claims 1-29.

31. A system, comprising:

memory;

one or more general purpose processors and/or graphics processing unit processors; and one or more computer-readable storage media storing computer-readable instructions that when executed by the processors, cause the processors to perform any one of the methods of claims 1-29.

32. A system, comprising:

a computer-readable database storing the collection of images;

a neural network coupled to receive the images as input; and

an encoding and clustering unit configured to generate clusters of the images by extracting activation values and/or weights for internal nodes of the neural network and their respective edges and classify the images into clusters based at least in part on similarities between the respective activation values and/or weights.

33. The system of claim 32, wherein the encoding and clustering unit is further configured to produce the extracted activation values and or weights as encoded vectors, wherein the clustering is based at least in part on the encoded vectors.

34. The system of claim 32, further comprising a fine-tuning unit configured to adjust the activation values and/or the edge weights using an iterative refinement technique.

35. The system of claim 34, wherein the fine-tuning is responsive to evaluating a convergence criteria for the clusters.