Elsevier

Medical Image Analysis

Volume 35, January 2017, Pages 303-312
Medical Image Analysis

Large scale deep learning for computer aided detection of mammographic lesions

https://doi.org/10.1016/j.media.2016.07.007Get rights and content

Highlights

  • A system based on deep learning is shown to outperform a state-of-the art CAD system.

  • Adding complementary handcrafted features to the CNN is shown to increase performance.

  • The system based on deep learning is shown to perform at the level of a radiologist.

Abstract

Recent advances in machine learning yielded new techniques to train deep neural networks, which resulted in highly successful applications in many pattern recognition tasks such as object detection and speech recognition. In this paper we provide a head-to-head comparison between a state-of-the art in mammography CAD system, relying on a manually designed feature set and a Convolutional Neural Network (CNN), aiming for a system that can ultimately read mammograms independently. Both systems are trained on a large data set of around 45,000 images and results show the CNN outperforms the traditional CAD system at low sensitivity and performs comparable at high sensitivity. We subsequently investigate to what extent features such as location and patient information and commonly used manual features can still complement the network and see improvements at high specificity over the CNN especially with location and context features, which contain information not available to the CNN. Additionally, a reader study was performed, where the network was compared to certified screening radiologists on a patch level and we found no significant difference between the network and the readers.

Introduction

Nearly 40 million mammographic exams are performed in the US alone on a yearly basis, arising predominantly from screening programs implemented to detect breast cancer at an early stage, which has been shown to increase chances of survival (Tabar, Yen, Vitak, Chen, Smith, Duffy, 2003, Broeders, Moss, Nyström, Njor, Jonsson, Paap, Massat, Duffy, Lynge, Paci, 2012). Similar programs have been implemented in many western countries. All this data has to be inspected for signs of cancer by one or more experienced readers which is a time consuming, costly and most importantly error prone endeavor. Striving for optimal health care, Computer Aided Detection and Diagnosis (CAD) (Giger, Karssemeijer, Armato, 2001, Doi, 2007, Doi, 2005, van Ginneken, Schaefer-Prokop, Prokop, 2011) systems are being developed and are currently widely employed as a second reader (Rao, Levin, Parker, Cavanaugh, Frangos, Sunshine, 2010, Malich, Fischer, Böttcher, 2006), with numbers from the US going up to 70% of all screening studies in hospital facilities and 85% in private institutions (Rao et al., 2010). Computers do not suffer from drops in concentration, are consistent when presented with the same input data and can potentially be trained with an incredible amount of training samples, vastly more than any radiologist will experience in his lifetime.

Until recently, the effectiveness of CAD systems and many other pattern recognition applications depended on meticulously handcrafted features, topped off with a learning algorithm to map it to a decision variable. Radiologists are often consulted in the process of feature design and features such as the contrast of the lesion, spiculation patterns and the sharpness of the border are used, in the case of mammography. These feature transformations provide a platform to instill task-specific, a-priori knowledge, but cause a large bias towards how we humans think the task is performed. Since the inception of Artificial Intelligence (AI) as a scientific discipline, research has seen a shift from rule-based, problem specific solutions to increasingly generic, problem agnostic methods based on learning, of which deep learning (Bengio, 2009, Bengio, Courville, Vincent, 2013, Schmidhuber, 2015, LeCun, Bengio, Hinton, 2015) is its most recent manifestation. Directly distilling information from training samples, rather than the domain expert, deep learning allows us to optimally exploit the ever increasing amounts of data and reduce human bias. For many pattern recognition tasks, this has proven to be successful to such an extent that systems are now reaching human or even superhuman performance (Cireşan, Meier, Masci, Schmidhuber, 2012, Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, 2015, He, Zhang, Ren, Sun, 2015).

The term deep typically refers to the layered non-linearities in the learning systems, which enables the model to represent a function with far less parameters and facilitates more efficient learning (Bengio, Lamblin, Popovici, Larochelle, 2007, Bengio, 2009). These models are not new and work has been done since the late seventies (Fukushima, 1980, Lecun, Bottou, Bengio, Haffner, 1998). In 2006, however, two papers (Hinton, Osindero, Teh, 2006, Bengio, Lamblin, Popovici, Larochelle, 2007) showing deep networks can be trained in a greedy, layer-wise fashion sparked new interest in the topic. Restricted Boltzmann Machines (RBM’s), probabilistic generative models, and autoencoders (AE), one layer neural networks, were shown to be expedient pattern recognizers when stacked to form Deep Belief Networks (DBN) (Hinton, Osindero, Teh, 2006, Bengio, Lamblin, Popovici, Larochelle, 2007) and Stacked Autoencoders, respectively. Currently, fully supervised, Convolutional Neural Networks (CNN) dominate the leader boards (Krizhevsky, Sutskever, Hinton, 2012, Zeiler, Fergus, 2014, Simonyan, Zisserman, Ioffe, Szegedy, He, Zhang, Ren, Sun, 2015). Their performance increase with respect to the previous decades can largely be attributed to more efficient training methods, advances in hardware such as the employment of many core computing (Cireşan et al., 2011) and most importantly, sheer amounts of annotated training data (Russakovsky et al., 2014).

To the best of our knowledge, Sahiner et al. (1996) were the first to attempt a CNN setup for mammography. Instead of raw images, texture maps were fed to a simple network with two hidden layers, producing two and three feature images respectively. The method gave acceptable, but not spectacular results. Many things have changed since this publication, however, not only with regard to statistical learning, but also in the context of acquisition techniques. Screen Film Mammography (SFM) has made way for Digital Mammography (DM), enabling higher quality, raw images in which pixel values have a well-defined physical meaning and easier spread of large amounts of training data. Given the advances in learning and data, we feel a revisit of CNNs for mammography is more than worthy of exploration.

Work on CAD for mammography (Elter, Horsch, 2009, Nishikawa, 2007, Astley, Gilbert, 2004) has been done since the early nineties but unfortunately, progress has mostly stagnated in the past decade. Methods are being developed on small data sets (Mudigonda, Rangayyan, Desautels, 2000, Zheng, Wang, Lederman, Tan, Gur, 2010) which are not always shared and algorithms are difficult to compare (Elter and Horsch, 2009). Breast cancer has two main manifestations in mammography, firstly the presence of malignant soft tissue or masses and secondly the presence of microcalcifications (Cheng and Huang, 2003) and separate systems are being developed for each. Microcalcifications are often small and can easily be missed by oversight. Some studies suggest CAD for microcalcifications is highly effective in reducing oversight (Malich et al., 2006) with acceptable numbers of false positives. However, the merit of CAD for masses is less clear, with research suggesting human errors do not stem from oversight but rather misinterpretation (Malich et al., 2006). Some studies show no increase in sensitivity or specificity with CAD (Taylor et al., 2005) for masses or even a decreased specificity without an improvement in detection rate or characterization of invasive cancers (Fenton, Abraham, Taplin, Geller, Carney, D’Orsi, Elmore, Barlow, 2011, Lehman, Wellman, Buist, Kerlikowske, Tosteson, Miglioretti, 2015). We therefore feel motivated to improve upon the state-of-the art.

In previous work in our group (Hupse et al., 2013) we showed that a sophisticated CAD system taking into account not only local information, but also context, symmetry and the relation between the two views of the same breast can operate at the performance of a resident radiologist and of a certified radiologist at high specificity. In a different study (Karssemeijer et al., 2004) it was shown that when combining the judgment of up to twelve radiologists, reading performance improved, providing a lower bound on the maximum amount of information in the medium and suggesting ample room for improvement of the current system.

In this paper, we provide a head-to-head comparison between a CNN and a CAD system relying on an exhaustive set of manually designed features and show the CNN outperforms a state-of-the-art mammography CAD system, trained on a large dataset of around 45,000 images. We will focus on the detection of solid, malignant lesions including architectural distortions, treating benign abnormalities such as cysts or fibroadenomae as false positives. The goal of this paper is not to give an optimally concise set of features, but to use a complete set where all descriptors commonly applied in mammography are represented and provide a fair comparison with the deep learning method. As mentioned by Szegedy et al. (2014), success in the past two years in the context of object recognition can in part be attributed to judiciously combining CNNs with classical computational vision techniques. In this spirit, we employ a candidate detector to obtain a set of suspicious locations, which are subjected to further scrutiny, either by the classical system or the CNN. We subsequently investigate to what extent the CNN is still complementary to traditional descriptors by combining the learned representation with features such as location, contrast and patient information, part of which are not explicitly represented in the patch fed to the network. Lastly, a reader study is performed, where we compare the scores of the CNN to experienced radiologists on a patch level.

The rest of this paper is organized as follows. In the next section, we will give details regarding the candidate detection system, shared by both methods. In Section 3, the CNN will be introduced followed by a description of the reference system in Section 4. In Section 5, we will describe the experiments performed and present results, followed by a discussion in Section 6 and conclusion in Section 7.

Section snippets

Candidate detection

Before gathering evidence, every pixel is a possible center of a lesion. This approach yields few positives and an overwhelming amount of predominantly obvious negatives. The actual difficult examples could be assumed to be outliers and generalized away, hindering training. Sliding window methods, previously popular in image analysis are recently losing ground in favor of candidate detection (Hosang et al., 2015) such as selective search (Uijlings et al., 2013) to reduce the search space (

Deep convolutional neural network

In part inspired by human visual processing faculties, CNNs learn hierarchies of filter kernels, in each layer creating a more abstract representation of the data. The term deep generally refers to the nesting of non-linear functions (Bengio, 2009). Multi Layered Perceptrons (MLPs) have been shown to be universal function approximators, under some very mild assumptions, and therefore, there is no theoretical limit that prevents them from learning the same mapping as a deep architecture would.

Reference system

The large majority of CAD systems rely on some form of segmentation of the candidates on which region based features are computed. To this end, we employ the mass segmentation method proposed by Timp and Karssemeijer (2004), which was shown to be superior to other methods (region growing (te Brake and Karssemeijer, 2001) and active contour segmentation (Kupinski and Giger, 1998)) on their particular feature set. The image is transformed to a polar domain around the center of the candidate and

Data

The mammograms used were collected from a large scale screening program in The Netherlands (bevolkingsonderzoek midden-west) and recorded using a Hologic Selenia digital mammography system. All tumours are biopsy proven malignancies and annotated by an experienced reader. Before presentation to a radiologist, the manufacturer applies some processing to optimize it for viewing by a human. To prevent information loss and bias, we used the raw images instead and only applied a log transform which

Discussion

To get more insight into the performance of the network, examples of the top misclassified positives and negatives are shown in Fig. 11 and 10 respectively. A large part of the patches determined as suspicious by the network are benign abnormalities such as cysts and fibroadenomae or normal structures such as lymph nodes or fat necrosis. Cysts and lymph nodes can look relatively similar to masses. These strong false positives occur due to the absence of benign lesions in our training set. In

Conclusion

In this paper we have shown that a deep learning model in the form of a Convolutional Neural Network (CNN) trained on a large data set of mammographic lesions outperforms a state-of-the art system in Computer Aided Detection (CAD) and therefore has great potential to advance the field of research. A major advantage is that the CNN learns from data and does not rely on domain experts, making development easier and faster. We have shown that the addition of location information and context can

Acknowledgements

This research was funded by grant KUN 2012-557 of the Dutch Cancer Society and supported by the Foundation of Population Screening Mid West.

References (65)

  • Y. Bengio et al.

    Greedy layer-wise training of deep networks

    Advances in Neural Information Processing Systems

    (2007)
  • J. Bergstra et al.

    Theano: a CPU and GPU math expression compiler

    Proceedings of the Python for Scientific Computing Conference (SciPy)

    (2010)
  • H. Bornefalk et al.

    On the comparison of froc curves in mammography CAD systems

    Med. Phys.

    (2005)
  • G.M. te Brake et al.

    Segmentation of suspicious densities in digital mammograms

    Med. Phys.

    (2001)
  • G.M. te Brake et al.

    An automatic method to discriminate malignant masses from normal tissue in digital mammograms

    Phys. Med. Biol.

    (2000)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • M. Broeders et al.

    The impact of mammographic screening on breast cancer mortality in europe: a review of observational studies

    J. Med. Screening

    (2012)
  • ChengS.C. et al.

    A novel approach to diagnose diabetes based on the fractal characteristics of retinal images

    IEEE Trans. Inf. Technol. Biomed.

    (2003)
  • D.C. Cireşan et al.

    Mitosis detection in breast cancer histology images with deep neural networks

    Medical Image Computing and Computer-Assisted Intervention

    (2013)
  • D.C. Cireşan et al.

    Flexible, high performance convolutional neural networks for image classification

    International Joint Conference on Artificial Intelligence

    (2011)
  • Dauphin, Y. N., de Vries, H., Chung, J., Bengio, Y., 2015. Rmsprop and equilibrated adaptive learning rates for...
  • K. Doi

    Current status and future potential of computer-aided diagnosis in medical imaging

    British J. Radiol.

    (2005)
  • B. Efron

    Bootstrap methods: Another look at the jackknife

    Annals Stat.

    (1979)
  • M. Elter et al.

    Cadx of mammographic masses and clustered microcalcifications: a review

    Med. Phys.

    (2009)
  • J.J. Fenton et al.

    Effectiveness of computer-aided detection in community mammography practice

    J. Nation. Cancer Inst.

    (2011)
  • K. Fukushima

    Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

    Biol. Cybern.

    (1980)
  • Gens, R., Domingos, P. M., 2014. Deep symmetry networks....
  • M.L. Giger et al.

    Computer-aided diagnosis in medical imaging

    IEEE Trans. Med. Imag.

    (2001)
  • B. van Ginneken et al.

    Computer-aided diagnosis: how to move from the laboratory to the clinic

    Radiology

    (2011)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Computer Vision and Pattern Recognition

    (2014)
  • R.M. Haralick et al.

    Textural features for image classification

    IEEE Trans. Syst. Man Cybern.

    (1973)
  • K. He et al.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    Comput. Vis. Pattern Recognit.

    (2015)
  • Cited by (803)

    View all citing articles on Scopus
    View full text