Breast cancer risk estimation with artificial neural networks revisited
Discrimination and calibration
Turgay Ayer MS,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
Oguzhan Alagoz PhD,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
Jagpreet Chhatwal PhD,
- Health Economic Statistics, Merck Research Laboratories, North Wales, Pennsylvania
Jude W. Shavlik PhD,
- Department of Computer Science, University of Wisconsin, Madison, Wisconsin
Charles E. Kahn Jr MD, MS,
- Department of Radiology, Medical College of Wisconsin, Milwaukee, Wisconsin
Elizabeth S. Burnside MD, MPH, MSCorresponding author
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
- Department of Radiology, University of Wisconsin Medical School, E3 of 311, 600 Highland Avenue, Madison, WI 53792-3252
Discriminating malignant breast lesions from benign ones and accurately predicting the risk of breast cancer for individual patients are crucial to successful clinical decisions. In the past, several artificial neural network (ANN) models have been developed for breast cancer-risk prediction. All studies have reported discrimination performance, but not one has assessed calibration, which is an equivalently important measure for accurate risk prediction. In this study, the authors have evaluated whether an artificial neural network (ANN) trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
Our dataset consisted of 62,219 consecutively collected mammography findings matched with the Wisconsin State Cancer Reporting System. The authors built a 3-layer feedforward ANN with 1000 hidden-layer nodes. The authors trained and tested their ANN by using 10-fold cross-validation to predict the risk of breast cancer. The authors used area the under the receiver-operating characteristic curve (AUC), sensitivity, and specificity to evaluate discriminative performance of the radiologists and their ANN. The authors assessed the accuracy of risk prediction (ie, calibration) of their ANN by using the Hosmer-Lemeshow (H-L) goodness-of-fit test.
Their ANN demonstrated superior discrimination (AUC, 0.965) compared with the radiologists (AUC, 0.939; P < .001). The authors' ANN was also well calibrated as shown by an H-L goodness of fit P-value of .13.
The authors' ANN can effectively discriminate malignant abnormalities from benign ones and accurately predict the risk of breast cancer for individual abnormalities. Cancer 2010. © 2010 American Cancer Society.
Successful breast cancer diagnosis requires systematic image analysis, characterization, and integration of many clinical and mammographic variables.1 An ideal diagnostic system would discriminate between benign and malignant findings perfectly. Unfortunately, perfect discrimination has not been achieved, so radiologists must make decisions based on their best judgment of breast cancer risk amid substantial uncertainty. When there are numerous interacting predictive variables, ad hoc decision strategies based on experience and memory may lead to errors2 and variability in practice.3, 4 That is why there is intense interest in developing tools that can calculate an accurate probability of breast cancer to aid in making decisions.5-7
Discrimination and calibration are the 2 main components of accuracy in a risk-assessment model.8, 9 Discrimination is the ability to distinguish benign abnormalities from malignant ones. Although assessing discrimination with area under the receiver-operating characteristic (ROC) curve (AUC) is a popular method in the medical community, it may not be optimal in assessing risk prediction models that stratify individuals into risk categories.10 In this setting, calibration is also an important tool for accurate risk assessment of individual patients. Calibration measures how well the probabilities generated by the risk prediction model agree with the observed probabilities in the actual population of interest.11 There is a trade off between discrimination and calibration, and a model typically cannot be perfect in both.10 In general, risk-prediction models need good discrimination, when their aim is to separate malignant findings from benign ones, and good calibration, when their aim is to stratify individuals into higher or lower risk categories, to aid in decisions and communication.11
Computer models have the potential to help radiologists increase the accuracy of mammography examinations in both detection12-15 and diagnosis.16-20 Existing computer models in the domain of breast-cancer diagnosis can be classified under 3 broad categories: prognostic, computer-aided detection (CAD), and computer-aided diagnostic (CADx) models. Prognostic models, such as the Gail model,21-24 use retrospective risk factors such as a woman's age, her personal and family histories of breast cancer, and clinical information to predict breast cancer risk during a time interval in the future for treatment or risk-reduction decisions.24 These models provide guidance for clinical trial eligibility, tailored disease surveillance, and chemoprevention strategies.25 Because risk stratification is of primary interest in prognostic models, the performance of these models is assessed principally by calibration measures.11Detection or CAD models12-15, 26-28 are developed to assist radiologists in identifying possible abnormalities in radiologic images, leaving the interpretation of the abnormality to the radiologist.29 Because discrimination is most important, and calibration is less critical in detection, the performance of CAD models is typically evaluated in terms of ROC curves.11 Diagnostic or CADx models30-39 characterize findings from mammograms (eg, size, contrast, shape) identified either by a radiologist or a CAD model29 to help radiologists classify lesions as benign or malignant by providing objective information, such as the risk of breast cancer.40 CADx models are similar to prognostic models in 1 way; they estimate the risk of breast malignancy to help physicians and patients improve decisions.29 On the other hand, CADx models differ from prognostic models in the sense that their risk estimation is based on mammography findings and at a single time point (ie, at the time of mammography) to aid in further imaging or intervention decisions. Both discrimination and calibration are important features of a CADx model. High discrimination is needed because helping radiologists to distinguish malignant findings from benign ones is the primary purpose of CADx models.11 In addition, good calibration is needed to stratify risk and communicate the risk with patients as in the example of prognostic models.11
However, existing CADx studies that use ANNs to assess the risk of breast cancer have ignored calibration and focused only on discrimination ability.31, 36, 38, 39 Most of these studies have good discrimination but may be very poorly calibrated.41 For example, 4 such models report that no cancers would be missed if the threshold to defer biopsy was set to 10%-20%.31, 35, 37, 42 By suggesting a threshold in this range to defer biopsy, these models not only substantially exceed the accepted biopsy threshold in clinical practice of 2%,43 but they also indicate a systematic overestimation of malignancy risk. This discrepancy is likely attributable to suboptimal calibration.
In addition, existing studies have several potential limitations that make them impractical for clinical implementation. First, the size of training datasets used for building ANNs in these previous studies has been relatively small (104-1288 lesions)31, 35, 36, 38, 39 to obtain reliable models. Second, the majority of these studies developed models by using only findings that underwent biopsy,30, 31, 35-37, 39 or were referred to a surgeon,38 and excluded other findings in their analysis, which may lead to biased models.
Our research team has developed 2 CADx models that use the same dataset to discriminate malignant mammography findings from benign ones.33, 34 This study differs from our previous research in 2 different ways. First, this study uses a different modeling technique (an artificial neural network [ANN]) than our previous research, which used logistic regression and a Bayesian network. Second, this study considers calibration, whereas our previous research, like many other CADx models, did not evaluate calibration but only evaluated discrimination.
The purpose of our study is to evaluate whether an ANN trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
MATERIALS AND METHODS
The institutional review board exempted this Health Insurance Portability and Accountability Act (HIPAA)-compliant, retrospective study from requiring informed consent. The data used in this study have been presented in our previous studies33, 34 and is repeated here for the convenience of the reader.
All of the screening and diagnostic mammograms performed at the Froedtert and Medical College of Wisconsin Breast Care Center between April 5, 1999 and February 9, 2004 were included in our dataset for retrospective evaluation. We consolidated our database in the National Mammography Database (NMD) format, a data format based on the standardized Breast Imaging Reporting and Data System (BI-RADS) lexicon developed by the American College of Radiology (ACR) for standardized monitoring and tracking of patients.44, 45 The study comprised 48,744 mammograms belonging to 18,269 patients (Table 1).
|No. of mammograms||477 (1)||48,267 (99)||48,744 (100)|
|Age groups, y|
|<45||66 (13.84)||9529 (19.74)||9595 (19.68)|
|45-49||49 (10.27)||7524 (15.59)||7573 (15.54)|
|50-54||56 (11.74)||7335 (15.2)||7391 (15.16)|
|55-59||71 (14.88)||6016 (12.46)||6087 (12.49)|
|60-64||59 (12.37)||4779 (9.9)||4838 (9.93)|
|≥65||176 (36.9)||13,084 (27.11)||13,260 (27.20)|
|Predominantly fatty||61 (12.79)||7226 (14.97)||7287 (14.95)|
|Scattered fibroglandular||201 (42.14)||19,624 (40.66)||19,825 (40.67)|
|Heterogeneously dense||174 (36.48)||17,032 (35.29)||17,206 (35.30)|
|Extremely dense tissue||41 (8.6)||4385 (9.08)||4426 (9.08)|
|1||0 (0)||21,094 (43.7)||21,094 (43.28)|
|2||13 (2.73)||10,048 (20.82)||10,061 (20.64)|
|3||32 (6.71)||8520 (17.65)||8552 (17.54)|
|0||130 (27.25)||8148 (16.88)||8278 (16.98)|
|4||137 (28.72)||364 (0.75)||501 (1.03)|
|5||165 (34.59)||93 (0.19)||258 (0.53)|
Each mammogram was prospectively interpreted by 1 of 8 radiologists. Four of these radiologists were general radiologists, 2 of them were fellowship trained in breast imaging, and the other 2 had extensive experience in breast imaging. These radiologists had between 1-35 years of experience interpreting mammography. Each radiologist reviewed 6994 mammograms on average (median, 2924; range, 49-22,219) in our dataset.
Each mammographic finding, if any, was recorded as a unique entry in our database. In case of a negative mammogram, a single entry showing only demographic data (age, personal history, prior surgery, and hormone replacement therapy) and BI-RADS assessment category was entered. If an image had more than 1 reported finding with only 1 of them being cancer, we considered the other findings as false positives. Throughout the current article, the term “finding” will be used to denote the single record for normal mammograms or each record denoting an abnormality on a mammogram. Both radiologists (for mammography findings) and technologists (for demographic data) used PenRad (Minnetonka, Minn) mammography reporting/tracking data system, which records clinical data in a structured format. (ie, Point-and-click entry of information populates the clinical report and the database simultaneously.) We included in our ANN model all of the demographic risk factors and BI-RADS descriptors that were routinely collected in the practice and predictive of breast cancer (Table 2). We obtained the reading radiologist's information by merging the PenRad data with the radiology information system at the Medical College of Wisconsin. We could not assign 504 findings to a radiologist during our matching protocol. We elected to keep these unassigned findings in our dataset to maintain its consecutive nature.
|Age groups, y||<45, 45-50, 51-54, 55-60, 61-64, ≥65|
|Hormone therapy||None, <5 y, >5 y|
|Personal history of BCA||No, yes|
|Family history of BCA||None, minor (nonfirst-degree family members), major (1 or more first-degree family members)|
|Breast density||Predominantly fatty, scattered fibroglandular, heterogeneously dense, extremely dense|
|Mass shape||Circumscribed, ill-defined, microlobulated, spiculated, not present|
|Mass stability||Decreasing, stable, increasing, not present|
|Mass margins||Oval, round, lobular, irregular, not present|
|Mass density||Fat, low, equal, high, not present|
|Mass size||None, small (<3 cm), large (≥3 cm)|
|Lymph node||Present, not present|
|Asymmetric density||Present, not present|
|Skin thickening||Present, not present|
|Tubular density||Present, not present|
|Skin retraction||Present, not present|
|Nipple retraction||Present, not present|
|Skin thickening||Present, not present|
|Trabecular thickening||Present, not present|
|Skin lesion||Present, not present|
|Axillary adenopathy||Present, not present|
|Architectural distortion||Present, not present|
|Prior history of surgery||No, yes|
|Postoperative change||No, yes|
|Popcorn||Present, not present|
|Milk||Present, not present|
|Rodlike||Present, not present|
|Eggshell||Present, not present|
|Dystrophic||Present, not present|
|Lucent||Present, not present|
|Dermal||Present, not present|
|Round||Scattered, regional, clustered, segmental, linear ductal|
|Punctate||Scattered, regional, clustered, segmental, linear ductal|
|Amorphous||Scattered, regional, clustered, segmental, linear ductal|
|Pleomorphic||Scattered, regional, clustered, segmental, linear ductal|
|Fine Linear||Scattered, regional, clustered, segmental, linear ductal|
|BI-RADS category||0, 1, 2, 3, 4, 5|
We analyzed discrimination and calibration accuracy at the finding level because this is the level at which recall and biopsy decisions are made in clinical practice. We believe this is the level at which computer-assisted models will help radiologists improve performance. However, because conventional analysis of mammographic data is at the mammogram level (where findings from a single study are combined), we also calculated the cancer detection rate, the early stage cancer detection rate, and the abnormal interpretation rate at the mammogram level for comparison. We specify whether analyses in this study are based on mammograms or findings.
Data obtained from the Wisconsin Cancer Reporting System (WCRS), a statewide cancer registry, was used as our reference standard. The WCRS has been collecting information from hospitals, clinics, and physicians since 1978. The WCRS records demographic information, tumor characteristics (eg, date of diagnosis, primary site, stage of disease), and treatment information for all newly diagnosed breast cancers in the state. Under data exchange agreements, out-of-state cancer registries also provide reports on Wisconsin residents diagnosed in their states. Findings that had matching registry reports of ductal carcinoma in situ or any invasive carcinoma within 12 months of a mammogram date were considered positive. Findings shown to be benign by biopsy or without a registry match within the same time period were considered negative.
We built a 3-layer, feed-forward, neural network by using Matlab 7.4 (Matlab, The Mathworks, Natick, Mass) with a backpropagation learning algorithm46 to estimate the likelihood of malignancy. The layers included an input layer of 36 discrete variables (mammographic descriptors, demographic factors, and BI-RADS final assessment categories as entered by the radiologists; Table 2), a hidden layer with 1000 hidden nodes, and an output layer with a single node generating the probability of malignancy for each finding. We designed our ANN to have a large number of hidden nodes, because ANNs with a large number of hidden nodes generalize better than networks with small number of hidden nodes when trained with backpropagation and “early stopping”.47-49 (See Discussion, this article).
To train and test our ANN, we used a standard machine-learning method called 10-fold cross-validation, which ensures that a test sample is never used for training. In our 10-fold cross-validation, the data was divided into 10 subsets that were approximately equal in size. In the first iteration, 9 of these subsets were combined and used for training. The remaining 10th set was used for testing the performance of our ANN on unseen cases. We repeated this process for 10 iterations until all subsets were used once for testing. In addition to 10-fold cross-validation, to assess the robustness of our ANN, we performed the following supplementary analyses: 1) we trained our ANN on the first half of the dataset and tested on the second half, 2) we trained our ANN on the second half of the dataset and tested on the first half.
We used “early stopping (ES)” procedure to prevent our ANN from overfitting and to keep it generalizable to future cases.50, 51 Generalizability is the ability of a model to demonstrate similar predictive performance on data not used for training but consisting of unseen cases from the same population. A model lacks generalizability when overfitting occurs, a phenomenon whereby the model “memorizes” the cases in the training data but fails to generalize to new data. When overfitting occurs, ANNs obtain spuriously good performance by learning anomalous patterns unique to the training set but generate high error resulting in low accuracy when presented with unseen data.52 We performed ES by using a validation (tuning) set, in addition to a training and a testing set, to calculate the network error during training and to stop training early if necessary to prevent overfitting.50-52
We evaluated the discriminative ability of our ANN against radiologists at an aggregate level and at an individual-radiologist level. We plotted the receiver-operator characteristic (ROC) curve for our ANN by using the probabilities generated for all findings by means of our 10-fold cross-validation technique. We constructed the ROC curves for all radiologists individually and in aggregate by using BI-RADS assessment categories assigned by the radiologists to each finding. We ordered BI-RADS assessment categories by the increasing likelihood of malignancy (1<2<3<0<4<5) for this purpose. We measured area under the curve (AUC), sensitivity, and specificity to assess the discriminative ability of our ANN and the radiologists (in aggregate and individually). We used a 2-tailed DeLong method53 to measure and compare AUCs because it accounts for correlation between the ROC curves obtained from the same data.
We calculated sensitivity and specificity of our ANN and the radiologists at recommended levels of performance: sensitivity at a specificity of 90% and specificity at a sensitivity of 85%, as they represent the minimal performance thresholds for screening-mammography.54 When calculating the sensitivity and specificity of the radiologists, we considered BI-RADS 0, 4, and 5 positive, whereas BI-RADS 1, 2, and 3 were designated negative.45 We used 1-tailed McNemar test to compare sensitivity and specificity between the radiologists and our ANN.55 A McNemar test accounts for correlation between the sensitivity and specificity ratios and is not defined when the ratios are equal, nor when 1 of the ratios is 0 or 1. We used the Wilson method to generate confidence intervals for sensitivity and specificity.56 We considered P < .05 to be the level of statistical significance.
We assessed the calibration of our ANN by calculating the Hosmer-Lemeshow (H-L) goodness-of-fit statistic57 and plotting a calibration curve. The H-L statistic compares the observed and predicted risk within risk categories. A lower H-L statistic and a higher P value (P > .05) indicate better calibration. For the H-L statistic, the predicted risks of findings were rank-ordered and divided into 10 groups, based on their predicted probability. Within each predicted risk group, the number of predicted malignancies was accumulated against the number of observed malignancies. The H-L statistic was calculated from this 2 × 10 contingency table. The H-L statistic was then compared with the chi-square distribution, with degrees of freedom equal to 8. We also plotted a calibration curve to visually compare calibration of our ANN to the perfect calibration in predicting breast malignancy risk. In a calibration curve, a line at a 45° angle (line of identity) indicates perfect calibration. Data points to the right of the perfect calibration line represent overestimation of the risk, and those to the left of the line represent underestimation.58 Although a calibration curve does not provide a quantitative measure of reliability for probability predictions, it provides a graphical representation of the degree to which predicted probability of malignancy by our ANN corresponds to actual prevalence.58, 59 The calibration curve shows the ability of the model to enable prediction of probabilities across all ranges of risk.
After matching to the cancer registry, our final matched dataset contained a total of 62,219 findings [510 (0.8%), malignant and 61,709 (99.2%) benign], in 18,269 patients (17,924 women and 345 men). The mean age of the female patients was 56.5 years (range, 17.7-99.1; SD, 12.7). Women were, on average, 2 years younger compared with men, whose mean age was 58.5 years (range, 18.6-88.5; SD, 15.7).
Our analysis at the mammogram level showed that 14% of the mammographic abnormalities occurred predominantly in fatty tissue, 41% in scattered fibroglandular tissue, 36% in heterogeneously dense tissue, and 9% in extremely dense tissue (Table 1). At the findings level, the cancers included 246 masses, 121 microcalcifications, 27 asymmetries, 18 architectural distortions, 86 combinations of findings, and 12 other.
Cancer registry match revealed a detection rate of 8.9 cancers per 1000 mammograms for the radiologists at the mammogram level (432 cancers for 48,744 mammograms—33 patients had more than 1 cancer resulting in 510 total cancers). The abnormal interpretation rate (considering BI-RADS 0, 4, and 5 abnormal) was 18.5% (9037 of 48,744 mammograms). Of all the 432 cancers, 390 had staging information from the cancer registry, and 42 did not. Of the detected cancers with staging information, only 26.7% (104 of 390) had lymph node metastasis, and 71% (277 of 390) were early stage (ie, stage 0 or 1).
Following training and testing using 10-fold cross-validation, the AUC of our ANN, 0.965, was significantly higher than that of the radiologists in aggregate, 0.939 (P < .001), at the finding level, which implied that our ANN performed better than the radiologists alone in discriminating between benign and malignant findings. The ROC curve of our ANN (aggregate level) dominated the combined ROC curve of all radiologists at all cutoff thresholds (Fig. 1). This trend was preserved when the ANN was trained on the first half of the dataset and tested on the second half (ANN AUC, 0.949; radiologists AUC, 0.926; P < .001) or when trained on the second half of the dataset and tested on the first half (ANN AUC, 0.966; radiologists AUC, 0.951; P < .001). At the individual radiologists level, 4 of 8 comparisons were not statistically significant (Table 3). Of the 4 significant differences, our ANN outperformed the radiologists in all except a single, low-volume reader (Radiologist 8, Table 3).
At a specificity of 90%, the sensitivity of our ANN was significantly better (90.7% vs 82.2%; P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN identified 44 more cancers when compared with the radiologists at this level of specificity (Table 5, part A.). At a fixed sensitivity of 85%, the specificity of our ANN was also significantly better (94.5% vs 88.2%, P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN decreased the number of false positives by 3941 when compared with the radiologists' performance at this level of sensitivity (Table 5, part B). In terms of specificity, all statistically significant comparisons revealed the ANN to be superior with the exception of 1 low-volume reader (Radiologist 8 in Table 4). In terms of sensitivity, all statistically significant comparisons revealed the ANN to be superior; however, 1 low-volume reading radiologist demonstrated the opposite trend (Radiologist 1 in Table 4).
|1||3312||77||93.5 (84.8, 97.6)||88.4 (78.4,94.1)||.0625||94.4 (93.6, 95.2)||96.9 (96.4, 97.5)||<.001|
|3||18953||180||78.3 (71.4, 83.9)||90.0 (84.5, 93.8)||<.001||85.0 (84.4, 85.5)||95.0 (94.7, 95.3)||<.001|
|4||26690||171||82.4 (75.7, 87.6)||93.0 (87.8, 96.1)||<.001||85.6 (85.1, 86.0)||96.4 (96.1, 96.5)||<.001|
|6||6796||36||83.3 (66.5, 93.0)||86.1 (69.7, 94.7)||.999||88.4 (87.6, 89.1)||94.5 (93.9, 95.0)||<.001|
|7||3637||29||75.8 (56.0, 88.9)||72.5 (52.5, 86.5)||.999||79.9 (78.6, 81.2)||86.2 (85.0, 87.2)||<.001|
|8||1695||9||77.7 (40.1, 96.0)||66.7 (30.9, 90.9)||.999||86.7 (85.0, 88.3)||80.7 (78.7, 82.5)||<.001|
|Unassignede||497||7||100.0 (56.1, 100.0)||100.0 (56.1, 100.0)||ND||98.3 (96.7, 99.2)||99.6 (98.4, 99.9)||0.015|
|Total||61709||510||82.2 (78.5, 85.3)||90.7 (87.8, 93.0)||<.001||88.2 (87.9, 88.5)||94.5 (94.3, 94.6)||<.001|
|Radiologists||419 (400-435)||91 (75-110)|
|ANN||463 (449-475)||47 (36-62)|
|B.||Performance at 85% Sensitivity|
|False Negative||True Positive|
|Radiologists||7282 (7126-7441)||54,427 (54,268-54,583)|
|ANN||3341 (3232-3454)||58,368 (58,256-58,477)|
The H-L statistic for our ANN was 12.46 (P = .13, df = 8). The precision of the predicted probabilities is shown graphically in Figure 2. Although the calibration curve of our ANN does not perfectly match the line of identity (the line at a 45° angle), the deviation is pictorially minimal.
We have demonstrated that our ANN can accurately estimate the risk of breast cancer by using a dataset that contains demographic data and prospectively collected mammographic findings. To our knowledge, this study uses 1 of the largest datasets of mammography findings to develop a CADx model. Our results demonstrate that ANNs may have the potential to aid radiologists in discriminating between benign and malignant breast diseases. When we compare discriminative accuracy by using AUC, sensitivity, and specificity, our ANN performs significantly better than all radiologists in aggregate. Although the difference between the AUCs of the radiologists and our ANN may appear to be small (0.026), this difference is both statistically (P < .001) and clinically significant because our ANN identified 44 more cancers and decreased the number of false positives by 3941 when compared with the radiologists at the specified sensitivity and specificity values. Note that these results would be similar for any other specified sensitivity and specificity values because the ROC curve of our ANN outperforms that of the radiologists at all threshold levels. On the other hand, the reason for obtaining a numerically small difference between the AUCs relates to the disproportionate number of benign findings (61,709) compared to malignant findings (510) in our dataset resulting in very high specificity at baseline and little room for improvement in this parameter.
Among statistically significant comparisons, our ANN demonstrates superior AUC, sensitivity, and specificity versus all but 1 radiologist, including the 2 highest-volume readers. Therefore, similar to other ANN models presented in the literature, our ANN has the potential to aid radiologists in classifying (discriminating) findings on mammograms by predicting the risk of malignancy. When compared with the previous CADx models developed by our research team (a logistic regression and a Bayesian network), the discrimination performance of our ANN was slightly higher (ANN AUC, 0.965; logistic regression AUC, 0.963; Bayesian network AUC, 0.960). On the other hand, no statistically significant difference was found between the ANN and the logistic regression (P = .57), or the ANN and the Bayesian network (P = .13).
However, our model is unique in several ways. In contrast to prior ANN models, which used a relatively small selected population of suspicious findings undergoing tissue sampling with biopsy as the reference standard,30, 31, 35-37, 39 we use a large consecutive dataset of mammography findings with tumor registry outcomes as the reference standard to train our ANN. Furthermore, contrary to previously developed CADx models in breast cancer-risk prediction, we expand the evaluation of CADx models beyond discrimination by measuring the accuracy of the estimated probabilities themselves by using calibration metrics.
Although discrimination or accurate classification is of primary interest for CADx models,11, 60 calibration is also crucial, especially when clinical decisions are being made for individual patients.11, 61 Individual decisions are made under uncertainty and, therefore, aided more effectively by accurate risk estimates. Because there is a trade off between discrimination and calibration,10 the selection of the primary performance measure should be based on the intended purpose of the model.11 In this study, similar to previous CADx models, we designed our ANN primarily for optimizing the discrimination ability. However, contrary to previous CADx studies, we also measured the calibration as the secondary objective. We showed that our ANN is well calibrated, as demonstrated by the low value of the H-L statistic, the corresponding high P value, and the favorable calibration curve; and, thus, our ANN can accurately estimate the risk of malignancy for individual patients. The ability of our ANN to assign accurate numeric probabilities is an important complement to its ability to discriminate between ultimate outcomes.61
We posit that the good calibration of our ANN is attributable to both the characteristics of our training set and attributes of our model. For example, the consecutive nature of our dataset of mammography findings and the use of a tumor registry match as a reference standard, which reflects a real-world population, may lead to accurate calibration. In addition, the use of a large number of hidden nodes in concert with training with a validation set to prevent overfitting may have enhanced calibration. In future work, we plan to analyze which parameters most profoundly influence calibration.
CADx models for breast cancer risk estimation have ignored calibration and have typically been developed and evaluated on the basis of their discrimination ability.31-39 Although calibration has not been formally assessed in previous CADx models, there is some evidence that these models are not well calibrated.31, 35, 42 Poor calibration may indicate that these models are not optimized for individual cases, ie, the predicted breast cancer risk for a single patient may be incorrect.
From a clinical standpoint, our ANN may be valuable because it provides an accurate post-test probability for malignancy. This post-test probability may be useful to communication among the radiologist, patient, and referring physician, which, in turn, may encourage making shared decisions.5-7 Each individual patient has a unique risk tolerance and comorbidities, and these factors should be considered when making decisions involving mammographic abnormalities. Risk assessments based on individual characteristics may also help promote the concept of personalized care in the diagnosis of breast cancer. Furthermore, our ANN is designed to increase the effectiveness of mammography by aiding radiologists and not by acting as a substitute. Our ANN quantifies the risk of breast cancer by using mammographic features assessed by the radiologist, so the ANN's performance depends largely on the radiologist's accurate observations and overall assessment (BI-RADS category).
Our ANN has the potential to be used as a decision-support tool, although it may face similar challenges that have, in the past, prevented the implementation of effective decision-support algorithms in clinical practice. To be used in the clinic, a decision-support tool must be seamlessly integrated into the clinical workflow, which can be challenging. We believe in the case of mammography, a decision-support tool would be most useful if directly linked to structured reporting software that radiologists use in daily practice, which would enable immediate feedback. On the other hand, the good performance of our ANN may not be preserved after the integration into clinical practice. Before clinical integration, it is important to consider the ways our ANN could fail, due to both inherent theoretical limitations and errors that may occur during the process of integration.62 In fact, numerous computer-aided diagnostic models that have performed well in evaluation studies have not made an impact on clinical practice.63-68 Furthermore, the optimal performance of our ANN would be required to gain the trust of clinicians to influence clinical practice. Unfortunately, the parameters of ANNs do not carry any real-life interpretation, and clinicians have trouble trusting decision-support algorithms that represent a “black box” without explanation capabilities. Although there is rule extraction software that converts a trained ANN to a more humanly understandable representation,69-71 integration of these various software programs with the ANN requires extra effort. Therefore, we recognize that substantial challenges remain in the implementation of ANNs for decision support at the point of care, and we emphasize the importance of these issues for future research and implementation.
There are 3 important implementation considerations. First, determining the number of effective hidden nodes in an ANN is crucial and may significantly affect its output performance. Unfortunately, there is no general rule to determine the effective number of hidden nodes that maximizes the network performance when presented with an unseen dataset (generalizability).47 Although some researchers have said that conventional wisdom suggests that when neural networks have excess hidden nodes they generalize poorly,48 several recent studies in the machine-learning literature have shown that ANNs with excess capacity (ie, with a large number of hidden nodes) generalize better than small networks (ie, networks with a small number of hidden nodes) when trained with backpropagation and early stopping.47-49 Therefore, we built an ANN with excess capacity and did not optimize the number of hidden nodes. Also, note that if we had optimized the number of hidden nodes to maximize the AUC, as other researches have, we would have achieved an even higher AUC than described here.
Second, selection of the primary performance measure is also crucial when building an ANN model. In our study, we built our ANN principally to maximize the discrimination accuracy because discrimination is of primary interest to optimize accurate diagnosis.11, 60 On the other hand, ANNs could also be trained for maximizing the calibration when the primary purpose is to stratify individuals into higher or lower risk categories of clinical importance. However, it should be noted that for a direct maximization of calibration, the estimated probabilities by the ANN should be compared with the true underlying probabilities,72
Agur, Z., Hassin, R., & Levy, S. (2006). Optimizing chemotherapy scheduling using local search heuristics. Operations Research, 54(5), 829–846. CrossRefGoogle Scholar
Alagoz, O., Maillart, L. M., Schaefer, A. J., & Roberts, M. S. (2004). The optimal timing of living-donor liver transplantation. Management Science, 50(10), 1420–1430. CrossRefGoogle Scholar
Alagoz, O., Maillart, L. M., Schaefer, A. J., & Roberts, M. S. (2007a). Determining the acceptance of cadaveric livers using an implicit model of the waiting list. Operation Research, 55(1), 24–36. CrossRefGoogle Scholar
Alagoz, O., Maillart, L. M., Schaefer, A. J., & Roberts, M. S. (2007b). Choosing among living-donor and cadaveric livers. Management Science, 53(11), 1702–1715. CrossRefGoogle Scholar
American Cancer Society (ACS) (2001–2008). Cancer facts and figures 2001–2008. http://www.cancer.org/docroot/STT/content/STT_1x_2001_Facts_and_Figures.pdf.asp. Accessed 5 Oct 2009.
American Cancer Society (ACS) (2009). Cancer facts and figures 2009. http://www.cancer.org/downloads/STT/500809web.pdf. Accessed Apr 21 2010.
Andre, T., Colin, P., Louvet, C., Gamelin, E., Bouche, O., Achille, E. et al. (2003). Semimonthly versus monthly regimen of fluorouracil and leucovorin administered for 24 or 36 weeks as adjuvant therapy in stage II and III colon cancer: Results of a randomized trial. Journal of Clinical Oncology, 21(15), 2896–2903. CrossRefGoogle Scholar
Andre, T., Boni, C., Mounedji-Boudiaf, L., Navarro, M., Tabernero, J., Hickish, T. et al. (2004). Oxaliplatin, fluorouracil, and leucovorin as adjuvant treatment for colon cancer. New England Journal of Medicine, 350(23), 2343–2351. CrossRefGoogle Scholar
Barbolosi, D., & Iliadis, A. (2001). Optimizing drug regimens in cancer chemotherapy: A simulation study using a PK-PD model. Computers in Biology and Medicine, 31(3), 157–172. CrossRefGoogle Scholar
Birkhead, B. G., & Gregory, W. M. (1984). A mathematical model of the effects of drug resistance in cancer chemotherapy. Mathematical. Bioscience, 72, 59–69. CrossRefGoogle Scholar
Butcher, J. C. (2003). Numerical methods for ordinary differential equations. New York: Wiley. CrossRefGoogle Scholar
CancerHelp UK (2010). Which treatment for advanced bowel cancer. http://www.cancerhelp.org.uk/type/bowel-cancer/treatment/which-treatment-for-advanced-bowel-cancer#chemo.
Clare, S. E., Nakhlis, F., & Panetta, J. C. (2000). Molecular biology of breast metastasis: The use of mathematical models to determine relapse and to predict response to chemotherapy in breast cancer. Breast Cancer Research, 2(6), 430–435. CrossRefGoogle Scholar
Coldman, A. J., & Goldie, J. H. (1983). A model for the resistance of tumor cells to cancer chemotherapeutic agents. Mathematical Biosciences, 65(2), 291–307. CrossRefGoogle Scholar
Coldman, A. J., & Goldie, J. H. (1986). A stochastic model for the origin and treatment of tumors containing drug-resistant cells. Bulletin of Mathematical Biology, 48(3/4), 279–292. CrossRefGoogle Scholar
Coldman, A. J., & Murray, J. M. (2000). Optimal control for a stochastic model of cancer chemotherapy. Mathematical Biosciences, 168, 187–200. CrossRefGoogle Scholar
Coldman, A. J., Goldie, J. H., & Ng, V. (1985). The effect of cellular differentiation on the development of permanent drug resistance. Mathematical Biosciences, 74(2), 177–198. CrossRefGoogle Scholar
Coldman, A. J., Coppin, C. M. L., & Goldie, J. H. (1988). Models for dose intensity. Mathematical Biosciences, 92(1), 97–113. CrossRefGoogle Scholar
Costa, M. I. S., & Boldrini, J. L. (1997). Chemotherapeutic treatments: A study of the interplay among drug resistance, toxicity and recuperation from side effects. Bulletin of Mathematical Biology, 59(2), 205–232. CrossRefGoogle Scholar
d’Onofrio, A., Ledzewicz, U., Maurer, H., & Schattler, H. (2009). On optimal delivery of combination therapy for tumors. Mathematical Biosciences, 222(1), 13–26. CrossRefGoogle Scholar
Day, R. S. (1986). Treatment sequencing, asymmetry, and uncertainty: Protocol strategies for combination. Cancer Research, 46, 3876–3885. Google Scholar
de Gramont, A., Figer, A., Seymour, M., Homerin, M., Hmissi, A., Cassidy, J. et al. (2000). Leucovorin and fluorouracil with or without oxaliplatin as first-line treatment in advanced colorectal cancer. Journal of Clinical Oncology, 18(16), 2938–2947. Google Scholar
Dua, P., Dua, V., & Pistikopoulos, E. N. (2008). Optimal delivery of chemotherapeutic agents in cancer. Computers & Chemical Engineering, 32, 99–107. CrossRefGoogle Scholar
Feng, S.-S., & Chien, S. (2003). Chemotherapeutic engineering: Application and further development of chemical engineering principles for chemotherapy of cancer and other diseases. Chemical Engineering Science, 58(18), 4087–4114. CrossRefGoogle Scholar
Floares, A., Floares, C., Cucu, M., & Lazar, L. (2003). Adaptive neural networks control of drug dosage regimens in cancer chemotherapy. In Proceedings of the international joint conference on neural networks, Portland (Vol. 1, pp. 154–159). Google Scholar
Gardner, S. N. (2002). Modeling multi-drug chemotherapy: Tailoring treatment to individuals. Journal of Theoretical Biology, 214(2), 181–207. CrossRefGoogle Scholar
Goldie, J. H., Coldman, A. J., & Gudauskas, G. A. (1982). Rationale for the use of alternating non-cross-resistant chemotherapy. Cancer Treatment Reports, 66, 439–449. Google Scholar
Harrold, J. M. (2005). Model-based design of cancer chemotherapy treatment schedules. Ph.D. thesis, University of Pittsburgh, Pittsburgh. Google Scholar
Harrold, J. M., & Parker, R. S. (2009). Clinically relevant cancer chemotherapy dose scheduling via mixed-integer optimization. Computers & Chemical Engineering, 33(12), 2042–2054. CrossRefGoogle Scholar
Horner, M. J., Ries, L. A. G., Krapcho, M., Neyman, N., Aminou, R., & Howlader, N. (2009). Seer cancer statistics review 1975–2006. http://seer.cancer.gov/csr/1975_2006/. Accessed 15 Oct 2009.
Hryniuk, W. M. (1988). The importance of dose intensity in the outcome of chemotherapy. Important Advances in Oncology, 121–141. Google Scholar
Iliadis, A., & Barbolosi, D. (2000). Optimizing drug regimens in cancer chemotherapy by an efficacy-toxicity mathematical model. Computers and Biomedical Research, 33(3), 211–226. CrossRefGoogle Scholar
Itik, M., Salamci, M. U., & Banks, S. P. (2009). Optimal control of drug therapy in cancer treatment. Nonlinear Analysis, 71(12), 1473–1486. CrossRefGoogle Scholar
Jackson, T. L., & Byrne, H. M. (2000). A mathematical model to study the effects of drug resistance and vasculature on the response of solid tumors to chemotherapy. Mathematical Biosciences, 164(1), 17–38. CrossRefGoogle Scholar
Liang, Y., Leung, K.-S., & Mok, T. S. K. (2006). A novel evolutionary drug scheduling model in cancer chemotherapy. IEEE Transactions on Information Technology in Biomedicine, 10(2), 237–245. CrossRefGoogle Scholar
Martin, R. B. (1992). Optimal control drug scheduling of cancer chemotherapy. Automatica, 28(6), 1113–1123. CrossRefGoogle Scholar
Martin, R. B., & Teo, K. L. (1994). Optimal control of drug administration in cancer chemotherapy. World Scientific: Singapore. Google Scholar
Martin, R. B., Fisher, M. E., Minchin, R. F., & Teo, K. L. (1990). A mathematical model of cancer chemotherapy with an optimal selection of parameters. Mathematical Biosciences, 99(2), 205–230. CrossRefGoogle Scholar
Martin, R. B., Fisher, M. E., Minchin, R. F., & Teo, K. L. (1992a). Optimal control of tumor size used to maximize survival time when cells are resistant to chemotherapy. Mathematical Biosciences, 110(2), 201–219. CrossRefGoogle Scholar
Martin, R. B., Fisher, M. E., Minchin, R. F., & Teo, K. L. (1992b). Low-intensity combination chemotherapy maximizes host survival time for tumors containing drug-resistant cells. Mathematical Biosciences, 110(2), 221–252. CrossRefGoogle Scholar
Matveev, A. S., & Savkin, A. V. (2000). Optimal control applied to drug administration in cancer chemotherapy: The case of several toxicity constraints. In Proceedings of the 39th IEEE conference on decision and control (Vol. 5, pp. 4851–4856). Google Scholar
Meropol, N. J., & Schulman, K. A. (2007). Cost of cancer care: Issues and implications. Journal of Clinical Oncology, 25(2), 180–186. CrossRefGoogle Scholar
Miller, B. E., Miller, F. R., & Heppner, G. H. (1989). Therapeutic perturbation of the tumor ecosystem in reconstructed heterogeneous mouse mammary tumors. Cancer Research, 49, 3747–3753. Google Scholar
Murray, J. M. (1990). Some optimal control problems in cancer chemotherapy with a toxicity limit. Mathematical Biosciences, 100(1), 49–67. CrossRefGoogle Scholar
Murray, J. M. (1994). Optimal drug regimens in cancer chemotherapy for single drugs that block progression through the cell cycle. Mathematical Biosciences, 123(2), 183–213. CrossRefGoogle Scholar
Murray, J. M. (1997). The optimal scheduling of two drugs with simple resistance for a problem in cancer chemotherapy. IMA Journal of Mathematics Applied in Medicine & Biology, 14, 283–303. CrossRefGoogle Scholar
Nanda, S., Moore, H., & Lenhart, S. (2007). Optimal control of treatment in a mathematical model of chronic myelogenous leukemia. Mathematical Biosciences, 210(1), 143–156. CrossRefGoogle Scholar
NCI—National Cancer Institute (2007). Cancer trends progress report-2007 update. http://progressreport.cancer.gov/doc_detail.asp?pid=1&did=2007&chid=75&coid=726&mid. Accessed 5 Oct 2009.
Panetta, J. C. (1996). A mathematical model of periodically pulsed chemotherapy: Tumor recurrence and metastasis in a competitive environment. Bulletin of Mathematical Biology, 58(3), 425–447. CrossRefGoogle Scholar
Panetta, J. C. (1998). A mathematical model of drug resistance: Heterogeneous tumors. Mathematical Biosciences, 147(1), 41–61. CrossRefGoogle Scholar
Panetta, J. C., & Adam, J. (1995). A mathematical model of cycle-specific chemotherapy. Mathematical and Computer Modelling, 22(2), 67–82. CrossRefGoogle Scholar
Parker, R. S., & Doyle, F. J. (2001). Control-relevant modeling in drug delivery. Advanced Drug Delivery Reviews, 48, 211–228. CrossRefGoogle Scholar
Pereira, F. L., Pedreira, C. E., Pinho, M. R., Fernandes, M. H., & Sousa, J. B. (1990). An optimal control algorithm for multidrug cancer chemotherapy design. In Proceedings of the twelfth annual international conference of the IEEE engineering in medicine and biology society (pp. 1021–1022). CrossRefGoogle Scholar
Pereira, F. L., Pedreira, C. E., & De Sousa, J. B. (1995). A new optimization based approach to experimental combination chemotherapy. Frontiers of Medical and Biological Engineering, 6(4), 257–268. Google Scholar
Petrovski, A., & McCall, J. (2001). Multi-objective optimisation of cancer chemotherapy using evolutionary algorithms. Lecture Notes in Computer Science, 1993/2001, 531–545. CrossRefGoogle Scholar
Petrovski, A., Shakya, S., & McCall, J. (2006). Optimising cancer chemotherapy using an estimation of distribution algorithm and genetic algorithms. Paper presented at: 8th annual conference on genetic and evolutionary computation. Google Scholar
Petrovski, A., Sudha, B., & McCall, J. (2004). Optimising cancer chemotherapy using particle swarm optimisation and genetic algorithms. Lecture Notes in Computer Science, 3242, 633–641. CrossRefGoogle Scholar
Pillis, L. G. D., & Radunskaya, A. (2003). The dynamics of an optimally controlled tumor model: A case study. Mathematical and Computer Modelling, 37, 1221–1244. CrossRefGoogle Scholar
Pillis, L. G. D., Gu, W., Fister, K. R., Head, T., Maples, K., Murugan, A. et al. (2007). Chemotherapy for tumors: An analysis of the dynamics and a study of quadratic and linear optimal controls. Mathematical Biosciences, 209(1), 292–315. CrossRefGoogle Scholar
Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., & Mishchenko, E. F. (1962). The mathematical theory of optimal processes. New York: Gordon & Breach. Google Scholar
Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Hoboken: Wiley. CrossRefGoogle Scholar
Retsky, M. W., Demicheli, R., Swartzendruber, D. E., Bame, P. D., Wardwell, R. H., Bonadonna, G. et al. (1997). Computer simulation of a breast cancer metastasis model. Breast Cancer Research and Treatment, 45(2), 193–202. CrossRefGoogle Scholar
Schrag, D. (2004). The price tag on progress—chemotherapy for colorectal cancer. New England Journal of Medicine, 351(4), 317–319. CrossRefGoogle Scholar
Skipper, H. E., Schabel, F. M., & Lloyd, H. (1979). Dose-response and tumor cell repopulation rate in chemotherapeutic trials. Advances in Cancer Chemotherapy, 1, 205–253. Google Scholar
Sullivan, P. W., & Salmon, S. E. (1972). Kinetics of tumor growth and regression in IgG multiple myeloma. The Journal of Clinical Investigation, 51(7), 1697–1708. CrossRefGoogle Scholar
Swan, G. W. (1984). Applications of the optimal control theory in biomedicine. New York: Dekker. Google Scholar
Swan, G. W. (1987). Tumor growth models and cancer chemotherapy. In J. R. Thompson & B. Brown (Eds.), Cancer modeling (pp. 91–179). New York: Dekker. Google Scholar
Swan, G. W. (1990). Role of optimal control theory in cancer chemotherapy. Mathematical Biosciences, 101(2), 237–284. CrossRefGoogle Scholar
Swan, G. W., & Vincent, T. L. (1977). Optimal control analysis in the chemotherapy of IgG multiple myeloma. Bulletin of Mathematical Biology, 39(3), 317–337. CrossRefGoogle Scholar
Tan, K. C., Khor, E. F., Cai, J., Heng, C. M., & Lee, T. H. (2002). Automating the drug scheduling of cancer chemotherapy via evolutionary computation. Artificial Intelligence in Medicine, 25(2), 169–185. CrossRefGoogle Scholar
Tse, S. M., Liang, Y., Leung, K.-S., Lee, K.-H., & Mok, T. S. K. (2007). A memetic algorithm for multiple-drug cancer chemotherapy schedule optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 37(1), 84–91. CrossRefGoogle Scholar
U.S. Department of Health and Human Services Centers for Medicare and Medicaid Services (SCMMS) (1960–2007). National health expenditures by type of service and source of funds, calendar years 2007–1960. http://www.cms.hhs.gov/NationalHealthExpendData/02_NationalHealthAccountsHistorical.asp#TopOfPage.Social. Accessed 7 Oct 2009.
Villasana, M., & Ochoa, G. (2004). Heuristic design of cancer chemotherapies. IEEE Transactions on Evolutionary Computation, 8(6), 513–521. CrossRefGoogle Scholar
Yu, P. L., & Leitmann, G. (1974). Nondominated decisions and cone convexity in dynamic multicriteria decision problems. Journal of Optimization Theory and Applications, 14(5), 573–584. CrossRefGoogle Scholar
Zietz, S., Desaive, C., Grattarola, M., & Nicolini, C. (1980). Modeling to determine dose dependence of drug and cell kinetic parameters. Computers & Biomedical Research, 13(3), 297–305. CrossRefGoogle Scholar
Zietz, S., & Nicolini, C. (1979). Mathematical approaches to optimization of cancer chemotherapy. Bulletin of Mathematical Biology, 41(3), 305–324. CrossRefGoogle Scholar
Zitzler, E. (1999). Evolutionary algorithms for multi-objective optimization: Methods and applications. Swiss Federal Institute of Technology, Zurich. Google Scholar