In order to build a machine learning model to help us with mammography-based diagnosis we will need data to train our models. Thankfully, there are a number of data sets publicly available.
University of South Florida Digital Database for Screening Mammography
This dataset appears to be the most complete and detailed I have seen; it contains high-resolution imagery and meta-data. The scans appear to be from the mid to late 90s.
Each study includes two images of each breast, along with some associated patient information (age at time of study, ACR breast density rating, subtlety rating for abnormalities, ACR keyword description of abnormalities) and image information (scanner, spatial resolution, ...). Images containing suspicious areas have associated pixel-level "ground truth" information about the locations and types of suspicious regions.
There are over 2500 mammogram case, split across 43 volumes, including non-cancerous, cancerous, and benign results. The data is available on an anonymous FTP server and is approximately 230GB in total size. Here's an example cancerous mammogram case.
UPDATE: The Cancer Imaging Archive has an edited version of this dataset that purports to be more usable:
The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included.
The mini-MIAS database of mammograms
This dataset contains imagery which has been reduced in resolution for easier handling. The imagery was released in 1994 and is in the ancient PGM file format.
the original MIAS Database (digitised at 50 micron pixel edge) has been reduced to 200 micron pixel edge and clipped/padded so that every image is 1024 × 1024 pixels
This dataset may provide an easy path for writing some initial code, but the reduced resolution and small sample size disadvantageous it from being the basis of a useable machine learning model.
Breast Cancer Surveillance Consortium Digital Mammography Example Dataset
This dataset includes the mammography assessment and subsequent breast cancer diagnosis within one year as well as participant characteristics that have been previously shown to be associated with mammography performance including age, family history of breast cancer, breast density, use of hormone replacement therapy, BMI, history of biopsy, receipt of prior mammography, and presence of comparison films.
This dataset does not have the imagery itself. The sample dataset is available by requesting a login to their download page. To gain access to the full dataset, you must submit a valid research proposal.
Are you aware of any additional publicly-available mammography datasets that could be helpful in building machine learning model? Do you have a dataset that could be made publicly available? If so, please reach out and I will add them to this list: hello [at] jacob [dot] vi
- As highlighted in Machine Learning in Cancer Care
- The Cancer Imaging Archive as a large number of datasets containing cancer imagery beyond breast cancer
- Stackoverflow post with python code to read a PGM file into a numpy array