Google's TensorFlow project has posted many tutorials showing how machine learning can be used to classify images. We are going to go through these tutorials and adapt each to detecting cancer in mammography. We will start with the MNIST For ML Beginners tutorial and use The mini-MIAS database of mammograms.
Note that this dataset only contains 322 images, which is not nearly enough data to successfully train a model[1]. However, this exercise will still serve to walk through how to adapt an existing image recognition model to our specific use-case. Later attempts will be made with more complex code and larger data sets.
The full code for this tutorial is on GitHub in the file MNIST-For-Beginners-with-Mini-MIAS.py. I will focus this post on the portions of this code that differ from the MNIST tutorial. If you have not gone through the MNIST tutorial, you should do so first.
Processing the images to use as tensors
The mini-MIAS project contains 322 images in the PGM file format. First, we must convert the image from PGM into a numpy array containing the values for each pixel in the image. The MNIST tutorial has images that are 28 x 28 pixels and imports each image into a vector of length 784. Our images are 1024 x 1024 pixels so we will use a numpy array of length 1,048,576.
with open(filename, 'rb') as f:
buffer = f.read()
try:
header, width, height, maxval = re.search(
b"(^P5\s(?:\s*#.*[\r\n])*"
b"(\d+)\s(?:\s*#.*[\r\n])*"
b"(\d+)\s(?:\s*#.*[\r\n])*"
b"(\d+)\s(?:\s*#.*[\r\n]\s)*)", buffer).groups()
except AttributeError:
raise ValueError("Not a raw PGM file: '%s'" % filename)
return numpy.frombuffer(buffer,
dtype='u1' if int(maxval) < 256 else byteorder+'u2',
count=int(width)*int(height),
offset=len(header)
).reshape((int(height)*int(width)))
In the linked stack overflow code the numpy array is resized to be two dimensional; we use only a single dimension - a vector instead of an array. This is accomplished with the .reshape
command above. This follows the MNIST tutorial's approach:
Flattening the data throws away information about the 2D structure of the image. Isn't that bad? Well, the best computer vision methods do exploit this structure, and we will in later tutorials. But the simple method we will be using here, a softmax regression (defined below), won't.
Defining the Labels
The mini-MIAS database also provides the results of each mammogram giving us the image reference number, the character of background tissue, the abnormality present, the severity of said abnormality, and the location of the abnormality within the image. We have extracted this data from the website and placed it into its own file, to be parsed by our program. The data for the first few images looks like this:
mdb001 G CIRC B 535 425 197
mdb002 G CIRC B 522 280 69
mdb003 D NORM
mdb004 D NORM
mdb005 F CIRC B 477 133 30
mdb005 F CIRC B 500 168 26
Since each image has a corresponding class of abnormality, we can use that column as our label:
Class of abnormality present:
CALC Calcification
CIRC Well-defined/circumscribed masses
SPIC Spiculated masses
MISC Other, ill-defined masses
ARCH Architectural distortion
ASYM Asymmetry
NORM Normal
We create a one-hot vector for each image, encoding the abnormality found in the mammogram. A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. Parsing each line of the data file, we read the abnormality column and encode it as such:
if abnormality == "CALC":
labels_tensor[image_num] = numpy.array([(1,0,0,0,0,0,0)])
elif abnormality == "CIRC":
labels_tensor[image_num] = numpy.array([(0,1,0,0,0,0,0)])
elif abnormality == "SPIC":
labels_tensor[image_num] = numpy.array([(0,0,1,0,0,0,0)])
elif abnormality == "MISC":
labels_tensor[image_num] = numpy.array([(0,0,0,1,0,0,0)])
elif abnormality == "ARCH":
labels_tensor[image_num] = numpy.array([(0,0,0,0,1,0,0)])
elif abnormality == "ASYM":
labels_tensor[image_num] = numpy.array([(0,0,0,0,0,1,0)])
elif abnormality == "NORM":
labels_tensor[image_num] = numpy.array([(0,0,0,0,0,0,1)])
Machine Learning Model
With our images and labels stored in numpy arrays we are now able to create the machine learning model, a softmax regression. At this point we can use the Tensorflow code almost verbatim - all we have to do is alter the size of the tensors to fit our images and labels. Our input variable reflects the difference in image size:
x = tf.placeholder(tf.float32, [None, 1024*1024])
Our weights and biases adjust for both images size and the size of our label tensor:
W = tf.Variable(tf.zeros([1024*1024, 7]))
b = tf.Variable(tf.zeros([7]))
The placeholder for the correct answers is also adjusted to fit our label size:
y_ = tf.placeholder(tf.float32, [None, 7])
Training our model
Due to the small number of images, training the model is a bit trickier (and much less successful) than the MNIST tutorial. With 50,000 training images in MNIST, the tutorial has plenty of data to train on. The MNIST tutorial completes 1000 training steps, using 100 random images for each step, as such:
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
We iterate on the first 272 images in the mini-MIAS dataset as our training data and use the last 50 images as our test data. For each training step, we will choose 100 images at random. Since mint.train.next_batch
is unavailable for our usage, we build our own set-generation code to pick those random images.
# Numpy arrays to hold training batches
batch_xs = numpy.zeros((100,1024*1024))
batch_ys = numpy.zeros((100,7))
# 1000 Training Steps
for i in range(1000):
# Create a batch of 100 random images for training
for i in range(100):
j = randint(0,271)
batch_xs[i] = images_train[j]
batch_ys[i] = labels_train[j]
#run the train step
sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys})
Evaluating the model
In order to test the model for accuracy, we will use a batch containing the last 50 images in the database:
test_xs = numpy.zeros((50, 1024*1024))
test_ys = numpy.zeros((50,7))
for i in range(50):
test_xs[i] = images_train[272+i]
test_ys[i] = labels_train[272+i]
print(sess.run(accuracy, feed_dict={x:test_xs, y_:test_ys}))
How do we do? Awful! My last run comes back with only 6% accuracy. With such little data this model is not going to be performant. However, this example is an easy way to get comfortable with some of the steps in building real world models, including:
- Importing images into numpy arrays
- Sizing tensors to fit the data
- Adapting training and evaluation steps to the data
- Understanding the need for large data sets!
With more complex tutorials at TensorFlow and additional data available in other Publicly Available Mammography Datasets, we can continue to build upon this example to get closer to a functioning mammogram diagnosis model.
- There are many ongoing research projects attempting to devise machine learning models that will work with more limited datasets, but those projects are out of the scope of this post