Impression Generation From X-Ray Images — Case Study

Generating Medical Reports of Chest X-Rays Using Encoder-Decoder Model and Attention Mechanism.

15 min readJun 11, 2021

Impression generation is the process of generating textual description from X-Ray Images. In the world of Deep learning it is known as Image Captioning. Image captioning uses both Natural Language Processing(NLP) and Computer Vision(CV) to generate the text output. We as humans can look at a normal picture and describe whatever it is in it, in an appropriate language. But can you give the below image description?

This task requires a trained radiologist with years of experience. X-Rays are a form of Electromagnetic Radiation that is used for medical imaging. X-rays can be used to spot the fractures, bone injuries or any tumors. Analysis of X-ray reports is an important task of radiologists to recommend the correct diagnosis to the patients. But with deep learning techniques we can predict medical reports by using just medical images.

How ?? The answer is in the blog, Let’s find out.

Prerequisites :

To understand the blog better, The reader should have some familiarity with concepts like :

Objective : Medical Report Generation From X-Ray Images : end-to-end Deep learning case study

Overview :

I will walk you through my approach for solving the problem by following these step processes.

Business Problem
Mapping to ML Problem
Existing Research Work
EDA, Preprocessing and Structure Data
Baseline Model [Encoder-Decoder]
Main Model [with Attention]
Deployment
Conclusion
Future Works

1. Business Problem :

Clinical imaging captures enormous amounts of information but most radio-logic data are reported in qualitative and subjective terms. X-Rays are a form of Electromagnetic Radiation that is used for medical imaging. Analysis of X-ray reports is a very important task of radiologists and pathologists to recommend the correct diagnosis to the patients. In this project, we are tackling the image captioning problem for a data set containing Chest X-ray images with the help of the state of the art deep learning architecture and optimizing parameters of the architecture.

The problem statement here is to find the impression from the given chest X-Ray images. These images are of two types: Frontal and Lateral view of the chest. With these two types of images as input we need to find the impression for given X-Ray. To resolve this problem statement, we will be building a predictive model which involves both image and text processing to build a deep learning model.

2. Mapping to ML Problem :

We will divide this problem in two parts, first is the Encoder part for feature extraction from image data, using transfer learning method we extract information from x-ray images. Second part is Decoder, we will give those image information to the decoder model and get a medical report.

In simple words, we have to extract bottleneck features from images using CNN from scratch or using transfer learning. The latter approach is preferable, as we have less data. Then use these extracted features to predict the captions using LSTM or GRU. The output would be a sequence of words.

Performance Metric :

To evaluate the model performance, I will use the bilingual evaluation understudy (BLEU) score. BLEU is a well-acknowledged metric to measure the similarity of one hypothesis sentence to multiple reference sentences. Given a single hypothesis sentence and multiple reference sentences, it returns a value between 0 and 1. The metric close to 1 means that the two are very similar. Apparently we need to have a higher BLEU score.

3. Existing Research Work :

My work is inspired from the mentioned research papers and Blog. These are some of the terrific and state of the art work in deep learning, If you have some extra minutes do visit them :

Let’s see some glimpse of Harshall’s work. Harshall used the flicker 8k images data set for his task. Each image has 5 captions and he stored them in a dictionary. For cleaning and preprocessing the data, he applied lowercase to all words then removed special tokens and words having numbers. After extracting unique words he removed all those words whose frequency was less so that the model can be robust to outliers. Then he adds <start> and <end> token to each captions. For feature engineering of the images he has used the inception v3 model with ImageNet data set weights. Images were sent to this model and the output of the second last layer was taken (2048 size vector, bottleneck features) and saved these vectors. For preprocessing of captions he has tokenized the captions and also found out the maximum length of the sentence in the captions (found to be 34) so that he can use it for padding. He used pretrained Glove vectors for word embeddings. The output of the model gives a probability value and the word which has the highest probability comes next in the sentence. He used the greedy search method. The model has been trained for 30 epochs with the initial learning rate of 0.001 with batch size=3. Optimizer and Loss used were Adam and categorical cross entropy respectively.

4. EDA, Preprocessing and Structure Data :

Data Source : Indiana University(X-ray images and Radiology reports)

Chest X-ray -There are 7,471 images in .png file format which contain front view and lateral view of each patient’s chest.
Radiology Report -There are about 3955 patients text reports available in .XML format.

The data set contains chest X-ray images and radiology text reports. Each image has been paired with four captions such as Comparison, Indication, Findings and Impressions that provide clear descriptions of the salient entities and events. The finding caption gives maximum information present in images. The goal of this case study is to predict the findings of the medical report attached to the images.

Sample Data Point :

XML Parsing Creating Data Points :

In this section we will see how the raw XML data is parsed and structured as data points, Then the data points are stored in csv files for future model requirements.

Raw XML Tree View:

This XML file has a lot of information related to patients such as : image_id, text captions like — comparison, indication, findings, impression etc. We will extract the findings feature from these files and consider them as reports because they are more useful for the medical report. We also need to extract the image_id from these files to get the x-rays corresponding to each report.

We extracted data from XML file using below code :

A structured sample data point after extracting data from XML :

4.1 Data Preprocessing :

In this phase the text data are preprocessed to remove unwanted tags, texts, punctuation and numbers. Perform basic decontractions i.e words like won’t, can’t and so on will be converted to will not, can not and so on respectively. We will also check for the empty cell or NaN values.

If there are any empty cells in the image name column we will drop those cells.
Each text column word counts are calculated and added to the data frame column.
If there any empty or NaN value in text data we will replace it with “No <Column Name>” (ex: No Impression)

After the data preprocessing step, we have a total of 3851 rows present in the final data points.

4.2 Exploratory Data Analysis :

Exploring the Image data:

Sample 9 X-Rays:

We can see the images are in both Front and Lateral view. Let’s analyze the total image present per data point or report.

Minimum Image count is 1 — Maximum Image count is 5 — Median Image count is 2.

Exploring the text features :

In the text analysis we will be taking the findings features as target feature. First let me show you diagrams of My Analysis, Then we will jump to the observations.

Unique-Words in Findings :

PDF and CDF for word count distribution of Findings feature :

Word cloud on Findings feature :

Observations on finding feature:

There are a total of 3851 entries and 2545 values are unique in all of them i.e, They never repeated. So, this can not be a categorical feature.
In finding feature, 50% data have less than 20 words per findings, 99% data have less than 50 words per findings.
From wordcloud : pleural, effusion, silhouette, within, normal, lungs, cardiomediastinal are the highlighted words i.e. these are important words.

4.3 Structure Data :

There are only two image types — Front and Lateral, but each patient has multiple x-rays associated with them. The maximum number of images associated with a report can be 5 while the minimum is 0. The highest frequency of being associated with a report are 2 images.

We have more than 2 images and in some cases less than 2 images are associated with each data point. Lets handle the data points which are having 1,3,4,5 images. we need to come up with an idea to structure data points that could help us in this case.

Approach :

limiting the data point to 2 images per data point, if we have 5 images, it’s 4+1 (all image + last image) so make it as 4 data points, Here last image should be Lateral if we have frontal as remaining images. I converted multiple images into two images. For that I have implemented following steps:

1. if I have 5 images then total 4 data points created

1st image + 5th image
2nd image + 5th image
3rd image + 5th image
4th image + 5th image

2. If I have 4 images then total 3 data points created

1st image + 4th image
2nd image + 4th image
3rd image + 4th image

3. If I have 3 images then total 2 data points created

1st image + 3rd image
2nd image + 3rd image

4. If I have 2 images then it is according to our requirement

5. At last, If i have 1 image, i just replicate it and make it 2.

Code for the above explained data structuring.

with this data constructing method we could also increase the data point and come up with fine input data points. Then we split the data in train, test and validation. We should include the Single Image data frame data points in all train, test and validation data sets in equal proportion to avoid any additional bias.

5. Baseline Model [Encoder-Decoder] :

A sequence-to-sequence model is a deep learning model that takes a sequence of items (in our case, features of an image) and outputs another sequence of items (reports).

The encoder processes each item in the input sequence, it compiles the information it captures into a vector called the context. After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.The steps I followed are following :

5.1 Add Token in text data :

After creating new data points from existing data points, we will add <start> and <end> token into text data and prepare decoder input and output. The start and End tokens are special tokens added at the Start of the sentence and End of the sentence respectively to let the mode learn the beginning and ending of the sentences.

5.2 Tokenization :

Machines only understand numerical value and we can not feed text data into deep learning and machine learning models. We will convert text data into numerical data using Tokenizer. The tensorflow deep learning library provides tools to perform this operation.

5.3 Image feature :

We will be using the transfer learning for image to feature vector and I will use the pre-trained CheXnet competition model weights.

The pre-trained model can be found here : CheXNet-Keras

CheXNET Model is a Denset121 layered model which is trained on 112,120 number of chest x-ray images for the classification of 14 diseases. We can load the weights of that model and pass the image through that model. The top layer will be ignored.

As there are two X-Rays corresponding to each patient , so each image is preprocessed according to the input of the DenseNet121 model and the model’s predictions for both the images is concatenated at the end.

5.4 Encoder-Decoder Architecture :

I will pass the concatenated image tensors to encoder, we feed image features into a dense layer having 512 neurons and then I add a dropout layer for tuning. In the decoder part, I have an embedding layer, dropout layer and LSTM layer. The input sequence i.e. encoder_output has passed to the embedding layer then we pass this layer output to the dropout layer and finally feed to LSTM.

LSTM layer is Long Short-Term Memory networks — are a special kind of RNN, capable of learning long-term dependencies.

To know more about LSTM refer this link: Understanding LSTM Networks

we then add the outputs of encoder and decoder using the Add layer of keras. This output has been passed to a Time distributed dense layer. The time distributed dense layer is applied at the end because the output is a sequential output and it should be applied to every temporal slice of output.

Note : The dropout layers are added only for fine tuning the model.

5.5 Model Performance : visualized in TensorBoard

5.6 Model Inference :

In the inference stage, I have used the argmax based Greedy search to find the output sentence. Greedy search is a vanilla implementation for generation of output which is selecting a single word with maximum probability from the entire vocabulary. Code for greedy search is mentioned below.

5.7 Sample Predictions:

We can see from the above prediction that the model has predicted some correct words grammatically but It was found that model prediction has repetitions, most of the words are repeated for almost all predictions. Some medical reports are even meaningless and not useful. Since this is a baseline model there is always room for improvement, we can check on the other models predictions and compare their performance to this.

6. Main Model [with Attention] :

Attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

For this model, I will be using Bahdanau additive attention mechanism. Let’s quickly look into the architecture of encoder-decoder model with additive attention mechanism :

Credits : Source(Used for educational purpose only)

Input- The model is fed with both image vector and report text with embedding dimension in which both inputs are added and sent as the context vector to decoder.

Decoder stage- Bi-directional GRU is used to get high level features from input to get more understanding depth in input features.

Additive Attention- It provides weight vectors (alpha) to every sequence of words and gets added up with word level features from each time stamp into sentence level features vector. This is a simplified form of Bahdanau’s Attention.

To know more about Attention refer this link: Attention? Attention!

Code Implementation :

Encoder is the same as Baseline model. For the decoder, I have created a one_step_decoder layer which takes in decoder_input, the encoder_output and state value. The decoder_input will be any character token number. This will be passed through the embedding layer and then embedding output and the encoder_output will be passed through the attention layer which will produce the context vector. The context vector will then be passed through the RNN (here GRU will be used) with the initial state being that of the previous decoder. I have used dropout layers for tuning and regularization of model.

The decoder calls the one-step attention layer for each of the decoder time-steps and computes the scores and attention-weights. All the outputs of each time-steps are stored in the ‘all-outputs’ variable. The outputs from each decoder step are the next word in the sequence. ‘all-outputs’ will be our final output.

6.1 Model Performance : visualized in TensorBoard

6.2 Model Inference :

For the final model, In the inference stage I have used the Beam search to find the output sentence. In Beam search at each time step at the decoder part,we select the top K-words(K=Beam length) with maximum likelihood of occurrence and generate words. We instantiate K independent versions of the model and use them to generate words but as Beam width increases the inference time and memory consumption increases due to which making predictions will be slow. Code for Beam search is mentioned below.

The reason behind choosing beam search over greedy search is — The objective function of Beam search is maximizing the conditional probability of all candidate words and choose the max probable outcome.

6.3 Sample Predictions :

I have displayed all the candidates from beam search for prediction. The number of candidates are decided by beam width. In my Implementation, The less the probability the better is prediction. The printed Beam probabilities are negative of log of actual probability score. So they follow an Inverse Relationship.

Let’s see one more example :

The predictions are almost the same as the original report. The model is doing a really good job. However, even our attention model can’t predict each and every image accurately. Keeping in mind that we are just training this model on 3200 data points this model works pretty well. To learn more complex features, the model needs more data.

Comparison of model performance :

I have used Cumulative N-Gram Scores(BLEU Score). Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.

7. Deployment :

I have also deployed the best model using flask & ngrok. I have provided a basic user interface where users can upload the X-Rays and submit. It does the necessary feature engineering i.e. convert image to tensors and feed this to model. The model predicts text from tensors and outputs it as a medical report.

Video by Author

8. Conclusion :

Now that we have come to an end to this project, let’s conclude what all we’ve done:

We just saw an end-to-end deep learning case study of image captioning in the medical field. We understood the problem and the need for such applications.
For the baseline model, Created an Encoder-Decoder model which did not give us decent results.
Improved the baseline results by building an Attention model. The results are promising and the impression statements are meaningful according to the X-ray image.
Attention model outperforms Simple Encoder-Decoder model in each cumulative N-gram Scores.
Beam Search generated better results than Greedy Search.

9. Future Works :

Making use of little more advanced techniques like transformers or BERT, might yield better results.
We can further increase the Encoder CNN layer to a deep layer for improvements.
Obviously in this case study the data set size is very limited, so more data could improve model predictions.
Get more X-ray images with diseases since most of the data that is available on this data set were of “no disease” category.
Inference output quality can be improved by increasing beam width but at the expense of computational resources.

Thank you for reading the blog. Tell me your thoughts below!

GitHub Repository : Impression-Generation-From-X-Ray-Images

LinkedIn Handle : Kundan Jha