Atrial Fibrilation (AFib) is a condition where the heart’s atria beat irratically, out of sync with the rest of the heart. This results in the reduced effectiveness of the heart and it is associated with more severe heart conditions like stroke or heart failure (heart.org). Due to this, there has been much effort directed toward developing better ways for detecting AFib early. In the new age of wearable technology like the Apple Watch, passive machine learning-based methods for detection have become common place. In this project, I wanted to explore how electrocardiograms (ECGs) inform us of episodes of AFIB and determine if 1-D convolutional neural networks are adequate for detecting AFib.
I recently bought an Apple Watch. It was on sale for a pretty good deal and I've been wanting one for quite a while. This led to me discovering its ECG AFib detection capabilities, so I set out to try recreate a similar system with this project. While it's not going to be as thorough as Apple's research, I went with what was realistically capable with my resources, time, and prior knowledge. The biomedical domain isn't really my specialty and I have no prior experience working with ECGs. However, this project has served as some decent practice and has been a fruitful exploration into the biomedical domain.
The first thing I had to do was to source a good ECG dataset. Luckily, I quickly found the MIT-BIH Atrial Fibrillation Database. It contains annotated ECG data from 23 unique patients each with two simultaneous 10-hour ECG signals. The annotations denote when an episode of a rythym begins for a record of which there are 4 different labels available: Atrial Fibrilation (AFIB), Atrial Flutter (AFL), AV Junction (J), and "all other rythyms" (N). The dataset mostly captures AFIB and N rythyms while AFL and J only make up less than 1% of the data.
Total Duration (Minutes) | Total Duration (%) | Avg Duration (Samples) | Min Duration | Unique Episodes | Episodes (>30s) | |
---|---|---|---|---|---|---|
Label | ||||||
AFIB | 5,603.85 | 39.87% | 288,858 | 420 | 291 | 226 |
AFL | 97.95 | 0.70% | 104,947 | 882 | 14 | 7 |
J | 5.52 | 0.04% | 6,894 | 380 | 12 | 3 |
N | 8,349.30 | 59.40% | 434,859 | 1062 | 288 | 263 |
Again, I am going into this project with little to no background knowledge on how ECGs work and what characteristics each rhythm tends to display. So using the data, I'm going to hypothesize the characteristics of each rhythm then compare my hypotheses with existing information.
First, I've curated a small random sample that has decent variation which I have plotted below for a quick visual comparison. While I don't really notice any obvious differences between the different rhythms, I did notice that they seem to follow the same pattern of having some sort of lead-up activity, a giant spike in activity and then some follow-up activity. This is most prominent in record 08455.
I later learned that these were the P, R, and T waves of the heart rhythm. However, I missed a couple other waves. According to this diagram, it should appear like this for normal sinus rhythm. (I will dive deeper into what these represent later.)
Since nothing really stood out between the different rythms, perhaps another view of the data might prove more informative. Using the same sample, I plotted their Discrete Fourier Transforms and noted some observations below:
AFIB | AFL | J | N |
---|---|---|---|
• Hard to notice distinct harmonic banding • Noisy † |
• Clear harmonic banding • High frequency fundamental |
• Sometimes strong harmonic banding • Low frequency fundamental • Slightly noisy † |
• Occasionally strong harmonic banding • Records 05091 and 04936 are a little noisy † • Record 08455 has 60 Hz noise (probably a product of the data capture process) |
† I should be careful when I say "noisy". If you actually look at the signals they derived from, they aren't so noisy per say. However, in the frequency domain, it is harder distinguished harmonic spikes like we see in other signals.
To do a more general comparison, I took 200 samples for each class, applied a Discrete Fourier Transform, and averaged their values to generate the plot below.
The only real distinction I can make is sharpness in the banding for each rythym. However, I believe this doesn't really inform us much about the characteristics of each class. We can expect banding as the result of the harmonics produced by the beat rythym. How sharp the banding appears could potentially be explained by the variance of the data in each label. AFL and J will have low variance due to their limited data availablity while AFIB and N have higher variance due to having much larger sample pools. Thus, AFL and J appear "sharper" while AFIB and N are more "fuzzy".
On the other hand, if we compare AFIB and N, they do not follow this pattern. N has slightly more availability than AFIB, having about 20% more data by duration. Therefore, we would expect its plot to appear more varied and with less clear banding. Despite this, its plots appear to have more distinct banding than that of the AFIB signals.
One explanation could be due to AFIB appearing in 23 records while N only appears in 21. If this is the case, then I theorize that N's appearance in this plot will become more fuzzy as I increase the sample size for each class. I tested this at two other sample sizes. At 1,000, I observed slightly more variance in both N and AFIB, but harmonic banding remained distinguishable in N. At 10,000 there was no noticable difference than that at the 1,000 level. From this experiment, I can't conclusively say that this isn't the case.
Another explanation could be that AFIB rythyms simply have more variation in BPM resulting in this plot appearing less sharp.
One final explanation might be more clear if we reference the previous frequency domain plot. Recall how I noted that the AFIB plots had indistinguishable spikes. This may suggest that the AFIB signals are more eratic resulting in their aggregates appearing as they do.
An electrocardiogram (ECG or EKG) captures electrical activity along an electrical axis determined by the placement of the electrodes. When a heart beats, the Sinoatrial Node (SA Node) initiates the beat by generating an action potential. This action potential propagates through the myocardium of the atria forcing them to depolarize and pump blood into the ventricles. This event is recorded on the ECG as the P wave. As the action potential continues, it eventually reaches the Atrioventricular Node (AV Node) which functions to delay the action potential from continuing into the ventricles before they are filled. After this delay, the signal then propagates through the ventricles resulting in the huge QRS complex. Finally, the T wave is the result of the repolarization of the ventricles. It should be noted that the atria also repolarize but this is obscured by the QRS complex. (Source)
While normal beats should look mostly like the example that I have provided (with the P, Q, R, S, and T waves in order and occurring in regular intervals), they can vary significantly and appear differently depending on the placement of the electrical leads and numerous other factors. This especially goes for the QRS complex which can take on many different forms according to some of my sources.
During atrial fibrillation, the atria chaotically and rapidly depolarize. However, after activation, the AV Node becomes temporarily unresponsive to further stimulus, so these action potentials do not always reach the ventricles. This results in an ECG that typically lacks a P wave and has an irregular heartbeat. For example, we can observe this in records 08215 and 07910 for the AFib column on the time domain plot.
Now, my main goal: Creating machine learning models to detect AFib. I actually had a lot of trouble with this initially. I thought that by having such a large dataset, I could just generate random samples during training and validation. However, this system yielded unreliable results, no matter the sample sizes I used. After struggling for a while, I referred to Detection of Atrial Fibrillation Using 1D Convolutional Neural Network (Hsieh, 2020) for some guidance which lead to my final data loading system. After making these changes, I immediately saw better results. Here's the details:
To split the ECGs into 2-lead, 10-second labeled samples with 3-fold cross-validation, I extracted each unique epsiode of a rhythym and noted the record it came from, when the episode began, and when it ended. Then I discarded any episodes less than 30s (3x the expected length) and split each episode into 3 smaller, equally-sized signals and randomly one-to-one mapped each to a fold. From here, these subsamples were further split into 10 second slices with a 50% overlap between each, discarding any excess. This resulted in 54,989 samples (22,020 AFIB, 32,969 N) per fold.
When considering what type model to apply to this problem, I immediately jumped to Convolutional Neural Networks. CNNs have proven themselves as very capable signal classifiers in various other tasks, so I thought that they should be my go-to answer for this problem. However, the exact architecture of a CNN can vary widely so I've compared various designs in this project. The only common elements for the models is that they each take 10-second, 2-lead ECGs as their input and output a prediction of Normal Sinus Rhythym (0) or AFib (1).
To generate a baseline I used two models: First a 1-D variation of the Pytorch MobileNetV2 implementation and second a model described in Hsieh et. al, 2020. I also created a self-made CNN, though admittedly I have very little experience with them.
To train a model, I held out one fold for validation and trained on the remaining data and repeated this for each model and fold.
Lastly, I grouped each fold by architecture into ensembles by averaging their outputs (without performing any further training). I then evaluated the ensembles on the entire dataset to determine if averaging outputs was an effective approach for merging the various models together.
Cross-validated Performance
Model | Custom | Hsieh | MobileNetV2 | ||||||
---|---|---|---|---|---|---|---|---|---|
AUC | Accuracy | F1 Score | AUC | Accuracy | F1 Score | AUC | Accuracy | F1 Score | |
Fold | |||||||||
1 | 0.9997 | 0.9943 | 0.9929 | 0.9997 | 0.9949 | 0.9937 | 0.9997 | 0.9959 | 0.9949 |
2 | 0.9993 | 0.9930 | 0.9913 | 0.9997 | 0.9952 | 0.9940 | 0.9997 | 0.9951 | 0.9939 |
3 | 0.9989 | 0.9904 | 0.9879 | 0.9998 | 0.9943 | 0.9930 | 0.9998 | 0.9927 | 0.9909 |
Accuracy | F1 Score | AUC | |
---|---|---|---|
Model | |||
Custom | 0.9925 | 0.9907 | 0.9993 |
Hsieh | 0.9948 | 0.9936 | 0.9997 |
MobileNetV2 | 0.9946 | 0.9932 | 0.9998 |
Accuracy | F1 Score | AUC | |
---|---|---|---|
Model | |||
Custom | 0.9938 | 0.9922 | 0.9994 |
Hsieh | 0.9956 | 0.9945 | 0.9998 |
MobileNetV2 | 0.9963 | 0.9954 | 0.9999 |
In summary, all of the CNN’s seemed to be extremely capable at classifying AFib ECGs against N ECGs. After all, they were designed with signal classification in mind. We also saw that averaging models outputs was a viable method of aggregating the models into ensembles. I wouldn’t say it necesssarily improves the models' performances, but it doesn’t completely break the model either.
Now, this implementation is likely far from what Apple did to approach the problem. Namely, Apple specifically targeted a deployment on a proprietary wearable device. This makes some aspects of this project unrealistic. Apple had various systems in place to deal with noisy data which can be frequently encountered when recording ECGs from a mobile device. Whereas my data was captured in a clinical setting likely with a more reliable data capture method as well as an extra electrical lead. Furthermore, I would need to consider not just the accuracy of model, but also its size and compute time. I did have this in mind which was one of the reasons why I chose to use Mobilenet as one of the models I tested.
While I am satified with the performance of the models, I think there may be some methods to achieve better results.
We could also expand the capabilities of the model by adding other arrythmias as labels. The biggest limit here would be collecting enough data. AFib is one of the most common arrhythmias so it is not too difficult to collect a lot of data for it if targeting it. However episodes of other arrhythmias will occur much more rarely and result in severely imbalanced datasets. Therefore you’d likely need to collect an unconscionable amount of data and implement an unsupervised or semi-supervised method for labeling the data due to its immensity.
Overall, I'm very satisfied with my results. If you would like to replicate them, I have detailed steps for doing so in the github repo for this project linked in the Appendix.
Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
Hsieh, Chaur-Heh et al. “Detection of Atrial Fibrillation Using 1D Convolutional Neural Network.” Sensors (Basel, Switzerland) vol. 20,7 2136. 10 Apr. 2020, doi:10.3390/s20072136