Can We Detect Atrial Fibrillation using Apple Watch Sensor Data

Data Science
Data Engineering
Health Data Science
contact us

Detecting heart arrhythmias using machine learning and Apple Watch data

Yancheng Liu

March 15, 2016

Yancheng Liu
Data Scientist, AthenaHealth
Insight Fellow, 2015
Cornell University Yancheng is an Insight alumnus from the first Health Data Science session and is now a data scientist at AthenaHealth. While at Insight, he partnered with the UCSF Health eHeart study to detect atrial fibrillation patients using Apple Watch heart rate data. This content originally appeared on his personal website.

Why atrial fibrillation?

Atrial fibrillation (Afib) is the most common form of cardiac arrhythmia, affecting 2.7 million patients in the US alone, and roughly 70~140 million are estimated to be undiagnosed around the world. It is one of the leading causes of strokes. Fifty percent of Afib patients who experience a stroke will survive less than one year. However, if Afib is detected, the risk of stroke can be reduced by >75% with proper medications.

Why wearable sensor data?

With Android Wear and Apple Watch, millions of people are now carrying heart rate sensors on their wrists at all times. These wearable devices generate orders of magnitude more data than ECGs (electrocardiograms). How can we make use of these valuable datasets to make an impact on medical practice? Of the many possible directions we can pursue, predicting heart diseases using heart-rate measurements appears to be the most promising and feasible. The specific question I ask here is: can we identify undiagnosed Afib individuals using heart-rate data measured by Apple Watch?

Can we use machine learning to detect people with serious, life-threatening arrhythmias using the heart rate data measured by an Apple or Android Watch?

Raw Data

Here is the data I worked with (courtesey of UCSF's Health eHeart Study):

13.5 million heart rate measurements from ~500 users in normal cardiac rhythm
Roughly 100,000 measurements from a dozen atrial fibrillation patients
Activity/steps data from the HealthKit and Google Fit apps

Data clean-up and preparation

As you can imagine, the raw data is extremely messy. My first step was to extract clean 10-minute frames of continuous data. In workout mode, the watch reports a heart rate measurement every 5 seconds, but as with any real-world measurement, there is missing data when the sensor can't get good contact with the skin, the user is exercising and sweating, etc.

Heart rate before cleaning up:

Heart rate after cleaning up:

Fourier transformation

From the medical literature, we know that a Fourier transform of normal sinus rhythm will show three characteristic peaks: 0.0033-0.04Hz and 0.04-0.15Hz (parasympathetic nervous system), and 0.15-0.4Hz (respiratory rate). In contrast, a Fourier transform of a patient in atrial fibrillation will show noise.

I was able to recapitulate the characteristic low frequency peak, and the noisier nature of atrial fibrillation with the sensor data (see below). Limited by the sampling rate (1 measurement every 5 seconds), I could not detect the higher frequency peaks. This suggests that increasing sampling rate, for example to 1 measurement per second, will likely increase the ability to differentiate Afib from normal sinus rhythm.

Build a supervised learning model

I trained a model using data from both the time and frequency domains to distinguish atrial fibrillation from normal rhythm. Since I had a very unbalanced sample population, I downsized the number of normal samples to avoid getting a trivial classifier that calls everyone "normal". The samples I used to train/test (75%/25% split) the classifier is only a fraction of the total sample pool - I ended up using 41 'Afib'(confirmed positive afib cases), 185 'workout'(healthy and exercising), and 274 'normal' (healthy and still) subjects.

With these 500 samples, I tested a range of supervised learning algorithms to find the best one. I used accuracy, precision and recall scores as the evaluation metrics. The detailed comparision is shown in the table (below). I also plotted the Receiver Operating Characteristic (ROC) curve. K-nearest neighbors was the fastest classifier, but Extremely Randomized Trees (ERT) was the clear winner in terms of precision and recall.

This classifier can detect ~50% of the Afib patients using only the Apple Watch data. If identified as a potential 'Afib' patient, there is ~86% chance he/she has the condition - and they should probably see a cardiologist for an ECG test.

How can we improve the model?

Rebalancing the samples by SMOTE

Besides undersampling the normal class, another common approach is to oversample the minority Afib class. Oversampling has the benefit of preserving the variance among the normal samples. However, standard random oversampling with replacement suffers from the issue of overfitting. Another strategy is to use the SMOTE (Synthetic Minority Oversampling Technique) method proposed by Nitesh Chawla (Journal of Artificial Intelligence Research 16:321-357). This method creates synthetic samples along the line segments joining any or all of the k-nearest neighbors for the minority class. I tried both oversampling techniques, and the SMOTE method had much better performance, especially for the recall score. The performance of both ERT and Support Vector Machine (SVM) are improved with the SMOTE method as shown in the ROC graph below.

Feature learning and dimensionality reduction

Since I had a very limited number of Afib patients, one possible way to improve the classifier is to reduce the number of features. Given a fixed sample size, there is an optimal number of features above which the performance of a classifier will degrade. In addition, reducing dimensionality will improve computation time and allow us to directly visualize the data in a 2D or 3D space. PCA (Principle Component Analysis) does this by projecting the data to a lower-dimensional space in such a way that the variance among data is maximally preserved.

As an alternative, I can use a nonlinear method such as neural networks based autoencoder for dimensionality reduction. It will likely have a better performance than standard PCA. Convolutional neural network (ConvNet) has seen amazing success in imaging processing and voice recognition. In some sense, the heart-rate data is similar to voice signal. Therefore, a ConvNet based autoencoder may work well for our data.

Summary

With increasing amounts of wearable sensor data, we can make an impact on medicine by applying machine learning techniques.

One area of application is to detect heart diseases with heart-rate sensor data.

As a proof-of-principle, I built an ERT-based classifier which showed 50% recall and 86% precision for atrial fibrillation. This could help doctors identify people with undiagnosed heart conditions.

Find out more about the Insight Health Data Science Fellows Program.

Companies and organizations interested in partnering with Insight can email [email protected]

index

International Women's Day: What #PledgeForParity Means To Us

Insight Health Data Science Fellows Program Expands to Silicon Valley

Company

Health Data Science Program

Data Science Program
Data Engineering Program
Insight Fellows

Blog

Apply

Contact

Facebook
LinkedIn
Twitter
Jobs
Extra

Contact

[email protected]
Boston, MA