Ehukai, Oahu, Hawaii. Photo: Ozan Aygun
Dynamic Time Warping: Unsupervised learning for temporal sequences
In this tutorial, I would like to describe a neat way of understanding different groups of data streams within your time-series data. Imagine you have a large number of time-series collected, for example monthly measurements of a the levels of a certain biomarker for a given clinical trial participant over a year period of time. Perhaps you would like to understand the groups of subjects that display the similar increases and decreases over time. While you can use classical clustering algorithms to tackle this problem, the inherent covariance structure in the time-series data makes the classical similarity/distance-based unsupervised modeling a little bit more complicated. Furthermore, time-series data often are not in the same length (e.g: following the above example, certain participant may lost follow up in the clinical trial, or may have start participating late such that they may not have 12 successive data points each).
Fortunately, there are elegant methods such as Dynamic time warping (DTW) to measure similarity between two time-series. We can use DTW to understand pairwise relationships between a large number of series, which can help us to extract patterns, insights or engineer distinct features.
Here, I used unsupervised-learning for an interesting problem: can we identify different patterns from the yearly number of academic publications that correspond to active drug ingredients? Note that academic research is one of the driving factors of drug discovery and any meaningful signal we may obtain from academic research trends can be used to potentially identify new, upcoming drug candidates, or pointing out new research developments for an existing drug.
To tackle this problem, I web-scraped time series data from NIH's PubMed , comprising 1000 yearly PubMed publication counts of drug ingredients between 1950 - 2020. I downloaded the list of drug ingredients from FDA's OrangeBook , hence we have the linked drug approval timeline data available each of these drug ingredients as well.
In this tutorial, I demonstrate 2 main approaches we can use to extract patterns from this data using Dynamic Time Warping. First I demonstrate preparing a DTW similarity matrix and using it in Hiearchical clustering. I also discuss how the signal forming different clusters can be amplified by applying variance filters. Next, I illustrate application of an independent approach, K-shape clustering for the same data, including a practical illustration of how the hyperparameter k can be tuned to obtain a desireable model that can describe the data. Finally, I also provide different ways of visualizing the resulting time-series clusters and discuss implications of different clusters as lagging indicators of drug approval for given drug ingredients.
I hope you enjoy reading this DTW tutorial and learn something new today!