This article provides a granular breakdown of data and concept drift, along with methods for detecting them and best practices for dealing with them.
Join the DZone community and get the full member experience. Data and concept drift are frequently mentioned in ML monitoring, but what exactly are they, and how are they detected? Furthermore, given the common misconceptions, are data and concept drift things to be avoided at all costs or natural and acceptable consequences of training models in production? Read on to find out. Perhaps the more common of the two is data drift, which refers to any change in the data distribution after training the model. In other words, data drift commonly occurs when the inputs a model is presented within production fail to correspond with the distribution it was provided during training. This typically presents itself as a change in the feature distribution, i.e., specific values for a given feature may become more common in production. In contrast, other values may see a decrease in prevalence. For example, consider an e-commerce company serving an LTV prediction model to optimize marketing efforts. A reasonable feature for such a model would be a customer’s age. However, suppose this same company changed its marketing strategy, perhaps by initiating a new campaign targeted at a specific age group. In this scenario, the distribution of ages being fed to the model would likely change, causing a distribution shift in the age feature and perhaps a degradation in the model’s predictive capacity. This would be considered data drift. Contrary to popular opinion, not all data drift is bad or implies that your model is in need of retraining. For example, your model in production may encounter more customers in the 50 — 60 age bracket than it saw during training. However, this does not necessarily mean that the model saw an insufficient number of 50 — 60-year-olds during training, but rather that the distribution of ages known to the model simply shifted. In this case, retraining the model would likely be unnecessary. However, other cases would demand model retraining. For example, your training dataset may have been small enough that your model didn’t encounter any outliers during training, such as customers over the age of 100. When deployed in production, though, the model might see such customers. In this case, the data drift is problematic, and addressing it is essential. Therefore, having a way to assess and detect the different types of data drift that a model may encounter is critical to getting the best performance. Concept drift refers to a change in the relationship between a model’s data inputs and target variables. This can happen when changes in market dynamics, customer behavior, or demographics result in new relationships between inputs and targets that degrade your model’s predictions. The key in differentiating concept drift from data drift is the consideration of the targets—data drift applies only when your model encounters new, unseen, or shifting data.