Real world data analysis

I have worked on data analytics for my entire career, spanning seven years of employment with NASA and more than twenty years in the private sector. While computing power has increased dramatically over these years, the central challenge remains the same: identifying and quantifying meaningful relationships in data. Especially with massive computing power, it is easy to find apparent patterns and relationships in data. The challenge is to differentiate between real relationships and over-fitted data. In many ways, this problem becomes harder when computational resources allow you to try so many different approaches so easily.

Aside from the basic problem of over-fitting, there are related challenges that arise from assuming that the system being analyzed is statistically stationary when it is not. Many systems evolve over time. Is it likely that financial and economic relationships observed in the period from the 1970’s to 1990’s will necessarily be relevant today? That is an incredibly strong assumption, but many widely-used models are based on this notion.

A third major problem in data modeling is assumptions about the underlying statistical processes. Many statistical tests are predicated on the assumption that the randomness or noise in the system is normally distributed (gaussian) or follows some other specific distribution. For real-world problems, this is a strong assumption. There are techniques that do not assume a specific distribution and these are generally referred to as non-parametric statistics. Parametric methods remain widely-used, whether or not they are entirely appropriate.

A fourth challenge in data analysis is accounting for serial correlation in data. If, for example, you are building a model to explain X in terms of Y and Z over time and there is correlation through time in X, Y, or Z, you must be cautious to properly account for this temporal structure. It is very easy to find spurious relationships between variables that have structure through time. Simply put, any two variables with trends over time will appear to be correlated to one another. If, for example, sales of electric cars are increasing over time and the incidence of poisonous spider bites is decreasing over time, there may appear to be a significant negative correlation between these variables if you don’t account for the trends in these data as part of the analysis.

A fifth common issue in data analysis is that one may accidentally create structure in data through the process of analyzing it. A common error, for example, would be to detrend the data on spider bites and electric car sales before correlating these two variables. The detrended data may, in fact, have some new structure that is an artifact of the trend removal process. Preparing data for analysis is often referred to as pre-processing and one must be cautious that this does not inject spurious structure.

There are numerous cases in which mistakes in data analysis and modeling lead to substantial real-world impacts. In finance, this often occurs because analysts make strong assumptions about the world, such as that financial market returns are normal, that estimates of key variables are more accurate than they are, or that the their data are stationary. In reality, market returns are not normal and extreme events are substantially more frequent that normal models would suggest. The failure of normal distributions might be due to non-stationarity, however. A non-stationary normal process can generate outcomes that are far from normal if you don’t account for the non-stationarity. Similarly, a variable that is auto-correlated and is driven by a normal noise process may have a distribution that is non-normal.

There are enormous opportunities in data science, as the amounts of data and available computational power are both growing incredibly fast. Automation has its limits, however. Successful models still rely on analysts to make educated decisions about how to model data and how to use model results to inform decisions. In finance, my field of emphasis, better models have enormous economic value. Poorly-specified or incorrectly-applied models can also result in awful outcomes. It is tempting to think that the types of catastrophic outcomes associated with over-confidence in models (such as the LTCM implosion) is becoming lower, but my hunch is that we will experience plenty more. Bigger computers and data sets make it much easier to over-fit data, along with the ability to find more sophisticated patterns in the data.