## On correlation and causation

- October 18th, 2009
- Write comment

It just strikes my mind to see how some topics come back in cycles. I remember an interesting conversation

on correlation/causation and predictive models back in 2005. Yes, I’m an old man and quite silent, but I’m working on it.

Jonathan Lewis’ post on correlation puts me in a fuzzy state of over-stimulative reminiscences. This is why I often find Jonathan’s posts so stimulating: they are not only very informative but are food for thought and propose new exploratory possibilities.

**Correlation implies cause**

For a lot of people, this is intuitively right. For 2 measurements which are related with a third variable say time, having these measures changing over time in a very similar fashion looks like they share a common cause.

Constant change rates on both variables indicates linearity. Okay, let me be nit-picking and make some definitions:

The linear correlation coefficient *r* measures the strength and direction of a linear relationship (or association) between two variables (aka Pearson’s coefficient).

When *r* is close to 1, the relationship between both variables is strong. That means that when values for X go up, values for y also go up.

On the other hand, when *r* is close to -1, there is also a linear relationship, but now when x goes up, y goes down.

Finally when r is close to 0, there is no linear relationship, but random non-linear relationships can be found on these variables.

To tell how “strong” a correlation is depends on the kind of data. This is why scientific data generally need a higher correlation coefficient to call them “strong” (generally above 0.8) than medical/social/psychological data. It is well accepted today that the interpretation of a correlation coefficient depends on the context of data.

The determination coefficient R^2 gives the proportion of variance of one variable that is predictable from the other one. It will tell us how certain we can be of making a prediction from a certain model.

So, back to the original discussion, the sentence “correlation does not mean causation” doesn’t necessarily mean that correlation doesn’t indicate potential causal relations. It’s just saying that a strong correlation is not sufficient to establish a causal relationship. period.

Haven’t you seen Dr House ? The white board ? Yes, it’s fiction, but illustrative.

So far, I have found it useful to correlate some wait events from statspack measurements with other non-db measurements. In one particular case, many years ago, it helped me to find the root cause of a really odd performance issue. I found out the problem to be related to the NFS client configuration on a Solaris server while using NAS storage and Oracle 8.0.6.

Personally, I would be very careful on building predictive models for anything in Oracle. One thing I’ve learned over all these Oracle versions is that one size doesn’t fit all, and of course, the more I know, probably the more I miss. The only steady ground I have is the scientific method: hypothesis, test, prediction.

Working with test cases, with representative test data, with a well-known baseline state.

As Ian Anderson sings, “life is a song”…