Data Mining with the Maximal Information Coefficient

by Ben Lorica

UPDATE (2/21/2013): In light of a recent paper by Simon and Tibshirani, I'm recommending the distance correlation instead of the MIC. Simon and Tibshirani looked at the Maximal Information Coefficient and conclude that:

... the recently proposed distance correlation measureof Szekely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use.
In response, the creators of MIC published the following rejoinder:
While a method with better power is always preferable if all other things are equal, the desiderata of a statistic depend on the problem it is being used to solve, and distance correlation's lack of equitability makes it ill-suited for the data exploration setting posed in Reshef et al. [2011]. MIC is a more appropriate measure of dependence in a situation in which there are likely to be an overwhelming number of signicant relationships in a data set, and there is a need to automatically find the strongest ones.
Finally, Kinney and Atwal propose using Mutual Information instead:
... it was claimed that MIC possesses a desirable mathematical property called equitability that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect.
... We are aware of two other critiques of the work by Reshef et al., one by Simon and Tibshirani [28] and one by Gorne et al. [29]. These do not address the issues of equitability discussed above, but rather focus on the statistical power of MIC. Through the analysis of simulated data, both critiques found MIC to be less powerful than a recently developed statistic called distance correlation (dCor) [30]. Gorne et al. [29] also recommend a different statistic, HHG [31]. Like mutual information, both dCor and HHG can detect arbitrary relationships between vector-valued variables. Moreover, both dCor and HHG are plug-in statistics that can be easily computed directly from data, i.e. they do not require an approximate estimation procedure like mutual information and MIC do. However, neither dCor[x; y] or HHG[x; y] are invariant under invertible transformations of x and y. As a result, both statistics violate self-equitability as well as DPI. We therefore suggest that dCor or HHG may be useful in situations when estimating mutual information is either impractical, due to computational cost or under-sampling, or does not provide sufficient statistical power.

Original post: (last updated Dec/2011)

The recently published paper by a team from Harvard and M.I.T. contains a metric that may soon become a standard technique in every data miner's toolkit. Loosely described as a "measure of dependence for two-variable relationships", an analyst faced with hundreds of variables could use the Maximal Information Coefficient (MIC) during exploratory data analysis. Just like the correlation coefficient between two variables, the MIC has the additional advantage of being easy to interpret. A common task in data modeling and analysis is to find good explanatory variables for a given independent variable. The MIC uncovers variables that have (nonlinear) functional relationships and variables that are statistically independent from a target variable. There are other techniques that provide partial results along these lines. But based on my understanding of the MIC, it may become an efficient starting point for data mining and predictive modeling.

Rather than defining1 the MIC, let me enumerate some of its key properties. Actually these properties (which the researchers prove and numerically demonstrate in their paper) are what have me excited about the MIC:

1.    0   ≤   MIC   ≤   1

  With probability approaching 1 (as the sample size grows):

2. "MIC assigns scores that tend to 1, to all never constant noiseless functional relationships"

3. "MIC assigns scores that tend to 1, for a larger class of noiseless relationships (including super-positions of noiseless functional relationships)"

4. "MIC assigns scores that tend to 0 to statistically independent variables"

  Furthermore, for a pair of random variables X and Y:

5. "if Y is a function of X that is not constant on any open interval, then data drawn from (X,Y) will receive an MIC tending to 1 with probability 1 as sample size grows"

6. "IF the support of (X,Y) is described by a finite union of differentiable curves of the form c(t) = [x(t),y(t)] for t in [0,1]

  THEN data drawn from (X,Y) will receive an MIC tending to 1 with probability one as sample size grows, provided that (dx/dt) and (dy/dt) are each zero on finitely many points"

7. "the MIC of data drawn from (X,Y ) converges to zero in probability as sample size grows if and only if X and Y are statistically independent."

8. the MIC of a noisy functional relationship is bounded from below by a function of its R2

9. NON-LINEARITY: "a natural measure of nonlinearity is MIC - r2, where r denotes the Pearson product-moment correlation coefficient, a measure of linear dependence. The statistic MIC - r2 is near 0 for linear relationships and large for nonlinear relationships with high values of MIC."

The authors conduct a host of numerical experiments to demonstrate the properties above, and along the way they show that the MIC compares favorably with a host of other measures of association (including Pearson/Spearman correlations and Mutual Information).

An Example from Finance

Once I read the paper I wanted to see the MIC applied to an actual data set. Luckily the authors have a web site and sample code in Java/R. It was easy to run the Java code against my data set, consisting of 15 Hedge Fund Indices. I wanted to see if MIC would detect associations between the different Hedge Fund trading styles. The data set consists of 15 time-series -- monthly rates of return -- over three different time periods: 1994-2011, 2000-2005, and 2006-2011.

Why examine 3 different time periods? It has been noted that financial assets have become more correlated since the financial crisis of 2007/20082. I wanted to use the MIC to quickly see if associations3 are also increasing between the different Hedge Fund trading styles. In the charts below, I contructed a heat map of the MIC matrix for the three time periods (darker shade of green means a higher MIC).

Comparing the chart for 2006-2011 with that for 2000-2005, one can quickly see a slight increase in MIC for a family of trading styles. Broadly speaking, the rise in MIC can be seen among funds that follow Event-Driven & Relative Value trading strategies. In particular, for the 2006-2011 period the MIC has risen between Distressed, Multi-strategy, Event-driven, Long/Short Equity and Convertible Arbitrage funds.

DJ/Credit Suisse Hedge Fund Indices: Maximal Information Coefficient

Time Period:

Summary

The Maximal Information Coefficient is a tool that I plan to use more often in the future. It provides a quick way to evaluate (non-linear) associations between lots of variables. In particular, in the course of building predictive models, I can see using it to evaluate potential predictors. When it comes to capturing non-linear relationships, the MIC is superior to the correleation coefficient and Mutual Information. On the other hand, it provides comparable results when faced with linear associations.  
 
Related resources:
  • Hedge Fund profits flow mostly to Industry Insiders

  • Hedge Fund Performance: My semi-regular blog posts on the Dow Jones / Credit Suisse Hedge Fund indices.



  • (1) The formal definition and proofs can be found in the original paper: Detecting Novel Associations in Large Data Sets by Reshef et. al. Also see this feature article and accompanying video from the M.I.T. press office.

    (2) Here I'm most interested in co-movements between different asset classes. It has also been the case that within equity markets, "correlations between stocks are currently at the highest level in recent history".

    (3) Following properties (1) - (9) listed above, an MIC closer to 1 suggests a functional relationship exists between two time series.  
     
     
    Back to Resources page.



    NOTE: Reproduction & reuse allowed under Creative Commons Attribution.    Creative Commons Attribution