Big Data and Spurious Correlations

Big data analytics is a remarkable new field of investigation. However, the effectiveness of the new field seems to encourage an aggressive “philosophy” or “methodology” based on the dictum that “with enough data, the numbers speak for themselves”. We show, using Ramsey theory and algorithmic information theory, that this view is radically wrong. Specifically, we prove that, exactly because of their very large size, databases have to contain arbitrary correlations, most of them spurious. These correlations appear only on account of the size, not because of the nature of data. The scientific method can be enriched by computer mining over immense databases, but cannot be replaced by it.

