Big Data requires special attention to the computational aspects of modeling. With lots of data, the options for a researcher to explore are many; however, naively exploring each model can prove computationally intractable. Considerations for model selection is the topic of today’s post.
Alekh Agarwal, UC Berkeley, presented his work on “Computation meets Statistics: Trade-offs and fundamental limits for large data sets” at Stanford’s Statistics seminar this afternoon.
There were several interesting ideas that I came away with from the talk:
- Considering computational costs for M-estimators can be thought of in terms of the order of the number of search iterations required to achieve a particular level of precision.
- It may be possible to construct a computational algorithm which may have a larger minimization error than the theoretical best, but its error may be of the same computational order as that best. In other words, a slight compromise between computational cost and bias/variance in estimation can be fruitful (i.e., O(B-B^) = O(B-B*) for some computationally simpler estimator B* of B).
- Model selection can be very computationally difficult in high-dimensional models–one of the strengths of Big Data. Tradeoffs should be made regarding the number of samples, computational complexity, and communication costs (esp. for distributed computing).
- A regularized objective function or otherwise constrained estimation framework can be applied to each of these tradeoffs to obtain a solution to the “budget constrained” model selection problem. This constrained problem can potentially have more favorable computational complexity than a brute force selection method.
In short, Big Data applications need to take these tradeoffs seriously. High-dimensional model selection is powerful, but constructing an algorithm which gives a good enough result today and can keep working on better results for tomorrow (with little-to-no intervention) is my ideal.
While I couldn’t find a copy of a paper (I believe it is still a work in progress), the abstract to the talk can be found below:
The past decade has seen the emergence of datasets of an unprecedented scale, with both large sample sizes and dimensionality. Massive data sets arise in various domains, among them computer vision, natural language processing, computational biology, social networks analysis and recommendation systems, to name a few. In many such problems, the bottleneck is not just the number of data samples, but also the computational resources available to process the data. Thus, a fundamental goal in these problems is to characterize how estimation error behaves as a function of the sample size, number of parameters, and the computational budget available.
In this talk, I present three research threads that provide complementary lines of attack on this broader research agenda: (i) lower bounds for statistical estimation with computational constraints; (ii) interplay between statistical and computational complexities in structured high-dimensional estimation; and (iii) a computationally budgeted framework for model selection. The rest characterizes fundamental limits in a uniform sense over all
methods, whereas the latter two provide explicit algorithms that exploit the interaction of computational and statistical considerations.