Big Data requires special attention to the computational aspects of modeling. With lots of data, the options for a researcher to explore are many; however, naively exploring each model can prove computationally intractable. Considerations for model selection is the topic of today’s post.
Alekh Agarwal, UC Berkeley, presented his work on “Computation meets Statistics: Trade-offs and fundamental limits for large data sets” at Stanford’s Statistics seminar this afternoon.
There were several interesting ideas that I came away with from the talk:
- Considering computational costs for M-estimators can be thought of in terms of the order of the number of search iterations required to achieve a particular level of precision.
- It may be possible to construct a computational algorithm which may have a larger minimization error than the theoretical best, but its error may be of the same computational order as that best. In other words, a slight compromise between computational cost and bias/variance in estimation can be fruitful (i.e., O(B-B^) = O(B-B*) for some computationally simpler estimator B* of B).
- Model selection can be very computationally difficult in high-dimensional models–one of the strengths of Big Data. Tradeoffs should be made regarding the number of samples, computational complexity, and communication costs (esp. for distributed computing).
- A regularized objective function or otherwise constrained estimation framework can be applied to each of these tradeoffs to obtain a solution to the “budget constrained” model selection problem. This constrained problem can potentially have more favorable computational complexity than a brute force selection method.
In short, Big Data applications need to take these tradeoffs seriously. High-dimensional model selection is powerful, but constructing an algorithm which gives a good enough result today and can keep working on better results for tomorrow (with little-to-no intervention) is my ideal.
While I couldn’t find a copy of a paper (I believe it is still a work in progress), the abstract to the talk can be found below: Continue reading