Why I Still Publish Academic Papers

Many people have asked me, “If you’re working in industry, why do you still write academic papers?” In hopes of providing a more complete answer to all who might be interested (professors, corporate managers, job seekers, recruiters, colleagues, economists, data scientists), I have prepared this list of reasons why I, as an industry research scientist prioritize writing and publishing academic papers:

  1. Professional credibility & status: By publishing, I can demonstrate to myself and to my managers (who may or may not be experts at evaluating my talents) that I am on the cutting edge of my niche of the field (either theory, empirics, practice, etc.). This also provides an up-to-date public portfolio to demonstrate the advancement of my talents to prospective employers or clients.
  2. Technical communication: By forcing myself to write up my technical work formally, I am forced to solidify & distill my understanding of the material in order to mentally move on to the next step of research or the next project. Further, I am much better at communicating my work in the future to others in both business and academia who have, in turn, evangelized my work more broadly; this has saved me hours of repeating myself, expanded my impact, and opened up new opportunities. Precise & concise communication is key.
  3. Intellectual community: I really like associating with top notch talent to talk shop and life in general. In particular, these are the people that I want to work with for the rest of my career because they are motivated by intellectual truth & science more than salesmanship or money. That said, I work in industry precisely because I want my research to have a direct impact by helping companies be more efficient and successful. To me, this balance is the best of both worlds. Also, a side benefit from coauthoring academic papers with external researchers is “free R&D” for the company, either in the form of improved communication, informed brainstorming, or actual data work (if permitted).
  4. Social preferences: I have benefitted greatly from the contributions of others, and I like giving back to the community. I learn a lot from referees and by reviewing others’ papers and attending conferences and discussing ideas.
  5. Portable research: Work that I publish is in the public domain and, hence, is portable across companies. I want to continue to push forward my research and professional development, regardless of which company I work for. And I want to be able to positively and publicly build upon prior work.
  6. Option value: If I decide to return to academia, I would probably teach at either a business school or a second tier liberal arts school. Having a solid publication record would help my prospects. However, a former academic + industry manager of mine once told me, “Randall, after 10 years, academia won’t want you, and you won’t want academia.” And, I suspect, he is right. If I were to return, I would likely take a significant pay cut, and at least for now, I am loving the type of industrial research I get to do at tech companies.
  7. NOT for corporate promotion or reward: At least in my current position, nobody will directly reward me for or directly encourage publications. However, I do have a level of status and expertise from my professional and research record. It is not clear how much status I would have without the publication record if I knew all the same technical material. Honestly, having the publications directly on my resume probably does not matter much, but I would not be as good with the technical material without pushing myself to achieve precise academic communication—having a community holding me up to a high standard helps me grow.

Publishing academic papers is a great input into my work, but I am still selective about what I publish. I recognize that I cannot publish everything (e.g., business sensitive work) nor would I want to due to the opportunity cost of formalizing communication. I currently focus on publication outlets that keep the returns to my academic paper-writing efforts the highest by reaching the right audiences for my work, by reducing the overhead of publication, and by focusing on papers that I genuinely think are important for all of the reasons above.

Finally, Writing and publishing a paper represents an investment of ~3 months of total work, after the basic R&D: perhaps 1 month to write up the paper and then a couple more months to deal with the referee process. That is a lot of valuable time, so I try to focus on publishing only about 1 paper a year now, only focusing on the most important ideas. I have considered writing blog posts as well, but that becomes a little hard given that I would need to go through all the same internal corporate publication release effort for blog posts as for formal papers which the academic community more formally recognizes. So, I have focused mostly on papers.

Whose job is it: the hardware’s, software’s, or programmer’s?

Perhaps the biggest question in parallel computing for Big Data is, “Who’s responsible for the logical work to harness parallel architectures: the hardware, the compiler, or the programmer?”

Today I found an interesting lecture on MIT’s Open Courseware titled L3: Introduction to Parallel Architectures given as part of the “Multicore Programming Primer” IAP course by Saman Amarasinghe. The lecture discusses high-level parallel computing architectures from the past 50 years of parallel computing and lends insight into this question.

The question is about the distinction between Implicit and Explicit Parallelism

Continue reading

Budget-Constrained Model Selection: Trading off Statistical and Computational Complexities

Big Data requires special attention to the computational aspects of modeling. With lots of data, the options for a researcher to explore are many; however, naively exploring each model can prove computationally intractable. Considerations for model selection is the topic of today’s post.

Alekh Agarwal, UC Berkeley, presented his work on “Computation meets Statistics: Trade-offs and fundamental limits for large data sets” at Stanford’s Statistics seminar this afternoon.

There were several interesting ideas that I came away with from the talk:

  • Considering computational costs for M-estimators can be thought of in terms of the order of the number of search iterations required to achieve a particular level of precision.
  • It may be possible to construct a computational algorithm which may have a larger minimization error than the theoretical best, but its error may be of the same computational order as that best. In other words, a slight compromise between computational cost and bias/variance in estimation can be fruitful (i.e., O(B-B^) = O(B-B*) for some computationally simpler estimator B* of B).
  • Model selection can be very computationally difficult in high-dimensional models–one of the strengths of Big Data. Tradeoffs should be made regarding the number of samples, computational complexity, and communication costs (esp. for distributed computing).
  • A regularized objective function or otherwise constrained estimation framework can be applied to each of these tradeoffs to obtain a solution to the “budget constrained” model selection problem.  This constrained problem can potentially have more favorable computational complexity than a brute force selection method.

In short, Big Data applications need to take these tradeoffs seriously. High-dimensional model selection is powerful, but constructing an algorithm which gives a good enough result today and can keep working on better results for tomorrow (with little-to-no intervention) is my ideal.

While I couldn’t find a copy of a paper (I believe it is still a work in progress), the abstract to the talk can be found below: Continue reading

Welcome to Econinformatics: Home of Economics & Big Data

You must be saying, “Econinformatics? What’s that?” Econinformatics is “the application of computer science and information technology to the field” of economics (Wikipedia: Bioinformatics), particularly as it applies to the economic analysis of Big Data:

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. (Wikipedia: Big Data)

Note that there are orders of magnitude of difference between “capture, manage, and process” and “analyze.”

This blog will provide a discussion platform for “Big Data” topics relevant to economists.

A recent article in Forbes titled “Big Data — Big Money Says It Is A Paradigm Buster” highlights the economic trajectory for Big Data in industry, and invariably, for applied economic research.

The fundamental issues associated with Big Data are succinctly described by Anand Rajaraman who is the senior vice president at Walmart Global e-commerce and co-founder @WalmartLabs and a professor at Stanford.

“The tools [for Big Data] are very different. Many of the fundamental algorithms for predictive analytics depend crucially on keeping the data in main memory with a single CPU to access it. Big Data breaks that condition. The data can’t all be in memory at the same time, so it needs to be processed in a distributed fashion. That requires a new programming model.”

This can be hard for traditional data users to understand, He watches students attack Big Data problems by creating a sample, but that defeats the value of Big Data with all its potentially informative outliers.

The challenges are not only faced by “students.” These research and analytic challenges posed by Big Data are facing industry and academic researchers in many fields.

Researchers must undergo a paradigm shift in how they attack research with Big Data. I don’t pretend to have all of the answers about Big Data, but by sharing our knowledge and experiences together, we can shorten the learning curve and all do better work.