Academic Proof-of-Work in the Age of LLMs

Written quickly as part of the Inkhaven Residency.Related: Bureaucracy as active ingredient, pain as active ingredientA widely known secret in academia is that many of the formalities serve in large part proof of work. That is, the reason expensive procedures exist is that some way of filtering must exist, and the amount of effort invested can often be a good proxy for the quality of the work. Specifically, the pool of research is vast, and good research can often be hard to identify. Even engaging in research enough to understand its quality can be expensive. As a result, people look toward signs of visible, expensive effort in order to determine whether to engage in the research at all. Why do people insist only on reading research that’s published in well-formatted, well-written papers, as opposed to looking at random blog posts? Part of the answer is that good writing and formatting makes the research easier to digest, and another part is that investing the time to properly write up your results often causes the results to improve. But part of the answer is proof-of-work: surely, if your research is good, you’d be willing to put in the 30-40 hours to do the required experiments and format it nicely as a paper?Similarly, why do fields often insist on experiments beyond their scientific value? For example, why does machine learning often insist that people do expensive empirical experiments even for theory papers. Of course, part of the answer is that it’s easy to generate theoretical results that have no connection to reality. But another part of the answer is that doing the empirical experiments serves as the required proof of work; implementing anything on even a medium sized open-source LLM is hard, but surely you’d invest the effort if you believed enough in your idea? (This helps explain the apparently baffling observation that many of the empirical results in theoretical papers have little relevance to the correctness or even the applicability of the theoretical results.)Other aspects of ML academia – the beautifully polished figures[1], the insistence on citing the relevant papers to show knowledge of the field, and so forth – also exist in part to serve as a proof-of-work filter for quality. In a sense, this is one of the reasons academia is great. In the absence of a proof-of-work system, the default would be something closer to proof-of-stake: that is, some form of reputational system based on known, previously verified accomplishments. While proof-of-work filters can be wasteful, they nonetheless allow new, unknown researchers to enter the field and contribute (assuming they invest the requisite effort). An obvious problem with this entire setup is that LLMs exist, and what was once expensive is now cheap. While previously, good writing was expensive, LLMs allow anyone to produce seemingly coherent, well-argued English text. While it was once quite expensive to produce ML code, current LLMs produce seemingly correct code for experiments quickly. And the same is true for most of the proof-of-work signifiers that academia used to depend on: any frontier LLM can produce beautifully formatted figures in matplotlib, cite relevant work (or at least convincingly hallucinate citations), and produce long mathematical arguments. I’ve observed this myself in actual ML conference contexts. In the past, crackpot papers were relatively easily to identify. But in the last year, I’ve seen at least one crackpot paper get past other peer reviewers through a combination of dense mathematical jargon and an expansive code base that was hardcoded to produce the desired results. Specifically, while the reviewers knew that they didn't fully understand the mathematical results, they assumed that this was due to their lack of knowledge, instead of the results themselves being wrong. And since the codebase passed the cursory review given to it by the other reviewers, they did not investigate it deeply enough to notice the hardcoding.[2]In a sense, this is no different than the problems introduced by AI in other contexts, and I’m not sure there’s a better solution than to fall back to previous proof-of-stake–like reputation systems.[3] At the very least, I find it hard not to engage with new, seemingly-exciting results from unknown researchers without a high degree of skepticism. This makes me sad, but I'm not sure there's a real solution here.^Especially the proliferation of beautiful "figure one"s that encapsulate the paper's core ideas and results in a single figure. ^In fact, it took me about an hour to decide that the paper's results were simply wrong as opposed to confusing. Thankfully, in this case, the paper's problems were obvious enough that I could point at e.g. specific hardcoded results to the other reviewers, (and the paper was not accepted for publication) but there's no guarantee that this would always be the case. ^Of course, there are other possibilities that less pessimistic people would no doubt point to: for example, there could be a shift toward proof-of-work setups that are LLM resistant, or we could rely on LLMs to do the filtering instead. But insofar as LLMs are good at replicating all cognitively shallow human effort, then I don't imagine there are going to be any proof-of-work setups that would continue to work as LLMs get better. And I personally feel pretty sad about delegating all of my input to Claude. Discuss