Thirty-three years after he launched a rudimentary computer forum for a niche subfield of physics, Cornell’s Paul Ginsparg is now helping determine the future of arXiv, the home of 2.5 million (and counting) pre-print research papers.
Paul Ginsparg tends to describe his best-known contribution to science – the scientific pre-print platform arXiv – in metaphors.
“I look at it as the artist who sees the various ways others have retouched and vandalized his work and wishes they had left it unchanged for 30 years,” says a smiling Ginsparg from his office at Cornell University, where arXiv is headquartered.
Ginsparg briefly wonders aloud what Picasso would think of the museum that uncovered a little dog that the artist had deliberately painted over in his oil-on-canvas work Le Moulin de la Galette. Picasso is no longer around to witness what people do to his works. As a physicist, Ginsparg uses arXiv every day – just like practically every physicist on Earth.
The power of arXiv is that submitted papers meeting basic scientific and stylistic requisites are published within days, whereas publishing a paper in a peer-reviewed journal can take months. While many authors have typically viewed arXiv as just one of many stepping stones en route to a peer-reviewed journal publication, papers are increasingly released on arXiv alone.
Ginsparg doesn’t want the Picasso metaphor to come across as overly negative (nor equate his oeuvre to Picasso’s), so he tries out a “glass-half-full, glass-half-empty” metaphor for his complicated paternal relationship with arXiv (pronounced “archive” because the “X” represents the Greek letter chi).
“There’s the glass-half-full picture, which is ‘Oh my God, what incredible foresight to create a website that runs the same software 30 years later.’ The glass-half-empty side is, ‘Boy, does this website need updating.’”
The website needs updating mostly because it has become a victim of its own wholly unexpected success. In 1991, when Ginsparg launched a pre-arXiv prototype at Los Alamos National Laboratory, xxx.lanl.gov, it was intended to support about 100 submissions a year.
“I had moved to the Los Alamos National Laboratory and, for the first time, had my own computer on my desk, a 25 MHz NeXTstation with a 105 MB hard drive and 16 MB of RAM,” Ginsparg wrote in a 2011 Physics World piece marking the site’s 20th anniversary.
His digital repository was never intended to become what it is today – the world’s largest ongoing experiment in open science. Ginsparg’s original code not only predates search engines and chatbots – it predates the WorldWideWeb entirely.
The internet and arXiv grew up together in many ways; in 1994, Tim Berners-Lee, inventor of the World Wide Web, hosted Ginsparg at his home in France near CERN, where the two discussed the “dawning era of ubiquitous Web servers” and “marveled” at how quickly public perceptions were transforming about the Web.
When Ginsparg launched xxx.lanl.gov, he anticipated his creation would host only dozens of papers by members of the niche high-energy particle physics community. But its reputation grew quickly among scientists who saw promise in email and other emerging digital communications. Following the early days at Los Alamos, Ginsparg relocated arXiv to Cornell because it “needed a suitable institutional home to continue its transition from an afternoon software experiment to a longer-term sustainable service.” It worked.
By the middle of 2024, arXiv surpassed 2.5 million submissions from a growing nebula of physics subfields, including chaotic dynamics, disordered systems and neural networks, and combinatorics.
Like the expanding universe that many of its papers seek to explain, arXiv itself is not only expanding, but its expansion is accelerating. This year, more than 21,000 papers were published on arXiv in a single month, setting a new record.
The recent surge in arXiv submissions is a perfect storm: the research community is growing, that community is increasingly adopting the open-access model of arXiv, a “publish or perish” mentality pervades science, and now there’s the new challenge of hungry robots. In the past year, large language models have started scouring arXiv for its surplus of words and paragraphs to learn from, and their learnings are being used in the generation of (often low-quality) papers faster than ever before humanly possible.
“Fifty percent of the accesses now are robotic,” says Ginsparg. “That is up staggeringly from even 2023.”
This tsunami of science constantly flooding into arXiv is aggregated and vetted (by intelligences both human and artificial), and then published in time to become part of the bottomless scroll of titles and abstracts for physicists everywhere to browse with their morning coffees.
The arXiv team employs a first defensive line of AI to sniff out signs of other artificial intelligence, such as patterns of wording and syntax that indicate the paper was likely generated in part or in whole by a tool like ChatGPT. Papers that pass the AI scrutiny get reviewed by a member of a small team of reviewers that is not growing at pace with the volume of submissions.
And despite the functional changes over the decades, and the recent onslaught of AI crawlers Ginsparg could never have anticipated, arXiv still runs on essentially the same software he wrote three decades ago.
“And let me also make a claim,” Ginsparg says with a smile, “which I assert on the basis of having made no detailed investigation whatsoever, but since I’m open to a challenge, I claim that arXiv is the single largest website that has been up over a 30-year timeframe and has never had a broken link. The interface became scalable – we never needed to change the URL scheme.”
That remarkable durability of Ginsparg’s original code also has a glass-half-empty side: Ginsparg’s innate familiarity with the code makes it tricky for him to achieve his dream scenario as “just a normal user” of arXiv. For now, he remains part of an understaffed and overwhelmed team at Cornell working valiantly to manage the unprecedented inflow of papers that is partly driven by artificial intelligence and nefarious players trying to game the system. The graphs below demonstrate the drastic growth in both submissions and subfields:
Steinn Sigurdsson, arXiv’s scientific director since 2017, told FirstPrinciples that the cascading demands on arXiv are outpacing the supply of solutions.
“We struggle to keep up,” he said. Asked what his team struggles with most – whether funding, staffing, or sheer volume of work – Sigurdsson replies with potent brevity: “Yes.”
Some statistics from arXiv’s 2023 annual report convey the mounting demands:
3.1 billion total downloads
5 million monthly active users
208,493 new submissions in 2023
17,000 submissions per month
153 categories
21 staff members
Sigurdsson is also prone to metaphors when describing arXiv: “We’re an old classic car, and the rust has finally come through, and the pistons are wearing out,” he told Scientific American.
That classic car is in the midst of a thorough tune-up, however. Matching gifts of $5 million each from the Simons Foundation and the National Science Foundation are fuelling a much-needed migration of arXiv to the cloud, and will allow the team to “modernize our code reliability, fault tolerance and accessibility.” Charles Frankston, arXiv’s technical director, has said arXiv has “reached the limits of the legacy system” and that the move to the cloud will “accommodate growth sustainably.”
Ginsparg is part of this effort to modernize the code he wrote 33 years ago and has nurtured since. Once the migration of arXiv to the cloud is complete, he says, he may finally be able to achieve his dream of being “just a regular user” of arXiv.
Until that day, another metaphor comes to mind to describe his relationship with arXiv. “It’s like this teenager who went to college and came back, and is now crashing on my living room couch.” Ginsparg says he loves arXiv, but it has to “survive without me.”
Photo Credit © John D. and Catherine T. MacArthur Foundation