Preserving data wealth in the midst of the great LLM arms race

By Ziv Reichert

10 May 2023

Share

Stung by privacy concerns, ChatGPT has now introduced a feature it likens to ‘incognito’ mode, providing users the option to disable their chat history. Doing so means conversational data will no longer be fed into OpenAI for model training purposes.

While a step in the right direction, the tweak does not address the potential data land grab, which has likely been taking place behind closed doors over the past few months — a data land grab involving government bodies and large-scale enterprises.

Microsoft, through OpenAI, and Google, with Bard, Baidu, Yandex, among others, are racing to build the largest, most ‘all-encompassing’, foundational models — models which they hope will power many of the world’s applications. To win this race, they are now competing to consume every dataset imaginable, both open and closed, on which to train their models. It’s fair to say that policymakers are struggling to keep up.

Data held and maintained by governments and large enterprises would be like gold to them. To name just a handful of examples from the public side in the UK, we’d be talking private information held in NHS records, census details, longitudinal education studies, our transport habits, as well as data from the ONS and the Met Office.

If they have not been doing so already, the leading LLM players will soon be talking to governments around the world in an effort to gain access to these types of datasets via partnerships. Deals may initially appear enticing to public officials who are constantly under pressure to cut costs. However, they may end up looking like obvious mistakes in the mid-to-long term, given that they will very likely undervalue the true worth of the data at hand. To prevent this from happening, it is crucial that our public officials in the UK, as well as those around the world, understand the strategic and financial value of the datasets they oversee.

These are, after all, key to our nation’s future prosperity and should be considered critical national resources. Just as Norway built its immense sovereign wealth fund from the value realised from its North Sea oil, we need to understand that this data could be of significant value to the public purse and should not be given away lightly. Governments must ensure they benefit financially if and when public data is used to train LLMs.

What’s so important about domain-specific datasets?

LLMs are only as good as the data they are trained on. The most prominent models — those used by ChatGPT, Bard et. al — rely almost exclusively on data sourced from the open web — with all its misinformation and bias. This is why LLMs often suffer from hallucinations — confident ‘best guesses’ or assumptions made to look like facts.

If you are using ChatGPT to write a shopping list in the style of a Shakespearean sonnet, no problem. However, the same technology could also be used to determine a traffic fine or a treatment pathway for a medical condition — matters of greater importance. To dominate the LLM arms race, companies will need to consistently offer factual answers to their users’ queries. No matter how specific or broad.

The kinds of domain-specific datasets that have been amassed over decades by governments worldwide are therefore uniquely valuable to train LLMs on because they are accurate in nature. They can aid LLMs in grasping the nuances and terminologies unique to respective domains, which in turn, can transform them into powerful tools for building domain-specific applications in both the public and private sectors around the world.

But it’s not just publicly owned data that may be at risk of being ingested by models for less than its worth; proprietary private data gathered by global corporations is also innately valuable. It comes as no surprise that organisations such as JP Morgan, Verizon, and Accenture have banned ChatGPT use internally. Prior to the introduction of ChatGPT’s ‘incognito’ mode a week ago, users’ queries were being fed to OpenAI as training data en masse. It’s fair to say that more and more organisations are beginning to wake up to the fact that their employees have been handing over sensitive information for months.

Nations and large corporations need to pause and assess what they are at risk of handing over. And in a world where data truly is the new oil, we need to acknowledge that we, the UK, are sitting on massive reserves — some of the largest in the world — across both our public and private sectors.

In 1626, the story goes, indigenous peoples sold off the entire island of Manhattan to the Dutch for just $24 worth of beads.

Earlier this century, many of us blithely gave up our personal data in return for basic internet products and services such as social networking platforms. Compared to this, the coming transfer of value is many times that magnitude, since it involves not only individual users’ data but also the data of governments and corporations.

Failure to promptly evaluate and appraise the value of data assets, will leave our governments and corporations exposed when LLMs come knocking (assuming they haven’t already). The value we stand to lose is much greater than that implied by a data breach.

Assessing value is difficult yet urgent. Unlike oil, there is no transparent market setting prices hour by hour. Given that it is so hard to understand the present and future value of public and private datasets, governments and corporations will need to find a mechanism — royalties, licences, equity, etc— that enables them to recover the potential value created when integrated with LLMs. That is, if they decide to partner.

The alternative, which may very well end up being the logical path forward, will be to maintain full control by training bespoke models on proprietary datasets, hosted on-prem / in private clouds.

Either way, the appropriate action should be to pause and assess… to acknowledge that the space is moving at a thousand miles an hour, and without safeguards and a concrete game plan, there is a real risk of losing the opportunity to generate significant future wealth down the line.