My original goal with this article was to read through the GPT-4 Technical Report and System Card to explain some of the more complicated ideas for an audience unfamiliar with machine learning. In a way, this is still an explanation of those documents. Instead of machine learning ideas, the focus of this explanation will be on how OpenAI applies a gilding of science to documents that should have been blog posts and why they should be treated as such.
Sure, there are the dressings of scientific literature in the form of the paper’s formatting and citations. There’s even a specified desired way to cite the Technical Report if it is used in future literature. These attributes do not make up for the lack of a Methods section, hand-wavy explanations and metrics, and unfounded claims about Artificial General Intelligence (AGI). These faults in the Technical Report and System Card would have been caught and heavily critiqued during peer review. As such, it would have been more appropriate for OpenAI to communicate these documents in the form of blog posts instead of giving the documents the false appearance of scientific literature. Blog posts can still be cited, and those documents would have been much better suited as such.
The missing Methods section
The closest approximation to a Methods section provided by OpenAI is the Scope and Limitations of this Technical Report section. In a Methods section, you’ll typically find information about the data used to train the model, the architecture of the model, and maybe some more specifics about how the model was trained. None of this was provided in the Scope and Limitations of this Technical Report section.
First, let’s look at the information provided about the data used. OpenAI claims to use “both publicly available data (such as internet data) and data licensed from third-party providers.” In other words, OpenAI could have literally used anything. OpenAI could have used Common Crawl, Wikipedia, or even C4 since these are some of the most common public datasets. Google used these datasets to train T5 and Pegasus models to good results [1, 2]. But no, all we get from OpenAI is “publicly available data.” As for, “data licensed from third-party providers” there really is no place to even start to guess what was used [5]. Some novel information could have been included in these third-party datasets that could have had a significant impact on the performance of the model, but OpenAI has decided it’s more important to keep those potential improvements inside their walled garden.
Continuing the trend of providing no information, OpenAI decides it isn’t worth it to provide information about “architecture, hardware, training compute, dataset construction, training methods, or similar.” [5] To be clear, this sort of information is critical for independently validating the claims made in scientific literature, and it’s nearly standard practice to not only publish this information in your paper, but also provide the final model (in this case GPT-4) for people to further iterate on.
Then, there are the “domain experts” that OpenAI has contracted to vet GPT-4. In the introduction section, the Scope and Limitations section, and throughout the rest of these documents, readers are provided no information about who these domain experts might be [5]. The clear implication being that we should trust OpenAI in choosing them. Due to OpenAI’s reticence to provide information about anything, especially about these experts, the validity of the expert’s work and credentials needs to be questioned. Some people would claim Andrew Wakefield [3] is a vaccine expert; I wouldn’t. In fact, due to the very apparent power of GPT-4, I am heavily invested in knowing that Andrew and anyone like him aren’t anywhere near a position where they could determine what valid medical information is. The same applies to other potential experts and other fields of information. OpenAI doesn’t publish any information on their domain experts, making it so that readers and evaluators of GPT-4 can’t account for potential (potentially malicious) biases in the model.
The one exception to OpenAI’s refusal to provide information is the contracting of the Alignment Research Group (ARC). OpenAI contracted ARC to make sure that GPT-4 wasn’t trying to take over the world or replicate itself. This is a calculated disclosure of information done to make GPT-4 more impressive than it is, which will be discussed later in its own section. My impression is that OpenAI mentioning their work with ARC is calculated PR instead of actual useful information. I don’t think it’s a coincidence that the tiny morsel of information we get is the one that implies GPT-4 might be the next step of human evolution on the way to SkyNet, even though it isn’t right now.
The entire Scope and Limitations of this Technical Report section could have been written as follows:
We, OpenAI, want to make the most amount of money possible. So, we have decided that we will not be sharing any relevant information here. Actually, why are you reading this section anyway, we know you’re really here to see our graphs.
Open AI’s custom measurement score for their model
OpenAI makes claims about the efficacy of their models using internal metrics (see Figure 6). Since OpenAI has a financial interest in portraying their models in the best possible light, these internal metrics are unhelpful for someone reading the Technical Report. Without any external validating information, Figure 6 in the Technical Report offers no actionable or relevant information (even though the figure claims to show how much better GPT-4 is than previous OpenAI models).
For internal metrics to work, readers have to trust that OpenAI is running their internal factual evaluation objectively, which isn’t a clear cut case. OpenAI has a financial incentive to make their newest model seem as powerful as possible, and nothing would hamper those claims more than publishing a graph showing that GPT-4 actually performed worse on their internal metrics than chatgpt-v2. So, without the information about how this internal factual evaluation is run and no external validation of the results, Figure 6 cannot be trusted for any conclusions. The relative performance of OpenAI’s models (GPT-4 and others) on other standard metrics, gives some credence to Figure 6, but, right now, Figure 6 is the scientific equivalent of giving an award to yourself in an event that no one can even spectate.
A more nitpicky problem that I have with Figure 6 is the inclusion of error bars without any explanation as to why the error bars are there. GPT-4 isn’t deterministic, so maybe those error bars account for multiple runs of the internal evaluation. Maybe the error bars are to account for the variance in evaluator’s scores when determining how close GPT-4’s answers are to the human ideal. It’s even possible that the error bars are random lines the OpenAI team decided would look good on their graph and put them there for fun, kind of like this other paper where the error bars were just the `T` character in Figure 9 [4]. Because OpenAI provides no further information about the internal factual evaluation in these papers, we really can’t say which of these is true. Due to the lack of information about the internal factual evaluation, I propose that Figure 6 is useless for anything except puff piece news articles.
Open AI’s unfounded claims
If the Technical Report is viewed as scientific literature, OpenAI’s consultation with ARC to determine if GPT-4 has the “ability to carry out actions to autonomously replicate and gather resources” is an interesting evaluation on a groundbreaking model [5]. However, if the Technical Report is viewed through the more appropriate lens of a PR blog post, it’s clear that the consultation was done solely to drive clicks and exposure to GPT-4 in the form of catchy news headlines. It isn’t a coincidence that this is the only concrete information readers get in the otherwise opaque methodology. Credit where credit is due, this strategy worked. Here are some examples of news articles specifically about this:
- OpenAI checked to see whether GPT-4 could take over the world https://arstechnica.com/information-technology/2023/03/openai-checked-to-see-whether-gpt-4-could-take-over-the-world/
- GPT-4: A new capacity for offering illicit advice and displaying ‘risky emergent behaviors’ https://www.zdnet.com/article/gpt-4-has-new-capacity-for-offering-illicit-advice-and-having-risky-emergent-behaviors/
- OpenAI believed GPT-4 could take over the world, so they got it tested to see how to stop it https://www.firstpost.com/world/openai-believed-gpt-4-could-take-over-the-world-so-they-got-it-tested-to-see-how-to-stop-it-12309882.html
These are the news articles that will pop up in random people’s news feeds, and these are the headlines that will shape the understanding that people have of GPT-4 and its capabilities. The content of the articles that might add some context doesn’t matter as people won’t read past the headlines anyway. OpenAI is selling a product, and the ARC consultation did a perfect job of cultivating exactly the kind of PR OpenAI wanted. Better yet, these articles can co-opt some of the stylings of science themselves by citing the OpenAI Technical Report and System Card.
As a final reference to OpenAI’s outlandish claims, OpenAI says in their System Card, “if multiple banks concurrently rely on GPT-4 to inform their strategic thinking about sources of risks in the macroeconomy, they may inadvertently correlate their decisions and create systemic risks that did not previously exist”, showing a clear divorce from reality [5]. Anyone that has ever worked with a financial institution knows that they can’t even stop using Internet Explorer, let alone get to the point where GPT-4 can inform strategic thinking. It’s this sort of nonsense that makes it clear to me that sentences like these were seeded throughout the documents for news articles to find and write about, rather than for someone to read and develop their understanding of GPT-4.
Conclusion
To be clear, the GPT-4 model is an amazing piece of technology and is likely the next big step in large language models. Unfortunately, in the Technical Report and System Card, OpenAI does what it can to kneecap any further progress in the field (outside their silo) through a lack of information and grandiose claims. OpenAI is a company trying to sell a product, which is why it’s so disingenuous and crucial to point out when they try to apply a veneer of science to what is essentially a long form PR document.
In footnote 23 of the System Card, OpenAI invokes their charter, saying [5]:
We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be a better-than-even chance of success in the next two years.
I can not think of anything more likely to start “a competitive race without time for adequate safety precautions” than the actions taken by OpenAI when authoring their Technical Report and System Card for GPT-4.
Citations
- Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” The Journal of Machine Learning Research 21.1 (2020): 5485-5551.
- Zhang, Jingqing, et al. “Pegasus: Pre-training with extracted gap-sentences for abstractive summarization.” International Conference on Machine Learning. PMLR, 2020.
- Wikipedia contributors. “Andrew Wakefield.” Wikipedia, 11 Mar. 2023, en.wikipedia.org/wiki/Andrew_Wakefield.
- Ruyao Gong, Binghong Liu, “Monitoring of Sports Health Indicators Based on Wearable Nanobiosensors”, Advances in Materials Science and Engineering, vol. 2022, Article ID 3802603, 11 pages, 2022. https://doi.org/10.1155/2022/3802603
- OpenAI (2023). GPT-4 Technical Report and System Card. https://arxiv.org/abs/2303.08774
Leave a comment