annual letter 2023: golden age of stochastic parrots

25 Dec, 2023

This was a seminal year in the history of mankind, as it demarcates the starting gun to the revolution of intelligence which follows those of the agriculture, industrial, and information¹. To many people, ChatGPT came out of nowhere. But let's take a look at the past, in order to make sense of the present, and if we're mindful, perhaps the future.

Estimated Reading Time: 10 minutes

Contents

Discretely Deterministic, Continuously Stochastic
Mechanized Algorithms, Mechanized Inference
Learning Systems, A Golden Age

Discretely Deterministic, Continuously Stochastic

While we are neither philosophers whom neither concern ourselves with the nuances of epistemology (what can be known) nor ontology (what is real), as programmers we share a common operational understanding in which reality provides phenomena for us to observe and describe using natural or formal languages. Both mathematics and programming² are considered to be of the latter, meaning in practice, these formal languages sacrifice expressivity for increased precision. While we all have our favorite introductory books which found the canon for our discipline of programming, the second part of the seminal Structured Programming by Dijkstra, Dahl, and Hoare in 1972 backs out of the proverbial Plato cave and asks from afar (epistemologically) what are we doing during the activity of programming?

The primary use for representations is to convey information about important aspects of the real world to others, and to record this information in written form, partly as an aid to memory and partly to pass it on to future generations. However, in primitive societies the representations were sometimes believed to be useful in their own right, because it was supposed that manipulation of representations might in itself cause corresponding changes in the real world; and thus we hear of such practices as sticking pins into wax models of enemies in order to cause pain to the corresponding part of the real person. This type of activity is characteristic of magic and witchcraft. The modern scientist on the other hand, believes that the manipulation of representations could be used to predict events and the results of changes in the real world, although not to cause them. For example, by manipulation of symbolic representations of certain functions and equations, he can predict the speed at which a falling object will hit the ground, although he knows that this will not either cause it to fall, or soften the final impact when it does.

Continuously Stochastic

The last stage in the process of abstraction is very much more sophisticated; it is the attempt to summarise the most general facts about situations and objects covered under an abstraction by means of brief but powerful axioms, and to prove rigorously (on condition that these axioms correctly describe the real world) that the results obtained by manipulation of representations can also successfully be applied to the real world. Thus the axioms of Euclidean geometry correspond sufficiently closely to the real and measurable world to justify the application of geometrical constructions and theorems to the practical business of land measurement and surveying the surface of the earth.

The process of abstraction may thus be summarised in four stages:

Abstraction: the decision to concentrate on properties which are shared by many objects or situations in the real world, and to ignore the differences between them.

Representation: the choice of a set of symbols to stand for the abstraction; this may be used as a means of communication.

Manipulation: the rules for transformation of the symbolic representations as a means of predicting the effect of similar manipulation of the real world.

Axiomatisation: the rigorous statement of those properties which have been abstracted from the real world, and which are shared by manipulations of the real world and of the symbols which represent it.

The only meaningful difference is that the toolsmiths of mathematics (logicians) limits itself to supplying a language that specifies the what. i.e mathematicians only concern themselves with their programs typechecking. Whereas the toolsmiths of programming (language implementors) supply both a language that specify the how and additionally provides the runtime — whether that runtime is implemented virtual interpreter with instructions or a physical processor with transistors — for the program to be executed on. To quote Sussman and Abelson, programming provides procedural epistemology.

Computation: Mechanized Mathematics

While the seeds for this kind procedural epistemology can be found in the algorithms produced by the mathematics practiced in the era of the Greeks (i.e Euclid's Algorithm) and by the Germans³ (i.e Newton's method), it was not until Herbert's 20th century pursuit of "solving math" in which the notion of a computer was formalized⁴, subsequently leading to the industrial and automated evaluation of these sequences of instructions with processors that provided the necessary escape velocity to initiate what is colloquially known as the information revolution.

Since any piece of constructive mathematics can be simulated with a program, we would expect none other than the deterministic/discrete and continuous/stochastic distinction in mathematics also applies to that of computational mathematics.

— most industrial programmers are intimately familiar with that of the former — mathema such as sets, associations, and iterators — which cannot be said with that of the latter — mathema such as tensors, distributions, and derivatives. While many industrial programmers began modelling reality away creating cathedrals and bazaars of digital infrastructure such as financial networks, commercial networks, and social networks, a few people were busy exploring the other fork of computation with a more probabilistic, stochastic, continuous bent. Namely, that of machine learning.

Learning: Mechanized Inference

The essence of machine learning as a discipline is functionally equivalent to that of statistical learning with the primary distinctions being between their

tools: machine learning heavily emphasizes the use of computational⁵ techniques
culture⁶: machine learning prioritizes prediction over interpretability, meaning black box models are acceptable And this fork has finally bore it's penultimate fruit this year with none other than ChatGPT.

Machine learning, like it's statistical learning cousin, performs inference on the distribution of parameters aka parameter estimation but with more expressive function classes besides the one's learned in a course on probability 101. That is, rather than update beliefs on priors which look like traditional gaussians, binomials, and bernouillis, machine learning uses models such as kernel machines, and of course, deep neural networks.

Mechanized Algorithms, Mechanized Inference

Although neural networks date back to well over three decades, it was not until the infamous "ImageNet moment" in 2012 in which

2018 (GPT1 117M params): grammar
2019 (GPT2 1.5B params): prose, poetry, metaphor
2020 (GPT3 175B params): long stories
2023 (GPT4 1.76T params): college-level exams

karpathy: vision -> language

Toronto streetcars are not yet handled well by FSD. Btw, @karpathy is on a ~4 month sabbatical.
— Elon Musk (@elonmusk) March 27, 2022

It’s been a great pleasure to help Tesla towards its goals over the last 5 years and a difficult decision to part ways. In that time, Autopilot graduated from lane keeping to city streets and I look forward to seeing the exceptionally strong Autopilot team continue that momentum.
— Andrej Karpathy (@karpathy) July 13, 2022

https://github.com/karpathy/minGPT

UPDATE: gwern in 2024

Learning Systems: Golden age

One could say that the intelligence revolution is in fact a part of the information revolution↩
The Curry Howard Correspondence between types and propositions, as well as programs and proofs gesture towards the notion that they are in fact the same.↩
Clearly Sir Isaac Newton was an Englishman. However, the usage of "German" mathematics is reductively used here, in the mathematical-history-sense of how "Greek" mathematics refers to the mathematical zeitgeist of Egypt, Alexandria, Greece, how "Silicon Valley" refers to the technological zeitgeist of the Bay Area, New York City, Austin.↩
The Church-Turing Thesis with formal models of Lambda Calculus, Recursive Functions, Turing Machines, or in practice, von Neumann Machines.↩
Michael Jordan ↩
Statistical Modeling: The Two Cultures (2001), and On Chomsky and the Two Cultures of Statistical Learning ↩