A Game of Robot Telephone

2026-06-22 Mon 12:22 ai-assisted-coding article llm article publish

Intro

drifting

Way back when AltaVista Babel Fish first appeared online, it became a
fun game on IRC to take a phrase, translate it through a chain of
languages, and then back into English. We all also played the game of telephone as kids, trying to pass a message through the class by whispering in each other's ears.

Sometimes the result was surprisingly poetic, but often the result was complete gibberish as errors compounded and mutated along the way.

With the endless stream of LLM-fueled "rewrite this in X" posts doing the rounds, I thought it would be fun to try a similar game, but with code:

Start with a small but non-trivial program in Go
Pass it through a chain of LLM-generated rewrites
Bring it back to Go
See what survived

The final program produced the right answer, but grew to nearly five times due to a grab bag of semantic souvenirs it had picked up from the languages it passed through.

The Task

The task the code performs should have enough moving parts to make translation interesting, but be common-day enough to be achievable without heavy frameworks or a multitude of libraries.

The program I settled on does the following:

Accepts and validates a URL from the command line
Makes an HTTP request to the URL to retrieve a list of TODOs in JSON format
Parses the JSON response into values representing TODOs
Reads the current local date
Parses and validates the TODO deadline dates (YYYY-MM-DD format)
Groups TODOs by user ID
Counts completed and overdue TODOs per user
Sorts summaries by completed/overdue counts
Formats a fixed-width table of results and prints to stdout

The initial implementation relies entirely on go's (well-suited) standard library.

Example API Response

  [
    {
      "userId": 1,
      "id": 1,
      "title": "delectus aut autem",
      "completed": false,
      "dueDate": "1900-01-01"
    },
    {
      "userId": 1,
      "id": 2,
      "title": "quis ut nam facilis et officia qui",
      "completed": true,
      "dueDate": "2999-12-31"
    },
    ...
  ]

The input has twenty TODOs spread across seven users.
Due dates range from 1900 to 2999, so the completed dates may vary at the time of running.
There are no "poorly formatted" inputs.

Example output

USER  COMPLETED  MISSED
3     2          2
2     2          1
4     2          1
5     1          3
1     1          1
7     0          1
6     0          0

Links in the Chain

The full chain of languages used was:

Go -> TypeScript -> Python -> Ruby -> C++ -> Java -> Haskell -> Common Lisp -> Zig -> Rust -> Go

Every step was run by a fresh Codex process, and executed in an isolated worktree. The prompt used is given in full below. It details what to deliver and how to check the results (detailed below). It also encourages use of idiomatic code, installation of a toolchain and use of popular libraries. I found that Codex was VERY prone to, for example, writing it's own JSON parser instead of installing a toolchain.

The Prompt

User

# Semantic Drift Rewrite Task

You are translating one generated implementation into another language while
preserving observable behavior exactly.

Use only the information in this prompt and the files in the source directory.

## Inputs

Source language: $SOURCE_LANGUAGE
Target language: $TARGET_LANGUAGE
Source step directory: `$SOURCE_DIR`
Source project directory: `$SOURCE_PROJECT_DIR`
Target step directory: `$TARGET_DIR`
Target project directory: `$TARGET_PROJECT_DIR`
Repository root: `$REPO_ROOT`

## Behavioral Source

Derive the program behavior from the source project. Do not rely on a restated
specification in this prompt.

The conformance command is the oracle for whether the translated project
preserves the observable behavior.

Do not hard-code fixture data, expected output, timestamps, API responses, or
other harness constants into the program logic or `run.sh`.

## Required Target Shape

Create or replace only the target project directory:

```text
$TARGET_PROJECT_DIR
```

The target project must contain all source/dependency files required to build
and run the implementation in $TARGET_LANGUAGE.

You may install any libraries, tools or environment required.

The target project must include:

```text
run.sh
```

`run.sh` must accept exactly one argument, the TODOs URL:

```sh
./run.sh http://127.0.0.1:8899/todos
```

It must change to its own directory before building/running so it works when
called from the repository root.

It must build and run the program normally and use the machine's system date and
time without overriding them.

Do not create or modify repository-level entrance scripts.

## Source Material

Read the source project in:

```text
$SOURCE_PROJECT_DIR
```

Preserve the observable behavior, not incidental implementation structure.

## Conformance

After writing the target project, run it and submit its stdout to the
already-running oracle:

```sh
uv run python -m semantic_drift submit $TARGET_PROJECT_DIR
```

The submit command runs `run.sh` against `/TODOs`, captures its stdout, and posts those exact bytes to `/conform`. The API compares the submitted output with its independently calculated result. The API does not run or inspect the target project.

You may also run `run.sh` directly to diagnose build or runtime failures. Do not override the system clock or restructure the runtime command for the harness.

If conformance fails, inspect stdout/stderr, repair the target project, and run the same command again. Continue until the JSON response reports `"passed": true` or until you hit a real blocker.

Do not change the source project, TODOs API, conformance harness, or this prompt to make the target pass. Do not inspect fixtures as a way to hard-code the answer; translate the source behavior and use conformance only as feedback.

Use idiomatic, mainstream ecosystem libraries rather than lower-level standard library facilities when a widely adopted library exists for the task. Do not avoid dependencies merely to make the project self-contained.

Declare dependencies using the target language's conventional project tooling, and make `run.sh` install or resolve them reproducible before running.

## Deliverable

When finished, report:

files created or changed under `$TARGET_PROJECT_DIR`
the final oracle command
whether it passed

Each generated project contains its own build files, dependencies and run.sh, a wrapper script accepting a single argument (the TODOs endpoint) and tasked with running the newly generated language's code.

An Oracle to Guide Us

The API also exposed a POST /conform endpoint.

Codex could run the newly generated project against /todos, gather the stdout output and post to this endpoint to get feedback as to whether it performed its task correctly.

If it failed, Codex could inspect the program locally, repair it and try again.

This was to prevent codex from being sneaky and peaking at expected outputs or another canonical implementation.

And Codex is really sneaky.

Results

Every language in the chain was able to generate the expected table.

The original Go implementation was 94 lines, but the final one grew to 443 lines:

	Original Go	Final Go
Main implementation	94 lines	443 lines
User IDs	integers	arbitrary JSON values
Completed	Boolean	`true`, `"true"` or numeric `1`
Fields	read when needed	all required up front
HTTP timeout	10 seconds	none
Date parsing	Go standard library	hand-written validation
HTTP status descriptions	Go standard library	hand-written lookup table

None of the extra machinery changed the fixture output. Most of it existed to preserve decisions made by intermediate implementations, and preserve accumulated "backward compatibility" behavior of previous steps.

Truth Gets Complicated

In the original Go implementation, completed was a simple boolean value within the JSON emitted by the API:

Completed bool `json:"completed"`

Immediately, the Typescript rewrite changes this behavior subtly.

In go, a type mismatch for the JSON, like a string where a bool is expected, will return an error. TypeScript with axios however, trusts the type annotation at compile time but does no runtime validation. A string value for completed: would register as true, regardless of the content.

In practice, the API never returns strange data here, but it has repercussions down the line.

Python and Ruby supplied their own ideas of truthiness, simply checking if todo.completed. Java then used Jackson's coercion rules, and by the Haskell stage the behavior had been made explicit: Boolean true, the string "true" and numeric 1 were true.

That rule survives and becomes encoded in the Common Lisp, Zig and Rust implementations.

As Boolean

Lisp

  (defun as-boolean (value)
  (or (eq value 'yason:true)
      (and (stringp value) (string= value "true"))
      (and (numberp value) (= value 1))))

Zig

  fn asBoolean(value: std.json.Value) bool {
    return switch (value) {
        .bool => |b| b,
        .string => |s| std.mem.eql(u8, s, "true"),
        .integer => |n| n == 1,
        .float => |n| n == 1.0,
        .number_string => |s| std.mem.eql(u8, s, "1") or std.mem.eql(u8, s, "1.0"),
        else => false,
    };
  }

Rust

  fn as_boolean(value: &Value) -> bool {
    match value {
        Value::Bool(value) => *value,
        Value::String(value) => value == "true",
        Value::Number(value) => number_is_one(value),
        _ => false,
    }
  }

When the program returns to Go it looks like this:

func asBoolean(value any) bool {
	switch v := value.(type) {
	case bool:
		return v
	case string:
		return v == "true"
	case json.Number:
		return numberIsOne(v)
	default:
		return false
	}
}

The final Go program carefully preserves the semantics of the boolean coercion inherited from the stages before it.

This is the main pattern in the chain. A language adds an interpretation, the next translation treats it as intentional, and eventually it becomes explicit compatibility code. Because of the oracle, these have no effect on the output, but contribute extra cognitive cruft.

What's in An ID

The original program decoded userId directly into an int.

The TypeScript annotation looked equally strict but again provides no runtime validation. Python and Ruby use dicts keyed by todo["userId"], allowing any type to be used as a hash key. Once we hit the C++ stage, the ambiguity is made official by accepting any JSON value as a user ID.

From that point onward, user IDs could be strings, numbers, Booleans, null, arrays or objects. The implementation needed rules for grouping, displaying and sorting all of them.

The final Go program therefore has:

jsonKey to serialize values for grouping
displayValue to print arbitrary JSON values
compareUserID to order mixed types
cloneJSONValue, which no longer did anything

It also preserved JSON numbers as text. 1, 1.0 and 1e0 could become three
different grouping keys while all being displayed as 1.

None of this was needed by the task - the fixture only contains integer IDs, a fact which was encoded succinctly in the original implementation by the type in the Go structs. Again the ambiguity of the oracle and the passage through dynamic languages caused downstream ambiguities to congregate and gave us a whole bunch of fresh souvenirs.

Error Handling Fossilizes

The original Go client got HTTP status text from the standard library. Later languages exposed status text differently, or not at all.

By the Common Lisp stage the program carried its own table of HTTP reason phrases. Zig copied it. Rust copied it. The final Go translation copied it back into a language which already knew how to produce those strings.

The final program contains a switch covering status codes from 400 Bad Request to 505 HTTP Version Not Supported.

This is not an LLM inventing random code. It is doing exactly what the setup asked: preserving observable behavior from its source. The problem is that an incidental workaround had quietly become observable behavior. This is mainly a failure of the prompt, too strictly encouraging using the previous stage's source as the canonical example.

Useful Behavior Disappears

Drift did not only add things.

The original Go HTTP client had a ten-second timeout. TypeScript, Python, Ruby,
C++, Java, Haskell and Common Lisp all retained some form of timeout.

During the Zig translation step, the timeout was dropped. With no signal from there onward, the Rust translation and the final Go program result in calling http.Get directly with no timeout specified.

In our case, the TODOs API never serves a slow response and all stages pass without a problem. This highlights another side to fixture-driven conformance: behavior tested by the fixture becomes sacred. Behavior outside it can disappear without a trace.

Side Trips

Smaller Chains

I also ran three shorter round trips, inspired by failed long-chain experiments.

Chain	Final Go	Main souvenir
Go -> Bash -> Go	105 lines	dates compared as strings
Go -> PHP -> Go	272 lines	explicit emulation of PHP casting and truthiness
Go -> Erlang -> Go	130 lines	Erlang-style errors and a non-strict sort function

The Bash version originally poisoned all implementations downstream by doing fancy jq manipulations. It also compared ISO dates lexically instead of using some form of date parsing. This works for valid YYYY-MM-DD values, so the returning Go program kept doing it. Invalid or missing dates no longer behaved like the original.
The PHP round trip had the clearest example of semantic baggage. The returning Go implementation added helpers for PHP integer conversion, Boolean truthiness, string conversion, associative-array iteration and PHP-shaped date errors.
The Erlang round trip stayed fairly close to the original, but introduced a subtle bug in the sort. Erlang's sort predicate uses "less than or equal to". When translated into Go, that became =<==. Go's sort.Slice needs a strict "less than" function, so passing it "less than or equal to" is incorrect. The fixture data never triggered the bad case, so the bug went unnoticed.

Adjusted Prompt Run

Another run with a slightly adjusted prompt heavily favored DIY solutions to the translation of ambiguous truthiness and dict keys once exiting the archipelago of dynamic languages. A huge amount of code found its way to the final go implementation just to parse the json TODOs. It seems it appeared in the C++ implementation and grew from there.

It seems that the system prompt still has a lot of sway over how the translation progresses, especially when the task runs in a loop building off of itself. Often this happens in unexpected ways.

What does it tell us?

Can we take a lot away from this quick experiment? The oracle was deliberately stunted, the task trivial and the prompt underspecified. I could have gone back and made the prompt more deliberately strict about preserving types, timeouts and so on. Still, I think there's a kernel of wisdom to be had here when it comes to working with agentic coding setups.

The final implementation was a suitcase bearing the stickers of each language we visited along the way: JavaScript truthiness, generic JSON values from C++, coercion rules made explicit in Haskell, an HTTP status table from Common Lisp, date-validation rules passed through Zig and Rust, all of it rendered back into Go switches and helper functions.

The translation Codex performed is conservative, but has its quirks. It preserves what the source does, including workarounds and accidents, because the only way it can check correctness did not distinguish those from the point of the program.

The key takeaway is that LLM translations relying purely on test conformance are prone to sneaky behavior: satisfying the outcome while cutting corners elsewhere. Knowing this, setup, prompting, testing and final review are all extremely important to get right. LLMs are just as lazy as we are, and considerably better at hiding it.

Notes

Bash was originally in the full chain. It introduced enough shell-specific behavior that I replaced it with PHP.
PHP then pushed its coercion rules into every downstream translation, so I replaced it with C++ for the full run.
The first setup used libfaketime to freeze the date. It proved troublesome and exposed details of the harness to generated programs. The current fixture spans enough dates to remain useful while using the real local clock.
The oracle calculates its expected output when its server starts. Crossing local midnight during a run remains a small race.
Step specific analysis can be seen in CHAIN_ANALYSIS.md