[go: up one dir, main page]

The Mystery of Detective Barbie’s Audio

Update: The mystery has been solved. More on that below.

Second Update: Someone has turned this work into a website to match the game’s user interface. Find your name and have it spoken here: https://miomoto.de/barbenamer/

This blog post is part culture, part coding mystery. I think anyone could find the culture interesting. Nerds are going to want to read past the tech warning and continue on to the end. And heck, maybe help solve the mystery of the audio files.

A thread in a private Slack instance I participate in has been low-key obsessed with a particular video game from 1999, “Detective Barbie 2: The Vacation Mystery.” I’d never heard of it, and suspect you haven’t either. A year ago a TikTok came out exploring the game’s name customization. The game has thousands of names to choose from. Barbie can speak each one. Presumably, the voice actress recorded all 9,342 names (of which there are almost 50,000 alternate spellings, if I’m interpreting things correctly).

You can watch the TikTok of a woman exploring a bunch of different names: https://www.tiktok.com/t/ZTNQ1jY9S/

These names include: Na’sheemon (pronounced “Nah-she-ma-NUH”), Xalviertah (“Alberta”), Jahzingah, Egbeartina,Moonunitt [sic, with two Ts], Qecpfcnh (“Canda-lynn”???), Obeadience (!!!), Purificacionh (!!!), Honeycombb [sic], Seeagigey, Caeetlande (“Caitlyn”), and Silence.

We’ve gotten some anecdotes that this game company has done similar things in previous games. They started with a “baby names” list on an early game. Of course in that era, the list was likely all common white-people names, of the variety you’d find on keychains, mugs, magnets, and license plates in the gift shops of the early 90s. They crowd-sourced additional names for later games through a phone-in hotline. Eventually they ended up with what shipped with this Barbie game. I’ll let you decide how many of these are trolls, but my suspicion is that there are a lot of them.

Minor Update: In 2007 someone stumbled across thousands of sound files of Barbie saying various names. This wasn’t the same dataset as this CD-ROM, and the sound quality is much worse. The whole thing got posted to YouTube as “Barbiephonic: Say My Name.” Between the droning list of names and the sound quality, it’s quite frightening to listen to for more than a couple of minutes.

You can grab a copy of the game’s CD-ROM from Archive.org. Apparently it plays well on Windows 95 and 98, but has gameplay-impacting quirks on later Windows operating systems.

This is the point in this blog post where we depart from the culture and dive into the tech. But if you’re interested in jumping directly to an extraction tool you can find barbie.py and instructions on how to use it on the BarbieExtract GitHub.

Someone had the idea of building a website to browse through and listen to all the names, which pulled us down a technical rabbit hole.

Examining the CD, three interesting game files catch the eye:

  • pinames.hug, 75MB : This appears to be the audio of all the spoken names. There is a short header followed by what appear to be 9,342 concatenated WAV files. This has been my main focus. More on this in a bit.

The other two files I haven’t done any more detailed analysis on them beyond glancing at them in a hex editor, but these are my best initial educated guesses:

  • pinames.hix, 1MB : This might be a map from names to offsets in the hug file, to locate the audio for each name?
  • pinames.lst, 1MB : This might be a map from homonym names to their matching authoritative name/sound-file?

I’ve been told that putting all the audio clips into a single file (that giant hug file) was a common practice of the era. You get them in one long contiguous block. When authoring the CD-ROM, that file can be placed near the center of the disc, allowing for faster access than if it were at the edges or were separate files.

To my eyes, the overall file itself doesn’t appear to be compressed. If it were, I would expect the WAV headers to have been impacted, as the common strings and fields would be easy to de-dupe. That said, if we extract a few of these WAVs, by slicing the hug file at each header, there is certainly some weirdness going on that suggests compression of some sort. For example, looking at the first file’s header:

  • It appears to follow the standard WAV header specification.
  • It claims to be PCM.
  • It claims to be single channel (mono).
  • It claims to have a sample rate of 22,050.
  • It claims to use 8 bits per sample.

All of these seem reasonable, BUT:

  • It claims to have a file length of 19,682 bytes.
  • The actual file length is 9,135 bytes.

Several theories have arose in the Slack thread discussing the game:

  • The developer has used run-length encoding (RLE) in the past to compress audio.
  • This might actually be 4-bit audio instead of 8.
  • This might be using a non-standard audio codec.

UPDATE: So. Spoiler alert. Alistair Buxton was able to determine that it uses a custom compression based around 4-bit chunks and also contributed a Python script to do extraction and decoding! I’m not sure how this works on Windows, but if you have a Linux machine or Mac and are comfortable with the command line, you should be able to do something like:

git clone https://github.com/BrianEnigma/BarbieExtract.git
cd BarbieExtract
make download
python3 -m pip install --user -r requirements.txt
./barbie.py dump

This grabs the project files, downloads the relevant CD-ROM files, installs some required Python packages, and performs the extraction. You’ll end up with the 9,342 WAV files in a folder called out. There are other ways to use barbie.py, such as printing a list of the canonical names and their homophones (as text or json).

Earlier Research/Exploration

The rest of this blog post is not as relevant now, given the Python script. I’m preserving it here in case folks want to see the (failed) ideas and methodologies I had around decoding the WAV files. Unless you’re REALLY interested in my attempts to analyze and decode the audio, you can stop reading here.

Before we go further, I should point out that I have a GitHub repo, BarbieExtract, where I’ve explored some of these theories. You’ll need to download the Barbie binary files yourself, but links and details are in the readme.

barbie_extract.cpp extracts the individual WAV files into an ./output folder. Currently, to keep things from getting overwhelming, it has a hard-coded throttle to only extract the first 50. You can alter or remove that limit in extractFiles().

If you don’t want to compile C++ code, the Makefile also has targets to extract the first WAV using the Unix dd command. It can also extract only the samples, without the WAV header. For example:

  • dd if=pinames.hug bs=1 skip=20 count=9135 of=test.wav
  • dd if=pinames.hug bs=1 skip=60 count=9091 of=test.bin

As far as looking into the theories: to me, the RLE theory shows promise. PCM audio will often have long runs of the same value (such as zeros). Looking at a hex dump of the first WAV, we don’t see long runs of anything, but there does seem to be something funny going on with zeros. If you squint hard enough, it looks like maybe a 0x00 could be a sentinel value indicating RLE and perhaps a pair, 0x00 0x00 could be an escaped single zero?

00002190: 2202 2220 2202 2000 2122 0201 2000 0020
000021a0: 0020 0200 2000 2120 0021 2fff 0721 0001
000021b0: 0111 1110 1101 1111 0112 1110 1010 0101
000021c0: 2000 0020 0222 0222 2220 2200 0222 1200
000021d0: 0210 2120 0020 0020 02ff f082 12ff f062
000021e0: 1200 1211 0111 1111 0011 1110 1010 0101
000021f0: 0101 0012 1021 2002 0220 2222 2202 2000
00002200: 2202 fff0 7212 0002 1202 0002 1212 12ff
00002210: f072 12ff f061 0011 1011 1100 1111 1010
00002220: 0100 1010 1000 1000 0021 2200 2220 2220
00002230: 2220 0000 2200 2000 0212 0020 0020 0021
00002240: 2121 2120 0002 1212 0012 1001 0101 1111
00002250: 0011 1101 1001 0100 1001 001f ff06 2002
00002260: 0220 2202 2202 2000 0220 0200 0200 0021
00002270: 202f ff0a 2120 12ff f0a1 0010 0110 1110
00002280: 0011 1110 0112 1100 1001 0001 2120 0212

The file rle_test.cpp makes several different attempts at exploring RLE methods. At the time of writing, these include:

  • A naïve decoder that literally assumes <length><character> through all of the audio samples. This produces something way too big for the first WAV (508,811 bytes).
  • RLE only at 0x00<length><character>, with the assumption that 0x00 0x00 is an escaped zero (equivalent to 0x00 0x01 0x00). This produces a file that’s in the ballpark, but still too short (12K instead of 19K).
  • Swapping the previous to 0x00<character><length>. This also gets us in the same ballpark (14K), but doesn’t make a ton of sense when looking at the actual file data. For instance 0x00 0x10 0x00 would indicate repeat 0x10 zero times?
  • Assuming 0x00<character><length> is always the case, with no escaping or special behavior around 0x00 0x00.
  • Assume 0x00 0x00 is just a plain 0x00. (But this seems weird because it’d be the same as 0x00 0x01.) and assume 0x00 <length> is an RLE directive to insert that many zeros.
  • Assume 0x00 is just a regular pair of bytes and ssume 0x00 0x00 is followed by an RLE length for inserting zeros.
  • The 0x00<length><characters> RLE scheme, but expand each 8-bit sample into two 4-bit samples (upconverted to 8 bits).
  • Simple expansion, without RLE compression, of each 8-bit sample into two 4-bit samples (upconverted to 8 bits). This actually gets us closest to the target expected file size (18,228K vs the expected 19,682)

So far, no luck on any of these. They all sound like garbage.

On the “exotic codec” front, I’ve done some basic exploration there. ffmpeg allows you to override the decode codec. I’ve taken advantage of this to make my copy of ffmpeg (from homebrew) dump a list of every audio decoder codec, and then feed that into subsequent decode commands. You can see that as a combination of the codecs.txt target in the Makefile and the force_decode_all.sh shell script. None of them produce anything listenable, including several 4-bit codecs. If it’s a rare codec, it doesn’t appear to be one that ffmpeg knows about.

I’m running short on ideas. There are plenty of examples of WAVs where the audio is almost listenable. You can hear the cadence and diction of words being pronounced, but it’s always obscured by digital static — sometimes lightly, sometimes heavily. My educated is that in the audible parts, we’re hearing the waveforms that can’t be (RLE?) compressed and that the static comes from the brief pauses that can be compressed.

I have a little bit of analysis happening in analyze.cpp. A frequency analysis shows that zeros are the most often seen byte. The analyzer also gives a printout of every place a zero is found in the samples and the three bytes directly before/after it, to help in discerning patterns. It hasn’t helped me, but maybe you’ll see something.

At this point, we’re well beyond my experience and knowledge of audio files and audio codecs. Although I work in video, my knowledge of audio starts becoming thin once we go beyond the container format and header structures. Working with the actual audio samples is something I’ve never had to worry too much about because I either pass them through as opaque data or blindly throw them into an audio decoder library.

Do you have experience in audio files? In 90s-era CD-ROM games? Or even reverse-engineering the code to discover how the audio codec works? Please feel free to contribute in the comments or the GitHub repo.

Posted in: Games Projects Software

Published by

Brian Enigma

Brian Enigma is a Portlander, manipulator of atoms & bits, minor-league blogger, and all-around great guy. He typically writes about the interesting “maker” projects he's working on, but sometimes veers off into puzzles, software, games, local news, and current events.

4 thoughts on “The Mystery of Detective Barbie’s Audio”

  1. Many years ago I found a file for download that purported to be a woman saying thousands of names from a Barbie game. I’m pretty sure it must have been this dataset. So someone did it at one point… wish I could find the file for you.

  2. I’m in awe of this achievement. That ridiculous list of names has lived in my head all week, to the point where I was picking at the source files. There’s no way in hell I could’ve cracked it so imagine my delight when I found that it’s been done!! this rules

Leave a Reply

Your email address will not be published. Required fields are marked *