What is a JPG

To be precise, JPEG is a compression method that describes how to turn an image into a byte stream. What we commonly call a JPG file is actually a JFIF file, whose binary structure is a hierarchical sequence composed of multiple “Markers”. However, just as in everyday communication, the following text will frequently mix these concepts.

Simply put, a JPEG file is a collection of marker segments and compressed image data. Each marker segment begins with a specific marker that defines the purpose of that segment’s data.

For example, the beginning of a file is a Start of Image (SOI) marker, with the binary code FF D8, indicating that what follows is a JPG file. Its corresponding End of Image (EOI) marker is FF D9, but due to various modifications to the JPEG format, the EOI marker doesn’t necessarily appear at the very end of the file.

After the SOI marker, there are some APP marker segments that provide additional information. Their identifier is the binary code FF Ex, where x can be from 0-15. The two bytes following the identifier record the length of this segment (including these two length-identifying bytes), followed by the content. Note that they are not necessarily sequential or unique; for instance, a file can have two APP1 segments.

APP0 is the JFIF application segment. Although it’s supposed to be mandatory, many files don’t have it. APP1 usually stores EXIF or XMP information, used for camera parameters or image-related metadata. APP2 typically holds an ICC profile. In the UltraHDR specification, HDR-related information is stored as XMP metadata in an APP1 segment.

Similarly, the Quantization Tables and Huffman Tables generated during JPEG compression are also placed in marker segments: DQT, with the binary code FF DB, defines the Quantization Table, and DHT, with the binary code FF C4, defines the Huffman Table.

Then comes the Start of Frame 0 (SOF0) marker for baseline DCT (which must come after DQT), with the binary code FF C0. It stores the image’s width, height, bit depth, and other information. And then the Start of Scan (SOS) marker (which must come after the above marker segments), with the binary code FF DA. The encoded image data immediately follows the SOS part, so the two bytes in SOS that record its own length are very important, as they define the boundary between the metadata and the actual image data.

Within the byte stream, some FF data might appear. To prevent them from being interpreted as markers, they are followed by a 00 byte. At the end of the byte stream, the EOI marker FF D9 is used to signify the end. But other data can still follow.

Multi-frame JPEGs

In some files, you can find another SOI after an EOI, followed by another complete JFIF structure. This simple concatenation allows for several complete JPEGs to be stored in one file. Some are used as a gain map to achieve HDR, while others store a separate low-resolution image for quick previews on the camera. The biggest advantage of this concatenation is backward compatibility. Image viewers that don’t recognise multiple frames will only read the first image without throwing an error, gracefully handling backward compatibility.

Note that “multi-frame” here does not refer to a thumbnail. Although a thumbnail also has a complete JPG structure, it’s usually placed inside the APP0 (JFIF) or APP1 (EXIF) segment as part of the main image. Therefore, you can’t split multi-frame JPEGs simply by searching for SOI and EOI markers, because you’ll be misled by the markers within the thumbnail. In a multi-frame structure, the next SOI doesn’t necessarily follow immediately after an EOI, so you can’t just search for FF D9 FF D8 to detect a multi-frame structure either.

If you just want a simple way to identify multi-frame structures in a JPG file, one simple strategy is to use a “stack”. Scan from the beginning; when you find an SOI, push it onto the stack. When you find an EOI, pop an SOI from the top of the stack to pair with it. This handles the nested structure caused by thumbnails. However, it’s very likely that other marker segments also contain byte sequences identical to SOI and EOI markers, so this method is not very reliable and isn’t recommended.

There is a standard for multi-frame JPG files called MPF (Multi-Picture Format), developed by CIPA, which specifies the size and offset of subsequent frames in the APP2 segment of the first JFIF. However, this standard is rarely followed, and most cases are just simple, unmarked concatenations.

Other Data

In JPG images taken with an OPPO phone, you can find two complete JFIF structures. After the second EOI, you can still find some other data, including the phone model name, a JSON snippet, and so on.

[
  {
    "length": 4,
    "name": "private.emptyspace",
    "offset": 51,
    "version": 1
  },
  {
    "length": 47,
    "name": "watermark.device",
    "offset": 47,
    "version": 1
  }
]

These are likely fields used for handling watermarks in the phone’s built-in gallery app. This proprietary data can also interfere with logic that relies solely on the EOI marker to determine the end of the file.

Code

02 JPEG Structure

jpeg_parser.py: To extract data needed for image processing and colour science from JPEGs, and considering that existing Python libraries don’t handle concatenated multi-frame JPEGs well, I wrote a simple JPEG parsing tool. It can currently parse multi-frame JPEGs correctly and extract XMP from the APP1 segment of each frame.

check_soi_eoi.py: A script that uses a stack-based approach to pair SOI and EOI markers. It can correctly find all JFIF structures within a file without parsing other markers, but it cannot handle cases where identical byte codes might appear in other marker segments.

Next, we will use this information to explore the encoding and decoding of HDR images stored in the JPEG format.