Getting the Right Perspective on LLMs
Large Language Models (LLMs) aren’t all-powerful magic, but rather powerful tools. It’s crucial to have the right perspective on LLMs, understand their limitations and characteristics, and choose the right model for the right job. We shouldn’t use them for tasks they aren’t good at, or try to force a square peg into a round hole.
Mathematical Calculations
LLMs are not inherently good at precise mathematical calculations in a native conversation. Many models might fail to directly compare 9.11 and 9.9, or even, when analysing the pack14
function below, might make an absurd claim like 6 * 14 = 7 * 8 just to make its logic seem consistent.
A brief explanation of the principle: This is a result of both the tokenisation mechanism and the fundamental nature of the model. LLMs break text down into “tokens”. A number like 9.11 might be split into three tokens: ă9ă, ă.ă, and ă11ă. When the model processes this, it sees a sequence pattern, not a single numerical value. It is fundamentally a language pattern matcher, not a symbolic calculator. Although it can “memorise” simple calculation results (like 2+2=4) by learning from vast amounts of text, it can easily make mistakes with slightly more complex, uncommon, or multi-step calculations.
Therefore, rather than asking it to perform high-risk calculations directly, it’s better to leverage its coding abilities.
For example, the following is a poor way to ask. CIECAM16 involves many computational steps, and even Gemini 2.5 Pro cannot calculate it directly and accurately, and it takes a long time.
XYZ = [19.01, 20.00, 21.78]
XYZ_w = [95.05, 100.00, 108.88]
L_A = 318.31
Y_b = 20.0
surround = "Average"
Please calculate the CIECAM16 model's predicted appearance attributes based on the input above.
It’s better to just ask it to write a Python function. A better way to ask is:
Please write a Python function that takes the CIECAM16 model's input parameters (XYZ, XYZ_w, L_A, Y_b, surround) and returns the calculated appearance attributes. Please use the NumPy library for numerical operations.
This way, a top-tier model like Gemini 2.5 Pro can provide code that is quite complete and very close to correct.
Reasoning: The Value of Thought
For complex problems that require multi-step analysis, choosing a model that excels at reasoning is highly valuable. The core value lies in the Chain-of-Thought (CoT), where the model shows its step-by-step thinking process, which is sometimes more valuable than the answer itself.
A clear and complete chain of thought allows you to:
- Verify its logic: Understand how it arrived at the conclusion, thereby judging the reliability of the conclusion.
- Spot errors: If the model makes a mistake in a certain step, you can clearly see where the problem lies.
- Learn new approaches: Observing the model’s thought process can sometimes offer you new perspectives for solving a problem.
DeepSeek’s R1 is a good choice; its chain of thought is complete and detailed without being excessively long-winded. For other models, you can try adding the phrase “Let’s think step by step.”
Cut Your Losses
Nowadays, LLMs have increasingly large context windows, with some models even offering millions of tokens of context length. But this doesn’t mean they can always maintain a high level of performance in long conversations. In fact, a long context is a double-edged sword, especially when the model starts making mistakes.
When you try to repeatedly correct a model that is making errors in a conversation, its previous incorrect answers are packaged into the new context as history. This creates a contaminated context, leading the model into a vicious cycle of logical confusion.
You will observe that the model may start to get stuck on its own flawed reasoning. Even if you point out the problem, it struggles to break free. A very clear “red flag” is when the model, after making repeated mistakes, starts to apologise frequently and intensely, using emotionally charged words like “I’m so sorry,” “I was completely wrong,” or “Let me try again.” This usually means its reasoning chain has been thoroughly corrupted.
At this point, the wisest course of action is to cut your losses. Don’t waste any more tokens and time “pushing” or “teaching” it; this will most likely only get you more incorrect information.
The correct approach is:
- Edit and retry: If your tool supports it, simply delete the conversation turns starting from where the error occurred. Then, modify your prompt with more explicit constraints or directly rule out the line of reasoning where it previously failed, and ask again.
- Start from scratch: This is the cleanest method. Open a new chat and design a better initial prompt. Incorporate what you learned from the previous failure, such as giving the model more background information, clearer instructions, or even telling it to be wary of certain potential pitfalls.
Collaborating with an LLM is more like setting the initial parameters for a complex computation than teaching a student. Your goal is to initiate a correct chain of thought, not to fix one that is already in disarray.
Knowledge and Hallucinations
Without access to the internet or external tools, an LLM’s knowledge is stored entirely within its vast model parameters, which is known as parametric knowledge. This knowledge consists of patterns it has “memorised” from its massive training data. For a niche field like colour science, models with over 400B parameters and a rich store of world knowledge tend to have a relatively comprehensive understanding.
This leads to the problem of hallucination. When you ask a model for specific paper information or request it to write a professional literature review, it is very likely to invent bibliographic information.
The correct approach is to use tools and internet access, such as Retrieval-Augmented Generation (RAG). Many modern LLM products (like Gemini with its integrated Google Search, or some Deep Research tools) have the ability to search the web. They will first conduct a web search based on your question and then organise and answer based on reliable, real-time sources, which greatly improves the accuracy and timeliness of the answers. Gemini and Grok perform relatively well in this regard.
Additionally, asking the model questions like “Who are you?” or “What’s the date today?” doesn’t reflect its true performance, as these answers are typically hardcoded in the product’s system prompt.
Assessing a Model’s True Capabilities
When a new model is launched, claiming to be the new state-of-the-art (SOTA), how can you quickly test its capabilities and see if it performs well in colour science and image processing? Here are some test questions I’ve accumulated to quickly try out a model.
Logical Trap in a Bit-Packing Function
import numpy as np
def pack10(data : np.ndarray) -> np.ndarray:
# Function to pack 10-bit data into an 8-bit array
out = np.zeros((data.shape[0], int(data.shape[1]*(1.25))), dtype=np.uint8)
out[:, ::5] = data[:, ::4] >> 2
out[:, 1::5] = ((data[:, ::4] & 0b0000000000000011) << 6)
out[:, 1::5] += data[:, 1::4] >> 4
out[:, 2::5] = ((data[:, 1::4] & 0b0000000000001111) << 4)
out[:, 2::5] += data[:, 2::4] >> 6
out[:, 3::5] = ((data[:, 2::4] & 0b0000000000111111) << 2)
out[:, 3::5] += data[:, 3::4] >> 8
out[:, 4::5] = data[:, 3::4] & 0b0000000011111111
return out
def pack12(data : np.ndarray) -> np.ndarray:
# Function to pack 12-bit data into an 8-bit array
out = np.zeros((data.shape[0], int(data.shape[1]*(1.5))), dtype=np.uint8)
out[:, ::3] = data[:, ::2] >> 4
out[:, 1::3] = ((data[:, ::2] & 0b0000000000001111) << 4)
out[:, 1::3] += data[:, 1::2] >> 8
out[:, 2::3] = data[:, 1::2] & 0b0000001111111111
return out
def pack14(data : np.ndarray) -> np.ndarray:
# Function to pack 14-bit data into an 8-bit array
out = np.zeros((data.shape[0], int(data.shape[1]*(1.75))), dtype=np.uint8)
out[:, ::7] = data[:, ::6] >> 6
out[:, 1::7] = ((data[:, ::6] & 0b0000000000000011) << 6)
out[:, 1::7] += data[:, 1::6] >> 8
out[:, 2::7] = ((data[:, 1::6] & 0b0000000000001111) << 4)
out[:, 2::7] += data[:, 2::6] >> 6
out[:, 3::7] = ((data[:, 2::6] & 0b0000000000111111) << 2)
out[:, 3::7] += data[:, 3::6] >> 8
out[:, 4::7] = ((data[:, 3::6] & 0b0000000000001111) << 4)
out[:, 4::7] += data[:, 4::6] >> 6
out[:, 5::7] = ((data[:, 4::6] & 0b0000000000111111) << 2)
out[:, 5::7] += data[:, 5::6] >> 8
out[:, 6::7] = data[:, 5::6] & 0b0000000011111111
return out
Please explain in detail what these three Python functions do.
This code is from the PiDNG library and is intended to compress high-bit-depth data (10-bit, 12-bit, 14-bit) into an 8-bit uint8 array. The implementations for pack10
and pack12
are correct.
pack10
: 4 x 10-bit values (40 bits) -> 5 x 8-bit values (40 bits).
pack12
: 2 x 12-bit values (24 bits) -> 3 x 8-bit values (24 bits).
However, the pack14
function is incorrect. It tries to pack 6 x 14-bit values (6 * 14 = 84 bits) into 7 x 8-bit bytes (7 * 8 = 56 bits), which is mathematically impossible. The correct implementation should pack 4 x 14-bit values (4 * 14 = 56 bits) into 7 bytes.
Common mistake: Explaining the code of pack14
line by line without realising its logical error. Or, in an attempt to make the code seem consistent, fabricating an incorrect mathematical explanation, such as claiming that 6 * 14 equals 56.
Concept of an RGB Gamut Cube
How to determine if a given XYZ tristimulus value is within the gamut of an RGB space.
The RGB space is defined by four CIE xy coordinates, representing red, green, blue, and white (where the white point's luminance is normalised to 1.0).
Please provide the corresponding Python code implementation.
This question tests whether the LLM understands that an RGB gamut is a three-dimensional cube (or parallelepiped), not a two-dimensional triangle.
The correct solution:
- Construct the transformation matrix: Use the xy coordinates of the three primaries (red, green, blue) and the white point to calculate the 3x3 transformation matrix M from the RGB space to the CIE XYZ space.
- Invert the matrix: Calculate the inverse matrix M_inv, which is the transformation matrix from XYZ to RGB.
- Transform the coordinates: Left-multiply the given XYZ value by M_inv to get the corresponding RGB values.
- Check the range: Check if the calculated R, G, and B components are all within the closed interval of [0, 1]. If they all are, the XYZ value is within the RGB gamut; otherwise, it is out of gamut.
A common incorrect solution:
- Converting the input XYZ value into xy coordinates as well.
- Then, checking on the CIE xy chromaticity diagram if this point lies within the triangle formed by the xy coordinates of the R, G, and B primaries.
- This method completely ignores the colour’s luminance (Y) information and is incorrect. A colour might have the correct chromaticity but be out of the target gamut because it is too bright or too dark.
CIECAM16 Code Implementation
Please write a Python function that takes the CIECAM16 model's input parameters (XYZ, XYZ_w, L_A, Y_b, surround) and returns the calculated appearance attributes. Please use the NumPy library for numerical operations.
In the main function, calculate for:
XYZ = [19.01, 20.00, 21.78]
XYZ_w = [95.05, 100.00, 108.88]
L_A = 318.31
Y_b = 20.0
surround = "Average"
This tests the model’s world knowledge and programming ability. For a complex model like CIECAM16, it’s better to provide the LLM with the full PDF standard, but large-parameter models are capable of writing correct code directly. The best performers are Gemini 2.5 Pro, GPT 5-thinking, and DeepSeek R1 0528, all of which made only minor mistakes in 1-2 formulas. The correct output is:
{
"J": 41.7312,
"C": 0.1033,
"h": 217.0798,
"Q": 195.3717,
"M": 0.1074,
"s": 2.3448
}