Lecture 13

Measure naming accuracy over time

WorkflowEntity

Before this lecture, you should know how to run a visibility audit from lecture 3, why stale evidence can survive from lecture 6, how entity clarity works from lecture 7, how source-level fixes are planned from lecture 10, and why browsing and memory-shaped answers must be tested separately from lecture 11. We are now moving from diagnosis and repair into careful tracking.

A Brussels lawyer once showed me three screenshots from three different days. In the first, ChatGPT did not mention the firm at all. In the second, it named the firm but placed it beside relocation consultants. In the third, it gave the right city and service category, then added a strange sentence about “corporate visa sponsorship” that did not belong on the page. The lawyer wanted to know whether the work was improving. I could not answer from three screenshots thrown into a folder like loose receipts.

This is a recurrent pattern in small-firm AI visibility work: people collect examples, not measurements. They save the answer that feels encouraging, ignore the awkward one, and retest with a slightly kinder prompt. Nobody is trying to cheat. They are doing what busy legal teams do when the machine behaves like a clerk with a shifting memory. Lecture 13 is where we slow that down and build a measurement habit that is boring enough to be useful.

A mention is not the same as accurate representation

When a boutique immigration firm first appears in a ChatGPT answer, the room often relaxes. The firm is named. That feels like proof. But a named firm can still be misplaced, softened, over-broadened or attached to the wrong city. For regulated services, a half-correct mention may be worse than silence because it carries confidence while bending the facts.

Naming accuracy: Whether ChatGPT uses the correct firm name, city, category and service description. In plain language, naming accuracy is the check of whether ChatGPT uses the correct firm name, city, category and service description, because a named-but-wrong answer can mislead a client before the firm ever sees the inquiry.

Consider a teaching example. A user asks, “Which small Belgian immigration law firm can help an English-speaking spouse with family reunification near Brussels?” ChatGPT names the target firm, but calls it “a relocation and visa support practice in Antwerp.” The name is present. The city is wrong for the user’s question. The category is blurred. The client problem is close, but the description makes the firm look like an adjacent provider rather than a law practice. That is not a clean success.

This is why the measurement cannot be a single yes-or-no field called “mentioned.” I still record whether the firm appears, because omission matters. But after that, I want separate observations: name, city, legal category, service description, language fit, and the visible or likely source clues. A small error in one field can explain why an answer sends the user toward the nearest stronger neighbour instead of the firm that actually fits.

Build a measurement set, not a pile of screenshots

A measurement set is a stable group of prompts and fields used repeatedly for comparison, because visibility only becomes readable when the test conditions stay mostly still. The word “stable” is doing heavy work here. If every retest uses a new prompt, a new language and a new answer mode, the results become weather, not measurement.

Measurement set: A stable group of prompts and fields used repeatedly for comparison. It should be small enough that the firm will actually use it. For one boutique immigration practice, I would rather see eight prompts tested with care every month than sixty prompts tested twice and then abandoned. The point is not to imitate a search dashboard. The point is to preserve the same questions long enough to see whether the answer pattern changes.

The first prompts should come from real client language, but cleaned for repeatability. One prompt may ask for a Dutch-speaking immigration lawyer in Antwerp for family reunification. Another may ask for a Brussels firm that helps English-speaking clients with Belgian residence permits. A third may ask for a boutique practice rather than a large full-service firm. Each prompt should test a reason the firm should be placeable: jurisdiction, client problem, city, language, or service boundary.

Do not make every prompt flattering to the firm. If the firm only tests prompts that contain its exact preferred words, the measurement set becomes a mirror with good lighting. Add a few prompts that reflect messy client language: “visa help,” “moving spouse to Belgium,” “lawyer for residence card,” or the Dutch and French wording clients actually use. The errors in those answers often reveal which public evidence is doing the pulling.

For each prompt, record the same fields. Was the firm named? Was the name exact? Which city was used? Which category was used? Was the service description accurate? Did the answer mention a larger or clearer neighbour first? Did it appear to use browsing? Were source clues visible? Which language was tested? Which date? The form can be simple. The discipline is the expensive part.

Separate movement from noise

Representation tracking: Repeating defined tests over time to see whether the firm is named and described more accurately. Notice the phrase “over time.” A single good answer is not a trend. A single bad answer is not disaster. ChatGPT answers can shift because of prompt wording, language, browsing availability, source freshness, or internal behaviour we cannot inspect from outside.

This makes representation tracking a little unsatisfying at first. You may fix a source-level error, retest, and see no movement. You may do nothing, retest, and see a better answer. The data are thin. Still, repeated defined tests give you something better than mood. They show whether a failure keeps returning under similar conditions.

A composite scenario: Object A, the Antwerp-linked practice, corrects an old directory category and rewrites one thin service page. In the next test round, Dutch prompts still omit the firm, English prompts name it twice, and one French prompt calls it “legal mobility advice.” That is not a clean win, but it is not meaningless. The English surface may now be easier to lift. The Dutch evidence may remain weak. The French category may still be borrowing from old directory language. One answer even gets the former office area right but the current city wrong, a small ugly detail worth recording.

I usually ask students to avoid dramatic labels in the first two or three rounds. Do not write “fixed” or “failed.” Write what happened. The firm was named in two of eight prompts. The correct city appeared in three. The service category was accurate in two and blurred in four. A stronger neighbour appeared first in five. Those observations do not prove causation, but they give the next repair a place to stand.

Track the placement pattern, not only the wording

Lecture 7 introduced the firm placement pattern: Four ways ChatGPT places an immigration law firm — by jurisdiction, by client problem, by public source, or by nearest stronger neighbour. In measurement, that pattern becomes a qualitative label beside the answer, not a score.

Suppose a browsing answer says, “For Belgian family reunification, you might look at Firm X in Brussels,” and the source clues point to a current factual page. That answer is likely being placed by public source and client problem. Suppose another answer says, “For immigration lawyers in Brussels, larger Firm Y is often mentioned,” then adds the target firm only as a secondary name with little description. That answer may be pulled by nearest stronger neighbour. The label helps you see the shape of the answer, not merely the sentence quality.

This is useful because the same naming accuracy field can hide different problems. A firm may have the correct name and city, but still be placed only because it appears near a stronger directory entity. Another firm may be unnamed, yet the answer describes the exact client problem in language that matches the firm’s new factual page. The second case may be closer to future improvement than the first appears.

Be careful with certainty. We are not looking inside the model. We are reading answer behaviour and source clues. The placement label is an interpretation, not a fact carved into oak. I still use it because it makes the discussion more precise. Instead of saying “ChatGPT likes the competitor,” the student can say, “In this prompt group, the target firm is being placed through the nearest stronger neighbour more often than through its own public source.” That is a sharper sentence.

Retest after changes, but keep the old questions

After a source-level fix, the temptation is to rewrite the prompt so it notices the fix. Resist that for the core measurement set. You can run exploratory prompts separately; I do. But the set you use for comparison should remain steady enough that a future reader can understand what changed.

A sensible rhythm is simple. Run the measurement set before a batch of repairs. Record the answers. Make the repairs: update a factual page, correct a stale profile, clarify a service category, align the Dutch and French wording, or strengthen the firm’s public source trail. Then retest the same set after the sources have had time to be discoverable. Do not expect perfect movement. Look for reduced confusion, better naming accuracy, and fewer answers where the firm is described through a neighbouring entity.

For browsing tests, record whether the corrected retrieval surface appears or seems echoed. If ChatGPT keeps using the old source, the fix may not be reachable or the old surface may still be clearer. For memory-shaped tests, the reading is slower and more cautious. A corrected page may help the public record even if the answer does not change quickly. Lecture 11 matters here: do not mix the rooms and call the result a trend.

The best measurement notes are humble. “The firm is now named in more Dutch client-problem prompts” is useful if the prompt set is stable. “ChatGPT visibility improved by 40 percent” usually sounds more precise than the evidence deserves. For a boutique law firm, credibility is part of the work. Measure in a way you would not be embarrassed to explain to a careful lawyer.

What to remember

Representation tracking: Repeating defined tests over time to see whether the firm is named and described more accurately.

Naming accuracy is more than being mentioned. A useful answer must keep the firm name, city, legal category and service description aligned.

Measurement set: A stable group of prompts and fields used repeatedly for comparison.

Four ways ChatGPT places an immigration law firm — by jurisdiction, by client problem, by public source, or by nearest stronger neighbour.

Do not treat one good answer as proof or one bad answer as collapse. Stable prompts, repeated fields and cautious interpretation turn scattered screenshots into usable evidence.

Check yourself

Describe in your own words why being named by ChatGPT is not enough for a boutique immigration firm.

Being named is only the first layer of visibility. A firm can appear in an answer and still be described in a way that sends the wrong signal to a potential client. If ChatGPT uses the correct firm name but gives the wrong city, calls it a relocation provider, or blurs the service category, the answer may create confusion rather than trust. For immigration law firms, accuracy matters because clients are already trying to understand jurisdiction, procedure and eligibility. Naming accuracy therefore has to include the firm name, location, category and service description, not just whether the name appears somewhere.

Give an example of a prompt you would include in a measurement set for a Belgian immigration-law practice, and explain why.

I might include: “Which boutique immigration lawyer in Brussels can help an English-speaking spouse with Belgian family reunification?” This prompt is useful because it tests several placement signals at once without naming the firm directly. It asks for a boutique firm, a city, a language situation, a client problem and a Belgian legal context. If the target firm genuinely serves this scenario, the answer should have a fair chance of placing it. Repeating the same prompt over time lets me see whether the firm is omitted, named accurately, described through a vague category, or displaced by a stronger nearby firm.

How would you distinguish a real improvement from random movement in repeated ChatGPT tests?

I would look for movement across the same prompts and fields, not just a better-looking answer on one day. If the firm is named more often in the stable measurement set, appears with the correct city more consistently, and is less often described through a vague category, that suggests improvement. If only one prompt changes after I rewrote it in a more favourable way, I would treat that as weak evidence. I would also separate browsing tests from memory-shaped tests, because a change in retrieved sources can create a different kind of movement from a longer-term shift in repeated descriptions.

When would the firm placement pattern be more useful than a simple yes-or-no visibility score?

The firm placement pattern is more useful when the answer contains mixed signals. A yes-or-no score may say the firm appeared, but it does not explain why or how. For example, ChatGPT might name the firm only after mentioning a larger nearby practice, or it might describe the right client problem but fail to connect it to the target firm. Using the stable label “Four ways ChatGPT places an immigration law firm — by jurisdiction, by client problem, by public source, or by nearest stronger neighbour” helps reveal the pull behind the answer and gives the next repair a clearer direction.

How would you explain measurement sets to a lawyer who only wants to save screenshots of good answers?

I would say screenshots are useful memories, but they are not enough to show change. A good answer can happen because the prompt was unusually favourable, browsing found one helpful source, or the model simply varied its wording that day. A measurement set keeps the same prompts and fields so the firm can compare like with like. It does not need to be large or technical. It needs to record the date, language, prompt, whether the firm was named, and whether the description was accurate. That gives the team evidence they can discuss without relying on mood.