Be Multi-Modal Ready: Make Your PDPs, Products & Packaging Machine-Readable

Myriam Jessier
Jun 23
8 min read

Updated: Jul 31

AI search is highly context-aware (location, time of day, personal history). The end game is to provide users with the best possible answer. Think of the old web as a librarian, pointing you towards books that exist so you can get the information. Today, the intuitive web is like a detective thanks to generative search. Give it a clue and it asks tons of questions for you, fanning out queries to ensure it provides a one-of-a-kind, tailored answer. It builds a new answer, it doesn't just find you an old one.

When it comes to ecommerce, this means that the fundamental goal is to close the gap between a visual or conversational query and your product page. You need to provide search engines like Google with such a rich, well-structured, and comprehensive understanding of your products that they can confidently serve them as the answer, whether the user's starting point is a typed keyword, a photo, a screenshot, or a spoken question.

Purple background with text: "Is it vegan? Gluten-free?" and "No more relying on brands to spell it out." Vegan and gluten-free logos. Camera image with text: "Snap a pic of the ingredients, and boom!"

Both Google Lens and Large Language Models (LLMs) leverage Optical Character Recognition (OCR) technology to process and interpret information from images. Google Lens uses OCR to extract text from images, enabling functionalities like translation, text copying, and searching based on visual input. LLMs, when tasked with multimodal inputs (like images with text), also employ OCR as a preliminary step to convert the text within images into a machine-readable format. This text is then processed alongside other visual elements by the LLM, allowing it to understand the full context of the image and respond to queries that involve both visual and textual information. In essence, OCR acts as a foundational layer that allows both tools to bridge the gap between visual and textual data, facilitating more comprehensive understanding and interaction.

Want access to a Google sheets version of this table because you can't read it for whatever reason? I've got you covered!

Advocate for OCR-Friendly Product Pictures & Packaging

Advise clients to use clear, sans-serif fonts and high contrast between text and background on their packaging. This allows users in a physical store to use Lens to easily scan ingredients, instructions, or materials.

Treat Your Packaging as a Landing Page

For multimodal search, you must treat your physical packaging with the same optimization mindset as a digital landing page. It needs to be easily crawlable and understandable by AI. If a machine can't read it, you're invisible at the crucial moment of consumer consideration.

Packages labeled "Upfront" and "Not Worse" on a shelf. Google Lens translation from Dutch to English visible. Bonus tags on packages. — Wanna know the context of this?

Design for a context, not a self-contained experience

Gather the physical packaging for your top 5-10 products. Go to a location with realistic lighting (i.e., not a perfect photo studio). Using a standard smartphone, take clear photos of the areas containing important text (ingredients, instructions, key features).

The test: use the image input feature on Google Lens, Gemini, or ChatGPT and give it simple, direct prompts:

"List all the ingredients on this package."
"What are the cooking instructions?"
"Summarize the key features written on the front of this box."
"Is this product gluten-free based on the text?"

Document the results. Does the AI read the text perfectly every time? Where does it fail? Note every error, misinterpretation, or missed word.

Identify the "OCR Failure Points"

Analyze the packages where the AI struggled. The failures almost always fall into a few common categories. You need to identify why the text was unreadable.

Common Culprits:

Low Contrast: The most common issue. Yellow text on a white background, light gray on a slightly darker gray, etc.
Stylized Fonts: Script, handwritten, or highly stylized serif fonts are difficult for OCR to parse accurately.
Text Size & Spacing: Tiny ingredient lists with tightly packed letters become a blurry mess for a camera.
Busy Backgrounds: Text printed over a complex pattern or image is easily confused.
Curved & Creased Surfaces: Text that wraps around the curve of a bottle or gets distorted by packaging creases is very hard to read.
Glossy & Reflective Finishes: Glare from store or home lighting can completely obscure words.

Create "Design-for-OCR" Guidelines

Based on your audit, create a simple, one-page best practices document for the brand's packaging design team. This translates your SEO findings into their language.

The Guidelines:

Prioritize Contrast: For all critical text (ingredients, instructions, warnings), use a high-contrast color scheme. Black on white is best. Recommend using a WCAG contrast checker.
Specify "Utility" Fonts: Mandate the use of clean, sans-serif fonts like Helvetica, Arial, Lato, or Open Sans for all informational text blocks.
Implement "Clear Zones": Insist that important text be placed on a solid, flat background, free from underlying patterns or images.
Recommend a Matte Finish: For the areas containing critical text, suggest using a matte varnish or finish instead of a high-gloss one to minimize glare.
Advocate for QR Codes: As a fallback and enhancement, include a QR code that links directly to a page with all the product's information in clear HTML text, making it perfectly machine-readable every time.

By implementing this strategy, you ensure that when a customer is standing in a store aisle wondering if your product meets their dietary needs, or at home wondering how to cook it, their phone can give them the right answer—an answer you provided by designing your packaging for machines as well as humans. That’s contextual engineering for multimodal search right there.

Tip #1: Test it in Grayscale

A great practical test is to see how your color combination looks in grayscale. If the text is still clearly distinct from the background, the contrast is likely strong. If they blend into similar shades of gray, the contrast is too low.

Optimize for "Product Adjacency" by Curating Your Visual Knowledge Graph

Every lifestyle photo is a small database of objects. A multimodal AI doesn't just see your product; it sees your product and everything else you placed next to it.

A man and non binary fem person with dessert plates smile, sitting outdoors in a city setting. Beige and goth clothes, sunny day, relaxed mood.

This collection of "adjacent" objects tells a powerful story about your brand's persona, price point, and target customer. Most brands let this happen by accident; the ones that curate their image like Chanel don’t leave it to chance. A watch photographed next to a leather-bound journal and a key to a luxury car has a different meaning than the exact same watch photographed next to a rock-climbing carabiner and a worn hiking map. Products selling a specific lifestyle, especially luxury products. You must deliberately curate the objects that appear alongside your product to build a "visual knowledge graph" for AI to understand.

The "Co-Occurrence Audit"

Take your top 5 lifestyle photos and upload them to a multimodal LLM like Gemini or use the object detection feature in the Google Vision API. Use a simple, powerful prompt: "List every single object you can identify in this image. Based on these objects, describe the person who owns them." You will get an objective, machine-generated inventory of the items you have associated with your product, and a persona description based on that data. Identify disconnects. You might discover your "affordable, minimalist" coffee maker is consistently photographed in kitchens with Sub-Zero refrigerators, accidentally signaling a luxury price point. Or your "rugged, outdoor" backpack is shown next to items that suggest a more casual, urban use case. Based on your analysis, create a new set of guidelines for your creative and photography teams. This "Object Bible" explicitly lists on-brand and off-brand props.

Here's a hilarious example found in the wild in a Luxembourg supermarket this summer: look at those BBQ marshmallows. It is obvious that the packaging is designed for a BBQ context, imagery, vibe, you name it! You see what the product looks like in use, you see the flames, you see the words clearly on it. There's no mistaking this fluffy bits for anything but a summer treat to be enjoyed over an open flame.

Store aisle with barbecue supplies: marshmallows, instant grill kits, brushes, and charcoal bags. Brightly colored packaging, tiled floor. The packaging for the marshmallows screams: BBQ SEASON — Extreme fluffy marshmallows! They are manly! They are hardcore (but still fluffy).

The Guidelines:

On-Brand for "Rugged Outdoor Co.": Stanley thermos, Darn Tough socks, a compass, a well-worn map, a GoPro.
Off-Brand for "Rugged Outdoor Co.": Fine china, silk pillows, a Peloton bike, designer stilettos.

You are deliberately curating the visual data set. You ensure that every time an AI analyzes your lifestyle photos, it understands your product's context, value, and ideal customer, making it far more likely to recommend your product for relevant, conversational queries like, "What's a good watch for someone who loves hiking?"

Emphasize a Clear, Recognizable Logo

A visually distinct and consistently used logo is easier for visual search to identify "in the wild" on social media or in real-life photos, leading searchers back to the brand's site.

Tip #2: Test your logo with Google Cloud Vision API to see if it is detected.

You can see Google Cloud Vision detects logos.

Why Facial Sentiment in Photos Matters

Traditionally, advertisers have known that happy, relatable models sell products. What the Google Cloud Vision API allows you to do is quantify this emotional resonance at scale. Instead of just saying "the model looks happy," you can get a data-backed confidence score for "Joy," "Sorrow," "Anger," and "Surprise." Unless you’re Zara where your models are famous for their uncomfortable poses and contorsions for example.

This impacts multimodal search in two key ways:

Deeper Context for LLMs: As users make more conversational queries, the emotional context of an image becomes a crucial signal. A dress aficionado like Stéphanie Walter might ask an LLM, "Show me fun outfits for a summer party." The AI can analyze the imagery, and if your product photos score high on "Joy," they are a much more relevant result than a competitor's photo with a neutral or moody model. You are literally providing emotional data for the AI to interpret.
Improved User Engagement and Conversion: Beyond AI, the emotional tone of your imagery directly affects the user. The right sentiment builds a stronger brand connection and can demonstrably improve metrics like time-on-page, add-to-cart rates, and overall conversion.

A woman in a navy blue off-shoulder dress smiles, with face detection markers overlaid. Emotion analysis chart shows high joy likelihood.

Tip # 3: Emotional Audit

Take the primary lifestyle photos for your top 10-20 products and run them through the Google Cloud Vision API demo. For each image with a human face, log the confidence scores for the primary emotions (Joy, Surprise, etc.).

Establish a baseline. What is the current emotional tone of your brand's imagery? Is it consistent? More importantly, does it match the intended feeling of the product? (e.g., You'd want to see "Joy" for a party dress but maybe "Calm" or "Serenity" for a yoga mat).

Create a clear A/B test for the product page, where 50% of visitors see the "Joyful" images and 50% see the "Focused" images. Track the conversion rate, add-to-cart rate, and bounce rate for each. This provides hard data to prove that a specific sentiment drives better business outcomes.

Summer of SEO tip provided to Freddie Chatt:

Freddie has this summer newsletter I participate in. You get at preview of what's in there here:

Make sure machine vision and customer vision align. LLMs should be able to see your products the right way. This means testing packaging with grayscale to make triple sure your ingredients or CTAs are parsed. It means making sure Google recognizes your logo when you check in Google Cloud Vision API. It also means your product images focus on the details customers seek (pockets, special stitching, patterns, etc.)

Voxxed Days Luxembourg Talk Slides

You can watch me talk about this on Youtube: https://www.youtube.com/@VoxxedDaysLuxembourg/videos

Or you can download the slides of the talk which contain a bit more content:

Download the slides

In Conlusion

This is most likely a work in progress article where I will keep adding best practices for multimodal images.