Houzz: Semantic Query Understanding at 50% Higher Accuracy

Houzz connects homeowners with home-improvement professionals — architects, designers, contractors — and runs a product marketplace alongside the directory. Millions of product pages. Millions of long-tail queries every day. The search engine is the platform.

`01` The Challenge

Long-tail queries missed the right product — and the missed queries compound

Houzz’s existing search ran on an NLP model that handled direct-match queries well and long-tail queries badly. Less than 40% of long-tail queries were processed accurately. When a user with a specific intent typed it in full, the engine frequently routed them to the wrong page — or to a page not optimized for conversion.

In a marketplace, missed queries don’t just lose a sale. They train users that search doesn’t work.

`02` The Approach

No large labeled dataset. Synthetic data, with latency and accuracy gates.

Two constraints shaped the approach:

Houzz wanted to avoid the cost and time of collecting and labeling a ground-truth dataset. Provectus proposed synthetic query generation instead — train on generated data that emulates real user queries across categories and attributes.

Inference had to be sub-second. Out-of-the-box LLMs couldn’t hit that bar. The team routed around the constraint: convert queries into embeddings first, then run classification on top.

`03` The Build

Amazon Titan embeddings plus classifiers plus a Flair NER model, trained on Claude 3 Sonnet synthetic data

Amazon Titan Text Embeddings (on Amazon Bedrock) generate query embeddings. Simple classifiers trained on those embeddings identify product categories and attributes. A NER model built on Flair classifies residual parts of the query — the pieces that aren’t categories or attributes — which is where most long-tail signal lives.

Training data came from Anthropic’s Claude 3 Sonnet. Synthetic queries span the full combinatorial space of categories and attributes across Houzz’s product taxonomy. The NER model, the category classifiers, and the attribute classifiers all train on this synthetic corpus. Model selection and tuning ran through Weights & Biases.

Houzz received two things: a bundle of trained ML models for semantic search understanding, and an infrastructure for generating synthetic data for new categories and attributes. Adding a new product line is a retraining, not a rebuild.

`04` The Results

52.94% → 78% category accuracy. 66.98% → 85% recall. Latency held.

+50%

Improvement in the engine’s ability to correctly understand customer queries

Category and attribute identification accuracy rose from 52.94% to 78% — a ~50% relative lift. Recall rose from 66.98% to 85%, so users are less likely to miss a relevant product. Precision held at 79% — the recall lift did not come at the cost of result relevance. Latency stayed inside the sub-second budget.

For the marketplace this means customers find what they searched for and stay on-platform longer. For merchants it means more qualified traffic. For Houzz it means search works for the queries users actually type, not just the clean ones.

`05` What’s Next

A platform that expands with taxonomy, not with labeling budgets

The synthetic-data infrastructure is the extension lever. New categories, new attributes, new product types get generated data and a retrained model. Houzz continues the engagement on the extension path.

Houzz: Semantic Query Understanding at 50% Higher Accuracy

01 The Challenge