Our Methodology
How Gaplens Matches Questions to Content
Gaplens uses a two-layer matching system to find whether your site answers each bridge question. Here's how it works, how we validated it, and where it has honest limitations.
How does Gaplens match questions to pages?
Gaplens uses a two-layer matching system. The first layer uses pattern matching with generated variants — for example, “Can I cancel?” also checks for “cancellation policy,” “how to cancel,” and other phrasings. This catches obvious matches quickly and cheaply.
When pattern matching doesn’t find a match, the second layer uses AI semantic analysis (Claude Haiku) to evaluate whether the page content genuinely answers the question, even if none of the expected phrasings appear. This catches cases where a page answers a question using completely different language.
Each layer has a specific role:
- Pattern matching: Fast, deterministic, zero cost. Checks page titles, headings, FAQ sections, and body content for question variants. Covers the majority of matches.
- AI semantic matching: Slower, uses an LLM call, but catches nuanced matches that patterns miss. Only runs on pages that pattern matching flagged as potentially relevant but couldn’t confirm.
This two-layer approach balances speed with accuracy. A typical audit costs about $0.02 in AI API calls, and most matches are found by the pattern layer alone.
How was the matching system validated?
We built a validation dataset of 186 labeled question-page pairs across three real websites spanning SaaS, healthcare, and government/finance. Two independent raters labeled each pair as “match,” “partial match,” or “no match.”
The inter-rater agreement was strong: Cohen’s kappa of 0.819, which indicates near-perfect agreement between raters on what counts as a match. This means the labels themselves are reliable, not just subjective opinions.
Key validation results:
- Precision: 100% — Every match the system reported was a genuine match. Zero false positives across all 186 pairs.
- Semantic false positive rate: 0% — The AI layer made zero incorrect match claims (7 out of 7 correct).
- Confidence threshold: 0.6 — All AI matches scored between 0.75 and 0.98 confidence, well above the threshold, confirming the system is conservative in its claims.
Why precision matters most:
When Gaplens tells you a question is covered, you can trust that answer. We optimized for zero false positives because telling you a gap is covered when it isn’t is worse than missing a match you could find manually.
What do the confidence scores mean?
Confidence scores appear on AI-matched questions and represent how certain the semantic model is that a page genuinely answers the question. Scores range from 0 to 100%.
In practice, the system is conservative:
- 75-100%: Strong match. The page clearly addresses the question, even if the exact wording differs.
- 60-74%: Moderate match. The page is relevant and contains a partial or indirect answer.
- Below 60%: Not reported as a match. The system only surfaces matches it’s confident about.
Pattern-matched questions (found via text variants rather than AI) don’t have confidence scores because the match is deterministic — the expected phrasing was found in the content.
In our validation, all AI matches scored between 0.75 and 0.98, meaning the system naturally tends toward high-confidence matches rather than borderline ones.
What are the limitations of the matching system?
The main limitation is recall — the system may miss matches that exist on your site, particularly when the relevant content wasn’t included in the crawled pages. This is an honest trade-off: we prioritized never giving you a false “covered” result over catching every possible match.
Specific limitations:
- Crawl depth: Gaplens analyzes the pages it can crawl, which depends on your sitemap and link structure. If the answer to a question lives on a page that isn’t linked from your main navigation or sitemap, the system won’t find it. In our validation, one site with only 27 crawled pages had 0/13 matches — not because matching failed, but because the relevant content simply wasn’t in those 27 pages.
- JavaScript-rendered content: Content that requires JavaScript to render may not be visible to the crawler.
- Content behind auth: Gated content, login walls, and paywalled pages cannot be analyzed.
On well-crawled sites, the system performs strongly: our SaaS test site matched 10 of 13 labeled pairs, and our government/finance test site matched 12 of 13. The quality of results scales directly with how much of your site the crawler can reach.
How is this different from simple keyword matching?
Simple keyword matching looks for exact words — if you search for “cancel,” it finds pages containing “cancel.” Gaplens does something fundamentally different at both layers.
Pattern matching with variant generation: Instead of searching for one keyword, Gaplens generates dozens of phrasings for each question based on industry-specific patterns. “Can I cancel my subscription?” also checks for “cancellation policy,” “how to cancel,” “ending your plan,” and other natural phrasings. This catches matches that keyword search would miss.
AI semantic understanding: The second layer doesn’t search for words at all. It reads the page content and evaluates whether it answers the question, regardless of terminology. A page explaining “you can downgrade to our free tier at any time with no penalties” answers “Can I cancel?” even though the word “cancel” never appears.
This matters because your customers don’t always use the same words you do. Bridge questions are phrased in customer language, and the answers on your site are often in business language. The two-layer system bridges that gap.
Want to know exactly where your gaps are?
Gaplens audits your site against the bridge questions your audience asks, and scores every page for AI extraction readiness.
Request Early Access