Methodology

Table of Contents

The BS Meter
// How we score AI tools.

Every tool reviewed on DebunkTheAI receives a BS Meter score — a composite 0–100 measurement of the gap between what a vendor claims and what the product actually delivers. This page explains exactly how that number is calculated, what it means, and how you should read it.

What the score means

The score reflects reality alignment — not how good a tool is in absolute terms, but how closely it matches the promises its vendor makes. A simple tool that does exactly what it says can score higher than a feature-rich platform that oversells and underperforms.

75 – 100

✓ Worth evaluating

Marketing claims are broadly consistent with demonstrated product behaviour. Weaknesses exist but are honestly disclosed.

50 – 74

⚠ Proceed with caution

Core functionality works, but specific claims are exaggerated, pricing is opaque, or real-world performance falls short of benchmarks.

25 – 49

✗ High hype content

Marketing significantly outpaces functionality. Community reports recurring failures. Value for most buyers is questionable.

0 – 24

☠ Pure marketing fiction

Claims are demonstrably false or technically misleading. Product either does not function as advertised or actively harms the buyer.

The five dimensions

The composite score is a weighted average of five independently assessed dimensions. Each dimension is scored 0–100 before weighting is applied. The weights reflect our view of what matters most to a typical buyer evaluating an AI tool for real-world use.

Does the tool’s actual output match what the vendor’s landing page, demos, and marketing material show? We run standardised prompts or tasks and compare raw output to advertised examples. Cherry-picked demo outputs are penalised heavily.

Is the true cost of using this tool in production clearly disclosed? We examine hidden usage caps, credit consumption models, seat multipliers, enterprise-only feature gatekeeping, and real annual cost at scale versus advertised starter prices.

What do real paying users say? We analyse Reddit threads, G2 reviews, Trustpilot, and Product Hunt comment sections, weighting recent reports more heavily. Patterns of repeated uptime failures, broken features, or unresponsive support lower this dimension.

Who trains on your data? We read sub-processor lists, DPA clauses, opt-out mechanisms, and model training policies. Vendors who train on user inputs by default without clear disclosure receive low scores here.

Are the vendor’s claims specific enough to test, or are they deliberately vague to avoid accountability? Claims like “10x faster” or “enterprise-grade AI” with no benchmark references score close to zero here. Claims backed by reproducible methodology, published benchmarks, or third-party audits score high.

Sample output

Below is a representative example of what a BS Meter breakdown looks like in a published Teardown. Scores are fictitious and shown for illustration only.

D1 Output Quality vs. Claim (×0.30) 48

D2 Pricing Transparency (×0.25) 41

D3 Community Reliability (×0.20) 72

D4 Data Privacy & Architecture (×0.15) 55

D5 Claim Falsifiability (×0.10) 80

Editorial rules

Four non-negotiable rules govern every evaluation. Departing from these would turn the BS Meter into the very thing it exists to combat.

Scores precede affiliate arrangements Dimension scores are locked before any affiliate or commercial relationship with a vendor is considered. A higher score is never offered or accepted as a condition of partnership.

Reviews are time-stamped and refreshed Every Teardown shows its review date. Tools that substantially change their product, pricing, or policies are re-evaluated. Stale scores are labelled.

Evidence is quoted verbatim Community complaints, API documentation excerpts, pricing screenshots, and raw test outputs are cited or shown directly. Nothing is paraphrased in a way that misrepresents the original source.

Negative findings are never buried If a tool fails on a key dimension, that failure appears in the article summary, the score card, and the final verdict. It is not softened by a competing positive to protect an affiliate relationship.

What we do not do

Sponsored scoresNo vendor pays for a score. Ever.

Press-kit demosWe test accounts we control, not curated vendor showcases.

Evergreen listiclesNo “Top 10 AI Tools 2026” content that never ages or updates.

Vague verdictsEvery Teardown ends with a clear recommendation, not “it depends.”

Anonymous sources onlyCommunity quotes are linked to source threads where possible.

AI-only evaluationEvery Teardown is human-verified before publication.

How to read a Teardown

Start with the composite score

The headline number gives you a quick calibration before you invest reading time. Below 50 means read the red flags first.

Check your relevant dimension

A solo creator cares more about D1 and D2. An enterprise buyer cares more about D4. Weight the sub-scores by your actual priorities.

Read the review date

AI tooling changes fast. A score from six months ago may reflect a product that has since improved — or degraded. Check the “Last Reviewed” label.

Follow the evidence trail

Every claim in a Teardown links to a source: a screenshot, a Reddit thread, an API doc. If you disagree, click through and read the primary source.