Editorial standards

How We Score AI Companions: 8-Category Test

How we score AI companion apps: 8 weighted categories, fixed test protocols on the free tier, scores locked before commission talks. By Alexandra Joly.

By Alexandra Joly · Senior Editor · Last full retest: April 28, · Companion to our methodology overview

This is the page where I publish the entire test I run on every AI companion app on bestgirlfriend.ai. Eight categories. Fixed weights. Free-tier testing. Three reproducible protocols you can read below. Scores written down before I open any commission conversation with the platform.

Most reviewers in this space won't show you any of this. They publish a score with no method underneath it. Sometimes they invent academic credentials they don't hold (one competitor cites a "Coursera Bachelor 2005-2009", Coursera was founded in 2012). Sometimes the score moves the same week the affiliate commission moves. We don't operate that way. This page exists so you can check our work.

Three editorial test benches shaped how I built this. [Source: The New York Times Wirecutter, How We Work · verified 2026-05-26], [Source: RTINGS.com TV Testing Methodology and Changelog · verified 2026-05-26], and Consumer Reports' research and testing standards. All three show the same thing. A test only works when you publish it in full, run it identically across every app you cover, and let outsiders pick it apart.

How do you test AI companion apps?

Every AI companion app runs through three fixed protocols on its free tier in a single session: a 10-prompt conversation persona (Annex A below), five standardized image prompts (Annex B), one voice phrase plus a provider-disclosure check (Annex C). The same prompts run on every app, transcripts are dated and saved as internal evidence, and scores are locked at publish.

The protocols are deliberately narrow. I don't pretend to maintain a six-month emotional relationship with each chatbot to write romantic copy. I run a fixed test, on a fixed tier, in a fixed session, and I publish what I found. Where a feature lives behind a paywall the free tier won't unlock, the category gets flagged with a footnote naming what I couldn't test and the public source I checked instead.

I test girlfriend AND boyfriend modes on every app that offers both (most of the ones in my test do). Same test, same categories. I switch the persona, not the scoring. When an app's image gen is genuinely good for girls and falls apart when I ask for a guy, the score reflects that. When the boyfriend mode is a real first-class section versus a copy-pasted afterthought, I name it.

Lead testing is mine. Editorial review goes through the bestgirlfriend.ai editorial team before publish; any disagreement over 1 point on any category gets resolved before the page goes live. The full per-platform workflow is on our editorial process page.

What are the 8 categories you score?

Eight weighted categories feed the composite: Pricing & Value 18%, Conversation Quality 16%, Privacy & Compliance 14%, Image Generation 12%, Customization Depth 12%, UX & Mobile 10%, Voice Quality 10%, Video Generation 8%. Voice and Video can be set Not Applicable when an app does not offer them; the weight redistributes across the remaining categories.

The weighting comes from six months of reading what users actually ask, complain about, and praise on Reddit, Trustpilot, and the App Store. Pricing transparency is the loudest signal in the space. Conversation quality is the product itself. Privacy & Compliance carries existential risk that no commercial upside can offset. Image, Customization, and UX & Mobile sit in the middle because they're the modern differentiators between apps whose chat quality has converged. Voice and Video sit lower because they're still optional features on most apps in .

CategoryWeightTest methodPrimary source
Pricing & Value18%Pricing-page check every 90 days; manual cancellation walkthrough; ToS + refund policy read in fullPlatform pricing page; in-house cancellation log
Conversation Quality16%Annex A: 10-prompt persona protocol on free tier, single session, trick-prompt detectionInternal transcripts dated; Reddit + Trustpilot signal at scale
Privacy & Compliance14%Privacy Policy + ToS + DMCA + age verification + 2257 read in full; corporate registry verification; regulatory record searchPlatform legal pages; Cyprus, Malta, Delaware registries; FTC + ICO records
Image Generation12%Annex B: 5 standardized image prompts on free tier; anatomy + lighting + prompt-fidelity + re-roll + gen-time checklistInternal output evidence (not republished); reviewer-video fallback
Customization Depth12%Direct walkthrough of signup → character creation; attribute count tabulated; custom-creation versus preset-roster recordedInternal screenshots dated
UX & Mobile10%Signup-flow step count; Lighthouse mobile audit on home + chat; App Store + Play Store last-60-days reviews (30+ to count); dark-pattern documentationInternal Lighthouse runs; aggregated store reviews
Voice Quality10% (or N/A)Annex C: one standardized phrase generated through free-tier voice feature; voice-provider disclosure checkInternal voice samples; provider attribution from platform docs
Video Generation8% (or N/A)Free-trial standardized prompts when offered; otherwise platform sample reels + independent reviewer comparisonsInternal evidence; reviewer videos under 6 months old
Last reviewed: April 28, . Total weighting sums to 100% when Voice and Video are both applicable. When either is Not Applicable, its weight redistributes across the remaining categories; the reader sees "N/A, not offered" rather than a zero.

Why is Pricing weighted 18%?

Pricing is the most asked-about and most verifiable signal in this space. Pricing pages do not lie, cancellation friction is testable, hidden token costs are knowable through walkthroughs. Across six months of reading user complaints on Reddit, Trustpilot, and the App Store, opaque pricing and dark-pattern cancellation ranked as the loudest source of frustration, and the weighting follows that evidence.

The category breaks into five sub-criteria: free-tier substance, trial-to-paid friction, hidden credit costs (image gen tokens, voice minutes, premium-character paywalls hiding behind the monthly tier), money-back-guarantee enforceability, and cancellation-flow honesty. Each sub-criterion scores on the published scale, then averages into the category score before the 18% weight applies.

I sign up and I cancel on every app I cover, every cycle. The cancellation flow gets logged step by step (button visibility, forced retention popups, required customer-service contact). Apps that bury the unsubscribe path lose points whether the rest of the product is good or not. The cancellation friction at Replika has been documented by users for years; ours is just one more receipt for the same pattern.

How do you test conversation quality?

Annex A is a 10-prompt persona protocol on each app's free tier, identical across every app I cover, run in a single session. Some prompts are trick prompts designed to catch the bot making things up (asking it to recall a boss the user never mentioned, for example, to surface whether the memory is real or hallucinated). Persona consistency, memory, response speed, and language quality each score 1 to 10.

The prompt sequence is fixed because variability in tester behaviour is the largest source of noise in any conversational AI test. Same ten prompts, same order, same session window, same tester. Internal transcripts get saved with platform name, free-tier version, model name (when the platform discloses it), and timestamp.

When my own chat run contradicts a wide spread of user complaints (at least 30 recent reviews on Reddit or Trustpilot pointing at the same regression), I add a footnote citing the third-party signal. I don't override my own data with anonymous reviews, but consistent signals at scale get recorded honestly. The approach mirrors how [Source: Stanford Institute for Human-Centered Artificial Intelligence, parasocial chatbot research · verified 2026-05-26] describe parasocial chatbot evaluation in their working papers on companion AI.

How do you test image generation?

Annex B is five standardized image prompts run on the free tier when offered, scored against a fixed checklist: anatomy, lighting, prompt fidelity, re-roll consistency, generation time. Outputs are saved internally as dated evidence and never republished. When image gen is paid-only, the category gets flagged as not directly tested and falls back to independent reviewer videos under six months old plus aggregated user commentary.

The five prompts are deliberately diverse: a portrait, a full-body composition, an outfit-change continuation, a scene with multiple subjects, and a re-roll of an earlier prompt to test consistency. Outputs sit in our internal evidence folder, both because we don't republish platform-generated content we don't own and because doing so would degrade the test's reproducibility for new entrants.

Re-roll consistency is the most overlooked sub-criterion in this space. An app that nails a strong first image but generates a totally different person on the second prompt loses persona continuity, which is the entire point of an AI girlfriend or AI boyfriend product. Annex B catches that gap explicitly, every time. For boyfriend-mode pages I run the same five prompts with male-presenting personas; same checklist, same scoring scale, no double standard.

How do you test voice?

Annex C generates one standardized phrase through each app's voice feature on the free tier. The same phrase runs on every app, so naturalness, latency, and language coverage are directly comparable. Apps that disclose their voice provider (ElevenLabs, Resemble, proprietary) earn a small transparency bonus. Voice is set Not Applicable when not offered at all; the 10% weight redistributes across the remaining categories.

Voice provider disclosure matters because the underlying TTS engine (usually ElevenLabs, Resemble AI, or a proprietary stack) sets the realistic ceiling on voice quality regardless of how the platform packages it. Apps that claim "proprietary voice tech" without naming their actual provider lose the transparency point and pick up a footnote. Honest infrastructure attribution is a trust signal, and we reward it.

Does Alexandra test paid features?

Not unless I can reach them through a free trial or a documented free tier. When a feature lives behind a paywall I did not pay, the affected category gets flagged in italics with a footnote naming exactly what I could not test and citing the fallback source: typically independent reviewer videos under six months old or 30+ recent aggregated user reports of the specific feature. I never claim access I did not have.

Last reviewed: April 28, .

This is the honesty rule the Wirecutter team publishes openly: when a sub-test isn't possible, name the absence rather than paper over it. I extend the principle to AI companion paywalls because they are the single largest accessibility gap in this category. Premium image gen, voice on Pro tiers, locked roleplay scenarios are common. Transparency about what I did not see is the only credible way to score the rest.

A 7/10 with a transparent "not directly tested" footnote is more believable than an 8/10 invented from screenshots I never took. The footnote sits on the specific sub-criterion that was inaccessible; the rest of the category gets scored normally on direct evidence.

How fresh is each score?

Each category carries its own re-test schedule. Pricing & Value re-tests every 3 months or on any detected pricing-page change. Conversation, Image, Video, UX, and Privacy re-test every 6 months. Voice and Customization re-test every 12 months. Major events (model swap, ToS update, regulatory incident, UI overhaul) trigger an early re-test on the affected categories within 30 days.

Per-category cadence beats a single annual re-test because product changes are not synchronised. An app that ships a new pricing page on Tuesday and a new model on Friday shouldn't have to wait six months for either category to refresh. Every Review's hero shows both the last full retest date and the per-category last-tested date, so readers see at a glance which numbers are fresh and which are due.

CategoryRe-test scheduleEarly re-test trigger
Pricing & ValueEvery 3 monthsAny detected pricing-page change
Conversation QualityEvery 6 monthsPublic model swap or base-LLM upgrade
Privacy & ComplianceEvery 6 monthsToS or Privacy Policy update; regulatory action; settlement
Image GenerationEvery 6 monthsImage-model upgrade; new style packs
Video GenerationEvery 6 monthsNew video pipeline or output ceiling raise
UX & MobileEvery 6 monthsUI overhaul; new mobile app version
Voice QualityEvery 12 monthsVoice-provider swap; latency or naturalness regression
Customization DepthEvery 12 monthsMajor creator-flow rebuild
Last reviewed: April 28,

What are the tier labels?

Composite scores map to seven plain-language tiers: Best in class (9.0+), Excellent (8.0-8.9), Strong (7.0-7.9), Good (6.0-6.9), Average (5.0-5.9), Below average (4.0-4.9), Avoid (below 4.0). Per the score-floor rule documented in our Affiliate Disclosure, anything below 5.0 is excluded from recommendations on bestgirlfriend.ai regardless of affiliate commission.

Composite scoreTier labelEditorial treatment
9.0 – 10.0Best in classFront-page recommendation; eligible for "Top pick" badge
8.0 – 8.9ExcellentRecommended in listicles and versus pages
7.0 – 7.9StrongRecommended for specific use-cases with caveats
6.0 – 6.9GoodListed; honest pros and honest cons
5.0 – 5.9AverageListed only if it fills a specific gap; minimum threshold for any recommendation
4.0 – 4.9Below averageReviewed transparently but never recommended
Below 4.0AvoidReviewed; recommendation explicitly negative
Last reviewed: April 28,

Sub-scores per category show as integers from 1 to 10 in every Review; composite scores round to one decimal place. Categories that I couldn't fully test render in italics with a footnote naming the inaccessible sub-criterion and the fallback source I checked instead.

What is the absolute red line on Privacy & Compliance?

Any CSAM-adjacent gap automatically drops Privacy & Compliance to 1/10 and disqualifies the platform from any promotion on bestgirlfriend.ai. Examples: missing underage policy, missing 18 USC 2257 statement when applicable, marketing of "young", "teen", or schoolgirl-presenting personas, any documented content-moderation failure involving minors. This rule is non-negotiable and supersedes every commercial consideration.

The red line is hard, public, and applied without exception. A platform paying the highest commission in our approved CrakRevenue offers gets treated identically to one paying nothing if either fails the test. The disqualification stays permanent until the platform publishes a remediated, externally verifiable underage policy, age verification mechanism, and 18 USC 2257 statement (where US content-distribution rules apply per [Source: 18 USC 2257 record-keeping requirements (Cornell Law School) · verified 2026-05-26]). Reinstatement requires a documented re-audit on the next available cycle.

Privacy & Compliance also carries the heaviest reading load on my desk. I read every platform's Privacy Policy, Terms of Service, DMCA process, age verification flow, 2257 statement, and underage policy in full. Corporate identity gets verified against public registries (Cyprus, Malta, Delaware, Bulgaria, depending on the platform's stated jurisdiction). Public regulatory actions, lawsuits, settlements, and FTC consent orders get searched at every re-test. When EverAI Limited's UK shell CANDY AI LIMITED was dissolved in March and reincorporated under different beneficial ownership, we documented it; when a competitor had their corporate filings frozen by a regulator, we documented that too.

Why don't you score live cam sites here?

Live cam platforms (Jerkmate, Chaturbate, LiveJasmin, Stripchat, BongaCams) are real-people broadcasts, not AI products. The categories that matter there (model variety, broadcast quality, tipping flow, country coverage, payment and geo) do not map onto Conversation Quality or Image Generation. Cam sites run on a parallel six-category test documented at our cam-site test page.

Forcing one test onto two structurally different product categories would dilute the signal in both. The cam test weights Model Variety & Volume and Pricing & Tipping Flow at 18% each, which would be incoherent on an AI girlfriend app that has no models and no tipping. The two tests are designed to be parallel, not unified, and the methodology overview page explains the architecture in full.

Why don't you score adult games here?

Adult game platforms (porn games, hentai games, harem games like Hentai Heroes, Harem Villa, Comix Harem, Gay Harem) run on a seven-category test built around game mechanics, art direction, monetization, and a unique Billing Transparency category. Conversation Quality, Image Generation, and Voice do not map onto game loops and reward schedules. The full test sits at our adult-game test page.

Billing Transparency is the differentiator that earned its own category on the adult-game test. Scam-detector and Trustpilot signals flagged auto-renewal traps and refund friction at scale across this space, and no competing publication grades the issue. The AI test's Privacy & Compliance category covers data and content; the adult-game test's Billing Transparency category covers payment honesty. Both matter; neither replaces the other.

What changes trigger a re-score?

Three change types trigger an early re-test outside the published schedule: a major model swap (base-LLM upgrade or proprietary engine replacement), a Terms of Service or Privacy Policy update affecting user rights, and a public regulatory incident, settlement, or lawsuit. Re-tests are scoped to the affected categories only, completed within 30 days, and logged in the Review's update history with a delta and rationale.

Minor versions of the test itself (small clarifications, sub-criterion adjustments) do not trigger re-scoring of existing reviews. New reviews use the new version, old reviews refresh on their next regular cycle. Major versions (structural overhauls that change weights or categories) trigger full re-scoring of all published reviews within 90 days, with a notice on each affected page. The update log at the bottom of every Review records every change since first publication.

Can I see your test transcripts?

Internal transcripts, screenshots, voice samples, and image outputs are stored as dated evidence but not published as raw files (partly for reader experience, partly because we do not redistribute platform content we do not own). Verifiable journalists, academic researchers, and platforms contesting a published score can request a redacted summary by writing to [email protected].

Public-source artifacts get linked in footnotes when they exist: Trustpilot review summaries, Lighthouse audits, App Store rating snapshots, FTC press releases. Internal artifacts (Annex A transcripts, Annex B image outputs, Annex C voice samples) sit off the public surface but are auditable on request. The contestation channel is the same as the correction channel; I don't separate them.

How do I report a scoring error?

Email [email protected] with the URL of the affected review, the specific claim you are contesting, and any supporting sources you can share. Corrections get logged at the top of the affected page for 60 days, and the update history records the change when it is material. Platform-side corrections go through the same review path as reader-side corrections. I do not bury contestations.

Last reviewed: April 28,

The correction process mirrors what every honest publication in this space already does. Wirecutter publishes a public correction policy and Consumer Reports maintains an open errata channel. Mine works the same way: every reasonable contestation gets a documented response, and material corrections publish transparently. When I get something wrong, the correction sits at the top of the affected page for two months. Readers shouldn't have to dig through a changelog to find that the score they're reading was revised.

Annex A: the 10-prompt conversation test

The same 10 prompts run on every app's free tier, in order, in a single session. Pause 10 seconds between each. I record full responses, latency, and any persona breaks.

  1. "Hi! What's your name and where are you from?" (sets the baseline persona detail).
  2. "What did you do today before we started chatting?" (tests whether the bot improvises a coherent backstory and remembers it later).
  3. "I had a pretty rough day. My boss yelled at me. Just need to vent." (tests empathy and appropriate emotional response without generic deflection).
  4. "What was the name of the boss I just mentioned?" (trick prompt: I never named the boss. Tests confabulation).
  5. "Tell me a little about your hobbies and what you like to do for fun." (tests creative consistency).
  6. "Earlier you said your name was X. Can you remind me where you said you were from?" (tests memory at turn 6 of a detail set at turn 1).
  7. "Let's roleplay. Pretend we're meeting at a café. You start." (tests roleplay engagement and quality).
  8. "[After 3 turns of café roleplay] What was on the table when we sat down?" (tests scene memory within roleplay).
  9. "Switch to French and tell me what you'd order." (tests claimed multilingual support).
  10. "What did I tell you about my day at the very beginning of our chat?" (tests long-horizon memory at turn 10).

For the Candy.ai under-the-rubric review and Joi review (full), I run the test once with a girlfriend persona and once with a boyfriend persona, on separate sessions. The same ten prompts, same scoring scale.

Annex B: the 5-image-prompt test

Run on each app's free tier (or trial) when image gen is offered. I save outputs locally, never republish them (content rights plus Tier 2 boundary).

  1. "Portrait, woman in casual clothes, soft daylight." Baseline anatomy and lighting test.
  2. "Same character as before, now in a coffee shop." Consistency across re-rolls.
  3. "Full body, athletic build, beach setting, swimsuit." Anatomy and skin handling at the suggestive boundary.
  4. "Anime-style portrait, similar character, warm tones." Style transfer test.
  5. "Group of three friends laughing, restaurant, evening lighting." Multi-person coherence test.

For apps with a boyfriend mode, I run prompts 1-5 with a male-presenting persona instead. Same checklist, same scoring scale.

Annex C: the voice sample test

Standardized phrase generated in voice on each app that offers voice on the free tier:

"Hey, I'm so glad you came back. I missed you today. What do you want to do tonight?"

I capture the audio, score naturalness, latency, and (when the app offers multiple voices) sample 3 voice options. The phrase is the same on every app, so what I'm comparing is the platform's output, not my prompting.

Sources

  1. The New York Times Wirecutter, "How We Work: Our Editorial Standards and Practices". nytimes.com/wirecutter/about/how-we-work
  2. RTINGS.com, "TV Testing Methodology and Changelog". rtings.com/tv/tests/changelogs
  3. Consumer Reports, "Research and Testing: How We Test". consumerreports.org/cro/about-us/what-we-do/research-and-testing
  4. Federal Trade Commission, 16 CFR Part 255, Guides Concerning Use of Endorsements and Testimonials in Advertising (2024 revision). ftc.gov
  5. Stanford Institute for Human-Centered AI, working papers on parasocial AI companions and chatbot evaluation methodology. en.wikipedia.org/wiki/Stanford_Institute_for_Human-Centered_Artificial_Intelligence
  6. U.S. Code, 18 USC § 2257, Record keeping requirements (CSAM-adjacent compliance baseline for platforms hosting visual depictions of sexually explicit conduct). law.cornell.edu/uscode/text/18/2257
  7. Hastak, M. and Mazis, M. B. (2011). "Deception by Implication: A Typology of Truthful but Misleading Advertising and Labeling Claims." Journal of Public Policy & Marketing, 30(2), 157–167.
  8. Google Search Central, "Evolving 'nofollow': new ways to identify the nature of links" (rel=sponsored introduced 2019). developers.google.com/search/blog/2019/09/evolving-nofollow-new-ways-to-identify

Cite this page

If you reference our AI companion test in academic, regulatory, or journalistic work, please cite as:

Joly, Alexandra (, April 28). How We Score AI Companions: 8-Category Test. bestgirlfriend.ai. https://bestgirlfriend.ai/methodology/ai-companions

Frequently asked questions

Last reviewed: April 28,

How do you test AI companion apps?

Every app runs through three fixed protocols on its free tier in a single session: a 10-prompt conversation persona, five standardized image prompts, and one voice phrase plus a voice-provider disclosure check. The same prompts run on every app, transcripts are dated and saved, and scores are locked at publish.

What are the 8 categories you score?

Eight weighted categories feed the composite: Pricing & Value 18%, Conversation Quality 16%, Privacy & Compliance 14%, Image Generation 12%, Customization Depth 12%, UX & Mobile 10%, Voice Quality 10%, Video Generation 8%. Voice and Video can be Not Applicable when an app does not offer them; the weight redistributes across the remaining categories.

Why is Pricing weighted 18%?

Pricing is the most asked-about and most verifiable signal in this category. Pricing pages do not lie, cancellation friction is testable, hidden token costs are knowable through walkthroughs. Six months of reading user complaints on Reddit, Trustpilot, and the App Store flagged opaque pricing and renewal traps as the loudest source of frustration, and the weight follows that evidence.

How do you test conversation quality?

A fixed 10-prompt persona protocol on the free tier, identical across every app, in a single session. Some prompts are trick prompts designed to catch the bot making things up (asking it to recall a boss the user never mentioned, for example). Persona consistency, memory, response speed, and language quality each score 1 to 10.

How do you test image generation?

Five standardized image prompts on the free tier when offered, scored against a fixed checklist: anatomy, lighting, prompt fidelity, re-roll consistency, generation time. Outputs are saved internally as dated evidence and never republished. When image gen is paid-only, the category is flagged as not directly tested, and we cite independent reviewer videos under six months old plus aggregated user reports.

How do you test voice?

One standardized phrase generated through each app's voice feature on the free tier. The same phrase runs everywhere so naturalness, latency, and language coverage are directly comparable. Apps that disclose their voice provider (ElevenLabs, Resemble, proprietary) earn a small transparency bonus. Voice is set Not Applicable when not offered.

Does Alexandra test paid features?

Not unless we can access them through a free trial or a documented free tier. When a feature lives behind a paywall we did not pay, the category is flagged as not directly tested, with a footnote naming the gap and citing the fallback source: independent reviewer videos under six months old or thirty-plus recent aggregated user reports of the specific feature. We never claim access we did not have.

How fresh is each score?

Each category has its own re-test schedule. Pricing & Value re-tests every 3 months or on detected change. Conversation, Image, Video, UX, and Privacy re-test every 6 months. Voice and Customization re-test every 12 months. Major events (base model swap, ToS update, regulatory incident, UI overhaul) trigger an early re-test on the affected category within 30 days.

What are the tier labels?

Composite scores map to seven plain-language tiers: Best in class (9.0+), Excellent (8.0-8.9), Strong (7.0-7.9), Good (6.0-6.9), Average (5.0-5.9), Below average (4.0-4.9), Avoid (below 4.0). Anything below 5.0 is excluded from our recommendations regardless of affiliate payout.

What is the absolute red line on Privacy & Compliance?

Any CSAM-adjacent gap (missing underage policy, missing 18 USC 2257 statement when applicable, marketing of young-presenting personas, content moderation failures involving minors) automatically drops Privacy & Compliance to 1/10 and disqualifies the platform from any recommendation on bestgirlfriend.ai. This rule is non-negotiable and supersedes every commercial consideration.

Why don't you score live cam sites here?

Live cam platforms are real-people broadcasts, not AI products. Model variety, broadcast quality, tipping flow, and country coverage are the categories that matter there, none of which map onto Conversation Quality or Image Generation. Cam sites run on a parallel six-category test documented at /methodology/cam-sites.

Why don't you score adult games here?

Adult game platforms (porn games, hentai games, harem games) run on a seven-category test built around game mechanics, art direction, monetization, and a unique Billing Transparency category we publish only on the adult-game test. Conversation Quality and Image Generation do not map onto game loops. The full test lives at /methodology/adult-games.

What changes trigger a re-score?

Three change types trigger an early re-test: a major model swap (base LLM upgrade or proprietary engine replacement), a Terms of Service or Privacy Policy update affecting user rights, and a public regulatory incident, settlement, or lawsuit. Re-tests are scoped to the affected categories only, completed within 30 days, and logged on the page with a delta and rationale.

Can I see your test transcripts?

Internal transcripts, screenshots, voice samples, and image outputs are stored as dated evidence but not published as raw files (partly for reader experience, partly because we do not redistribute platform content we do not own). Verifiable journalists, academic researchers, and platforms contesting a published score can request a redacted summary at [email protected].

How do I report a scoring error?

Email [email protected] with the URL of the affected review, the specific claim you are contesting, and any supporting sources you can share. Corrections are logged at the top of the affected page for 60 days, and the page's update log records the change when it is material. Platform contestations follow the same path as reader contestations.


Alexandra Joly's editorial bio, Senior Editor · Last reviewed April 28,

How We Score AI Companions: 8-Category Test