KBO Fan Cam AI: The Stadium Goddess Prompt Explained

The 14.9M-View AI Fan Nobody Realized Was Fake

In early May 2026, a clip from a Korean Baseball Organization (KBO) broadcast started ripping through Korean Twitter, TikTok, and YouTube. The shot was unremarkable on its face — a young woman in the upper stands, team jersey, iced drink in hand, swaying her cheering stick (응원봉) during a lull between innings. The telephoto lens compressed the packed crowd behind her into a wall of blue and red. She blinked. She shifted her weight. She glanced toward the field.

Then someone pointed out that she wasn't real.

The "fan" was an AI-generated video, almost certainly built in Kling, dropped into the timeline of a real broadcast moment. By the time the Korea Times ran a piece on it, the clip had crossed 14.9 million views and spawned a wave of imitators. BigGo's reporting on a parallel "Baseball Goddess" version put that one at 8 million views inside a week.

This is now a format. People are calling it the KBO Fan Cam, the Stadium Goddess, or just "caught on cam" content. And like every viral AI format before it, the entire thing rests on one specific prompt structure that anyone can copy.

The Starrd Fan Cam template — one photo in, broadcast-grade catch out

Prompt used

Single continuous live sports broadcast shot, 12s, 16:9. Telephoto broadcast lens, 135mm equivalent, locked off from upper press box. Subject sits in packed stadium stands during a night game — team jersey, iced drink in hand, cheering stick on knee. Crowd densely compressed behind by long lens. Micro-actions only: blinks, weight shifts, sips drink, repositions cheering stick. No eye contact with camera. No cinematic drama. No music. Ambient stadium audio only. Pure live TV capture, broadcast video grain.

Pro Tip

Don't want to hand-write this prompt? The Fan Cam template is already live in the Starrd library. Upload one photo, get a 12-second broadcast-grade catch — same prompt structure, personalized for your face. Rest of this post is for people who want to understand why it works.

Why This Trend Works (and Why Most AI Videos Don't)

Most AI video looks like AI video because creators reach for the wrong ambition. They want their generation to be cinematic — sweeping cameras, dramatic lighting, slow motion, the works. The viewer's brain immediately reads "this is a video someone made" and the magic dies on contact.

The KBO fan cam does the exact opposite. Every choice is engineered to look like nobody made anything:

Telephoto compression — a 120-150mm focal length is what real broadcast cameras use to pick faces out of the stands. Wide angles look like a phone. Telephoto looks like ESPN.
No camera movement — locked-off, maybe a tiny handheld breath. Real broadcast operators hold steady on a face for 3-5 seconds before cutting.
Micro-actions only — a blink, a weight shift, a hand reposition. Not "performs a cheer," not "celebrates a home run." The whole point is that nothing performative happens.
No eye contact with camera — she's watching the game, not the lens. The moment a subject looks into camera, the illusion of a candid catch collapses.
Imperfect framing — slightly off-center, head not perfectly composed. Real broadcast camera operators are zooming and reframing in real time; perfection screams "rendered."

The trend isn't "make a beautiful video of a beautiful person at a baseball game." The trend is "make a video so boring and so technically specific that the viewer's brain classifies it as broadcast footage before it classifies it as content."

The Verbatim Kling Prompt That Started It

Here's the prompt skeleton being passed around — pulled from the Carat.im fan-cam guide, which documents the public formula:

Kling 'Stadium Goddess' Prompt — Original Formula

@image1 = character identity reference only (face, hairstyle, proportions).
Output: single continuous live sports broadcast shot, 4-5s, 16:9, 1080p, no cuts.
Telephoto broadcast lens (120-150mm). Long-distance zoom from upper stands camera.
[0-2s] Sits still, blinks once.
[2-4s] Subtle weight shift.
[4-5s] Small hand reposition.
Unstaged, candid, real broadcast moment. No cinematic drama. Pure live TV capture.

Read it carefully. Notice what isn't there: no lighting direction, no mood, no music, no narrative arc, no "epic," no "stunning," no "dramatic." The prompt is doing one job — instructing the model to not perform.

The @image1 tag is Kling-specific — it tells the model to treat the attached photo as a face reference only, not a style or composition reference. Most modern video models have an equivalent: in Seedance 2.0, an attached reference image is treated as character identity by default when the prompt describes a new scene.

The Anatomy of the Prompt, Line by Line

If you only memorize one part of this post, memorize this section. Every element in the prompt is load-bearing.

"Single continuous live sports broadcast shot, 4-5s, 16:9, 1080p, no cuts."

The model is being told this is one shot, not a montage. AI video models will sometimes invent cuts to fill duration if you don't forbid them. "No cuts" is a guardrail. 4-5 seconds is the duration that matches a real broadcast hold on a fan. 16:9 1080p matches broadcast spec — anything else (vertical, 4K, square) breaks the illusion immediately.

"Telephoto broadcast lens (120-150mm). Long-distance zoom from upper stands camera."

This is the single most important sentence in the prompt. Telephoto compression is what your eye recognizes as "TV." Two things happen at 120-150mm:

Background compression — the crowd behind the subject squishes flat, packed tight, no visible gaps between rows. Wide angles spread the crowd out and reveal seat geometry; telephoto stacks bodies into a wall.
Subject isolation — shallow depth of field naturally separates the fan from the crowd without you having to ask for it.

"From upper stands camera" places the virtual camera in a real broadcast position — typically the press box, level with or slightly above the subject. Ground-level angles read as fan-shot phone footage, not broadcast.

Time-Segmented Micro-Actions

[0-2s] Sits still, blinks once.
[2-4s] Subtle weight shift.
[4-5s] Small hand reposition.

This is the discipline that separates the viral originals from the imitators that get clocked instantly. The temptation is to write "She watches the game and cheers." Don't. Specify one tiny action per beat, in plain language. A real broadcast cut to a fan captures exactly this — a few seconds of someone existing, not performing.

The actions chosen here are deliberately ambient: blink, shift, reposition. They prove the subject is alive without proving they're a character. Compare to "smiles warmly at the camera" — that's a character beat, and it kills the illusion.

"Unstaged, candid, real broadcast moment. No cinematic drama. Pure live TV capture."

Closing instructions. These are negative prompts disguised as positive ones. "No cinematic drama" tells the model to stop reaching for the standard AI video aesthetic. "Pure live TV capture" is a strong style anchor that models trained on broadcast footage will lock onto.

The 12-Second Version (Seedance 2.0)

The original is 4-5 seconds, which fits Kling's defaults. Most production AI video workflows — including the Starrd template library — run on Seedance 2.0 at 12 seconds. Stretching the prompt without breaking the candid feel is the tricky part. The wrong way is to add bigger actions. The right way is to add more small actions, spaced out.

This is the prompt structure the Starrd Fan Cam template uses under the hood:

Seedance 2.0 — KBO Fan Cam (12s)

Single continuous live sports broadcast shot, 12s, 16:9, no cuts.

Subject sits in the upper stands of a packed KBO baseball stadium during a night game. Team jersey, iced Americano in left hand, cheering stick (응원봉) resting on knee. Hair slightly mussed from breeze. Crowd densely packed behind, compressed by long lens — sea of team colors, occasional movement, blurred faces.

Telephoto broadcast lens, 135mm equivalent, locked off from upper press box position. Slight handheld breath only. Shallow depth of field — subject in focus, crowd softly blurred. Stadium floodlight casts cool top-down light, slight warm bounce from team color sea behind.

[0-3s] Watches the field intently, blinks twice. Small head tilt as she tracks a play. [3-6s] Brief glance down at iced drink, takes a slow sip, returns gaze to field. [6-9s] Subtle weight shift, repositions cheering stick from knee to hand without raising it. [9-12s] Eyes follow something moving across the field. One small unconscious smile. Holds.

Unstaged, candid, real broadcast moment. No eye contact with camera. No cinematic drama. No music. Ambient stadium audio only — distant crowd murmur, faint announcer through PA, occasional cheering stick clack. Pure live TV capture aesthetic, slight broadcast video grain, 1080p 30fps interlaced feel.

A few changes worth calling out:

The drink sip at 3-6s is the highest-risk beat. Sips are easy to render unnaturally. If you're rolling your own and the generation comes out weird here, swap it for "adjusts grip on cheering stick" — same purpose, less mouth animation risk.
"1080p 30fps interlaced feel" is doing real work. Modern broadcast is mostly 1080i 60 in the US and 1080i 50 in Korea. Asking for that interlaced look pushes the model away from the slick 24fps cinematic default that screams "AI video."
Ambient audio only, no music. This is non-negotiable. The moment a music bed plays under your "broadcast catch," the format dies.

Warning

Do not write "fast" anywhere in this prompt, and do not ask for slow motion. Both ruin the broadcast feel. The whole trend depends on real-time motion at real-time speed. If you find yourself wanting more energy, you're already drifting away from what makes this format work.

Reference Image: What to Feed the Model

Kling, Seedance, and the Starrd Fan Cam template all accept a reference image for character identity. For this trend, the reference image matters more than usual because the entire payoff is "wait, is that a real person?" Imperfections help.

What to use:

A clear, well-lit photo of the subject's face. Not a selfie at a weird angle. Front or three-quarter view, eyes open, neutral expression.
Avoid heavy makeup or stylized photos. Influencer-shot images push the model toward "glamour" rendering, which fights the candid broadcast vibe.
One person only. Multi-subject references confuse identity locking.

What to skip:

Full-body shots — the model doesn't need the body, you're describing the clothing in the prompt.
Heavily filtered or AI-processed reference images. Garbage in, garbage out.
Group photos. Cropping doesn't save them; the model still sees the extra faces.

Variants of the Trend Already Emerging

Within a week of the original going viral, the format has already branched. Worth knowing about if you're planning to ride this wave:

MLB version — same prompt structure, swap KBO ballpark for Yankee Stadium / Dodger Stadium / Wrigley Field. Lose the cheering stick, add a beer or a foam finger. Works in the US market where the Korean specifics don't land.
Kiss Cam catch — the next obvious evolution. Two subjects, jumbotron POV, subject realizes mid-shot. Higher production risk because it requires a reaction beat, which is exactly the thing the original trend forbids. Hard to nail.
Courtside NBA version — telephoto from across the court, celebrity-adjacent seats. The aesthetic translates cleanly.
Soccer / football match cam — works in the EU. Premier League and La Liga broadcasts have the same telephoto-fan-catch grammar.
Concert crowd catch — different lighting (stage spill instead of stadium floodlights), but the same "caught existing" formula. Coachella, Fuji Rock, festival main stage crowds all work.

The thing all the variants share: a real broadcast context the viewer already trusts, and a fan who is conspicuously not performing for the camera.

Why Most Imitators Get Clocked

Watch the failures. Almost every imitation video that gets called out as AI is failing in one of these specific ways:

The fan looks at the camera. Game over. Real fans on broadcast almost never make eye contact with the lens; they're watching the field.
Too-perfect lighting on the face. The original benefits from harsh stadium floodlights creating slight unflattering shadows. AI defaults push toward soft flattering key light, which reads "set" instead of "stands."
The crowd behind moves like a screensaver. When the background loops or moves with unnatural synchronization, the eye catches it. Generate at higher quality, or accept the limitation and frame tighter.
No micro-imperfections. A stray hair, a slightly crooked cheering stick, an iced drink with condensation drip — these tiny details are what real broadcast catches. Sterile compositions read as rendered.
The duration is wrong. Real broadcast holds a fan for 3-6 seconds before cutting away. 8+ seconds of any fan in a single shot is already unusual; 12+ seconds without action is suspicious unless you've nailed every other detail.

Posting Strategy if You Want Views

The prompt is half the battle. The other half is where and how you drop it.

Lead with the catch, not the reveal. The viral originals posted the clip with no caption, no "AI" tag, no jokey commentary. The view count comes from people wondering. If you label it AI in the caption, you're posting an AI demo, not a viral broadcast clip. (Platforms increasingly require AI disclosure — comply with their rules, but the framing of the caption still matters within those rules.)
Frame it as a broadcast. "KBO last night 😭" or "Lotte fan was unbothered" lands better than "Made with AI."
Reply guy energy in comments drives reach. Drop one comment 30 minutes after posting acknowledging "wait, is this AI?" — engagement bait, but it works because it gives the algorithm the controversy signal it loves.
Vertical reframe for TikTok/Reels. The original is 16:9, but a 9:16 centered crop holds up because the subject is mid-frame. Don't generate vertical natively — the telephoto broadcast aesthetic only works in 16:9 first.

Skip the Prompt Engineering — Use the Template

Everything above is the deep version. The shallow version: the Starrd Fan Cam template already has all of this baked in. The time-segmented micro-actions, the telephoto framing, the broadcast grain, the no-music ambient audio, the personalization layer that ties it to your face — it's one upload and one tap.

We built it on Seedance 2.0 with the prompt structure above. The video at the top of this post is a generation straight from the template, no manual prompt tweaking.

The Stadium Goddess trend won't last forever — viral AI formats burn out in 4-6 weeks once everyone has seen them. If you want to catch the wave, the template is in the library right now.

Fan Cam

The KBO Stadium Goddess format — telephoto broadcast catch with one photo in

Try It

Festival Main Stage

Crowd POV with volumetric stage lighting — works for the concert variant of this trend

Try It

Frequently Asked Questions

What is the KBO Fan Cam / Stadium Goddess AI trend? It's a viral AI video format where an AI-generated "fan" is dropped into footage that looks like a real Korean baseball broadcast. The original clip racked up 14.9M views before viewers realized the fan wasn't real. The trend now includes MLB and concert variants.

What AI model was used to make the original viral fan cam? The original viral clips were generated in Kling using a specific telephoto broadcast prompt structure. The format works on any AI video model that accepts a reference image — Kling, Seedance 2.0, Runway Gen-4, and Veo all support it.

What is the exact Kling prompt for the Stadium Goddess video? The public prompt structure: single continuous live sports broadcast shot, 4-5s, 16:9, 1080p, no cuts. Telephoto broadcast lens 120-150mm from upper stands camera. Time-segmented micro-actions only (blink, weight shift, hand reposition). Unstaged, candid, no cinematic drama — pure live TV capture.

Why does the AI fan cam look so realistic? Three reasons: telephoto compression (120-150mm) matches real broadcast cameras, locked-off framing with no camera movement, and micro-actions only (no smiling at camera, no performing). The format works because it engineers boring authenticity, not cinematic drama.

How long can the video be? The original viral clips are 4-5 seconds. Seedance 2.0 supports 12 seconds — the Starrd Fan Cam template runs at 12s by adding more ambient micro-actions, not bigger ones. Longer than 12s starts to feel unnatural for a broadcast catch.

Can I make this without writing prompts? Yes. The Starrd Fan Cam template handles the entire prompt structure for you. Upload one photo, the AI personalizes the prompt to your face and generates the broadcast catch on Seedance 2.0 in minutes.