"I uploaded one photo, wrote a script, and… that person is actually talking?"
Up until now, ElevenLabs was the company that made voices.
But this time, it started making faces too.
Drop in a script and out comes a talking AI-person video in one shot. This is the story of Avatars.
Hey there, this is Sonetho. ⚡
We’ve been using ElevenLabs every single day for nearly three years,
and today we're bringing you a brand-new feature officially announced in mid-June 2026: Avatars.
Here's the one-line version.
You can now build an entire talking-person video right inside ElevenLabs.
Upload a photo to create an AI person, write a script, pick a voice,
→ and out comes a video where that person speaks with lip movements perfectly in sync.
If you've ever heard of HeyGen or Synthesia (talking AI-avatar video services), that's the lane.
And now the "voice champion," ElevenLabs, has stepped into it.
Today we'll dig all the way into what this is, how to use it, and how it differs from the existing services, all at a beginner-friendly level!
👉 Get started with ElevenLabs Avatars (free, no card) →
🤔 Why is a voice company suddenly making faces?
Let's start with the jargon, kept simple.
💡 Get-it-in-one-go glossary
Avatars = your own AI person built from a photo or text. Make it once and reuse it across many videos.
Talking-head = the "talking face" video where a person looks at the camera and speaks, common on YouTube and in ads.
Lip-sync = the tech that naturally matches the mouth movements to the voice.
ElevenCreative = ElevenLabs' content-creation workspace. Avatars now live inside its Image & Video menu.
ElevenLabs' real weapon, no matter who you ask, is the voice.
It's world-class at TTS (turning text into a human-sounding voice) and voice cloning.
But video creators kept running into this hassle:
generate the voice in ElevenLabs,
re-upload that audio file into another service (HeyGen, etc.),
and match the lips over there… that handoff (shuffling files back and forth) was a pain.
Avatars solves this whole sequence in one place.
Voice, face, and lip-matching → all done at once inside ElevenLabs.
It's not that a voice company built a face. The bigger picture is connecting "voice to video" seamlessly.
⚙️ How it works: the "export the audio" step is gone entirely
There's one key phrase from the announcement.
Namely, that "Text to Speech is built right into the prompt island."
Sounds technical, but it means something simple.
💡 In plain terms
Right where you type the script (the prompt island = the input panel where you write your prompt), the voice-generation feature is built in.
So the voice and the lip-synced video are generated together, in one go.
There's no need at all to export the audio file separately and move it elsewhere.
One more thing here.
It works in ElevenLabs' favor that it owns the voice-generation piece itself.
Because the voice model and the lip-sync model run together under the same roof,
the official announcement says the sync (the timing between mouth and sound) lines up more tightly than pulling in audio from the outside and matching lips to it.
That subtle mismatch, where the mouth says "hel-" but the sound says "-lo," gets reduced.
📌 Editor’s note: YOU pick the lip-sync model ⚡
ElevenLabs gathers several strong lip-sync technologies in one place and
lets you pick the lip-sync model you want right on the generation screen (a default is provided too).
The key point: each model differs in quality, max resolution, and credits-per-second. We've laid it all out in the real-world table just below.
🎬 Walkthrough: from a photo to a talking video, step by step
The actual flow is simpler than you'd expect.
Here it is, based on the official guidance.
Step 1: Create your avatar (your own AI person)
In ElevenCreative's Image & Video menu, hit "New" in the Avatar area.
Then create your person one of two ways.
Upload photos: uploading 3 to 5 photos of the same person from different angles gives stable results.
(With just one photo, results can be hit-or-miss.)Describe in text: you can also create one with no photos by describing "this kind of person" in a text prompt.
And note: you can make avatars not just of people but of characters and animals too. (Non-human is OK.)
Step 2: Name it and set a default voice
Give your avatar a name, set a default voice if you want, then lock the person in with "Create Avatar."
Each avatar gets a default voice attached up front, but you can change it anytime.
Step 3: Make the talking video
Select the avatar you made and hit "Create Lip Sync."
Then ① pick a style → ② pick a voice (a library voice or one you've cloned) → ③ enter the script → ④ hit "Generate speech" to create the voice and preview it.
Step 4: Generate
Add a quick visual prompt to set the mood if you like, then hit "Generate" and you're done.
The lip-synced video is finished, voice and all.
💡 Check the credits before you click
Avatar videos follow the existing "Image & Video" credit structure.
Cost varies by the lip-sync model you choose, the output resolution, and the video length.
The good news: the estimated credits show up on screen before you hit the generate button. Look first, then click!
(Resolution supports 480p, 720p, and 1080p, but by some measures video length affects credits more than resolution or aspect ratio.)
So we pulled the credits-per-second for each lip-sync model straight from the actual model-selection screen in June 2026. (Lower numbers are cheaper.)
Lip-sync model | Credits/sec | What it's for (official description) |
|---|---|---|
Veed Lipsync | 41 | Fast, affordable video lip-sync |
Sync Lipsync 2 Pro | 661 | Studio-grade for live-action, anime, and AI content |
Creatify Aurora | 848 | Top quality from images, guided lip-sync |
Sync 3 | 1,053 | Visual intelligence, professional quality |
HeyGen Avatar 4 (new) | 1,212 | Expressive movement, up to 1080p |
Veed Fabric | 1,212 | Realistic from any image, up to 720p |
OmniHuman 1.5 | 1,267 | Realistic lip-sync, supports non-human faces |
⚠️ The "per-second" trap: it scales directly with length
Because it's credits per second, the longer the video, the higher the cost climbs.
Ex) Sync 3 (1,053/sec) for a 30-second video → about 31,600 credits. A 1-minute one is about 63,000 credits.
On the Creator plan (roughly 120,000 credits/month), that's about 3 to 4 thirty-second clips. Honestly, not a lot of headroom.
By contrast, a cheap model like Veed Lipsync (41/sec) runs about 1,230 credits for 30 seconds, so you get dozens of times more for the same credits.
It's a quality vs. cost trade-off.On top of that, avatar (image) generation credits are separate. The per-second figures above cover only the "talking video (lip-sync)" portion.
※ Credits-per-second are real values from the June 2026 model-selection screen. Models and pricing policies change often, so always check the estimated credits on the screen right before generating.
👉 Try building an avatar yourself →
🪪 Make it once, use it forever: persistent identity & "Style" variations
The real strength of avatars is reuse.
An avatar you make once carries a persistent identity.
In plain terms, you can have that one person you built show up across many videos with the same face every time.
No more accidents where the face shifts subtly from video to video.
On top of that comes the Styles feature.
You can create variations that keep the person's core identity intact while changing the following:
camera angle (front / side, etc.)
outfit (suit / casual, etc.)
background and lighting
For example, build one "our brand presenter" and then pull an office-suit version, an outdoor-casual version, and a close-up version, all as the same person.
These avatars and styles stay consistent no matter how many times you generate, so you can reuse them across multiple projects.
📌 Why this matters ⚡
Whether it's a YouTube channel or an ad, viewers remember a brand when the same face shows up consistently.
Filming fresh every time, or using a different AI person each time, breaks that consistency.
Avatars give you a cast member you can "build once and milk for life."
🔁 Mass-produce with Flows: cranking out UGC ads in one batch
From here it's a bit more advanced, but for marketers and UGC creators it's pure gold.
💡 Just two terms
Flows = an automation feature that runs tasks one after another, like an automatic conveyor belt.
UGC ads = testimonial-style ads that look "shot by a real user." These days it's the format that performs best on Instagram, TikTok, and Shorts.
This release adds a new "Avatar node" (avatar block) to Flows.
Plug it in and you can wire avatar-video generation into an automated pipeline.
Here's the official example flow, copied straight over:
① enter a product brief (a short product description)
② AI generates the script
③ generate the voiceover (narration audio)
④ generate the video of the avatar speaking that script
And you can run this in a batch across products, languages, and hooks all at once.
Here a "hook" means the opening line that grabs you in the first 3 seconds of a video.
So you could swap in 5 different hooks ("You'll regret not knowing this," "Watch for just 3 seconds," etc.) and crank out 5 ad variations in one go.
It's perfect for the kind of work where you test which opener lands best by running multiple versions, like with Shorts and Reels ads.
Because you don't have to reshoot every time.
⚖️ How is it different from HeyGen and Synthesia? (an honest comparison)
"I already have HeyGen and Synthesia, so why ElevenLabs?"
A fair question. Here are just the essentials. (Prices are based on official and comparison sources and may vary with promotions and billing cycles.)
Service | Strength / billing | Best when |
|---|---|---|
ElevenLabs Avatars | Voice is the core business → voice + face in one place. Credit-based | Voice quality comes first, multilingual voices |
Synthesia | Billed by the minute, so budgeting is easy. Avatars get good reviews for realism | Corporate training and internal videos |
HeyGen | Credit-based. Strong at multilingual translation of existing video | Marketing and translating content for global audiences |
If we boil the key differentiator down to one line, it's this.
ElevenLabs is a "voice-first integration."
A company whose voice was already world-class attached a face (lip-sync) to that voice, letting you produce it all in one screen, in one go.
You don't have to shuffle audio around, and the strength is that the voice-to-lip sync is more precise.
Here's a quick feel for the pricing. (As of June 2026.)
HeyGen: credit-based. Roughly $1 per minute for its flagship avatar feature (Avatar IV) on the Creator plan.
Synthesia: subscription billed by the minute. Annual billing works out to about $1.8 to $2.1 per minute.
ElevenLabs Avatars: depending on the lip-sync model you pick, the range is wide, about $0.45/min (cheap) to $13.8/min (premium) (see the credits-per-second table above).
💰 So which is actually cheaper? We crunched it fully in Part 2
Honestly, if you make a lot of high-res video, a dedicated platform (HeyGen, Synthesia) can be cheaper per minute, while for occasional, small-batch, integrated-workflow use, ElevenLabs wins.
We worked out the break-even point, which comes down to "how many minutes you make per month," using a real per-minute cost table.
→ [Avatar Cost Showdown] Direct subscription vs. ElevenLabs: see who's really cheaper →
🚨 To be honest, some things are still uncertain
The maximum video length you can make in one go per model, and the credits for generating the avatar (image) itself, vary by model and setting and aren't published as exact figures.
(Max resolution also differs by model. As in the table above, some go up to 720p and some up to 1080p.)
Still, the exact cost shows up as estimated credits on the screen right before generating, so just look and click.
Also, at launch there's no API (external integration), with availability planned later.
🙋 So, who is this great for?
In our view, it's especially powerful for these folks.
Shorts and Reels creators: run a channel with a consistent "AI cast member," no need to show your own face.
UGC ad and performance marketers: mass-produce ad variations by just swapping the hook, making A/B testing easy.
Course and education creators: build series lessons with "the same instructor," scaling across subjects and languages.
Brand and social media managers: keep cranking out social content without filming each time.
Anyone who needs multilingual explainer videos: combine with ElevenLabs' multilingual voices to produce localized video.
On the flip side, if you want to make videos completely free, it's still a letdown.
Avatars (video generation) is available only on paid plans (the free plan can't generate video).
The good news is it's currently available on all paid ElevenCreative plans.
❓ Frequently asked questions
Q. Can I make an avatar right away with just one photo?
Technically yes, you can make one from a single photo, and you can even make one with no photos by describing it in text (a text prompt).
That said, the official guidance recommends 3 to 5 photos of the same person from different angles.
With just one, the face may not stay consistent across videos. If you want stable results, upload several.
Q. Can I make a talking avatar video on the free plan?
No. Avatar video generation is only available on paid plans (the free plan restricts video generation).
But it's available on all paid ElevenCreative plans, and the cost is deducted from your existing "Image & Video" credits.
It varies by the model, resolution, and video length you choose, and the estimated credits are shown on screen before you generate, so you can look and decide.
Q. Is there a reason to use ElevenLabs Avatars instead of HeyGen or Synthesia?
The biggest difference is the voice.
ElevenLabs' core business has always been TTS and voice cloning, so its voice quality and multilingual voices are strong.
Attach a face (lip-sync) to that, and the key strength is making video in one place, in one go, without moving audio to another service.
It's compelling if voice quality is your top priority or you frequently make multilingual videos.
(Conversely, Synthesia is a good pick if minute-based budgeting matters, and HeyGen if your main goal is multilingual translation of existing video.)
Q. Can I keep using the same person so the face doesn't change from video to video?
Yes, that's the heart of avatars.
An avatar you make once keeps a persistent identity, so it appears with the same face across many videos no matter how many times you generate.
With the Styles feature, you can also create variations that change only the angle, outfit, or background, keeping the identity while varying the staging.
🎁 Wrapping up
Let's recap just the essentials.
Avatars = a new feature where an AI person built from a photo or text comes out as a video, speaking your script in sync.
Voice and lip-matching happen in one screen, in one go → no hassle moving audio, and tighter sync.
An avatar you make once is reused continuously, with Styles for angle, outfit, and background variations.
The Avatar node in Flows lets you mass-produce UGC ads and Shorts by hook and by language.
Some figures like pricing, length, and auto-selected models aren't public → check the credits shown before generating.
The "voice champion" now holds the face in hand too.
An era where voice flows all the way through to video has begun.
If you're on a paid plan, upload a few photos today
and build yourself an AI cast member.
Turning a single line of script into a "talking video," once you try it, takes about a minute to feel!
👉 Get started with ElevenLabs Avatars (free, no card) →
We'll be back next time with more genuinely useful tips.
This was Sonetho. ⚡