Talk with AI-powered detailed 3D avatar. Use large language model (LLM), text-to-speech (TTS), Unity game engine, and lip sync to bring the character to life.
In the video we are asking the character "Who is Michael Jordan?". The avatar 'answers' the question after a short delay. Using the previous messages as a context, we can have entire conversatons. Notice the hair physics and blinking!
Showcase of remote events triggered from web browser. After selecting each VFX, the respective particle system is played. Popular usage is a firework particle effect when someone donates $5 on Twitch etc. During rain VFX you might even notice splash and bounce when droplet interacts with the character (top of the hair).
The core functionality is a custom 3D model that 'speaks'. It emits voice and uses Oculus' lip sync library to give a (hopefully convincing) impression. Here is a feature set:
The flow does not rely on any particular implementation. Feel free to mix and match LLMs, TTSs, or any suitable 3D models (requires specific shape keys). As you might notice, this architecture gives us ultimate flexibility. As you might imagine, the previous sentence is an understatement.
There is no speech recognition, the prompt is text-only. It would be trivial to add this feature using Whisper Fast. See below for instructions. TL;DR send GET or POST to
/prompt
endpoint.
Using TTS with streaming and DeepSpeed, I usually get a <4s response (from sending the prompt to the first sound). It's small enough, that conversation feels real-time. At this point, the bottleneck is the LLM. On a single GPU you can't run LLM and TTS at the same time (I have tried, check the FAQ about tts.chunk_size
config option). We have to first generate all text tokens and only then generate sound. I've tried offloading TTS to the CPU, but this also struggles.
Streaming means that we split the generated text into smaller chunks. There is a small crossfade to mask chunk transitions. A small first chunk means fast time-to-first-sound. DeepSpeed is a Microsoft library to speedup GPT inference. Both streaming and DeepSpeed are optional but recommended.
The first question after the server starts always takes the longest (~10s) as the server has to load the AI models. When used in the Unity editor, you will rarely have a garbage collection pause (kinda noticeable with audio). But I would be surprised if you actually got a GC issue in the production build.
I've got to say, I'm amused. I expected some problems when using the same GPU for both Unity rendering and the AI. I knew that an Android/iOS app was an easy fallback to offload the Unity cost to a separate device. But it's not necessary on my hardware. It's kind of unexpected that it works smoothly. Ain't gonna complain. I also limited Unity to a 30FPS (just in case).
If you go to the control panel you will see the timings for each response stage. For Unity, use the built-in profiler.
See INSTALL_AND_USAGE.md. It also includes instructions on how to use/expand current features.
The questions below are about general the philosophy of this app. For a more usage-oriented FAQ, see INSTALL_AND_USAGE.md.
This app shows we already have the technology to render a detailed 3D Avatar and run a few neutral nets on a single consumer-grade GPU in real-time. It is customizable and does not need an internet connection. It can also work in a client-server architecture, to facilitate e.g. rendering on mobile devices.
I could have used the standard Sintel model. I've created my own character because, well, I can. From dragging the vertices, painting the textures, animating the mouth, and adjusting hair physics to a 'talking' 3D avatar. Quite an enjoyable pastime if I do say so myself.
I've also wanted to test texture reprojection from a stable diffusion-generated image. E.g. you can add 'bald' to the positive prompt and 'hair' to the negative. It does speed up workflow a lot. Alas, reprojection will have specular highlights, etc. to remove manually.
I've used Sintel as a base mesh as it already has basic shape keys. Especially to control each part of the mouth - just add Blender 4.0-compatible drivers. This made it trivial to create viseme shape keys. I've already used Sintel's model many times in the past, so it was a no-brainer for this project.
PS. I hate rigging.
You've probably seen 'talking' real-time stable diffusion-generated virtual characters. It is a static image with the mouth area regenerated on every frame based on sound. You will notice that it's unstable. If you diffuse teeth every frame, they will shift around constantly. I've used stable diffusion a lot. I've seen my share of mangled body parts (hands!). It's... noticeable with teeth. A popular implementation is SadTalker. It even has Stable Diffusion web UI extension.
Instead, my app uses boring old technology that has been in video games for years. If you have hundreds of hours of dialogue (Baldur's Gate 3, Cyberpunk 2077, etc.), you can't animate everything by hand. Systems like JALI are used in every major title.
If you want real-time animated characters why use solely AI? Why not look for solutions used by the largest entertainment sector in the world? At the very least you could use it as a base for img2img. In recent years we also had VTubers, which push the envelope each day. A lot of this stuff is based on tech developed by Hatsune Miku fans.
Neuro-sama is a popular virtual streamer. It's an AI-driven character that plays video games and talks with its creator, Vedal. Here is how my app stacks against it:
This app includes source code/assets created by other people. Each such instance has a dedicated README.md in its subfolder that explains the licensing. E.g. I've committed to this repo source code for the "Oculus Lipsync" library, which has its own license (accept it before use!). XTTS v2.0 is also only for non-commercial use. The paragraphs below only affect things created by me.
It's GPLv3. It's one of copyleft licenses. GPL/copyleft licenses should be familiar to most programmers from Blender or Linux kernel. It's quite extreme, but it's dictated by the nature of the app. And, particularly, one of the possible uses.
Recently I've watched "Apple's $3500 Nightmare" by Eddy Burback. It's a review of the $3500 (!) Apple Vision Pro. One of the presented apps allows the user to date an AI "girlfriend". The interface has a stable diffusion-generated image on the left (I smell PastelDiffusedMix with Seraphine LoRA?). Text chat on the right. Is that the state of the art for this kind of software? It's lazy.
Ofc. the mobile dating apps were filled with controversies from the get-go. Tinder and Co. do not want to lose repeat customers. Scams galore before we even get to machine learning. There are millions of AI profiles on Tinder. And with straight-up AI dating it's a whole other issue.
You can use any model you like. Lip sync uses shape keys that correspond to ovrlipsync's visemes. With the "Enemies" tech demo, Unity has proven that it can render realistic humans.
Personally, I would use Unreal Engine's metahuman. You would have to rewrite my Unity code. For this effort, you get a state-of-the-art rig and a free high-fidelity asset. You could also try to import metahuman into Unity.
For some reason, Unity does not have a built-in pipeline for human characters. Even when creating the "Enemies" cinematic linked above, they did not bother to make it community-viable. It's a custom set of tools tailored to Autodesk Maya. And I've never heard about the 4D clip file format. Congratulations to the project lead! It's a baffling decision. E.g. they have their HairFX for hair rendering and simulation. It's based on TressFX. I've ported TressFX to OpenGL, WebGL, and Vulkan. I understand it quite well. And yet this app uses hair cards! Original Sintel has splines-based hair, this should have been a simple export operation. These systems need proper documentation.
At the end of the day, the tool is just a tool. I wish Unity got their priorities in order. I'd say rendering people is quite important in today's market.
Yes, but make sure you understand why you want to use a 3D engine for a 2D rendering technique. For Guilty Gear Xrd, the authors had to tweak normals on a per-frame basis. Even today, 3D is frowned upon by anime fans. The only exception (as far as I know) is Land of the Lustrous. And this is helped by its amazing shot composition.
Looking at Western real-time animation we have e.g. Borderlands. It replicates the comic book style using flat lighting, muted colors, and thick ink lines. There are tons of tutorials on YouTube for flat shading, but you won't get a close result without being good at painting textures.
While this might sound discouraging, I want you to consider your goal. There is a reason why everyone else is using VTubeStudio and Live2D. Creating models for 2D and 3D has no comparison in complexity, it's not even the same art form.
Disregard everything I said above if you work for Riot Games, Fortiche, Disney/Pixar DreamWorks, or Sony Pictures Animation.
Unity installation size is smaller. It is aimed at hobbyists. You can just write a C# script and drop it onto an object to add new behavior. While the UX can be all over the place, it's frictionless in core aspects.
Unity beats UE5 on ease of use and iteration time. The main reason to switch to UE5 would be a metahuman (!), virtual production, or industry-standard mocap.
Depends on the LLM model. The default gemma:2b-instruct
is tiny (3 billion parameters). It can create coherent sentences, but that's how far it can mostly go. If you can use a state-of-the-art 7B model (even with quantization), or something bigger, go for it. You can always swap it for ChatGPT too. Or use a multi-GPU setup. Or, run Unity on a mobile phone, TTS on a Raspberry PI, and have full VRAM for LLM.
I've not added this. It would require special cases added to the 3D model. E.g. it might be hard to animate the mouth during the lipsync. Blushing with 3D avatars is usually done by blending special texture in a shader graph.
Yet the basic tech is already there. If you want to detect emotions in text, you can use LLM for sentiment analysis. I've also added the tech to trigger the events using WebSocket. ATM it's starting a particle effect. Half of the C# code deals with triggering shape keys. Blinking is a function called every few seconds. Once you create an interaction on the 3D model, you can start it at any time. It's just time-consuming to create.
Yes, I tried (not added to this repo). The original plan was to style transfer the rendered frame to a stable diffusion-generated image. From my quick experiments, besides performance problems, the simplest solutions do not have the necessary quality or temporal stability.
We do not have a performance budget to run VGG16/19. This excludes the 'original' techniques like "A Neural Algorithm of Artistic Style" [Gatys2015] or "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" [Johnson2016]. None of them also looked at flickering. They were designed only for static images and not videos. There were further works that looked into that problem: [Jamriska2019], [Texler2020].
I know Unity also tried real-time style transfer in 2020: "Real-time style transfer in Unity using deep neural networks".
Afterward, we are in transformers territory (surprise!). Last year, "Data AugmenTation with diffUsion Models (DATUM)" [CVPR-W 2023] used diffusion (again, surprise!). There is a paperswithcode category called Synthetic-to-Real Translation if you want to track state of the art.
At this point, I've decided that it was a feature creep to try to fit this into the app.
There was a Two Minute Papers episode that looked into similar techniques: "Intel's Video Game Looks Like Reality!". Based on Intel's "Enhancing Photorealism Enhancement" [Richter2021].
Yes, check .fbx inside unity-project/Assets/Sintel.
All my projects have utilitarian names. This time, I wanted something more distinct. Iris is a purple-blue flower. Iris is a part of the eye. Seemed fitting? Especially since eyes and hair are the problems in CG characters.