Google's Gemini AI: Bringing 'Her' & 'JARVIS' To Life?
Multimodal AI models that can hear, see and create anything.
🪄 Creative Technology Digest
Distilled insights on AI, AR/VR, sensing, and robotics, and their societal impacts.
📺Prefer watching on YouTube instead of reading? Click here.
🔗New YouTube video on Google Gemini & AI Assistants
🍇 Today’s Juicy Topic: Multimodal AI models that can hear, see and create anything. A step closer to Jarvis?
Google’s Gemini demo video attracted millions of views showcasing real-time interactions across diverse tasks. It was a glimpse into the future of AI, and widely lauded as a historic advance in computing.
A historical advance.
Then a day later the narrative flipped completely. It turns out the demo wasn’t as live as it initially appeared, instead, used still frames and careful text prompting. While the responses were authentic Gemini outputs, the representation clearly took some creative license. Google defended the video, stating it was meant to inspire developers and illustrate potential user experiences with Gemini. This raises poignant questions about Google’s position in the AI space race compared to OpenAI and Microsoft.
CNBC’s coverage beautifully showcases the flip flop narrative
📺Google's Gemini AI: Bringing 'Her' & 'JARVIS' To Life?
So is Google back in the game or is this all hype? Google recently unveiled Gemini, a multimodal AI model — meaning it can process text, images, video, and audio seamlessly. Models like this could enable a real-world Jarvis-style assistant that helps with tutorials, content creation, redecorating, and more.
In this 6 minute video, I discuss Gemini's capabilities compared to models like GPT-4, the vast data Google is leveraging from Search, Maps, and YouTube, and why multi-modality represents the next frontier in the evolving quest for AGI.
Topics covered (~6 mins at 1x speed):
00:00 Introduction
00:31 Anything-to-Anything
01:42 Google's Insane Data Moat
02:42 Why Multimodality Matters
04:46 Gemini's Three Variants
06:03 Is Google back in the game?
🌶️ My Hot Take:
GPT-4 party is table stakes: Google needs to show it’s at least on par with, if not better than GPT-4. And right now it seems like it is. On the “no-cost” end of the spectrum, Gemini Pro (which is live in Bard today) is arguably better than the free-tier of ChatGPT and competitive with Bard. However, it’s the beefier Ultra model we all really want to get our hands on.
A controlled demo like this is cool, but imagine pointing Google Gemini at the world! 🤯
Gemini 1.0 depicts the missing pieces required to realize the sci-fi future of computing depicted in movies like Her.
TL;DR on multimodal AI and Google's insane data moat 🧵
Multimodality is the next frontier: Regardless of how things go, what is clear is that multimodal AI is where we’ll see the next wave of innovations — and Gemini is all-in on this paradigm “from the ground up.” The alpha that can be extracted from text is reaching its limit. Just think about it: most open and closed source models are based off of this paradigm. 2024 will be an exciting time for approaches beyond text e.g. Large Vision Models and next frame prediction instead of next token prediction. It is then unsurprising, that companies like Meta are building one heck of a dataset for training such models.
Meta has been building an absolutely wild dataset comprised entirely of first person or “egocentric” views 🤯
Not just imagery but also time-synced audio, IMU, eye gaze, head poses & 3D point clouds of the environment.
Perfect for training an AI Jarvis?
🎥 Creation Corner:
1. Real-time 3D Gaussian Splatting : With just a single camera, you are now able to achieve live, high-fidelity scene reconstruction at real-time frame rates.
Holy crap! Real-time 3d Gaussian splatting 🤯
People always ask me how I get such clean 3d scans. Well, this type of incremental approach gives the user realtime feedback on how “good” their capture is.
It’s a SLAM dunk if you ask me!
2. MagicAnimate: Dive into the world of MagicAnimate. ByteDance, the powerhouse behind TikTok, just open sourced MagicAnimate, a framework that animates human images using diffusion model. Keep an eye out - we might just be grooving with a dance filter on TikTok in the near future! 🕺
How soon until this is a dance filter on TikTok?
3. Apple AI Avatar: Human Gaussian Splats: creating animatable avatars from short 2D videos. Will be interesting to see how this research finds its way into FaceTime avatars and 3D videos for the Apple Vision Pro headset.
Apple AI avatar research that takes your short 2D video clip and pulls apart an editable 3D avatar and environment.
The modelling of hair, clothing etc is immaculate and looks far more realistic than rigging a 3D photo scan.
4. Real-time AI @30 fps : Real-time AI is no longer a dream. I wrote in March that real-time AI would transform 3D and AR forever. Seems like it’s happening much sooner than we all expected! Here’s anyone-to-anyone in real-time:
Ok yeah, real-time AI @ 30 fps is nuts! I can go from being me to @iamjamiefoxx to @chamath in a flash 😎
Who needs bougie sweaters when you've got AI? Soon we'll be able to turn anything into anything.
Made with @fal_ai_data and a sprinkle of MacOS AR effects. Can't wait till… twitter.com/i/web/status/1…
💌 Stay in Touch:
That’s all for this edition of Creative Tech Digest. One more left to bring in the new year, and after that I’ll see y’all in 2024.
my mom made this in chatgpt and it just warms my heart 🥹
she's the type of person who clicks edit > copy / paste every time, instead of using shortcuts.
but she can sure as heck can describe stuff in text and iterate. i love this so much more than getting a generic greeting!
Got feedback, questions, or just want to chat? Reply to this email or catch me on social media.
Bilawal Sidhu