Thomas Huang - avp_ai

Gen AI & ML

Multimodal AI Companion (2024)

Building on my last experiment with conversational AI on Apple Vision Pro (https://lnkd.in/gF6BBENE), I've now enhanced the prototype to include multimodal inputs. Watch the examples in this video.

The latest iteration integrates cutting-edge features such as speech-to-text (STT), Voice Activity Detection (VAD), real-time camera image capture (See what I see), and advanced text-to-speech (TTS). These combined techniques create a bridge between human interaction and AI functionality, resulting in a more intuitive, seamless, and natural experience. From consulting with a virtual interior designer, asking for dinner recipe suggestions, translating foreign languages in real-time, or getting instant tips—all becoming more accessible through enriched interactions that understand both your voice and visual context.

This advancement underscores AI's power to create new possibilities for simplifying our daily routines and making life more delightful.

Where else do you see this technology having a transformative impact on our lives?

Report abuse