Multimodal AI Essentials: Merging Text, Image, and Audio for Next-Generation AI Applications
.MP4, AVC, 1280x720, 30 fps | English, AAC, 2 Ch | 5h 33m | 1.08 GB
Instructor: Sinan Ozdemir
.MP4, AVC, 1280x720, 30 fps | English, AAC, 2 Ch | 5h 33m | 1.08 GB
Instructor: Sinan Ozdemir
This course shows you how combining modalities like text, audio, video, and images can enable AI systems to achieve remarkable capabilities. Gain hands-on experience building visual question-and-answer models, generating personalized images with diffusion, designing end to end multimodal applications, and even fine-tuning multimodal models for specific tasks. This course gives you the tools, knowledge, and confidence to design and deploy your own state-of-the-art multimodal AI systems.
Learning objectives
- Apply multimodal AI concepts.
- Build a voice-to-voice app.
- Apply visual question answering (VQA) concepts and architecture.
- Construct, fine-tune, and evaluate diffusion models with DreamBooth.
- Fine-tune a text-to-speech model with SpeechT5.
- Build visual agents from the ground up.
- Evaluate the performance of multimodal models.
- Extend multimodal systems with advanced techniques like computer use.