Eyes and Ears
Book 3
Eyes and Ears
When Machines Learn to See and Hear
What does it mean for a machine to “understand” an image? How can it generate a voice — or a video?
This volume is about the leap to multimodal AI: models that connect text, images, audio (and beyond), and then use those representations to create. The guiding theme remains the same: turning apparent magic into a mechanism you can reason about.
What you’ll learn
- Vision: from pixels to representations
- Audio: recognition, synthesis, and meaning
- Generative models: creating images/audio/video (and why it works)
- Multimodal alignment: connecting vision, audio, and language
- Accessibility: AI as a sensory bridge
In Development
This book is being written right now. If you want updates (chapters, demos, dates), email me.