Eyes and Ears

Book 3

Eyes and Ears

When Machines Learn to See and Hear

What does it mean for a machine to “understand” an image? How can it generate a voice — or a video?

This volume is about the leap to multimodal AI: models that connect text, images, audio (and beyond), and then use those representations to create. The guiding theme remains the same: turning apparent magic into a mechanism you can reason about.

What you’ll learn

Vision: from pixels to representations
Audio: recognition, synthesis, and meaning
Generative models: creating images/audio/video (and why it works)
Multimodal alignment: connecting vision, audio, and language
Accessibility: AI as a sensory bridge

In Development

This book is being written right now. If you want updates (chapters, demos, dates), email me.

← Back to trilogy Start with Book 1 →