Three new Microsoft MAI models for transcription, voice and image generation

02/04/2026

Microsoft introduces MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2, three models specialized in speech transcription, voice synthesis and image generation, now available in Microsoft Foundry.

Three new Microsoft MAI models for transcription, voice and image generation

Microsoft has made three new proprietary artificial intelligence models available to developers, grouped under the MAI brand. These are MAI-Transcribe-1, for speech-to-text transcription; MAI-Voice-1, for audio generation; and MAI-Image-2, for text-to-image creation. All three are available in Microsoft Foundry, the company's AI application development platform, and already power its own products such as Copilot, Bing and PowerPoint.

MAI-Transcribe-1 is Microsoft's speech transcription model. It operates across the 25 languages most used in the company's products and is designed for real-world audio conditions: background noise, low-quality recordings and overlapping speech. According to the FLEURS benchmark, an industry standard, it outperforms Whisper-large-v3 and GPT-Transcribe from OpenAI, Scribe v2 from ElevenLabs and Gemini 3.1 Flash-Lite from Google across all 25 languages, at approximately 50% lower GPU cost than comparable alternatives. Microsoft already uses it in Copilot's voice mode and in Microsoft Teams transcriptions. Pricing in Foundry starts at $0.36 per hour.

MAI-Voice-1 handles synthetic voice generation from text. The model can produce a full minute of audio in under one second on a single graphics processing unit. It includes the ability to clone a voice from an audio sample of just ten seconds, though this feature is subject to an approval process in line with Microsoft's responsible AI policies. The company already uses this model to power expressive voice features in Copilot. Access starts at $22 per million characters.

MAI-Image-2 is the text-to-image generation model. At launch, it ranked among the top three models in its category on the Arena.ai leaderboard. Microsoft states it delivers generation times at least twice as fast as its previous version at similar quality, based on real production traffic data. Advertising and communications group WPP is already using it for creative production workflows at scale. Pricing starts at $5 per million text input tokens and $33 per million image output tokens.

All three models have undergone internal evaluation and security testing prior to release, and are deployed with the governance and compliance controls built into Microsoft Foundry.

Key points

  • Microsoft releases three proprietary AI models —MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2— available in Microsoft Foundry.
  • MAI-Transcribe-1 outperforms transcription models from OpenAI, ElevenLabs and Google across 25 languages, according to the FLEURS benchmark.
  • It is designed for difficult audio conditions: background noise, low quality and overlapping voices.
  • MAI-Voice-1 generates one minute of audio in under a second and can clone voices from ten-second samples.
  • MAI-Image-2 is twice as fast as its previous version and ranked among the top three on the Arena.ai leaderboard.
  • All three models are already integrated into Copilot, Bing, PowerPoint and Microsoft Teams.

Related AI

Microsoft AI

Foundational models and applications

Microsoft AI develops artificial intelligence models and products with a focus on responsibility, accessibility, and practical utility. Its work encompasses foundational models, voice generation, ...

Lastest news

Trustpilot
This website uses technical, personalization and analysis cookies, both our own and from third parties, to facilitate anonymous browsing and analyze website usage statistics. We consider that if you continue browsing, you accept their use.