Microsoft introduces MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2, three models specialized in speech transcription, voice synthesis and image generation, now available in Microsoft Foundry.
Microsoft has made three new proprietary artificial intelligence models available to developers, grouped under the MAI brand. These are MAI-Transcribe-1, for speech-to-text transcription; MAI-Voice-1, for audio generation; and MAI-Image-2, for text-to-image creation. All three are available in Microsoft Foundry, the company's AI application development platform, and already power its own products such as Copilot, Bing and PowerPoint.
MAI-Transcribe-1 is Microsoft's speech transcription model. It operates across the 25 languages most used in the company's products and is designed for real-world audio conditions: background noise, low-quality recordings and overlapping speech. According to the FLEURS benchmark, an industry standard, it outperforms Whisper-large-v3 and GPT-Transcribe from OpenAI, Scribe v2 from ElevenLabs and Gemini 3.1 Flash-Lite from Google across all 25 languages, at approximately 50% lower GPU cost than comparable alternatives. Microsoft already uses it in Copilot's voice mode and in Microsoft Teams transcriptions. Pricing in Foundry starts at $0.36 per hour.
MAI-Voice-1 handles synthetic voice generation from text. The model can produce a full minute of audio in under one second on a single graphics processing unit. It includes the ability to clone a voice from an audio sample of just ten seconds, though this feature is subject to an approval process in line with Microsoft's responsible AI policies. The company already uses this model to power expressive voice features in Copilot. Access starts at $22 per million characters.
MAI-Image-2 is the text-to-image generation model. At launch, it ranked among the top three models in its category on the Arena.ai leaderboard. Microsoft states it delivers generation times at least twice as fast as its previous version at similar quality, based on real production traffic data. Advertising and communications group WPP is already using it for creative production workflows at scale. Pricing starts at $5 per million text input tokens and $33 per million image output tokens.
All three models have undergone internal evaluation and security testing prior to release, and are deployed with the governance and compliance controls built into Microsoft Foundry.
Microsoft AI develops artificial intelligence models and products with a focus on responsibility, accessibility, and practical utility. Its work encompasses foundational models, voice generation, ...
17/04/2026
Anthropic has launched Claude Design, a tool that enables users to create visual designs, interactive prototypes and presentations through ...
17/04/2026
Anthropic publishes Claude Opus 4.7, a model with notable gains in software development tasks, higher image resolution and new cybersecurity ...
08/04/2026
Meta Superintelligence Labs launches Muse Spark, a multimodal artificial intelligence model capable of processing text and images simultaneously, ...
07/04/2026
Anthropic has launched Project Glasswing, a cybersecurity initiative with twelve major technology companies to use its new AI model, Claude Mythos ...