Modern text-to-speech voices can convey social cues ideal for narrating multimedia learning materials. Amazon Alexa has a unique feature among modern text-to-speech vocalizers as she can infuse enthusiasm cues into her synthetic voice. In this first study examining modern text-to-speech voice enthusiasm effects in a multimedia learning environment, a between-subjects online experiment was conducted where learners from a large Asian university (n = 244) listened to either Alexa's: (1) neutral voice, (2) low-enthusiastic voice, (3) medium-enthusiastic voice, or (4) high-enthusiastic voice, narrating a multimedia lesson on distributed denial-of-service attack. While Alexa's enthusiastic voices did not enhance persona ratings compared to Alexa's neutral voice, learners could infer more enthusiasm expressed by Alexa's medium-and high-enthusiastic voices than Alexa's neutral voice. Regarding cognitive load, Alexa's low-and high-enthusiastic voices decreased intrinsic and extraneous cognitive load ratings compared to Alexa's neutral voice. While Alexa's enthusiastic voices did not impact affective-motivational ratings differently from Alexa's neutral voice, learners reported a significant increase of positive emotions from their baseline positive emotions after listening to Alexa's medium-enthusiastic voice. Finally, Alexa's enthusiastic voices did not enhance the learning performance on immediate retention and transfer tests compared to Alexa's neutral voice. This study demonstrates that a modern text-to-speech voice enthusiasm can positively affect learners' emotions and cognitive load during multimedia learning. Theoretical and practical implications are discussed through the lens of the Cognitive Affective Model of E-learning, Integrated-Cognitive Affective Model of Learning with Multimedia, and Cognitive Load Theory. We further outline this study's limitations and recommendations for extending and widening the text-to-speech voice emotions research.