Little-known HTML5 speech synthesis function

Author：Eve Cole Update Time：2025-02-14 09:36:02

If you listen to it, you will find that the sound played is not pre-recorded audio material, but a synthesized speech through text recognition.

Please put on your headphones first, and then copy the following code into the chrome console to experience it ~

 let msg = new SpeechSynthesisUtterance(Welcome to read my blog);window.speechSynthesis.speak(msg);

Look, it’s not difficult to implement speech synthesis on the front end

Today’s protagonist Speech Synthesis API

From the above example, we can guess the functions of the two methods called above.

 SpeechSyntehesisUtterancwindow.speechSynthesis.speak

Of course, speech synthesis does not only include these two APIs, but let’s start with these two points first.

SpeechSyntehesisUtteranc

Reference: developer.mozilla.org/en-US/docs/… The SpeechSyntehesisUtteranc object contains the content to be read by the speech service and some parameters, such as language, pitch and volume

 SpeechSyntehesisUtteranc()SpeechSynthesisUtterance.langSpeechSynthesisUtterance.pitchSpeechSynthesisUtterance.rateSpeechSynthesisUtterance.voiceSpeechSynthesisUtterance.volume

Note: The above properties are all readable and writable! You can copy the code below and give it a try. There will be instructions in the comments.

 let msg = new SpeechSynthesisUtterance();msg.text = how are you // Text to be synthesized msg.lang = en-US // American English pronunciation (automatically selected by default) msg.rate = 2 // Double speed (default is 1, range 0.1 ~ 10) msg.pitch = 2 // High pitch (the larger the number, the sharper, the default is 1, range 0 ~ 2) msg.volume = 0.5 // Volume 0.5 times (default is 1, range 0~1) window.speechSynthesis.speak(msg);

At the same time, this object can also respond to a series of events, which may be used:

start
end
boundary
pause
resume

With the help of these events, we can complete some simple functions, such as counting the number of words in English sentences:

 let count = 0; // Number of words let msg = new SpeechSynthesisUtterance(); let synth = window.speechSynthesis;msg.addEventListener('start',()=>{ // Start reading console.log(`Text content: $ {msg.text}`); console.log(start);});msg.addEventListener('end',()=>{ // End of reading console.log(end); console.log(`Number of text words (words): ${count}`); count = 0;});msg.addEventListener('boundary',()=>{ // Count words count++;});

After trying, since Chinese does not use spaces to separate each word, it will be automatically recognized. For example, welcome to readers will be recognized as two words: welcome and reader.

SpeechSynthesis

Reference: developer.mozilla.org/en-US/docs/…

After talking about SpeechSyntehesisUtteranc, let’s take a look at SpeechSynthesis

The main function of SpeechSynthesis is to perform a series of controls on speech, such as starting or pausing.

It has three read-only properties that indicate the status of the voice:

 SpeechSynthesis.pausedSpeechSynthesis.pending

There are also a series of methods for manipulating speech:

•SpeechSynthesis.speak() starts reading speech and triggers the start event at the same time

•SpeechSynthesis.pause() pauses and triggers the pause event

•SpeechSynthesis.resume() continues and triggers the resume event at the same time

•SpeechSynthesis.cancel() cancels reading and triggers the end event at the same time

Based on these methods of operation, we can further enhance our text reader:

Back to the starting point

Let's go back to the original starting point. Based on the above content, we can guess how automatic reading of articles is implemented on some websites.

If the front end of this website uses the MVVM framework (take Vue as an example), then the article content may be stored in data, which can be used to construct the speech synthesis we need

Of course, it is also possible that the article is obtained through an ajax request, the requested data is parsed, and the speech synthesis object is constructed.

If the article is written directly in HTML, the DOM needs to be parsed at this time. After testing, even the following chaotic structure

 <div id=test> <p>1</p> <p>2</p> <ul> <li>3</li> <li>4</li> </ul> <table> <tr > <td>5</td> <td>6</td> </tr> <tr> <td>7</td> <td>8</td> </tr> </table> <img src=https://www.baidu.com/img/bd_logo1.png 9</div>

Read the text directly through innerText, then construct a speech synthesis object, and you can also read it in the desired order (pictures will be ignored)

Of course, if we want to ignore some special structures, such as tables, we can spend some energy on parsing and filter out the data or DOM elements we don’t want.

No matter what, we can find a suitable solution~

gossip

This feature is still in draft and is not widely supported.

Again, this API cannot be applied to production environments yet.

The more common approach at present is to construct an API that synthesizes text into a speech file on the back end (perhaps a third-party API), and then plays it as media on the front end.

Once when I was confused, I read some articles by experts and read some senior thoughts on front-end development. One thing impressed me deeply:

The front end is closest to the user. Everything must be considered from the user's perspective. Barrier-free use is also a very important issue. Although the benefits brought by such a function are far less than those of other businesses, in order to make the product better serve users, it is worthwhile to put in more work. This is also a spirit of front-end development.

Summarize

The above is the little-known HTML5 speech synthesis function introduced by the editor. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time!