Implementing Hotword Detection with AssemblyAI's Streaming Speech-to-Text in Go
Hotword detection is a crucial feature for voice-activated systems like Siri or Alexa. In a recent tutorial by AssemblyAI, developers are guided on how to implement this functionality using AssemblyAI's Streaming Speech-to-Text API with the Go programming language.
Introduction to Hotword Detection
Hotword detection enables an AI system to respond to specific trigger words or phrases. Popular AI systems like Alexa and Siri use predefined hotwords to activate their functionalities. This tutorial from AssemblyAI demonstrates how to create a similar system, named 'Jarvis' in homage to Iron Man, using Go and AssemblyAI's API.
Setting Up the Environment
Before diving into the coding, developers need to set up their environment. This includes installing the Go bindings of PortAudio to capture raw audio data from the microphone and the AssemblyAI Go SDK for interfacing with the API. The following commands are used for setting up the project:
mkdir jarvis
cd jarvis
go mod init jarvis
go get github.com/gordonklaus/portaudio
go get github.com/AssemblyAI/assemblyai-go-sdk
Next, an AssemblyAI account is required to obtain an API key. Developers can sign up on the AssemblyAI website and configure their billing details to access the Streaming Speech-to-Text API.
Implementing the Recorder
The core functionality begins with recording raw audio data. The tutorial guides on creating a recorder.go
file to define a recorder
struct that captures audio data using PortAudio. This struct includes methods for starting, stopping, and reading from the audio stream.
package main
import (
"bytes"
"encoding/binary"
"github.com/gordonklaus/portaudio"
)
type recorder struct {
stream *portaudio.Stream
in []int16
}
func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
in := make([]int16, framesPerBuffer)
stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
if err != nil {
return nil, err
}
return &recorder{
stream: stream,
in: in,
}, nil
}
func (r *recorder) Read() ([]byte, error) {
if err := r.stream.Read(); err != nil {
return nil, err
}
buf := new(bytes.Buffer)
if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func (r *recorder) Start() error {
return r.stream.Start()
}
func (r *recorder) Stop() error {
return r.stream.Stop()
}
func (r *recorder) Close() error {
return r.stream.Close()
}
Creating the Real-Time Transcriber
AssemblyAI's Real-Time Transcriber requires event handlers for different stages of the transcription process. These handlers are defined in a transcriber
struct and include events such as OnSessionBegins
, OnSessionTerminated
, and OnPartialTranscript
.
package main
import (
"fmt"
"github.com/AssemblyAI/assemblyai-go-sdk"
)
var transcriber = &assemblyai.RealTimeTranscriber{
OnSessionBegins: func(event assemblyai.SessionBegins) {
fmt.Println("session begins")
},
OnSessionTerminated: func(event assemblyai.SessionTerminated) {
fmt.Println("session terminated")
},
OnPartialTranscript: func(event assemblyai.PartialTranscript) {
fmt.Printf("%s\r", event.Text)
},
OnFinalTranscript: func(event assemblyai.FinalTranscript) {
fmt.Println(event.Text)
},
OnError: func(err error) {
fmt.Println(err)
},
}
Stitching Everything Together
The final step involves integrating all components in the main.go
file. This includes setting up the API client, initializing the recorder, and handling the transcription events. The code also includes logic for detecting the hotword and responding appropriately.
package main
import (
"context"
"fmt"
"log"
"os"
"os/signal"
"strings"
"syscall"
"github.com/AssemblyAI/assemblyai-go-sdk"
"github.com/gordonklaus/portaudio"
)
var hotword string
var transcriber = &assemblyai.RealTimeTranscriber{
OnSessionBegins: func(event assemblyai.SessionBegins) {
fmt.Println("session begins")
},
OnSessionTerminated: func(event assemblyai.SessionTerminated) {
fmt.Println("session terminated")
},
OnPartialTranscript: func(event assemblyai.PartialTranscript) {
fmt.Printf("%s\r", event.Text)
},
OnFinalTranscript: func(event assemblyai.FinalTranscript) {
fmt.Println(event.Text)
hotwordDetected := strings.Contains(
strings.ToLower(event.Text),
strings.ToLower(hotword),
)
if hotwordDetected {
fmt.Println("I am here!")
}
},
OnError: func(err error) {
fmt.Println(err)
},
}
func main() {
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
logger := log.New(os.Stderr, "", log.Lshortfile)
portaudio.Initialize()
defer portaudio.Terminate()
hotword = os.Args[1]
device, err := portaudio.DefaultInputDevice()
if err != nil {
logger.Fatal(err)
}
var (
apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
sampleRate = device.DefaultSampleRate
framesPerBuffer = int(0.2 * sampleRate)
)
client := assemblyai.NewRealTimeClientWithOptions(
assemblyai.WithRealTimeAPIKey(apiKey),
assemblyai.WithRealTimeSampleRate(int(sampleRate)),
assemblyai.WithRealTimeTranscriber(transcriber),
)
ctx := context.Background()
if err := client.Connect(ctx); err != nil {
logger.Fatal(err)
}
rec, err := newRecorder(int(sampleRate), framesPerBuffer)
if err != nil {
logger.Fatal(err)
}
if err := rec.Start(); err != nil {
logger.Fatal(err)
}
for {
select {
case <-sigs:
fmt.Println("stopping recording...")
if err := rec.Stop(); err != nil {
log.Fatal(err)
}
if err := client.Disconnect(ctx, true); err != nil {
log.Fatal(err)
}
os.Exit(0)
default:
b, err := rec.Read()
if err != nil {
logger.Fatal(err)
}
if err := client.Send(ctx, b); err != nil {
logger.Fatal(err)
}
}
}
}
Running the Application
To run the application, developers need to set their AssemblyAI API key as an environment variable and execute the Go program with the desired hotword:
export ASSEMBLYAI_API_KEY='***'
go run . Jarvis
This command sets 'Jarvis' as the hotword, and the program will respond with 'I am here!' whenever the hotword is detected in the audio stream.
Conclusion
This tutorial by AssemblyAI provides a comprehensive guide for developers to implement hotword detection using their Streaming Speech-to-Text API and Go. The combination of PortAudio for capturing audio and AssemblyAI for transcription offers a powerful solution for creating voice-activated applications. For more details, visit the original tutorial.