FastAPIとNext.jsで作るリアルタイム文字起こしアプリケーション

はじめに
技術スタック
実装の詳細
参考サイト
1. 公式ドキュメント
2. Githubリポジトリ
まとめ

はじめに

今回は、Google Cloud Speech-to-TextのAPIを使用して、ブラウザ上でリアルタイムに音声を文字起こしできるWebアプリケーションを実装しました。
バックエンドにFastAPI、フロントエンドにNext.jsを使用し、WebSocketを利用してリアルタイムな音声ストリーミングを実現しています。

コードは以下にあります。

GitHub - nakajima97/real-time-transcription-using-google-speech-to-text-built-with-fastapi-and-next at v1.0.0

Google CloudのSpeech to Textを使ってリアルタイム文字起こしを実装したリポジトリ - GitHub - nakajima97/real-time-transcription-using-google-speech-t...

技術スタック

バックエンド: FastAPI + python-socketio
フロントエンド: Next.js + Socket.IO-client
音声認識: Google Cloud Speech-to-Text v1
通信プロトコル: WebSocket (Socket.IO)

実装の詳細

通信の流れ

バックエンド実装

1. FastAPIとSocket.IOの統合

FastAPIのアプリケーションにSocket.IOサーバーを統合しています：

from fastapi import FastAPI
from src.websocket.transcription import app_socketio

app = FastAPI()
app.mount("/socket.io", app_socketio)

2. 音声ストリーム処理

Google Cloud Speech-to-Text APIのストリーミング用の関数を使っています。
コードは transcription.py にあります。

real-time-transcription-using-google-speech-to-text-built-with-fastapi-and-next/server/api/src/websocket/transcription.py at v1.0.0 · nakajima97/real-time-transcription-using-google-speech-to-text-built-with-fastapi-and-next

Google CloudのSpeech to Textを使ってリアルタイム文字起こしを実装したリポジトリ - nakajima97/real-time-transcription-using-google-speech-to-text-bu...

非同期で処理を行いたいため SpeechAsyncClient を使っています。
また、ここはSpeech v1を使っています。
理由は最初に見つけたサンプルコードがv1だったのでそれに合わせた次第です。

v2にコードを修正してみた際のメモは以下にあります。

3. WebSocketイベントハンドリング

Socket.IOを使用して以下のイベントを処理：

connect: クライアント接続時の処理
startGoogleCloudStream: 音声認識開始
send_audio_data: 音声データ受信
stopGoogleCloudStream: 音声認識停止
disconnect: クライアント切断時の処理

フロントエンド実装

1. カスタムフック（useTranscription）

音声認識機能をReactカスタムフックとして実装：

export const useTranscription = () => {
  const [connection, setConnection] = useState<Socket>();
  const [currentRecognition, setCurrentRecognition] = useState<string>();
  const [recognitionHistory, setRecognitionHistory] = useState<string[]>([]);

主な機能：

WebSocketコネクション管理
音声ストリームの取得と送信
認識テキストの状態管理

コードは以下

real-time-transcription-using-google-speech-to-text-built-with-fastapi-and-next/frontend/src/features/transcription/hooks/useTranscription/index.ts at v1.0.0 · nakajima97/real-time-transcription-using-google-speech-to-text-built-with-fastapi-and-next

Google CloudのSpeech to Textを使ってリアルタイム文字起こしを実装したリポジトリ - nakajima97/real-time-transcription-using-google-speech-to-text-bu...

参考サイト

公式ドキュメント

ストリーミング入力の音声を文字に変換する | Cloud Speech-to-Text Documentation | Google Cloud

Transcribe audio from streaming input to text.

Class SpeechAsyncClient (2.30.0) | Python client library | Google Cloud

Githubリポジトリ

NodeJSで作成されているがローカルで動作確認ができたのでどういった考えで実装すればよいのか参考になりました

GitHub - untilhamza/Real-time-transcription-with-Google-speech-to-text-API: A simple app that demostrates how to use the google-speech-to-text API for real time transcription with react and node js

A simple app that demostrates how to use the google-speech-to-text API for real time transcription with react and node j...

まとめ

FastAPIとNext.jsを組み合わせることで、効率的かつスケーラブルなリアルタイム音声認識アプリケーションを実装できました。WebSocketを利用した双方向通信により、スムーズな音声認識体験を提供することができます。

今回の実装を通じて、以下の技術的知見を得ることができました：

FastAPIとSocket.IOの効果的な統合方法
Google Cloud Speech-to-Text APIのストリーミング活用
フロントエンドでのリアルタイムな音声処理
環境変数を用いた設定管理のベストプラクティス

Follow me!