feat: embed STT transcription in audio event content (#5731)

* feat: embed STT transcription in audio event content

Before sending the audio event, the client now fetches the STT
transcript first, then embeds it under 'user_stt' in the event
content. This mirrors the 'original_sent' pattern for text messages
and lets the bot read the transcript immediately without downloading
audio or calling choreo.

- Add ModelKey.userStt constant
- Rewrite onVoiceMessageSend to get transcript before sending audio
- Update getSpeechToTextLocal() to check userStt before botTranscription

* chore: replace inaccurate comment with TODO referencing #5730

* formatting

* fix pangea comments

* feat: make stt translations relate to pangea message events instead of stt representation events

* clean up pangea event types

---------

Co-authored-by: ggurdin <ggurdin@gmail.com>
This commit is contained in:
wcjord 2026-02-18 12:39:38 -05:00 committed by GitHub
parent edcc1e9b43
commit f6a048ca3e
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 120 additions and 234 deletions

View file

@ -8,43 +8,24 @@ Messages in Pangea carry rich metadata stored as Matrix events related to the ma
## Event Hierarchy for a Message
When a user sends a message, the client creates a tree of related Matrix events:
Custom events are either embedded within the content of the original `m.room.message` or stored as child events linked to the message. These include:
```
m.room.message (the chat message)
├── pangea.representation ← PangeaRepresentation (sent text + lang)
│ ├── pangea.tokens ← PangeaMessageTokens (tokenized text)
│ └── pangea.record ← ChoreoRecordModel (editing history)
├── pangea.representation ← (optional: L1 original if IT was used)
│ └── pangea.tokens
├── pangea.translation ← Full-text translation
├── pangea.activity_req ← Request to generate practice activities
├── pangea.activity_res ← Generated practice activity
├── pangea.activity_completion ← User's activity completion record
└── pangea.stt_translation ← Speech-to-text translation
```
1. **PangeaRepresentation** (`pangea.representation`): Representation (either text or speech-to-text transcription) of the message. The original sent text and orignal written text can be embedded within the original message content. Subsequent representations are sent as child events.
1. **Tokens** (`pangea.tokens`): Tokens in the message. Should be embedded in the original message content, unless an error occured
2. **ChoreoRecord** (`pangea.record`): Choreographer editing history for the message. Should be embedded in the original message content, if it exists.
2. **TextToSpeech** (`pangea.text_to_speech`): Text-to-speech audio for the message, stored as a child event.
3. **SpeechToText** (`pangea.translation`): Speech-to-text transcription of the message, usually embedded in original message contet, but can be sent as a child event of message.
4. **SttTranslation** (`pangea.stt_translation`): Translation of speech-to-text transcription, stored as a child event.
## Custom Event Types (`PangeaEventTypes`)
## Other Custom Event Types (`PangeaEventTypes`)
Defined in `lib/pangea/events/constants/pangea_event_types.dart`:
### Message-related
| Type | Constant | Purpose |
|---|---|---|
| `pangea.representation` | `representation` | A text representation with language code |
| `pangea.tokens` | `tokens` | Tokenized text (lemmas, POS, morphology) |
| `pangea.record` | `choreoRecord` | Choreographer editing history |
| `pangea.translation` | `translation` | Full-text translation |
| `pangea.stt_translation` | `sttTranslation` | Speech-to-text translation |
### Activities
| Type | Constant | Purpose |
|---|---|---|
| `pangea.activity_req` | `activityRequest` | Request server to generate activities |
| `pangea.activity_res` | `pangeaActivity` | A practice activity for a message |
| `pangea.activity_completion` | `activityRecord` | Per-user activity completion record |
| `pangea.activity_plan` | `activityPlan` | Activity plan definition |
| `pangea.activity_roles` | `activityRole` | Roles in a structured activity |
| `pangea.activity_summary` | `activitySummary` | Post-activity summary |
@ -55,45 +36,20 @@ Defined in `lib/pangea/events/constants/pangea_event_types.dart`:
|---|---|---|
| `pangea.construct` | `construct` | A tracked learning construct |
| `pangea.construct_summary` | `constructSummary` | Aggregate construct data |
| `pangea.summaryAnalytics` | `summaryAnalytics` | Summary analytics data |
| `pangea.analytics_profile` | `profileAnalytics` | User analytics profile |
| `pangea.activities_profile` | `profileActivities` | User activities profile |
| `pangea.analytics_settings` | `analyticsSettings` | Analytics display settings |
| `p.user_lemma_info` | `userSetLemmaInfo` | User-customized lemma info |
| `p.emoji` | `userChosenEmoji` | User-chosen emoji for a word |
| `p.analytics_settings` | `analyticsSettings` | Analytics display settings |
| `pangea.activity_room_ids` | `activityRoomIds` | List of saved activity room IDs |
### Room/Course Settings
| Type | Constant | Purpose |
|---|---|---|
| `pangea.class` | `languageSettings` | Room language configuration |
| `p.rules` | `rules` | Room rules |
| `pangea.roomtopic` | `roomInfo` | Room topic info |
| `pangea.bot_options` | `botOptions` | Bot behavior configuration |
| `pangea.capacity` | `capacity` | Room capacity limit |
| `pangea.course_plan` | `coursePlan` | Course plan reference |
| `p.course_user` | `courseUser` | User's course enrollment |
| `pangea.teacher_mode` | `teacherMode` | Teacher mode toggle |
| `pangea.course_chat_list` | `courseChatList` | Course chat list |
### Audio & Media
| Type | Constant | Purpose |
|---|---|---|
| `p.audio` | `audio` | Audio attachment |
| `pangea.transcript` | `transcript` | Audio transcript |
| `p.rule.text_to_speech` | `textToSpeechRule` | TTS settings |
### User & Misc
| Type | Constant | Purpose |
|---|---|---|
| `pangea.user_age` | `userAge` | User age bracket |
| `m.report` | `report` | Content report |
| `p.rule.analytics_invite` | `analyticsInviteRule` | Analytics sharing rules |
| `p.analytics_request` | `analyticsInviteContent` | Analytics sharing request |
| `pangea.regeneration_request` | `regenerationRequest` | Content regeneration request |
| `pangea.activity_room_ids` | `activityRoomIds` | Activity room references |
| `pangea.course_chat_list` | `courseChatList` | Course chat list default chat settings |
## Core Data Models

View file

@ -45,6 +45,7 @@ import 'package:fluffychat/pangea/choreographer/choreographer.dart';
import 'package:fluffychat/pangea/choreographer/choreographer_state_extension.dart';
import 'package:fluffychat/pangea/choreographer/text_editing/edit_type_enum.dart';
import 'package:fluffychat/pangea/choreographer/text_editing/pangea_text_controller.dart';
import 'package:fluffychat/pangea/common/constants/model_keys.dart';
import 'package:fluffychat/pangea/common/controllers/pangea_controller.dart';
import 'package:fluffychat/pangea/common/utils/error_handler.dart';
import 'package:fluffychat/pangea/common/utils/firebase_analytics.dart';
@ -1205,17 +1206,19 @@ class ChatController extends State<ChatPageWithRoom>
);
// #Pangea
// setState(() {
// replyEvent = null;
// });
final reply = replyEvent.value;
replyEvent.value = null;
// Get transcript first so we can embed it in the audio event,
// allowing the bot (and other clients) to read it immediately
// without waiting for a separate representation event.
final transcriptResult = await _getVoiceMessageTranscript(file);
final stt = transcriptResult.result;
// Pangea#
// #Pangea
final transcriptFuture = _getVoiceMessageTranscript(file);
// room
final eventFuture = room
final eventId = await room
// Pangea#
.sendFileEvent(
file,
@ -1234,60 +1237,40 @@ class ChatController extends State<ChatPageWithRoom>
// #Pangea
'speaker_l1': pangeaController.userController.userL1Code,
'speaker_l2': pangeaController.userController.userL2Code,
if (stt != null) ModelKey.userStt: stt.toJson(),
// Pangea#
},
// #Pangea
)
// #Pangea
// .catchError((e) {
.catchError((e, s) {
ErrorHandler.logError(
e: e,
s: s,
data: {'roomId': roomId, 'file': file.name},
);
// Pangea#
scaffoldMessenger.showSnackBar(
SnackBar(content: Text((e as Object).toLocalizedString(context))),
);
return null;
});
// .catchError((e) {
// scaffoldMessenger.showSnackBar(
// SnackBar(content: Text((e as Object).toLocalizedString(context))),
// );
// return null;
// #Pangea
// setState(() {
// replyEvent = null;
// });
if (eventId == null) {
ErrorHandler.logError(
e: Exception('eventID null in voiceMessageAction'),
s: StackTrace.current,
data: {'roomId': roomId},
);
return;
}
Future.wait([eventFuture, transcriptFuture]).then((results) async {
final eventId = results[0] as String?;
final transcript = results[1] as async.Result<SpeechToTextResponseModel>;
if (eventId == null) {
ErrorHandler.logError(
e: Exception('eventID null in voiceMessageAction'),
s: StackTrace.current,
data: {'roomId': roomId},
);
return;
}
if (transcript.result == null) return;
final stt = transcript.result!;
final event = await room.getEventById(eventId);
if (event == null) {
ErrorHandler.logError(
e: Exception('Event not found after sending voice message'),
s: StackTrace.current,
data: {'roomId': roomId},
);
} else {
final messageEvent = PangeaMessageEvent(
event: event,
timeline: timeline!,
ownMessage: true,
);
messageEvent.sendSttRepresentationEvent(stt);
}
if (stt != null) {
_sendVoiceMessageAnalytics(eventId, stt);
});
}
// Pangea#
return;
}

View file

@ -54,7 +54,6 @@ class ChatPermissionsSettingsView extends StatelessWidget {
PangeaEventTypes.activityRole,
PangeaEventTypes.activitySummary,
PangeaEventTypes.coursePlan,
PangeaEventTypes.courseUser,
];
Map<String, dynamic> missingPowerLevels = Map<String, dynamic>.from(

View file

@ -61,7 +61,6 @@ class RoomDefaults {
"invite": 50,
"redact": 50,
"events": {
PangeaEventTypes.courseUser: 0,
"m.room.power_levels": 100,
"m.room.join_rules": 100,
"m.space.child": spaceChild,

View file

@ -113,6 +113,7 @@ class ModelKey {
static const String transcription = "transcription";
static const String botTranscription = 'bot_transcription';
static const String userStt = 'user_stt';
static const String voice = "voice";
// bot options

View file

@ -1,28 +1,15 @@
class PangeaEventTypes {
static const languageSettings = "pangea.class";
static const transcript = "pangea.transcript";
static const rules = "p.rules";
// static const studentAnalyticsSummary = "pangea.usranalytics";
static const summaryAnalytics = "pangea.summaryAnalytics";
static const construct = "pangea.construct";
static const userSetLemmaInfo = "p.user_lemma_info";
static const constructSummary = "pangea.construct_summary";
static const userChosenEmoji = "p.emoji";
static const translation = "pangea.translation";
static const tokens = "pangea.tokens";
static const choreoRecord = "pangea.record";
static const representation = "pangea.representation";
static const sttTranslation = "pangea.stt_translation";
static const textToSpeech = "pangea.text_to_speech";
// static const vocab = "p.vocab";
static const roomInfo = "pangea.roomtopic";
static const audio = "p.audio";
static const botOptions = "pangea.bot_options";
static const capacity = "pangea.capacity";
@ -30,31 +17,20 @@ class PangeaEventTypes {
static const activityRole = "pangea.activity_roles";
static const activitySummary = "pangea.activity_summary";
static const userAge = "pangea.user_age";
static const String report = 'm.report';
static const report = 'm.report';
static const textToSpeechRule = "p.rule.text_to_speech";
static const analyticsInviteRule = "p.rule.analytics_invite";
static const analyticsInviteContent = "p.analytics_request";
/// A request to the server to generate activities
static const activityRequest = "pangea.activity_req";
/// A practice activity that is related to a message
static const pangeaActivity = "pangea.activity_res";
/// A record of completion of an activity. There
/// can be one per user per activity.
static const activityRecord = "pangea.activity_completion";
/// Profile information related to a user's analytics
static const profileAnalytics = "pangea.analytics_profile";
static const profileActivities = "pangea.activities_profile";
static const activityRoomIds = "pangea.activity_room_ids";
/// Relates to course plans
static const coursePlan = "pangea.course_plan";
static const courseUser = "p.course_user";
static const teacherMode = "pangea.teacher_mode";
static const courseChatList = "pangea.course_chat_list";

View file

@ -7,6 +7,7 @@ import 'package:flutter/foundation.dart';
import 'package:async/async.dart';
import 'package:collection/collection.dart';
import 'package:matrix/matrix.dart' hide Result;
import 'package:sentry_flutter/sentry_flutter.dart';
import 'package:fluffychat/pangea/choreographer/choreo_record_model.dart';
import 'package:fluffychat/pangea/common/constants/model_keys.dart';
@ -88,7 +89,7 @@ class PangeaMessageEvent {
_event;
// get audio events that are related to this event
Set<Event> get allAudio => _latestEdit
Set<Event> get ttsEvents => _latestEdit
.aggregatedEvents(timeline, PangeaEventTypes.textToSpeech)
.where((element) {
return element.content.tryGet<Map<String, dynamic>>(
@ -98,6 +99,9 @@ class PangeaMessageEvent {
})
.toSet();
Set<Event> get _sttTranslationEvents =>
_latestEdit.aggregatedEvents(timeline, PangeaEventTypes.sttTranslation);
List<RepresentationEvent> get _repEvents => _latestEdit
.aggregatedEvents(timeline, PangeaEventTypes.representation)
.map(
@ -259,7 +263,7 @@ class PangeaMessageEvent {
}
RepresentationEvent? get messageDisplayRepresentation =>
representationByLanguage(messageDisplayLangCode);
_representationByLanguage(messageDisplayLangCode);
/// Gets the message display text for the current language code.
/// If the message display text is not available for the current language code,
@ -277,7 +281,7 @@ class PangeaMessageEvent {
_representations = null;
}
RepresentationEvent? representationByLanguage(
RepresentationEvent? _representationByLanguage(
String langCode, {
bool Function(RepresentationEvent)? filter,
}) => representations.firstWhereOrNull(
@ -286,8 +290,11 @@ class PangeaMessageEvent {
(filter?.call(element) ?? true),
);
Event? getTextToSpeechLocal(String langCode, String text, String? voice) {
for (final audio in allAudio) {
RepresentationEvent? get _speechToTextRepresentation => representations
.firstWhereOrNull((element) => element.content.speechToText != null);
Event? _getTextToSpeechLocal(String langCode, String text, String? voice) {
for (final audio in ttsEvents) {
final dataMap = audio.content.tryGetMap(ModelKey.transcription);
if (dataMap == null || !dataMap.containsKey(ModelKey.tokens)) continue;
@ -314,28 +321,27 @@ class PangeaMessageEvent {
return null;
}
RepresentationEvent? _getSpeechToTextRepresentation() => representations
.firstWhereOrNull((element) => element.content.speechToText != null);
SpeechToTextResponseModel? getSpeechToTextLocal() {
final rep = _getSpeechToTextRepresentation()?.content.speechToText;
final rep = _speechToTextRepresentation?.content.speechToText;
if (rep != null) return rep;
final rawBotTranscription = event.content.tryGetMap(
ModelKey.botTranscription,
);
// Check for STT embedded directly in the audio event content
// (user-sent audio embeds under userStt, bot-sent audio under botTranscription)
final rawEmbeddedStt =
event.content.tryGetMap(ModelKey.userStt) ??
event.content.tryGetMap(ModelKey.botTranscription);
if (rawBotTranscription != null) {
if (rawEmbeddedStt != null) {
try {
return SpeechToTextResponseModel.fromJson(
Map<String, dynamic>.from(rawBotTranscription),
Map<String, dynamic>.from(rawEmbeddedStt),
);
} catch (err, s) {
ErrorHandler.logError(
e: err,
s: s,
data: {"event": _event.toJson()},
m: "error parsing botTranscription",
m: "error parsing embedded stt",
);
return null;
}
@ -344,17 +350,41 @@ class PangeaMessageEvent {
return null;
}
SttTranslationModel? _getSttTranslationLocal(String langCode) {
final events = _sttTranslationEvents;
final List<SttTranslationModel> translations = [];
for (final event in events) {
try {
final translation = SttTranslationModel.fromJson(event.content);
translations.add(translation);
} catch (e) {
Sentry.addBreadcrumb(
Breadcrumb(
message: "Failed to parse STT translation",
data: {
"eventID": event.eventId,
"content": event.content,
"error": e.toString(),
},
),
);
}
}
return translations.firstWhereOrNull((t) => t.langCode == langCode);
}
Future<PangeaAudioFile> requestTextToSpeech(
String langCode,
String? voice,
) async {
final local = getTextToSpeechLocal(langCode, messageDisplayText, voice);
final local = _getTextToSpeechLocal(langCode, messageDisplayText, voice);
if (local != null) {
final file = await local.getPangeaAudioFile();
if (file != null) return file;
}
final rep = representationByLanguage(langCode);
final rep = _representationByLanguage(langCode);
final tokensResp = await rep?.requestTokens();
final request = TextToSpeechRequestModel(
text: rep?.content.text ?? body,
@ -450,63 +480,26 @@ class PangeaMessageEvent {
return result.result!;
}
Future<Event?> sendSttRepresentationEvent(
SpeechToTextResponseModel stt,
) async {
final representation = PangeaRepresentation(
langCode: stt.langCode,
text: stt.transcript.text,
originalSent: false,
originalWritten: false,
speechToText: stt,
);
_representations = null;
return room.sendPangeaEvent(
content: representation.toJson(),
parentEventId: _latestEdit.eventId,
type: PangeaEventTypes.representation,
);
}
Future<String> requestSttTranslation({
required String langCode,
required String l1Code,
required String l2Code,
}) async {
// First try to access the local translation event via a representation event
RepresentationEvent? rep = _getSpeechToTextRepresentation();
final local = rep?.getSpeechToTextTranslationLocal(langCode);
final local = _getSttTranslationLocal(langCode);
if (local != null) return local.translation;
// The translation event needs a parent representation to relate to,
// so if the rep is null, we send a new representation event first.
// This happens mostly for bot audio messages, which store their transcripts
// in the original message event content.
SpeechToTextResponseModel? stt = rep?.content.speechToText;
if (rep == null) {
stt ??= await requestSpeechToText(l1Code, l2Code, sendEvent: false);
final repEvent = await sendSttRepresentationEvent(stt);
if (repEvent == null) {
throw Exception("Failed to send representation event for STT");
}
rep = _getSpeechToTextRepresentation();
if (rep == null) {
throw Exception("Failed to get representation event for STT");
}
}
// Make the translation request
final stt = await requestSpeechToText(l1Code, l2Code);
final res = await FullTextTranslationRepo.get(
MatrixState.pangeaController.userController.accessToken,
FullTextTranslationRequestModel(
text: stt!.transcript.text,
text: stt.transcript.text,
tgtLang: l1Code,
userL2: l2Code,
userL1: l1Code,
),
);
if (res.isError) {
throw res.error!;
}
@ -516,13 +509,7 @@ class PangeaMessageEvent {
langCode: l1Code,
);
// Send the translation event if the representation event exists
rep.event?.room.sendPangeaEvent(
content: translation.toJson(),
parentEventId: rep.event!.eventId,
type: PangeaEventTypes.sttTranslation,
);
_sendSttTranslationEvent(sttTranslation: translation);
return translation.translation;
}
@ -569,12 +556,12 @@ class PangeaMessageEvent {
RepresentationEvent? rep;
if (!includedIT) {
// if the message didn't go through translation, get any l1 rep
rep = representationByLanguage(_l1Code!);
rep = _representationByLanguage(_l1Code!);
} else {
// if the message went through translation, get the non-original
// l1 rep since originalWritten could contain some l2 words
// (https://github.com/pangeachat/client/issues/3591)
rep = representationByLanguage(
rep = _representationByLanguage(
_l1Code!,
filter: (rep) => !rep.content.originalWritten,
);
@ -638,4 +625,31 @@ class PangeaMessageEvent {
);
return repEvent?.eventId;
}
Future<Event?> sendSttRepresentationEvent(
SpeechToTextResponseModel stt,
) async {
final representation = PangeaRepresentation(
langCode: stt.langCode,
text: stt.transcript.text,
originalSent: false,
originalWritten: false,
speechToText: stt,
);
_representations = null;
return room.sendPangeaEvent(
content: representation.toJson(),
parentEventId: _latestEdit.eventId,
type: PangeaEventTypes.representation,
);
}
Future<Event?> _sendSttTranslationEvent({
required SttTranslationModel sttTranslation,
}) => room.sendPangeaEvent(
content: sttTranslation.toJson(),
parentEventId: _latestEdit.eventId,
type: PangeaEventTypes.sttTranslation,
);
}

View file

@ -5,7 +5,6 @@ import 'dart:developer';
import 'package:flutter/foundation.dart';
import 'package:async/async.dart';
import 'package:collection/collection.dart';
import 'package:matrix/matrix.dart' hide Result;
import 'package:sentry_flutter/sentry_flutter.dart';
@ -18,7 +17,6 @@ import 'package:fluffychat/pangea/events/extensions/pangea_event_extension.dart'
import 'package:fluffychat/pangea/events/models/language_detection_model.dart';
import 'package:fluffychat/pangea/events/models/pangea_token_model.dart';
import 'package:fluffychat/pangea/events/models/representation_content_model.dart';
import 'package:fluffychat/pangea/events/models/stt_translation_model.dart';
import 'package:fluffychat/pangea/events/models/tokens_event_content_model.dart';
import 'package:fluffychat/pangea/events/repo/token_api_models.dart';
import 'package:fluffychat/pangea/events/repo/tokens_repo.dart';
@ -65,9 +63,6 @@ class RepresentationEvent {
Set<Event> get tokenEvents =>
_event?.aggregatedEvents(timeline, PangeaEventTypes.tokens) ?? {};
Set<Event> get sttEvents =>
_event?.aggregatedEvents(timeline, PangeaEventTypes.sttTranslation) ?? {};
Set<Event> get choreoEvents =>
_event?.aggregatedEvents(timeline, PangeaEventTypes.choreoRecord) ?? {};
@ -110,36 +105,6 @@ class RepresentationEvent {
return ChoreoEvent(event: choreoEvents.first).content;
}
List<SttTranslationModel> get sttTranslations {
if (content.speechToText == null) return [];
if (_event == null) {
Sentry.addBreadcrumb(
Breadcrumb(message: "_event and _sttTranslations both null"),
);
return [];
}
if (sttEvents.isEmpty) return [];
final List<SttTranslationModel> sttTranslations = [];
for (final event in sttEvents) {
try {
sttTranslations.add(SttTranslationModel.fromJson(event.content));
} catch (e) {
Sentry.addBreadcrumb(
Breadcrumb(
message: "Failed to parse STT translation",
data: {
"eventID": event.eventId,
"content": event.content,
"error": e.toString(),
},
),
);
}
}
return sttTranslations;
}
List<OneConstructUse> get vocabAndMorphUses {
if (tokens == null || tokens!.isEmpty) {
return [];
@ -229,8 +194,4 @@ class RepresentationEvent {
? Result.error(res.error!)
: Result.value(res.result!.tokens);
}
SttTranslationModel? getSpeechToTextTranslationLocal(String langCode) {
return sttTranslations.firstWhereOrNull((t) => t.langCode == langCode);
}
}

View file

@ -29,7 +29,6 @@ class SttTranscriptTokens extends StatelessWidget {
@override
Widget build(BuildContext context) {
debugPrint("Tokens: ${tokens.map((t) => t.toJson())}");
if (model.transcript.sttTokens.isEmpty) {
return Text(
model.transcript.text,

View file

@ -134,7 +134,6 @@ abstract class ClientManager {
// to postLoad to confirm that these state events are completely loaded
EventTypes.RoomPowerLevels,
EventTypes.RoomJoinRules,
PangeaEventTypes.rules,
PangeaEventTypes.botOptions,
PangeaEventTypes.capacity,
PangeaEventTypes.userSetLemmaInfo,
@ -144,7 +143,6 @@ abstract class ClientManager {
PangeaEventTypes.constructSummary,
PangeaEventTypes.activityRoomIds,
PangeaEventTypes.coursePlan,
PangeaEventTypes.courseUser,
PangeaEventTypes.teacherMode,
PangeaEventTypes.courseChatList,
PangeaEventTypes.analyticsSettings,