Timeline Entry
Each element undertimeline contains multimodal signals for a single utterance.
Unique identifier for the utterance
Speaker identifier from diarization
Start time in seconds
End time in seconds
Transcribed text from speech recognition
ASR confidence score (0-1)
Acoustic metrics including:
- Pitch/energy statistics
- Pauses
- Spectral features
- Commitment/passion markers
- Lexical markers
- Hesitation flags
Natural-language descriptor (e.g., “COMMITTED, PASSIONATE, emphatic”)
Full emotion analysis including:
- Emotion scores (joy, sadness, etc.)
- PAD scores (valence, arousal, dominance)
- Dominant emotion
- Confidence
Structured sentiment analysis
Sentiment polarity (-1 to 1)
Subjectivity score (0-1)
Cognitive analysis including:
- Engagement score
- Cognitive load level
- Cognitive signals
Derived commitment score from prosody
Commitment level classification
Derived passion score from prosody
Passion level classification
Conviction score derived from prosody and acoustic reasoning
Lexical analysis including:
- Hedges
- Fillers
- Questions
- Disfluency rate
- Self-repair flags
Narrative produced explaining how acoustic cues map to cognition/emotion
Average confidence across all agents (0-1)
ConversationSummary Schema (Reference)
Fields:| Field | Type | Notes | |||
|---|---|---|---|---|---|
overview | string | Executive-level synopsis. | |||
key_insights | string[] | Bullet-ready highlights (timestamp references encouraged). | |||
notable_patterns | string[] | Trends across the conversation. | |||
recommendations | string[] | Actionable next steps. | |||
what_went_well | string[] | null | Populated when the user explicitly asks “what went well.” | |||
what_went_wrong | string[] | null | Populated on “what went wrong” requests. | |||
speaker_dynamics | [{speaker, description}] | null | Characterization per speaker. | |||
conversation_quality | `“excellent" | "good" | "fair" | "poor”` | Defaults to "good". |
Insights
Aggregated insights from the fusion service.Sentiment statistics:
mean: Average sentiment scorestd: Standard deviationtrend: Sentiment trend (increasing/decreasing/stable)
Top three dominant emotions (strings)
Engagement pattern:
increasing, decreasing, or stableCurated subset of detected moments (MomentEvent objects)
Profiles for each speaker
Analysis of conversation dynamics
Overall cognitive profile
Transition Event
Affective state transition detection.Unique transition identifier
Timestamp in milliseconds
Previous affective state
New affective state
Array of utterance IDs that contributed to this transition
Valence/arousal/engagement deltas that triggered the transition
Confidence score (0-1)
Stability indicator
Moment Event
Key moment detection.Unique moment identifier
Moment category:
objection, cta_offered, cta_accepted, cta_rejected, topic_shiftSpeaker identifier
Start timestamp in milliseconds
End timestamp in milliseconds
Array of associated utterance IDs
Summary of the moment
Additional labels
Confidence score (0-1)
Conversation Segmentation
Segmentation of conversation into phases.Array of ConversationPhase objects
Total number of phases
Explanation of segmentation approach
Narrative arc description
Array of key transitions (TransitionEvent objects)
Conversation Phase
Individual phase within a segmented conversation.Unique phase identifier
Name of the phase
Start timestamp in seconds
End timestamp in seconds
First utterance ID in this phase
Last utterance ID in this phase
Number of utterances in this phase
Theme of the phase
Summary of the phase
Emotional tone of the phase
Key moments within this phase (MomentEvent objects)
Transition signal that led to this phase (TransitionEvent)
Error Response
Standard error response format.Error message describing what went wrong
Live Session Response
Response from starting a live session.Unique session identifier
ISO 8601 timestamp when session expires
Current session status
Next expected chunk sequence number (zero-based)
Metrics Response
Observability metrics for dashboards.Event counts by type:
status: Number of status eventsfinal_transcript: Number of final transcript eventsemotion: Number of emotion events- Additional event types as key-value pairs
Average agent latency per utterance type (milliseconds)