When to Use
Standard “upload and wait” workflow for platforms that prefer a single JSON payload. Processing time is roughly linear to media length.Request
Audio or video file (
mp4, mov, webm, mp3, wav, m4a, aac). Sent as multipart/form-data.When
true, behaves like /analyze/stream (SSE). Default false.Response
Copy
{
"conversation_id": "analysis_f38de2c0c1",
"utterances": 27,
"summary": { ...ConversationSummary... },
"insights": { ... },
"timeline": [ ... full utterance array ... ],
"transitions": [ ... ],
"moments": [ ... ],
"segmentation": { ...ConversationSegmentation... },
"agent_results": [ ... per-utterance agent payloads ... ],
"artifact_dir": "analysis_f38de2c0c1",
"upload_path": "analysis_f38de2c0c1.mp4",
"agent_outputs_path": "analysis_f38de2c0c1/agent_outputs.json"
}
summarymatches theConversationSummaryschema (see Response Schemas).timelineelements mirror the SSEfinal_transcript + emotion + cognitivepayloads (see SSE Events).agent_resultsis useful when you want a raw dump for analytics pipelines.
Sample Response
Below is a complete example response from a 60-second monologue:Copy
{
"conversation_id": "analysis_93c84ed103",
"utterances": 13,
"summary": {
"overview": "The conversation involves a single speaker, Speaker_0, contributing 13 utterances over a duration of 59.6 seconds, indicating a continuous monologue or solo discussion without interaction from other participants.",
"key_insights": [
"The speaker maintains consistent verbal engagement throughout the entire duration, suggesting a prepared or thought-out message delivery.",
"Lack of multiple speakers indicates this may be a self-directed commentary, internal reflection, or one-sided informational delivery."
],
"notable_patterns": [
"Uninterrupted flow of speech with no overlapping or turn-taking, typical of monologic communication.",
"Average utterance duration suggests moderate pacing, allowing for clarity and emphasis on each statement."
],
"recommendations": [
"If interaction is the goal, consider introducing additional participants or prompts to encourage dialogue.",
"For engagement improvement in solo formats, integrate deliberate pauses or rhetorical questions to simulate interaction."
],
"what_went_well": null,
"what_went_wrong": null,
"speaker_dynamics": null,
"conversation_quality": "good"
},
"insights": {
"overall_sentiment": {
"mean": 0.7238461538461538,
"std": 0.32910328828344204,
"trend": "positive"
},
"dominant_emotions": ["joy", "anger"],
"engagement_pattern": "stable",
"key_moments": [
{
"type": "transition",
"t": 47.761,
"speaker": "Speaker_0",
"description": "State shift from enthusiastic to concern"
},
{
"type": "transition",
"t": 51.681,
"speaker": "Speaker_0",
"description": "State shift from concern to enthusiastic"
}
],
"speaker_profiles": {
"Speaker_0": {
"utterance_count": 13,
"total_duration": 56.2449953,
"emotions": {
"joy": 12,
"anger": 1
},
"avg_engagement": 0.9061538461538461,
"avg_polarity": 0.7238461538461538,
"dominant_emotion": "joy"
}
},
"conversation_dynamics": {
"total_duration": 59.600998,
"utterance_count": 13,
"transition_count": 2,
"avg_utterance_duration": 4.3265381000000005,
"transition_rate": 0.15384615384615385
},
"cognitive_profile": {
"dominant_states": ["engaged", "focused"],
"load_level": "low",
"trend": "stable",
"hotspots": [
{
"t": 0.0,
"type": "focus_peak",
"description": "High focus and engagement",
"score": 0.855
},
{
"t": 6.7999997,
"type": "focus_peak",
"description": "High focus and engagement",
"score": 0.855
},
{
"t": 18.005001,
"type": "focus_peak",
"description": "High focus and engagement",
"score": 0.855
}
],
"tips": [],
"averages": {
"focused": 0.8626783147185669,
"confused": 0.10341147849165101,
"engaged": 0.923565188543625,
"disengaged": 0.09892734711931882,
"thoughtful": 0.3228574588079178,
"uncertain": 0.09946367355965942
}
}
},
"timeline": [
{
"utterance_id": "analysis_93c84ed103_utt_0",
"speaker": "Speaker_0",
"t_start": 0.0,
"t_end": 6.56,
"text": "Hi. My name is Mohn Jurzaku, and I'm currently an applied scientist working on distributed systems at TechCorp.",
"prosody": {
"pitch_hz": 131.95232615690293,
"pitch_std": 49.99163368630343,
"pitch_range": 236.12918648705164,
"energy_rms": 0.033301159739494324,
"energy_std": 0.025332292541861534,
"energy_max": 0.1242227554321289,
"duration": 6.56,
"pause_before_ms": 0,
"spectral_centroid": 1633.089735165607,
"spectral_std": 1197.4031334570159,
"zcr": 0.08323211130742049,
"pause_ratio": 0.0,
"commitment_markers": {
"pitch_stability": 0.6211386707434094,
"pitch_contour": -0.3182847334261473,
"energy_strength": 1.0,
"energy_consistency": 0.2614741678050624,
"duration_confidence": 0.7,
"commitment_score": 0.741000078262439,
"commitment_level": "moderate_commitment"
},
"passion_markers": {
"pitch_dynamism": 1.0,
"pitch_variation": 1.0,
"energy_level": 1.0,
"energy_dynamics": 1.0,
"spectral_richness": 0.4082724337914017,
"spectral_variation": 1.0,
"passion_score": 1.0,
"passion_level": "highly_passionate",
"conviction_score": 0.8446000469574634,
"conviction_level": "strong_conviction"
},
"pitch_z": -0.6153967267807747,
"energy_z": -1.2766557938959506
},
"prosody_desc": "moderately committed, PASSIONATE, strong conviction, assertive tone, emphatic, highly dynamic",
"emotion": {
"utterance_id": "analysis_93c84ed103_utt_0",
"speaker_id": "Speaker_0",
"emotion": {
"joy": 0.65,
"sadness": 0.05,
"anger": 0.1,
"fear": 0.05,
"surprise": 0.2,
"disgust": 0.05,
"neutral": 0.3
},
"scores": {
"valence": 0.7,
"arousal": 0.85,
"dominance": 0.8
},
"confidence": 0.85,
"stability": "final"
},
"emotions": {
"joy": 0.65,
"sadness": 0.05,
"anger": 0.1,
"fear": 0.05,
"surprise": 0.2,
"disgust": 0.05,
"neutral": 0.3
},
"dominant_emotion": "joy",
"sentiment": {
"positive": 0.95,
"negative": 0.02,
"neutral": 0.03
},
"polarity": 0.93,
"cognitive": {
"utterance_id": "analysis_93c84ed103_utt_0",
"signals": {
"focused": 0.9,
"confused": 0.1,
"engaged": 0.95,
"disengaged": 0.05,
"thoughtful": 0.3,
"uncertain": 0.1
},
"engagement": 0.95,
"load_level": "low",
"confidence": 0.9,
"stability": "final"
},
"cognitive_signals": {
"focused": 0.9,
"confused": 0.1,
"engaged": 0.95,
"disengaged": 0.05,
"thoughtful": 0.3,
"uncertain": 0.1
},
"engagement": 0.95,
"load_level": "low",
"pad_scores": {
"valence": 0.7,
"arousal": 0.85,
"dominance": 0.8
},
"commitment_score": 0.74,
"commitment_level": "moderate",
"passion_score": 1.0,
"passion_level": "highly_passionate",
"conviction_score": 0.8446000469574634,
"conviction_level": "strong",
"acoustic": {
"commitment_level": "moderate",
"commitment_score": 0.74,
"passion_level": "highly_passionate",
"passion_score": 1.0,
"conviction_level": "strong",
"acoustic_insights": [
"High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
"Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
"Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
],
"delivery_style": "assertive and engaging",
"factuality_markers": null,
"hesitation_patterns": [],
"emphasis_points": []
},
"acoustic_insights": [
"High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
"Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
"Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
],
"delivery_style": "assertive and engaging",
"confidence": 0.9,
"lexical_markers": {
"hedge_density": 0.0,
"disfluency_rate": 0.0,
"is_question": false,
"self_repair": false,
"reasoning": "The provided text contains no hedging expressions or disfluency markers, indicating confident, fluent delivery."
},
"acoustic_reasoning": "The speaker is highly engaged and experiencing joy, conveyed through a passionate and assertive delivery that reflects strong conviction and focus, with no signs of uncertainty or disengagement."
}
],
"transitions": [
{
"transition_id": "tr_5af959b3",
"at_ms": 47761,
"from_state": "enthusiastic",
"to_state": "concern",
"evidence_utterance_ids": [
"analysis_93c84ed103_utt_10",
"analysis_93c84ed103_utt_11"
],
"drivers": {
"valence_drop": 1.2,
"engagement_delta": -0.08
},
"confidence": 0.9,
"stability": "provisional"
},
{
"transition_id": "tr_9e6072b4",
"at_ms": 51681,
"from_state": "concern",
"to_state": "enthusiastic",
"evidence_utterance_ids": [
"analysis_93c84ed103_utt_11",
"analysis_93c84ed103_utt_13"
],
"drivers": {
"valence_rise": 1.1,
"arousal_drop": 0.1,
"engagement_delta": 0.05
},
"confidence": 0.9,
"stability": "provisional"
}
],
"moments": [],
"segmentation": {
"phases": [
{
"phase_id": "phase_1",
"phase_name": "Introduction & Background",
"start_timestamp": 0.0,
"end_timestamp": 17.605,
"start_utterance_id": "analysis_93c84ed103_utt_0",
"end_utterance_id": "analysis_93c84ed103_utt_2",
"utterance_count": 3,
"theme": "Personal and professional introduction",
"summary": "Speaker_0 introduces themselves as an applied scientist at TechCorp working on distributed systems. They share their academic background, including completing a PhD in computer science at State University, where their research focused on modeling belief and cognitive states in dialogue through speech and text.",
"emotional_tone": "joyful, enthusiastic, confident",
"key_moments": [
"Self-introduction with name and current role",
"Mention of PhD completion and institution",
"Explanation of doctoral research focus"
],
"transition_signal": "Shift from personal background to current work"
},
{
"phase_id": "phase_2",
"phase_name": "Presentation of Elocution AI Vision",
"start_timestamp": 18.005001,
"end_timestamp": 27.845001,
"start_utterance_id": "analysis_93c84ed103_utt_3",
"end_utterance_id": "analysis_93c84ed103_utt_5",
"utterance_count": 3,
"theme": "Introducing the core project: Elocution AI",
"summary": "The speaker introduces 'Elocution AI', explaining its purpose as an API for cognitive-aware speech understanding. This marks a shift from past experience to current innovation, emphasizing deeper comprehension of human speech beyond transcription.",
"emotional_tone": "joyful, passionate, visionary",
"key_moments": [
"Announcement of building Elocution AI",
"Definition of Elocution AI's main purpose",
"Emphasis on 'cognitive aware' understanding"
],
"transition_signal": "Introduction of a new technological initiative after academic and professional background"
},
{
"phase_id": "phase_3",
"phase_name": "Critique of Current Speech Technologies",
"start_timestamp": 28.61,
"end_timestamp": 47.165,
"start_utterance_id": "analysis_93c84ed103_utt_7",
"end_utterance_id": "analysis_93c84ed103_utt_10",
"utterance_count": 3,
"theme": "Limitations of existing speech services",
"summary": "Speaker_0 critiques current speech technologies that provide only transcripts, keywords, and summaries, lacking human-aware vocal cues like tone, volume, and prosody. They emphasize that these cues are essential for revealing cognitive states and improving conversational understanding.",
"emotional_tone": "joyful but increasingly emphatic and critical",
"key_moments": [
"Pointing out the shallow nature of existing speech services",
"Listing missing vocal cues (tone, volume, prosody)",
"Connecting vocal cues to inference of cognitive states"
],
"transition_signal": "Contrast between current limitations and the need for better systems"
},
{
"phase_id": "phase_4",
"phase_name": "Application & Impact of Elocution AI",
"start_timestamp": 47.761,
"end_timestamp": 59.600998,
"start_utterance_id": "analysis_93c84ed103_utt_11",
"end_utterance_id": "analysis_93c84ed103_utt_15",
"utterance_count": 4,
"theme": "Real-world use cases and value proposition",
"summary": "The speaker identifies pain points in behavioral interviews for hiring managers, and extends the relevance to consultants and sales teams. They position 'illocution' (likely a variation or feature of Elocution AI) as a solution that enables broader, human-aware cognitive understanding. The segment concludes with a brief thank you, signaling closure.",
"emotional_tone": "assertive, purposeful, concluding",
"key_moments": [
"Identification of hiring managers' pain points",
"Expansion to consultants and sales teams",
"Positioning of illocution as an enabling technology",
"Final expression of gratitude"
],
"transition_signal": "Shift to practical applications and stakeholder benefits after establishing technical and cognitive foundations"
}
],
"total_phases": 4,
"segmentation_rationale": "The conversation naturally flows through four coherent stages: (1) establishing credibility through personal and academic background, (2) introducing the core innovation (Elocution AI), (3) critiquing existing solutions to highlight the problem space, and (4) articulating real-world applications and concluding. Each phase contains at least three utterances and represents a thematic progression from identity to vision, problem, and solution.",
"narrative_arc": "The narrative follows a classic innovation storytelling arc: 'Who I am' → 'What I'm building' → 'Why it's needed' → 'Who it helps'. This structure builds credibility, defines a technological gap, and positions the solution as impactful and necessary.",
"key_transitions": [
{
"timestamp": 18.0,
"from_phase": "phase_1",
"to_phase": "phase_2",
"trigger": "Shift from past academic and professional background to current project development"
},
{
"timestamp": 28.6,
"from_phase": "phase_2",
"to_phase": "phase_3",
"trigger": "Contrast introduced between Elocution AI's capabilities and the limitations of current speech services"
},
{
"timestamp": 47.8,
"from_phase": "phase_3",
"to_phase": "phase_4",
"trigger": "Move from technical explanation to real-world application and stakeholder impact"
}
]
},
"agent_results": [
{
"utterance_id": "analysis_93c84ed103_utt_0",
"emotion": {
"emotions": {
"joy": 0.65,
"sadness": 0.05,
"anger": 0.1,
"fear": 0.05,
"surprise": 0.2,
"disgust": 0.05,
"neutral": 0.3
},
"dominant_emotion": "joy",
"confidence": 0.85,
"reasoning": "The speaker uses a passionate, assertive, and highly dynamic tone with strong conviction, which is commonly associated with positive engagement and enthusiasm.",
"pad_scores": {
"valence": 0.7,
"arousal": 0.85,
"dominance": 0.8
}
},
"sentiment": {
"sentiment": {
"positive": 0.95,
"negative": 0.02,
"neutral": 0.03
},
"polarity": 0.93,
"subjectivity": 0.85,
"confidence": 0.98,
"key_phrases": [
"Hi",
"My name is Mohn Jurzaku",
"I'm currently an applied scientist",
"working on distributed systems at TechCorp"
],
"tone": "confident, enthusiastic, professional, assertive"
},
"cognitive": {
"cognitive": {
"focused": 0.9,
"confused": 0.1,
"engaged": 0.95,
"disengaged": 0.05,
"thoughtful": 0.3,
"uncertain": 0.1
},
"cognitive_load": "low",
"engagement_level": 0.95,
"confidence": 0.9,
"indicators": [
"speech_rate_wpm=210 (↑)",
"prosody_desc=PASSIONATE, assertive tone, emphatic, highly dynamic",
"latency_before_speech=0s (immediate initiation)",
"no disfluencies or self-repairs",
"declarative structure with strong conviction",
"no hedges or epistemic modals",
"energized prosody"
]
},
"acoustic": {
"commitment_level": "moderate",
"commitment_score": 0.74,
"passion_level": "highly_passionate",
"passion_score": 1.0,
"conviction_level": "strong",
"acoustic_insights": [
"High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
"Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
"Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
],
"delivery_style": "assertive and engaging"
}
}
],
"artifact_dir": "data/analysis/analysis_93c84ed103",
"upload_path": "uploads/analysis_93c84ed103.mp4",
"agent_outputs_path": "data/analysis/analysis_93c84ed103/agent_outputs.json"
}
The sample above shows a complete response structure. In practice, the
timeline array contains all utterances with full prosody, emotion, cognitive, and acoustic analysis. The agent_results array provides detailed per-utterance agent outputs for advanced analytics. Only the first timeline entry and first agent result are shown here for brevity.Sample cURL
Copy
curl -X POST https://api.illocution.ai/analyze \
-H "X-API-Key: $ILLOCUTION_KEY" \
-F "file=@/path/to/call.mp4"
Common Errors
| Status | Code | Description |
|---|---|---|
| 400 | NO_UTTERANCES | Media contained no speech above the threshold. |
| 400 | FILE_TOO_LARGE | Exceeded MAX_FILE_SIZE_MB. |
| 401 | INVALID_KEY | Missing/incorrect API key. |
| 500 | ANALYSIS_FAILURE | Downstream agent or ASR failure (details in detail). |