Skip to main content

When to Use

Standard “upload and wait” workflow for platforms that prefer a single JSON payload. Processing time is roughly linear to media length.

Request

file
binary
required
Audio or video file (mp4, mov, webm, mp3, wav, m4a, aac). Sent as multipart/form-data.
stream
boolean
When true, behaves like /analyze/stream (SSE). Default false.

Response

{
  "conversation_id": "analysis_f38de2c0c1",
  "utterances": 27,
  "summary": { ...ConversationSummary... },
  "insights": { ... },
  "timeline": [ ... full utterance array ... ],
  "transitions": [ ... ],
  "moments": [ ... ],
  "segmentation": { ...ConversationSegmentation... },
  "agent_results": [ ... per-utterance agent payloads ... ],
  "artifact_dir": "analysis_f38de2c0c1",
  "upload_path": "analysis_f38de2c0c1.mp4",
  "agent_outputs_path": "analysis_f38de2c0c1/agent_outputs.json"
}
Key objects:
  • summary matches the ConversationSummary schema (see Response Schemas).
  • timeline elements mirror the SSE final_transcript + emotion + cognitive payloads (see SSE Events).
  • agent_results is useful when you want a raw dump for analytics pipelines.

Sample Response

Below is a complete example response from a 60-second monologue:
{
  "conversation_id": "analysis_93c84ed103",
  "utterances": 13,
  "summary": {
    "overview": "The conversation involves a single speaker, Speaker_0, contributing 13 utterances over a duration of 59.6 seconds, indicating a continuous monologue or solo discussion without interaction from other participants.",
    "key_insights": [
      "The speaker maintains consistent verbal engagement throughout the entire duration, suggesting a prepared or thought-out message delivery.",
      "Lack of multiple speakers indicates this may be a self-directed commentary, internal reflection, or one-sided informational delivery."
    ],
    "notable_patterns": [
      "Uninterrupted flow of speech with no overlapping or turn-taking, typical of monologic communication.",
      "Average utterance duration suggests moderate pacing, allowing for clarity and emphasis on each statement."
    ],
    "recommendations": [
      "If interaction is the goal, consider introducing additional participants or prompts to encourage dialogue.",
      "For engagement improvement in solo formats, integrate deliberate pauses or rhetorical questions to simulate interaction."
    ],
    "what_went_well": null,
    "what_went_wrong": null,
    "speaker_dynamics": null,
    "conversation_quality": "good"
  },
  "insights": {
    "overall_sentiment": {
      "mean": 0.7238461538461538,
      "std": 0.32910328828344204,
      "trend": "positive"
    },
    "dominant_emotions": ["joy", "anger"],
    "engagement_pattern": "stable",
    "key_moments": [
      {
        "type": "transition",
        "t": 47.761,
        "speaker": "Speaker_0",
        "description": "State shift from enthusiastic to concern"
      },
      {
        "type": "transition",
        "t": 51.681,
        "speaker": "Speaker_0",
        "description": "State shift from concern to enthusiastic"
      }
    ],
    "speaker_profiles": {
      "Speaker_0": {
        "utterance_count": 13,
        "total_duration": 56.2449953,
        "emotions": {
          "joy": 12,
          "anger": 1
        },
        "avg_engagement": 0.9061538461538461,
        "avg_polarity": 0.7238461538461538,
        "dominant_emotion": "joy"
      }
    },
    "conversation_dynamics": {
      "total_duration": 59.600998,
      "utterance_count": 13,
      "transition_count": 2,
      "avg_utterance_duration": 4.3265381000000005,
      "transition_rate": 0.15384615384615385
    },
    "cognitive_profile": {
      "dominant_states": ["engaged", "focused"],
      "load_level": "low",
      "trend": "stable",
      "hotspots": [
        {
          "t": 0.0,
          "type": "focus_peak",
          "description": "High focus and engagement",
          "score": 0.855
        },
        {
          "t": 6.7999997,
          "type": "focus_peak",
          "description": "High focus and engagement",
          "score": 0.855
        },
        {
          "t": 18.005001,
          "type": "focus_peak",
          "description": "High focus and engagement",
          "score": 0.855
        }
      ],
      "tips": [],
      "averages": {
        "focused": 0.8626783147185669,
        "confused": 0.10341147849165101,
        "engaged": 0.923565188543625,
        "disengaged": 0.09892734711931882,
        "thoughtful": 0.3228574588079178,
        "uncertain": 0.09946367355965942
      }
    }
  },
  "timeline": [
    {
      "utterance_id": "analysis_93c84ed103_utt_0",
      "speaker": "Speaker_0",
      "t_start": 0.0,
      "t_end": 6.56,
      "text": "Hi. My name is Mohn Jurzaku, and I'm currently an applied scientist working on distributed systems at TechCorp.",
      "prosody": {
        "pitch_hz": 131.95232615690293,
        "pitch_std": 49.99163368630343,
        "pitch_range": 236.12918648705164,
        "energy_rms": 0.033301159739494324,
        "energy_std": 0.025332292541861534,
        "energy_max": 0.1242227554321289,
        "duration": 6.56,
        "pause_before_ms": 0,
        "spectral_centroid": 1633.089735165607,
        "spectral_std": 1197.4031334570159,
        "zcr": 0.08323211130742049,
        "pause_ratio": 0.0,
        "commitment_markers": {
          "pitch_stability": 0.6211386707434094,
          "pitch_contour": -0.3182847334261473,
          "energy_strength": 1.0,
          "energy_consistency": 0.2614741678050624,
          "duration_confidence": 0.7,
          "commitment_score": 0.741000078262439,
          "commitment_level": "moderate_commitment"
        },
        "passion_markers": {
          "pitch_dynamism": 1.0,
          "pitch_variation": 1.0,
          "energy_level": 1.0,
          "energy_dynamics": 1.0,
          "spectral_richness": 0.4082724337914017,
          "spectral_variation": 1.0,
          "passion_score": 1.0,
          "passion_level": "highly_passionate",
          "conviction_score": 0.8446000469574634,
          "conviction_level": "strong_conviction"
        },
        "pitch_z": -0.6153967267807747,
        "energy_z": -1.2766557938959506
      },
      "prosody_desc": "moderately committed, PASSIONATE, strong conviction, assertive tone, emphatic, highly dynamic",
      "emotion": {
        "utterance_id": "analysis_93c84ed103_utt_0",
        "speaker_id": "Speaker_0",
        "emotion": {
          "joy": 0.65,
          "sadness": 0.05,
          "anger": 0.1,
          "fear": 0.05,
          "surprise": 0.2,
          "disgust": 0.05,
          "neutral": 0.3
        },
        "scores": {
          "valence": 0.7,
          "arousal": 0.85,
          "dominance": 0.8
        },
        "confidence": 0.85,
        "stability": "final"
      },
      "emotions": {
        "joy": 0.65,
        "sadness": 0.05,
        "anger": 0.1,
        "fear": 0.05,
        "surprise": 0.2,
        "disgust": 0.05,
        "neutral": 0.3
      },
      "dominant_emotion": "joy",
      "sentiment": {
        "positive": 0.95,
        "negative": 0.02,
        "neutral": 0.03
      },
      "polarity": 0.93,
      "cognitive": {
        "utterance_id": "analysis_93c84ed103_utt_0",
        "signals": {
          "focused": 0.9,
          "confused": 0.1,
          "engaged": 0.95,
          "disengaged": 0.05,
          "thoughtful": 0.3,
          "uncertain": 0.1
        },
        "engagement": 0.95,
        "load_level": "low",
        "confidence": 0.9,
        "stability": "final"
      },
      "cognitive_signals": {
        "focused": 0.9,
        "confused": 0.1,
        "engaged": 0.95,
        "disengaged": 0.05,
        "thoughtful": 0.3,
        "uncertain": 0.1
      },
      "engagement": 0.95,
      "load_level": "low",
      "pad_scores": {
        "valence": 0.7,
        "arousal": 0.85,
        "dominance": 0.8
      },
      "commitment_score": 0.74,
      "commitment_level": "moderate",
      "passion_score": 1.0,
      "passion_level": "highly_passionate",
      "conviction_score": 0.8446000469574634,
      "conviction_level": "strong",
      "acoustic": {
        "commitment_level": "moderate",
        "commitment_score": 0.74,
        "passion_level": "highly_passionate",
        "passion_score": 1.0,
        "conviction_level": "strong",
        "acoustic_insights": [
          "High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
          "Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
          "Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
        ],
        "delivery_style": "assertive and engaging",
        "factuality_markers": null,
        "hesitation_patterns": [],
        "emphasis_points": []
      },
      "acoustic_insights": [
        "High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
        "Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
        "Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
      ],
      "delivery_style": "assertive and engaging",
      "confidence": 0.9,
      "lexical_markers": {
        "hedge_density": 0.0,
        "disfluency_rate": 0.0,
        "is_question": false,
        "self_repair": false,
        "reasoning": "The provided text contains no hedging expressions or disfluency markers, indicating confident, fluent delivery."
      },
      "acoustic_reasoning": "The speaker is highly engaged and experiencing joy, conveyed through a passionate and assertive delivery that reflects strong conviction and focus, with no signs of uncertainty or disengagement."
    }
  ],
  "transitions": [
    {
      "transition_id": "tr_5af959b3",
      "at_ms": 47761,
      "from_state": "enthusiastic",
      "to_state": "concern",
      "evidence_utterance_ids": [
        "analysis_93c84ed103_utt_10",
        "analysis_93c84ed103_utt_11"
      ],
      "drivers": {
        "valence_drop": 1.2,
        "engagement_delta": -0.08
      },
      "confidence": 0.9,
      "stability": "provisional"
    },
    {
      "transition_id": "tr_9e6072b4",
      "at_ms": 51681,
      "from_state": "concern",
      "to_state": "enthusiastic",
      "evidence_utterance_ids": [
        "analysis_93c84ed103_utt_11",
        "analysis_93c84ed103_utt_13"
      ],
      "drivers": {
        "valence_rise": 1.1,
        "arousal_drop": 0.1,
        "engagement_delta": 0.05
      },
      "confidence": 0.9,
      "stability": "provisional"
    }
  ],
  "moments": [],
  "segmentation": {
    "phases": [
      {
        "phase_id": "phase_1",
        "phase_name": "Introduction & Background",
        "start_timestamp": 0.0,
        "end_timestamp": 17.605,
        "start_utterance_id": "analysis_93c84ed103_utt_0",
        "end_utterance_id": "analysis_93c84ed103_utt_2",
        "utterance_count": 3,
        "theme": "Personal and professional introduction",
        "summary": "Speaker_0 introduces themselves as an applied scientist at TechCorp working on distributed systems. They share their academic background, including completing a PhD in computer science at State University, where their research focused on modeling belief and cognitive states in dialogue through speech and text.",
        "emotional_tone": "joyful, enthusiastic, confident",
        "key_moments": [
          "Self-introduction with name and current role",
          "Mention of PhD completion and institution",
          "Explanation of doctoral research focus"
        ],
        "transition_signal": "Shift from personal background to current work"
      },
      {
        "phase_id": "phase_2",
        "phase_name": "Presentation of Elocution AI Vision",
        "start_timestamp": 18.005001,
        "end_timestamp": 27.845001,
        "start_utterance_id": "analysis_93c84ed103_utt_3",
        "end_utterance_id": "analysis_93c84ed103_utt_5",
        "utterance_count": 3,
        "theme": "Introducing the core project: Elocution AI",
        "summary": "The speaker introduces 'Elocution AI', explaining its purpose as an API for cognitive-aware speech understanding. This marks a shift from past experience to current innovation, emphasizing deeper comprehension of human speech beyond transcription.",
        "emotional_tone": "joyful, passionate, visionary",
        "key_moments": [
          "Announcement of building Elocution AI",
          "Definition of Elocution AI's main purpose",
          "Emphasis on 'cognitive aware' understanding"
        ],
        "transition_signal": "Introduction of a new technological initiative after academic and professional background"
      },
      {
        "phase_id": "phase_3",
        "phase_name": "Critique of Current Speech Technologies",
        "start_timestamp": 28.61,
        "end_timestamp": 47.165,
        "start_utterance_id": "analysis_93c84ed103_utt_7",
        "end_utterance_id": "analysis_93c84ed103_utt_10",
        "utterance_count": 3,
        "theme": "Limitations of existing speech services",
        "summary": "Speaker_0 critiques current speech technologies that provide only transcripts, keywords, and summaries, lacking human-aware vocal cues like tone, volume, and prosody. They emphasize that these cues are essential for revealing cognitive states and improving conversational understanding.",
        "emotional_tone": "joyful but increasingly emphatic and critical",
        "key_moments": [
          "Pointing out the shallow nature of existing speech services",
          "Listing missing vocal cues (tone, volume, prosody)",
          "Connecting vocal cues to inference of cognitive states"
        ],
        "transition_signal": "Contrast between current limitations and the need for better systems"
      },
      {
        "phase_id": "phase_4",
        "phase_name": "Application & Impact of Elocution AI",
        "start_timestamp": 47.761,
        "end_timestamp": 59.600998,
        "start_utterance_id": "analysis_93c84ed103_utt_11",
        "end_utterance_id": "analysis_93c84ed103_utt_15",
        "utterance_count": 4,
        "theme": "Real-world use cases and value proposition",
        "summary": "The speaker identifies pain points in behavioral interviews for hiring managers, and extends the relevance to consultants and sales teams. They position 'illocution' (likely a variation or feature of Elocution AI) as a solution that enables broader, human-aware cognitive understanding. The segment concludes with a brief thank you, signaling closure.",
        "emotional_tone": "assertive, purposeful, concluding",
        "key_moments": [
          "Identification of hiring managers' pain points",
          "Expansion to consultants and sales teams",
          "Positioning of illocution as an enabling technology",
          "Final expression of gratitude"
        ],
        "transition_signal": "Shift to practical applications and stakeholder benefits after establishing technical and cognitive foundations"
      }
    ],
    "total_phases": 4,
    "segmentation_rationale": "The conversation naturally flows through four coherent stages: (1) establishing credibility through personal and academic background, (2) introducing the core innovation (Elocution AI), (3) critiquing existing solutions to highlight the problem space, and (4) articulating real-world applications and concluding. Each phase contains at least three utterances and represents a thematic progression from identity to vision, problem, and solution.",
    "narrative_arc": "The narrative follows a classic innovation storytelling arc: 'Who I am' → 'What I'm building' → 'Why it's needed' → 'Who it helps'. This structure builds credibility, defines a technological gap, and positions the solution as impactful and necessary.",
    "key_transitions": [
      {
        "timestamp": 18.0,
        "from_phase": "phase_1",
        "to_phase": "phase_2",
        "trigger": "Shift from past academic and professional background to current project development"
      },
      {
        "timestamp": 28.6,
        "from_phase": "phase_2",
        "to_phase": "phase_3",
        "trigger": "Contrast introduced between Elocution AI's capabilities and the limitations of current speech services"
      },
      {
        "timestamp": 47.8,
        "from_phase": "phase_3",
        "to_phase": "phase_4",
        "trigger": "Move from technical explanation to real-world application and stakeholder impact"
      }
    ]
  },
  "agent_results": [
    {
      "utterance_id": "analysis_93c84ed103_utt_0",
      "emotion": {
        "emotions": {
          "joy": 0.65,
          "sadness": 0.05,
          "anger": 0.1,
          "fear": 0.05,
          "surprise": 0.2,
          "disgust": 0.05,
          "neutral": 0.3
        },
        "dominant_emotion": "joy",
        "confidence": 0.85,
        "reasoning": "The speaker uses a passionate, assertive, and highly dynamic tone with strong conviction, which is commonly associated with positive engagement and enthusiasm.",
        "pad_scores": {
          "valence": 0.7,
          "arousal": 0.85,
          "dominance": 0.8
        }
      },
      "sentiment": {
        "sentiment": {
          "positive": 0.95,
          "negative": 0.02,
          "neutral": 0.03
        },
        "polarity": 0.93,
        "subjectivity": 0.85,
        "confidence": 0.98,
        "key_phrases": [
          "Hi",
          "My name is Mohn Jurzaku",
          "I'm currently an applied scientist",
          "working on distributed systems at TechCorp"
        ],
        "tone": "confident, enthusiastic, professional, assertive"
      },
      "cognitive": {
        "cognitive": {
          "focused": 0.9,
          "confused": 0.1,
          "engaged": 0.95,
          "disengaged": 0.05,
          "thoughtful": 0.3,
          "uncertain": 0.1
        },
        "cognitive_load": "low",
        "engagement_level": 0.95,
        "confidence": 0.9,
        "indicators": [
          "speech_rate_wpm=210 (↑)",
          "prosody_desc=PASSIONATE, assertive tone, emphatic, highly dynamic",
          "latency_before_speech=0s (immediate initiation)",
          "no disfluencies or self-repairs",
          "declarative structure with strong conviction",
          "no hedges or epistemic modals",
          "energized prosody"
        ]
      },
      "acoustic": {
        "commitment_level": "moderate",
        "commitment_score": 0.74,
        "passion_level": "highly_passionate",
        "passion_score": 1.0,
        "conviction_level": "strong",
        "acoustic_insights": [
          "High passion and conviction are supported by an assertive, emphatic tone and zero disfluencies or pauses, indicating confident self-presentation.",
          "Moderate commitment is reflected in a slightly below-average pitch (z=-0.62) but compensated by high dynamic variation and strong vocal presence.",
          "Absence of hedging and extremely low pause ratio (0.00) reinforce assertiveness and clear articulation of identity and role."
        ],
        "delivery_style": "assertive and engaging"
      }
    }
  ],
  "artifact_dir": "data/analysis/analysis_93c84ed103",
  "upload_path": "uploads/analysis_93c84ed103.mp4",
  "agent_outputs_path": "data/analysis/analysis_93c84ed103/agent_outputs.json"
}
The sample above shows a complete response structure. In practice, the timeline array contains all utterances with full prosody, emotion, cognitive, and acoustic analysis. The agent_results array provides detailed per-utterance agent outputs for advanced analytics. Only the first timeline entry and first agent result are shown here for brevity.

Sample cURL

curl -X POST https://api.illocution.ai/analyze \
  -H "X-API-Key: $ILLOCUTION_KEY" \
  -F "file=@/path/to/call.mp4"

Common Errors

StatusCodeDescription
400NO_UTTERANCESMedia contained no speech above the threshold.
400FILE_TOO_LARGEExceeded MAX_FILE_SIZE_MB.
401INVALID_KEYMissing/incorrect API key.
500ANALYSIS_FAILUREDownstream agent or ASR failure (details in detail).