Feature Overview

The emotional classification model used in our APIs is optimized for North American English conversational data. The API includes a baseline model of 4 basic emotions. The emotions included by default are angry, happy, neutral, and sad. Our other model offerings include different subsets of the following emotions: happy, sad, angry, neutral, surprised, disgusted, nervous, irritated, excited, sleepy. 
Coming soon – The API will include a model choice parameter, allowing users to choose between models of 4, 5, and 7 emotions.
The number of emotions, emotional buckets, and language support can be customized. If you are interested in a custom model, please contact us.

Choosing an API

While our APIs include the same model offerings in the backend, they are best suited for different purposes.
DiscreteAPIAsynchAPI
InputsA short audio file, 4-10s in length.A long audio file, at least 5s in length. Inputs can be up to 1 GB large.
OutputsA JSON that includes the primary emotion detected in the file, along with its confidence. The confidence scores of all other emotions in the model are also returned.A time-stamped JSON that includes the classified emotion and its confidence at a rate of 1 classification per 5 seconds of audio.
Response Time100-500 msDependent upon file size
Accessible outside SDK✅ Yes❌ No
The DiscreteAPI is built for real-time analysis of emotions in audio data. Small snippets of audio are sent to the API to receive feedback in real-time of what emotions are detected based on tone of voice. This API operates on an approximate per-sentence basis, and audio must be cut to the appropriate size. The AsynchAPI is built for emotion analysis of pre-recorded audio files. Files of any length, up to 1 GB in size, can be sent to the API to receive a summary of emotions throughout the file. Similar to the DiscreteAPI, this API operates on an approximate per-sentence basis, but the AsynchAPI provides timestamps to show the change in emotions over time.
Coming soon – StreamingAPI via WebSockets for real-time analysis of an audio stream.

Ideal Inputs

The APIs expect mono audio in the .wav format. An ideal audio file is recorded at 44100 Hz (44.1 kHz), though sampling rates as low as 8 kHz can still be used with high accuracy. For custom use cases, microphone specifications can be customized based on audio environment, including optimizations for mono/stereo audio, single microphone applications, noisy environments, etc.  For the DiscreteAPI, there are two input data formatting options:
  • raw audio file [multipart/form-data]
  • processed audio file [application/json]
Requests sent using the raw audio file will have reduced latencies. JSON input can only be sent from Python environments, as there are specific data processing requirements. These requirements are outlined in Quickstart and the API Reference. For the AsynchAPI, the only input data option is an audio file, in the .wav format.

Outputs

Outputs are returned as JSONs in the following formats:  DiscreteAPI:
{
  "main_emotion": "happy",
  "confidence": 0.777777777,
  "all_predictions": {
    "angry": 0.123456789,
    "happy": 0.777777777,
    "neutral": 0.23456789,
    "sad": 0.098765432
  }
}
The emotion returned in main_emotion is the highest confidence emotion returned from the model. Within all_predictions, each emotion is followed by its level of confidence. Some may use the top two highest confidence emotions to generate more nuanced states. We recommend dropping a main_emotion with confidence under 0.38, but that is at the user’s discretion. AsynchAPI:
{
  "request_id": "27a33189-bdd7-47ca-9817-abacfb7bdaf3",
  "status": "completed",
  "emotions": [
    {
      "t": "00:00",
      "emotion": "neutral",
      "confidence": 0.82791723
    },
    {
      "t": "00:05",
      "emotion": "neutral",
      "confidence": 0.719817432
    },
    {
      "t": "00:10",
      "emotion": "happy",
      "confidence": 0.917309381
    },
    {
      "t": "00:15",
      "emotion": "neutral",
      "confidence": 0.414097846
    }
	"..."
  ]
}
The emotions returned in emotions are the highest confidence emotion returned from the model, alongside the timestamp and confidence. The number of values in emotions correlates directly to the length of the input file. We recommend dropping emotions with confidence under 0.38, but that is at the user’s discretion.
Looking for a different interval of timestamps? The customizability of audio length is in beta and will be released soon.