Artificial Intelligence

Automatic Speech Recognition (ASR)

Copy link

Speech Recognition transcribes spoken audio in a video or video segment into text and returns blocks of text for each portion of the transcribed audio.

Tech Stack (OS, Programming Language, Libraries, Frameworks):

Linux, Bash, Python, CUDA, Kaldi


The dataset for this module consists of spoken/recorded speech samples in a specified format and transcripts (text of whatever is spoken) against those samples.

Resources (Hardware, Storage, Compute Power, Time):

Speech Recognition engines take a lot of time and require high computing power and storage capacity for training. In order to build an engine with good accuracy, we need a huge amount of train data (around 300 hours minimum), a powerful high-end CPU and GPU-based system on the cloud, and a couple of weeks to train the model.

Deployment (Server / API):

The following steps are included in the deployment of a Transcription Engine:

API Development

Environment Setup

Engine Installation and Configuration Model

Deployment Server and Route Setup

Testing 6EA2CE

Output (Screenshots):

Applications (General Real World Use): Speech To Text or Speech Transcription has the following applications: Voice assistants Voice user interfaces Call analytics and agent assist Media content search Media subtitling

Use Case (Our Specific):

We built the Speech Recognition Engine for Media Monitoring Platform focused around Broadcast Media (TV, Radio and Videos). The Engine generates transcriptions of Television Broadcast, Radio and Videos in multiple languages. Some of the languages we covered include English, German, Dari, Pushto, Hindi, Urdu.


Artificial Intelligence

  • Strategy

    Text-to-Speech, Speech Transcription, NLTK, Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *