Audio to Text from Blob trigger - azure

So I have a use-case where I want to upload audio files (.WAV) into a blob storage which triggers a Function and gets the text from the audio. At the moment, the only way possible is having the audio file locally. The audio config can't take the uri of the audio file. The code I'm using is this:
import azure.cognitiveservices.speech as speechsdk
speech_key, service_region = "sub-key", "westeurope"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_input = speechsdk.AudioConfig(filename="**BLOB URI**")
speech_recognizer = speechsdk.SpeechRecognizer(speech_config, audio_input)
result = speech_recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech Recognition canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
From my research, we can't have a uri as a filename (bold part of code). Solutions like downloading locally first won't work.
I tried reading the audio as a stream but I couldn't find a way to convert to an AudioInputStream.
Any help would be great. Thanks.

You can use the Batch transcription REST API operations that enables you to transcribe a large amount of audio in storage. You can point to audio files using a typical URI or a shared access signature (SAS) URI and asynchronously receive transcription results. With the v3.0 API, you can transcribe one or more audio files, or process a whole storage container.
Please see the followings:
https://medium.com/#abhishekcskumar/logic-apps-large-audio-speech-to-text-batch-transcription-d71e93bbaeec
https://github.com/PanosPeriorellis/Speech_Service-BatchTranscriptionAPI/blob/master/CrisClient/Program.cs
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription#sample-code

Related

How to convert an audio file in colab to text?

I am trying to convert an audio file I have in colab workspace into text using the speech recognition module. But it doesn't work as the audio argument here needs to be audio, how do I load an audio file "audio.wav" into some variable to pass there or just simply pass that file.
import speech_recognition as sr
r = sr.Recognizer()
text = r.recognize_google(audio, language = 'en-IN')
print(text)
The speech_recognition library has a procedure to read in audio files. You can do:
inp = sr.AudioFile('path/to/audio/file')
with inp as file:
audio = r.record(file)
After that pass the audio as the first argument to r.recognize_google()
Here is a good article to understand this library.
pip3 install SpeechRecognition pydub
Make sure you have an audio file in the current directory that contains english speech
import speech_recognition as sr
filename = "16-122828-0002.wav"
The below code is responsible for loading the audio file, and converting the speech into text using Google Speech Recognition:
# initialize the recognizer
r = sr.Recognizer()
# open the file
with sr.AudioFile(filename) as source:
# listen for the data (load audio to memory)
audio_data = r.record(source)
# recognize (convert from speech to text)
text = r.recognize_google(audio_data)
print(text)
This will take few seconds to finish, as it uploads the file to Google and grabs the output

How to calculate md5 for file in s3 bucket

I have a requirement to calculate md5 values for files that are held in s3 buckets. I know that I could download them to an onprem server and do it there but I want to keep my onprem server as small as possible and some of my s3 files are large (500MB+). So I have started developing a lambda python function do handle this but I can't figure out how to chunk through the file so I can generate the md5 value. Here is the code, I look forward to any assistance provided.
def s3_md5sum(bucket_name, object_key):
try:
md5Object = s3object.Object(bucket_name, object_key)
body = md5Object.get()['Body'].read()
except ClientError:
raise
else:
md5_obj = hashlib.md5()
while True:
buffer = body.read(8096)
if not buffer:
break
md5_obj.update(buffer)
hash_code = md5_obj.hexdigest()
md5 = str(hash_code).lower()
return md5
You can read the file as a stream instead of trying to read the entire file in memory. You can then use the hashlib library to create the MD5 based on the chuncks of the stream. An example of this can be found in this SO question.

Open bytes stream as image file to access exif. 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte'

I need to download an image when it arrives in Dropbox and upload to Azure Storage. I am doing this using a Kubeless serverless function which is triggered by Dropbox's push notification. This download and upload is working fine. However, I'd like to access the image's exif data so I can send metadata about the image to a RabbitMQ queue. When I try and open the image to use with Python's Pillow or exif modules I get this error:
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte'
dbx = dropbox.Dropbox(os.environ['DROPBOX_TOKEN'])
for entry in dbx.files_list_folder('').entries:
print('File found: ' + entry.name)
md, response = dbx.files_download('/' + entry.name)
file_stream = response.content
# Code to upload to Azure Storage and delete file from Dropbox here
I've tried:
with open(file_stream, 'rb') as data:
# do things with file
And using Pillow:
image = Image.open(file_stream)
image.show()
And get same error with each. I'm new to using Python and working with files this way so any help would be appreciated.
I was trying to open the file, whereas file_stream is already the opened file content.
The Python Request doc's provide an example of using Pillow's Image.open() in a way that works for me:
>>> from PIL import Image
>>> from io import BytesIO
>>> i = Image.open(BytesIO(r.content))
https://2.python-requests.org/en/master/user/quickstart/#binary-response-content
For future reference, this way of opening the image allowed me to access the exif data successfully (img_exif = image.getexif()) whereas the exif data was empty when using Image.fromBytes(). I'm not sure why.

Download video in Python using requests module results in black video

This is the URL I tried to download: https://www.instagram.com/p/B-jEqo9Bgk9/?utm_source=ig_web_copy_link
This is a minimal reproducible example:
import os
import requests
def main():
filename = 'test.mp4'
r = requests.get('https://www.instagram.com/p/B-jEqo9Bgk9/?utm_source=ig_web_copy_link', stream=True)
with open(os.path.join('.', filename), 'wb') as f:
print('Dumping "{0}"...'.format(filename))
for chunk in r.iter_content(chunk_size=1024):
print(chunk)
if chunk:
f.write(chunk)
f.flush()
if __name__ == '__main__':
main()
The code runs fine but the video does not play. What am I doing wrong?
Your code is running perfectly fine, but you did not provide the correct link for the video. The link you used is for the Instagram web page, not the video. So you should not save the content as 'test.mp4', but rather as 'test.html'. If you open the file in a text editor (for example Notepad++), you will see that it contains the HTML code of the web page.
You'll need to parse the HTML to acquire the actual video URL, and then you can use the same code to download the video using that URL.
Currently, the line that starts with meta property="og:video" content= contains the actual video URL, but that may change in the future.
(Note that copyright may apply for videos on Instagram. I assume you have the rights to download and save this video.)

Can google speech API convert text to speech?

I used Google speech API ti successfully convert speech to text using following code.
import speech_recognition as sr
import os
#obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say something!")
audio = r.listen(source)
# recognize speech using Google Cloud Speech
GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""{KEY}
"""
# INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE
try:
speechOutput = (r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS, language="si-LK"))
except sr.UnknownValueError:
speechOutput = ("Google Cloud Speech could not understand audio")
except sr.RequestError as e:
speechOutput = ("Could not request results from Google Cloud Speech service; {0}".format(e))
print(speechOutput)
I want to know if i can convert text to speech using the same API? If not what API to use and the sample python code for that.
Thank you!
For this you'll need to use the new Text-to-Speech API which is in Beta as of now. You can find a Python quickstart in the Client Library section of the docs. The sample is part of the python-docs-sample repo. Adding the relevant part of the example here for better visibility:
def synthesize_text(text):
"""Synthesizes speech from the input string of text."""
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = texttospeech.types.SynthesisInput(text=text)
# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices().
voice = texttospeech.types.VoiceSelectionParams(
language_code='en-US',
ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.MP3)
response = client.synthesize_speech(input_text, voice, audio_config)
# The response's audio_content is binary.
with open('output.mp3', 'wb') as out:
out.write(response.audio_content)
print('Audio content written to file "output.mp3"')
Update: rate and pitch configuration
You can enclose the text elements in a <prosody> tag to modify the rateand pitch. For example:
<prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>
The possible values for those follow the W3 specifications which can be found here. The SSML docs for Text-to-Speech API detail this and they also provide some samples.
Also, you can control the general audio playback rate with the speed option in <audio>, which currently accepts values from 50 to 200% (in 1% increments).

Resources