how to create subtitles from aws transcribe

how to create subtitles from aws transcribe - python-3.x

I'm using AWS SDK for python (boto3) and want to set the subtitle output format (i.e. SRT). When I use this code, I get the error below which mentioned parameter Subtitle is not a valid parameter but according to AWS Documentation, I should be able to pass some values in this parameter.
s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY)
transcribe = boto3.client('transcribe',aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY, region_name=region_name)
job_name = "kateri1234"
job_uri = "s3://transcribe-upload12/english.mp4"
transcribe.start_transcription_job(TranscriptionJobName=job_name,Media{'MediaFileUri':job_uri},
MediaFormat='mp4',
LanguageCode='en-US',
Subtitles={'Formats': ['vtt']},
OutputBucketName = "transcribe-output12"
)
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
print("Not ready yet...")
time.sleep(5)
print(status)
ERROR i get is Unknown parameter in input: "Subtitles", must be one of: TranscriptionJobName, LanguageCode, MediaSampleRateHertz, MediaFormat, Media, OutputBucketName, OutputEncryptionKMSKeyId, Settings
refered the aws documentation

I have faced a similar issue, and after some research, I have found out it is because of my boto3 and botocore versions.
I have upgraded these two packages, and it worked. My requirements.txt for these two packages:
boto3==1.20.0
botocore==1.23.54
P.S: Remember to check these two new versions are compatible with your other python packages. Especially if you are using other AWS Libraries like awsebcli. To make sure everything is working perfectly together, try running this command after upgrading these two libraries to check the errors:
pip check

Related

Reading file from G Drive via Apache Beam

I'm trying to fetch file from Google Drive using Apache Beam. I tried,
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
lines = (pipeline | beam.Create(filenames))
print(lines)
This returns a string like PCollection[[19]: Create/Map(decode).None]
I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?

If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.
Solution 1 :
You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.
You can check the following links :
google-drive-to-gcs
stackoverflow-copy-file-from-google-drive-to-gcs
Solution 2
You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.
You can for example develop a Python script using Python Google clients and the following packages :
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google-cloud-storage
This article shows an example.

If you want to use Beam for this, you would could write a function
def read_from_gdrive_and_yield_records(path):
...
and then use it like
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
paths = pipeline | beam.Create(filenames)
records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
records | beam.io.WriteToText('gs://...')
Though as mentioned, unless you have a lot of files, this may be overkill.

Changing subdirectory of MLflow artifact store

Is there anything in the Python API that lets you alter the artifact subdirectories? For example, I have a .json file stored here:
s3://mlflow/3/1353808bf7324824b7343658882b1e45/artifacts/feature_importance_split.json
MlFlow creates a 3/ key in s3. Is there a way to change to modify this key to something else (a date or the name of the experiment)?

As I commented above, yes, mlflow.create_experiment() does allow you set the artifact location using the artifact_location parameter.
However, sort of related, the problem with setting the artifact_location using the create_experiment() function is that once you create a experiment, MLflow will throw an error if you run the create_experiment() function again.
I didn't see this in the docs but it's confirmed that if an experiment already exists in the backend-store, MlFlow will not allow you to run the same create_experiment() function again. And as of this post, MLfLow does not have check_if_exists flag or a create_experiments_if_not_exists() function.
To make things more frustrating, you cannot set the artifcact_location in the set_experiment() function either.
So here is a pretty easy work around, it also avoids the "ERROR mlflow.utils.rest_utils..." stdout logging as well.
:
import os
from random import random, randint
from mlflow import mlflow,log_metric, log_param, log_artifacts
from mlflow.exceptions import MlflowException
try:
experiment = mlflow.get_experiment_by_name('oof')
experiment_id = experiment.experiment_id
except AttributeError:
experiment_id = mlflow.create_experiment('oof', artifact_location='s3://mlflow-minio/sample/')
with mlflow.start_run(experiment_id=experiment_id) as run:
mlflow.set_tracking_uri('http://localhost:5000')
print("Running mlflow_tracking.py")
log_param("param1", randint(0, 100))
log_metric("foo", random())
log_metric("foo", random() + 1)
log_metric("foo", random() + 2)
if not os.path.exists("outputs"):
os.makedirs("outputs")
with open("outputs/test.txt", "w") as f:
f.write("hello world!")
log_artifacts("outputs")
If it is the user's first time creating the experiment, the code will run into an AttributeError since experiment_id does not exist and the except code block gets executed creating the experiment.
If it is the second, third, etc the code is run, it will only execute the code under the try statement since the experiment now exists. Mlflow will now create a 'sample' key in your s3 bucket. Not fully tested but it works for me at least.

What are Python3 libraries which replace "from scikits.audiolab import Format, Sndfile"

Hope you'll are doing good. I am new to python. I am trying to use audio.scikits library in python3 verion. I have a working code version in 2.7(with audio.scikits) . While I am running with python3 version I am getting the Import Error: No Module Named 'Version' error. I get to know that python3 is not anymore supporting audio.scikits(If I am not wrong). Can anyone suggest me replacing library for audio.scikits where I can use all the functionalities like audio.scikits do OR any other solution which might helps me. Thanks in advance.
2.7 Version Code :
from scikits.audiolab import Format, Sndfile
from scipy.signal import firwin, lfilter
array = np.array(all)
fmt = Format('flac', 'pcm16')
nchannels = 1
cd, FileNameTmp = mkstemp('TmpSpeechFile.wav')
# making the file .flac
afile = Sndfile(FileNameTmp, 'w', fmt, nchannels, RawRate)
#writing in the file
afile.write_frames(array)
SendSpeech(FileNameTmp)
To check entire code please visit :Google Asterisk Reference Code(modifying based on this code)
I want to modify this code with python3 supported libraries. Here I am doing this for Asterisk-Microsoft-Speech To Text SDK.

Firstly the link code you paste is Asterisk-Google-Speech-Recognition, it's not the Microsoft-Speech-To-Text, if you want get a sample about Microsoft-Speech-To-Text you could refer to the official doc:Recognize speech from an audio file.
And about your problem you said, yes it's not completely compatible, in the github issue there is a solution for it, you could refer to this comment.

Azure Blob Storage not listing blobs

I am with trouble listing blobs from a specific container
I am using the oficial code, in Python, to list:
from azure.storage.blob import BlockBlobService
account_name = 'xxxx'
account_key = 'xxxx'
container_name = 'yyyyyy'
block_blob_service = BlockBlobService(account_name=account_name,
account_key=account_key)
print("\nList blobs in the container")
generator = block_blob_service.list_blobs(container_name)
for blob in generator:
print("\t Blob name: " + blob.name)
I have received the error:
raise AzureException(ex.args[0])
AzureException: can only concatenate str (not "tuple") to str
The version of azure storage related packages installed are:
azure-mgmt-storage 2.0.0
azure-storage-blob 1.4.0
azure-storage-common 1.4.0

I tried to run the same code of yours with my account, it works fine without any issue. Then, according to the error information, I also tried to reproduce it, as below.
Test 1. When I tried to run the code '123'+('A','B') in Python 3.7, I got the similar issue as the figure below.
Test 2. When ran the same code in Python 3.6, the error information is different.
Test 3. When in Python 2 (just on WSL), the same issue is like in Python 3.7
So I guess you were using Python 3.7 or 2 to run your code, and the issue was caused by using + symbol to concat a string with a tuple at other where of your codes. Please try to check carefully or update your post for more details about the debug information includes line number and its codes for helping to analyze.

Boto3 missing several apigateway usage plans

I am getting a list of all of my usage plans in AWS through Boto3 and noticed that I am missing several usage plans compared to what should be there. Specifically Boto3 thinks there are 25 plans while awscli counts 39 (which is the number displayed in the AWS console). Below is the code that I'm using to get the usage plans for my specific setup:
Python file:
import boto3
session = boto3.session.Session(profile_name='myprofile')
plans = session.client('apigateway').get_usage_plans()
print(len(plans.get('items')))
Running the file returns the following:
$ python3 getplans.py
25
While going through awscli returns the following:
$ aws apigateway get-usage-plans --profile myprofile | jq '.items | length'
39
I looked through the output of both and there's just some complete plans that are missing without any real rhyme or reason behind them. Does anyone know why this might be happening?

I figured it out for anyone who finds this question later. Looks like Boto3 was paginating the response. I ended up fixing the problem by using the following code:
import boto3
session = boto3.session.Session(profile_name='myprofile')
client = session.client('apigateway')
paginator = client.get_paginator('get_usage_plans')
page_iterator = paginator.paginate()
plans = []
for page in page_iterator:
for plan in page['items']:
plans.append(plan)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to create subtitles from aws transcribe - python-3.x

Related

Reading file from G Drive via Apache Beam

Changing subdirectory of MLflow artifact store

What are Python3 libraries which replace "from scikits.audiolab import Format, Sndfile"

Azure Blob Storage not listing blobs

Boto3 missing several apigateway usage plans

Categories

Resources