AWS Media Converter creates new job for each file? - node.js

I am working on AWS MediaConverter and trying to create a Node js API which converts .mp4 format to .wav format.
I have the api is working correctly, however it is creating a new job for each individual .mp4 file.
Is it possible to have one MediaConvert job and use that for every file in the input_bucket instead of creating a new job for every file?
I have tried going through AWS MediaConvert Documentation and various online articles, but I am not able to see any answer to my question.
I have tried to implement my api in following steps :
create an object of class
AWS.MediaConvert()
create a job template using
MediaConvert.createJobTemplate
create a job using
MediaConvert.createJob

There is generally a 1 : 1 relationship between inputs and jobs in MediaConvert.
A MediaConvert job reads an input video from S3 (or HTTP server) and converts the video to output groups that in turn can have multiple outputs. A single media convert job can create multiple versions of the input video in different codecs and packages.
The exception to this is when you want to join more than one input file into a single asset (input stitching).
In this case you can have up to 150 inputs in your job. AWS Elemental MediaConvert subsequently creates outputs by concatenating the inputs in the order that you specify them in the job.
Your question does however suggests that input stitching is not what you are looking to achieve. Rather, you are looking to transcode multiple inputs from the source bucket.
If so, you would need to create a job for each input.
Job Templates (as well as Output Presets) work to speed up your job setup by providing groups of recommended transcoding settings. Job templates apply to an entire transcoding job whereas output presets apply to a single output of a transcoding job.
References:
Step 1: Specify your input files : https://docs.aws.amazon.com/mediaconvert/latest/ug/specify-input-settings.html
Assembling multiple inputs and input clips with AWS Elemental MediaConvert : https://docs.aws.amazon.com/mediaconvert/latest/ug/assembling-multiple-inputs-and-input-clips.html
Working with AWS Elemental MediaConvert job templates : https://docs.aws.amazon.com/mediaconvert/latest/ug/working-with-job-templates.html
Working with AWS Elemental MediaConvert output presets : https://docs.aws.amazon.com/mediaconvert/latest/ug/working-with-presets.html

Related

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

AWS - resize BMP on upload

TASK
I am trying to write a Lambda function for AWS which upon uploading any given bitmap file to my AWS cloud, the function will read this given bitmap and resize it to a preset size and rewrite it back to the same bucket that it read it from.
SCENARIO
My Ruby web-app PUTs a given bitmap file to my AWS bucket which is 8MB in size and approximately 1920x1080 pixels in size.
Upon being uploaded, the image should be read by my Lambda function, resized to 350 x 350 in size and rewritten with the same file name and key location back to the bucket.
PROBLEM
I have no experience with NodeJS, and hence I cannot properly write this function myself. Can anyone advise me the steps to complete this task or point me to a similar function which outputs a resized BMP file?
Image resizing is one of the reference uses for Lambda. You can use the Serverless Image Resizer, which is a really robust solution, or an older version of it here.
There are literally dozens open source image manipulation projects, that you can find on Github. A very simple standalone Lambda that supports BMP's out of the box can be found here.

Merge multiple audio files into one file

I want to merge two audio files and produce one final file. For example if file1 has length of 5 minutes and file2 has length of 4 minutes, I want the result to be a single 5 minutes file, because both files will start from 0:00 seconds and will run together (i.e overlapping.)
You can use the APIs in the Windows.Media.Audio namespace to create audio graphs for audio routing, mixing, and processing scenarios. For how to create audio graphs please reference this article.
An audio graph is a set of interconnected audio nodes. The two audio files you want to merge supply the "audio input nodes", and "audio output nodes" are the destination single file for audio processed by the graph.
The scenario 4 of AudioCreatio official sample - Submix, just provide the feature you want. Provide two files it will output the mixed audio, but change the output node to AudioFileOutputNode for saving to a new file since the sample create AudioDeviceOutputNode for playing.

How to write avro to multiple output directory using spark

Hi,There is a topic about writing text data into multiple output directories in one spark job using MultipleTextOutputFormat
Write to multiple outputs by key Spark - one Spark job
I would ask if there is some similar way to write avro data to multiple directories
What I want is to write the data in avro file to different directory(based on the timestamp field, same day in the timestamp goes to the same directory)
The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputs.
Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own Schema and OutputFormat.
Case two: to write data to different files provided by user
AvroMultipleOutputs supports counters, by default they are disabled. The counters group is the AvroMultipleOutputs class name. The names of the counters are the same as the output name. These count the number of records written to each output name.
Also have a look at
MultipleOutputer
MultipleOutputsFormatTest (see the code example with unit test case here... For some reason MultipleOutputs does not work with Avro, but the near-identical AvroMultipleOutputs does. These obviously related classes have no common ancestor so they are combined under the MultipleOutputer type class which at least allows for future extension.)
Here is what we had implemented for our usecase in java : Write to different files with prefix depending upon the content of avro record using AvroMultipleOutputs.
Here is the wrapper on top of OutputFormat to produce multiple outputs using AvroMultipleOutputs similar to what #Ram has mentioned. https://github.com/architch/MultipleAvroOutputsFormat/blob/master/MultipleAvroOutputsFormat.java
It can be used to write avro records to multiple paths in spark the following way:
Job job = Job.getInstance(hadoopConf);
AvroJob.setOutputKeySchema(job, schema);
AvroMultipleOutputs.addNamedOutput(job,"type1",AvroKeyOutputFormat.class,schema);
AvroMultipleOutputs.addNamedOutput(job,"type2",AvroKeyOutputFormat.class,schema);
rdd.mapToPair(event->{
if(event.isType1())
return new Tuple2<>(new Tuple2<>("type1",new AvroKey<>(event.getRecord())),NullWritable.get());
else
return new Tuple2<>(new Tuple2<>("type2",new AvroKey<>(event.getRecord())),NullWritable.get());
})
.saveAsNewAPIHadoopFile(
outputBasePath,
GenericData.Record.class,
NullWritable.class,
MultipleAvroOutputsFormat.class,
job.getConfiguration()
);
Here getRecords returns a GenericRecord.
The output would be like this at outputBasePath:
17359 May 28 15:23 type1-r-00000.avro
28029 May 28 15:24 type1-r-00001.avro
16473 May 28 15:24 type1-r-00003.avro
17124 May 28 15:23 type2-r-00000.avro
30962 May 28 15:24 type2-r-00001.avro
16229 May 28 15:24 type2-r-00003.avro
This can also be used to write to different directories altogether by providing the baseOutputPath directly as mentioned here: write to multiple directory

Resources