Better/best approach to load huge CSV file into DynamoDb - node.js

I have a huge .csv file on my local machine. I want to load that data in a DynamoDB (eu-west-1, Ireland). How would you do that?
My first approach was:
Iterate the CSV file locally
Send a row to AWS via a curl -X POST -d '<row>' .../connector/mydata
Process the previous call within a lambda and write in DynamoDB
I do not like that solution because:
There are too many requests
If I send data without the CSV header information I have to hardcode the lambda
If I send data with the CSV header there is too much traffic
I was also considering putting the file in an S3 bucket and process it with a lambda, but the file is huge and the lambda's memory and time limits scare me.
I am also considering doing the job on an EC2 machine, but I lose reactivity (if I turn off the machine while not used) or I lose money (if I do not turn off the machine).
I was told that Kinesis may be a solution, but I am not convinced.
Please tell me what would be the best approach to get the huge CSV file in DynamoDB if you were me. I want to minimise the workload for a "second" upload.
I prefer using Node.js or R. Python may be acceptable as a last solution.

If you want to do it the AWS way, then data pipelines may be the best approach:
Here is a tutorial that does a bit more than you need, but should get you started:
The first part of this tutorial explains how to define an AWS Data
Pipeline pipeline to retrieve data from a tab-delimited file in Amazon
S3 to populate a DynamoDB table, use a Hive script to define the
necessary data transformation steps, and automatically create an
Amazon EMR cluster to perform the work.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html

If all your data is in S3 you can use AWS Data pipeline's predefined template to 'import DynamoDB data from S3' It should be straightforward to configure.

Related

is there any way to load csv file in aws opensearch?

hi anyone knows how to upload csv file to aws opensearch directly using api call (like bulk api of aws).I want to do this using nodejs, i don't want to use kinesis or logstash also make sure that upload must be happen in chunks .I tried a lot but couldn't make it happen.
Opensearch provide javascript client. You can use Bulk API to upload documents in chunks.
Update 1:
As you mentioned you want yo index directly CSV file then use elasticsearch-csv NPM package.

Is it possible to convert multiple files in MediaConvert AWS service?

I have got a few files in s3 bucket and all of them need to be converted (3 output file per 1 input file).
Convertion rules are equal for all files.
Is it possible to do this? How can it be implemented on Node AWS sdk?
Do I need any extra service for it?
You can create a MediaConvert JobTemplate
After this you can start one MediaConvert for each file in S3.
If you want to start this every time a file is added, for instance, your safest bet is to create a lambda that gets triggered when a new file is added to the S3 bucket and then start a new MediaConvert job using the saved JobTemplate.
Make sure you don't start a job for the outputs of the MediaConvert Job though.

Need to upload directory content to S3 bucket

My scenario is I am currently using AWS CLI to upload my directory content to S3 bucket using following AWS CLI command:
aws s3 sync results/foo s3://bucket/
Now I need to replace this and have python code to do this. I am exploring boto3 documentation to find the right way to do it. I see some options such as:
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.Client.upload_file
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.ServiceResource.Object
Could someone suggest which is the right approach.
I am aware that I would have to get the credentials by calling boto3.client('sts').assume_role(role, session) and use them subsequently.
The AWS CLI is actually written in Python and uses the same API calls you can use.
The important thing to realize is that Amazon S3 only has an API call to upload/download one object at a time.
Therefore, your Python code would need to:
Obtain a list of files to copy
Loop through each file and upload it to Amazon S3
Of course, if you want sync functionality (which only copies new/modified files), then your program will need more intelligence to figure out which files to copy.
Boto3 has two general types of methods:
client methods that map 1:1 with API calls, and
resource methods that are more Pythonic but might make multiple API calls in the background
Which type you use is your own choice. Personally, I find the client methods easier for uploading/downloading objects, and the resource methods are good when having to loop through resources (eg "for each EC2 instance, for each EBS volume, check each tag").

Python Boto3 - upload images to S3 in one put request

I have a script in python built out, that uploads images to s3, one image|put request at a time. Is it possible to upload all images to s3 at the same time, using one put request, to save $$ on requests?
for image_id in list_of_images:
#upload each image
filename = id_prefix+"/"+'{0}.jpg'.format(image_id)
s3.upload_fileobj(buffer, bucket_name, filename, ExtraArgs={ "ContentType": "image/jpeg"})
No.
The Amazon S3 API only allows creation of one object per API call.
Your options are to loop through each file (as you have done) or, if you wish to make it faster, you could use multi-threading to upload multiple files simultaneously to take advantage of more networking bandwidth.
If your desire is simply to reduce requests costs, do not panic. It is only $0.005 per 1000 requests.
Good news - it looks like several companies came out with boto3-like apis, with much better storage + download (per gb) pricing.
As of a few days ago - Backblaze came out with S3 compatible storage + API .
We did a few tests on our application, and everything seems to be working as advertised!

How to execute HTTP DELETE request in AWS Lambda Nodejs function

I am trying to create an AWS Lambda transform function for a Firehose stream that sends records directly to an Elasticsearch cluster.
Currently, there is no way to specify an ES document id in a Firehose stream record, so all records, even duplicates, are inserted. However, Firehose does support transformation functions hosted in Lambda, which gave me an idea:
My solution is to create a Lambda transform function that executes a DELETE request to Elasticsearch for every record during transformation, then returning all records unmodified, thereby achieving "delete-insert" behaviour (I am ok with the record disappearing for a short period).
However, I know very little about Nodejs and even though this is such a simple thing, I can't figure out how to do it.
Is there a Node package available to Lambda that I can use to do this? (Preferably an AWS Elasticsearch API, but a simple HTTP package would do).
Do I have to package up some other module to get this done?
Can something like Apex help me out here? My preferred language is Go, but so far I have been unable to get apex functions to execute or log anything to Cloudwatch...
Thanks in advance.
It seems a simple task to do, so I guess no framework is needed.
A few lines of code in Node.js would get the things done.
There are two packages that can help you:
elasticsearch-js
http-aws-es (If your ES domain is protected and you need to sign the requests)
The API doc: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html

Resources