Is there a sample implementation of Kiba ETL Job using s3 bucket with csv files as source and the destination is in s3 bucket also? - kiba-etl

I have csv file in s3 and I wanted to transform some columns and put the result in another s3 bucket and sometimes in same bucket but with different folder. Can I achieve it using Kiba? Im possible.. do I need to store the csv data in database first before transformation and other stuff?

Thanks for using Kiba! There is no such implementation sample available today. I'll provide vendor-supported S3 components as part of Kiba Pro in the future.
That said, what you have in mind is definitely possible (I've done this for some clients) - and there is definitely no need to store the CSV data in a database first.
What you need to do is implement a Kiba S3 source and destination which will do that for you.
I recommend that you check out the AWS Ruby SDK, and in particular the S3 Examples.
The following links will be particularly helpful:
https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-items.html to list the bucket items
https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html to download the file locally before processing it
https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-upload-bucket-item.html to upload a file back to S3
Hope this helps!

Related

Rename/move files in aws s3 to a different folder after X days automatically

Is there any s3 job/functionality to move all files that have been there for more than 10 days to another folder automatically (using files and folders for simplicity instead of objects)?
Or that have not been modified for more than 10 days?
My purpose is making a request using an sdk and retrieving the files that have been created in the last 10 days without deleting the others, just move them to a different folder
You can use Amazon S3 Lifecycles.
Please note, in this scenario, you usually move your objects to a different bucket.
Managing object lifecycle
Define S3 Lifecycle configuration rules for objects that have a
well-defined lifecycle. For example:
If you upload periodic logs to a bucket, your application might need
them for a week or a month. After that, you might want to delete them.
Some documents are frequently accessed for a limited period of time.
After that, they are infrequently accessed. At some point, you might
not need real-time access to them, but your organization or
regulations might require you to archive them for a specific period.
After that, you can delete them.
You might upload some types of data to Amazon S3 primarily for
archival purposes. For example, you might archive digital media,
financial and healthcare records, raw genomics sequence data,
long-term database backups, and data that must be retained for
regulatory compliance.
With S3 Lifecycle configuration rules, you can tell Amazon S3 to
transition objects to less-expensive storage classes, or archive or
delete them.
I would suggest you consider using AWS Step Functions for this. You can use implement the following workflow:
Use the S3 event to trigger the Step Function workflow. Information on that available here
Use the Wait state within Step Functions to pause for 10 days (the maximum is one year). Information available here
After this, you can trigger a Lambda function that will move the object to a new folder in S3.
I would suggest that you move object to a different S3 bucket rather than a different folder. This is because you want to avoid a loop where the movement of your object triggers another Step Functions workflow. You can limit the object prefix on the event rule, but it is safer not to worry about this.

Is there a way to handle multiple versions of an s3 object as if it were one single object?

I know there is no way to update an existing s3 object or a file on HDFS filesystem.
However, my data sources are updated with new data on a regular basis.
Currently, I am thinking mainly of JDBC data sources, but later there will be also other type of datasources (ex.: kafka stream).
I am wondering what would be the best solution to store these large amounts of data in the cloud, in a way that allows me to quickly perform operations on it in hadoop.
I would like to execute complex SQL queries on them (for example with Spark SQL), and some kind of ML algorithm will also be performed on the datasets. These processes will be initiated by users in a web interface.
As far as I know, actions can be executed relatively quickly on S3 objects in hadoop.
My plan is to upload only the new data (so what is not yet in the S3 storage) as a new object version in S3. But I’m not sure I can treat different versions of an object as one single object, and execute sql statements and perform ML stuffs on the whole dataset, not just on the chunks separately.
I'm a beginner in Cloud technology. Currently, only the data storage part is interesting. If I understand this part a little better, I can plan the rest more easily.
So what do you think? Can I achieve it with S3 storage type? If not, what method would you suggest?
Thanks.

Is there a way to log stats/artifacts from AWS Glue Job using mlfow?

Could you please let me know if any such feature available in the current version of mlflow?
I think the general answer here is that you can log arbitrary data and artifacts from your experiment to your MLflow tracking server using mlflow_log_artifact() or mlflow_set_tag(), depending on how you want to do it. If there's an API to get data from Glue and you can fetch it during your MLflow run, then you can log it. Write a csv, save a .png to disk and log that, or declare a variable and access it when you are setting the tag.
This applies for Glue or any other API that you are getting a response from. One of the key benefits of MLflow is that it is such a general framework, so you can track what matters to that particular experiment.
Hope this helps!

What is a good architecture to store user files if using Mongo schema?

Simply, I need to build an app to store images for users. So each user can upload images and view them on the app.
I am using NodeJS and Mongo/Mongoose.
Is this a good approach to handle this case:
When the user uploads the image file, I will store it locally.
I will use Multer to store the file.
Each user will have a separate folder created by his username.
In the user schema, I will define a string array that records the file path.
When user needs to retrieve the file, I will check the file path, retrieve it from the local disk.
Now my questions are:
Is this a good approach (storing in local file system and storing path in schema?
Is there any reason to use GridFS, if the file sizes are small (<1MB)?
If I am planning to use S3 to store files later, is this a good strategy?
This is my first time with a DB application like this so very much appreciate some guidance.
Thank you.
1) Yes, storing the location within your database for use within your application and the physical file elsewhere is an appropriate solution. Depending on the data store and number of files it can be detrimental to store within a database as it can impede processes like backup and replication if there are many large files
2) I admit that I don't know GridFS but the documentation says it is for files larger than 16MB so it sounds like you don't need it yet
3) S3 is a fantastic product and enables edge caching and backup through services and many others too. I think your choice needs to look at what AWS provides and if you need it e.g. global caching or replication to different countries and data centres. Different features cause different price points but personally I find the S3 platform excellent and have around 500G loaded there for different purposes.

Automatically sync two Amazon S3 buckets, besides s3cmd?

Is there a another automated way of syncing two Amazon S3 bucket besides using s3cmd? Maybe Amazon has this as an option? The environment is linux, and every day I would like to sync new & deleted files to another bucket. I hate the thought of keeping all eggs in one basket.
You could use the standard Amazon CLI to make the sync.
You just have to do something like:
aws s3 sync s3://bucket1/folder1 s3://bucket2/folder2
http://aws.amazon.com/cli/
S3 buckets != baskets
From their site:
Data Durability and Reliability
Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 Region. To help ensure durability, Amazon S3 PUT and COPY operations synchronously store your data across multiple facilities before returning SUCCESS. Once stored, Amazon S3 maintains the durability of your objects by quickly detecting and repairing any lost redundancy. Amazon S3 also regularly verifies the integrity of data stored using checksums. If corruption is detected, it is repaired using redundant data. In addition, Amazon S3 calculates checksums on all network traffic to detect corruption of data packets when storing or retrieving data.
Amazon S3’s standard storage is:
Backed with the Amazon S3 Service Level Agreement.
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 provides further protection via Versioning. You can use Versioning to preserve, retrieve, and restore every version of every object stored in your Amazon S3 bucket. This allows you to easily recover from both unintended user actions and application failures. By default, requests will retrieve the most recently written version. Older versions of an object can be retrieved by specifying a version in the request. Storage rates apply for every version stored.
That's very reliable.
I'm looking for something similar and there are a few options:
Commercial applications like: s3RSync. Additionally, CloudBerry for S3 provides Powershell extensions for Windows that you can use for scripting, but I know you're using *nix.
AWS API + (Fav Language) + Cron. (hear me out). It would take a decently savvy person with no experience in AWS's libraries a short time to build something to copy and compare files (using the ETag feature of the s3 keys). Just providing a source/target bucket, creds, and iterating through keys and issuing the native "Copy" command in AWS. I used Java. If you use Python and Cron you could make short work of a useful tool.
I'm still looking for something already built that's open source or free. But #2 is really not a terribly difficult task.
EDIT: I came back to this post and realized nowadays Attunity CloudBeam is also a commercial solution many folks.
It is now possible to replicate objects between buckets in two different regions via the AWS Console.
The official announcement on the AWS blog explains the feature.

Resources