EC2 instance for Python 3 script - python-3.x

I have a script written in Python and it's taking a fairly long time to process (around 30 minutes) on my laptop. I was thinking I could create a EC2 instance in AWS and try to see if it's possible to speed up the process. I have a AWS account so my question:
Which EC2 instance type should I create in order to run the process in a faster way? Process reads a csv file does some calculations and then writes csv with results. Script bottleneck is in the mathematical calculations as csv files are fairly small.
I can go with either a free tier or paid tier instance.

I would say go with p2 extra large If you have to use ec2.
Try to understand what's causing this delay. Which library are you using to read csv. There are various ways in python through which you can manipulate the csv file For example see the image:
Image source here
NumPy, SciPy, joblib and hdf5 are the recommended options for quickly saving and loading csv data.
Try to change your algorithm. In my experience pandas is not speedy when it comes to csv operations. Try to tweak your code if that dosent work switch to p2 extra large

Related

Live Connection to Database for Excel PowerQuery?

I currently have approximately 10M rows, ~50 columns in a table that I wrap up and share as a pivot. However, this also means that it takes approximately 30mins-1hour to download the csv or much longer to do a powerquery ODBC connection directly to Redshift.
So far the best solution I've found is to use Python -- Redshift_connector to run update queries and perform an unload a zipped resultset to an S3 bucket then use BOTO3/gzip to download and unzip the file, then finally performing a refresh from the CSV. This resulted in a 600MB excel file compiled in ~15-20 mins.
However, this process still feel clunky and sharing a 600MB excel file among teams isn't the best either. I've searched for several days but I'm not closer to finding an alternative: What would you use if you had to share a drillable table/pivot among a team with a 10GB datastore?
As a last note: I thought about programming a couple of PHP scripts, but my office doesn't have the infrastructure to support that.
Any help would or ideas would be most appreciated!
Call a meeting with the team and let them know about the constraints, you will get some suggestions and you can give some suggestions
Suggestions from my side:
For the file part
reduce the data, for example if it is time dependent, increase the interval time, for example an hourly data can be reduced to daily data
if the data is related to some groups you can divide the file into different parts each file belonging to each group
or send them only the final reports and numbers they require, don't send them full data.
For a fully functional app:
you can buy a desktop PC (if budget is a constraint buy a used one or use any desktop laptop from old inventory) and create a PHP/Python web application that can do all the steps automatically
create a local database and link it with the application
create the charting, pivoting etc modules on that application, and remove the excel altogether from your process
you can even use some pre build applications for charting and pivoting part, Oracle APEX is one examples that can be used.

Best way: how to export dynamodb table to a csv and store it in s3

We have one lambda that will update dynamodb table after some operation.
Now we want to export whole dynamodb table into a s3 bucket with a csv format.
Any efficient way to do this.
Also I have found the below way of streaming directly from dynamodb to s3
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
But in above it will store in json format. and can not find a way to do this efficiently for 10GB data
As far as I can tell you have three "simple" options.
Option #1: Program that does a Scan
It is fairly simple to write a program that does a (parallel) scan of your table and then outputs the result in a CSV. A no bells and whistles version of this is about 100-150 lines of code in Python or Go.
Advantages:
Easy to develop
Can be run easily multiple times from local machines or CI/CD pipelines or whatever.
Disadvantages:
It will cost you a bit of money. Scanning the whole table will use up some read units. Depending on the amount you are readin, this might get costly fast.
Depending on the amount of data this can take a while.
Note: If you want to run this in a Lambda then remember that Lambdas can run for a maximum of 15 minutes. So once you more data than can be processed within those 15 minutes, you probably need to switch to Step Functions.
Option #2: Process a S3 backup
DynamoDB allows you to create backups of your table to S3 (as the article describes you linked). Those backups will either be in JSON or a JSON like AWS format. You then can write a program that converts those JSON files to CSV.
Advantages:
(A lot) cheaper than a scan
Disadvantages:
Requires more "plumbing" because you need to first create the backup, then do download it from S3 to wherever you want to process it etc.
Probably will take longer than option #1

Confusion about the data location when applying Scikit-learn on cluster (Dask)

I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?
Thank you very much.
If your data is small enough (which it is in the tutorial), and preprocessing steps are rather trivial, then it is okay to read in with pandas. This will read the data in to your local session, not yet any of the dask workers. Once you call with joblib.parallel_backend('dask'), the data will be copied to each worker process and the scikit work will be done there.
If your data is large or you have intensive preprocessing steps its best to "load" the data with dask, and then use dask's built-in preprocessing and grid search where possible. In this case the data will actually be loaded directly from the workers, because of dask's lazy execution paradigm. Dask's grid search will also cache repeated steps of the cross validation and can speed up computation immensely. More can be found here: https://ml.dask.org/hyper-parameter-search.html

Speeding up dataframe.to_excel operations by a GPU

I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.

Create recordio file for linear regression

I'm using AWS Sagemaker to run linear regression on a CSV dataset. I have made some tests, and with my sample dataset that is 10% of the full dataset, the csv file ends up at 1.5 GB in size.
Now I want to run the full dataset, but I'm facing issues with the 15 GB file. When I compress the file with Gzip, it ends up only 20 MB. However, Sagemaker only supports Gzip on "Protobuf-Recordio" files. I know I can make Recordio files with im2rec, but it seems to be intended for image files for image classication. I'm also not sure how to generate the protobuf file.
To make things even worse(?) :) I'm generating the dataset in Node.
I would be very grateful to get some pointers in the right direction how to do this.
This link https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html has useful information if you are willing to use a Python script to transform your data.
The actual code from the SDK is https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py
Basically, you could load your CSV data into an NDArray (in batches so that you can write to multiple files), and then use https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/amazon/common.py to convert to Recordio-protobuf. You should be able to write the buffer with the Recordio-protobuf into a file.
Thanks

Resources