Installing Python and Jupyter on AWS EC2 - python-3.x

I am working on a school project along with my team where we have to analyze large data sets using Python. The data is in the form of images (jpeg files). Since the analysis involves images we will be using TensorFlow, OpenCV etc. As the data set is large we are exploring running Python on EC2 and storing the data set on S3. Is there any wiki or guide that can help us with:
1) Set up Python (3.5) on EC2 and connect to S3 bucket where the files are stored.
2) Create a multi-user environment where all the team members (five) can access the server remotely and run tests against the data set/files.
My skill level on AWS is basic at best. Greatly appreciate any help with this.

At a high level, you would want to use the AWS CLI, but there are several things that you would need to setup first.
Create an account and enter the IAM console to create your users. I assume you would want to assign them all to the same group and define one permission policy for all of them. You should need access to only EC2 and S3. You will need a complete working knowledge of the IAM service (it is relatively small).
Create a role so that your EC2 instance can get access to S3. Follow this tutorial.
Use the AWS CLI to access your EC2 instance. The installation/development workflow should mimic a linux workflow very closely.

Related

Is there a way to run a Deep Learning model locally with the data on AWS S3?

I am trying to implement a Neutral Network using Tensorflow with the dataset categorized into different folders (Each folders represent each class). I would like to know if there's a way to use the data from S3 and run the Deep Learning model in the local machine.
I have all the files on S3 but am unable to bring it to the local machine.
P.S I'm using Python version 3.5
As of now, no deep learning framework supports fetching data from s3 and train, maybe because of s3 pricing.
However you can mount S3 on your local system
S3-Fuse - https://github.com/s3fs-fuse/s3fs-fuse
S3Fs - https://fs-s3fs.readthedocs.io/en/latest/
Please not, for every read / write you will be billed according to aws s3 pricing, https://aws.amazon.com/s3/pricing/
Tensorflow supports this (but I think not in the nightly builds), see documentation.
Assuming you have configured the credentials as described (e.g. $HOME/.aws/credentials or with environment variables), you have to use URLs with s3 as protocol like
s3://mybucket/some/path/words.tsv
If you read or write files in your own code, be sure not to use any python IO but Tensorflow's tf.io.gfile.GFile. Similar, to list directories use e.g. tf.io.gfile.walk or tf.io.gfile.listdir
From the environment variables in the documentation, we only set AWS_REGION, but in addition the following ones are useful to control logging and avoid timeouts:
export AWS_LOG_LEVEL=3
export S3_REQUEST_TIMEOUT_MSEC=600000
Still, reading training data from s3 is usually only a good idea if you run your training on AWS. For running locally, it is usually better to copy the data to your local drive, e.g. with AWS CLI's sync command.

Build an extensible system for scraping websites

Currently, I have a server running. Whenever I receive a request, I want some mechanism to start the scraping process on some other resource(preferably dynamically created) as I don't want to perform scraping on my main instance. Further, I don't want the other instance to keep running and charging me when I am not scraping data.
So, preferably a system that I can request to start scraping the site and close when it finishes.
Currently, I have looked in google cloud functions but they have a cap at 9 min max for every function so it won't fit my requirement as scraping would take much more time than that. I have also looked in AWS SDK it allows us to create VMs on runtime and also close them but I can't figure out how to push my API script onto the newly created AWS instance.
Further, the system should be extensible. Like I have many different scripts that scrape different websites. So, a robust solution would be ideal.
I am open to using any technology. Any help would be greatly appreciated. Thanks
I can't figure out how to push my API script onto the newly created AWS instance.
This is achieved by using UserData:
When you launch an instance in Amazon EC2, you have the option of passing user data to the instance that can be used to perform common automated configuration tasks and even run scripts after the instance starts.
So basically, you would construct your UserData to install your scripts, all dependencies and run them. This would be executed when new instances are launched.
If you want the system to be scalable, you can lunch your instances in Auto Scaling Group and scale it up or down as you require.
The other option is running your scripts as Docker containers. For example using AWS Fargate.
By the way, AWS Lambda has limit of 15 minutes, so not much more than Google functions.

AWS: How to launch multiple of the same instance from python?

I have an AWS Windows Server 2016 VM. This VM has a bunch of libraries/software installed (dependencies).
I'd like to, using python3, launch and deploy multiple clones of this instance. I want to do this so that I can use them almost like batch compute nodes in Azure.
I am not very familiar with AWS, but I did find this tutorial.
Unfortunately, it shows how to launch an instance from the store, not an existing configured one.
How would I do what I want to achieve? Should I create an AMI from my configured VM and then just launch that?
Any up-to-date links and/or advice would be appreciated.
Yes, you can create an AMI from the running instance, then launch N instances from that AMI. You can do both using the AWS console or you could call boto3 create_image() and run_instances(). Alternatively, look at Packer for creating AMIs.
You don't strictly need to create an AMI. You could simply the bootstrap each instance as it launches via a user data script or some form of CM like Ansible.

Storage and cost for a website on EC2 AWS using Node.js + express

I'm trying to understand how to use the EC2 AWS services so I've developed a dynamic website using Node.js and Express.
I'm reading the documentation but people's advice are always useful when learning new stuff.
In this website users can upload photos so I need storage space (SSD would be better).
I have three questions:
1) Is storage provided in the EC2 instance or do I have to use another AWS service as S3Bucket? What's the best/fast and less expensive solution to store and access images?
2) I'm using a t2.nano which cost $0.0063 per hour. So if i run the instance for 10 days my costs are 24hours * 10days * 0.0063?
3)I'm using mongoDB, is a good solution to run it on my EC2 instance? Or should I use the RDS provide by AWS?
So:
1) Personally I'd use an S3 bucket to store images, note if you have multipart uploads in the S3 bucket, if one fails it'll not only not show on the object listing, it'll still use space. There's an option to remove them after a certain period.
When you add an object s3 you want to store it's key in your database, then you can simply retrieve it as required.
2) t2 nano is on free tier - so technically you can run it for nothing for the first year.
3) Personally i'd set Mongo up to run on an appropriate EC2 instance, note: you must properly define the Security group, you only want aws internal applications and services to access the EC2 instance, you'll need SSH access to configure it, but then I'd remove that rule from the security group.
Once your Mongo instance is setup, take an AMI so that should anything go wrong you can re deploy it configured(note this won't restore the data).
Aws pricing calc here for EC2 the easy way is to calculate it at 100% usage, the other stuff can get a bit complicated but that wizard lets you basically price up your monthly running costs.
Edit: checkout this comparison on the different storage options for S3 vs X for storing those images although your "bible" should be that pricing calculator - I'd highly recommend learning how to use it as for your own business it's going to be invaluable and if your working for someone else it'll help you make business cases.

Is a Amazon Machine Images (AMI's) static or it's code be modified and rebuilt

I have a customer who wishes me to do some customisations of the erp system opentaps, which they used via opentaps Amazon Elastic Computing Cloud (EC2) images, I've only worked with it on a normal server and don't know anything about images in the cloud. When I ssh in with the details the client gave me there is no sign of the erp installation directory I'd expect to see. I did originally expect that the image wouldn't be accessible, but the client assured me it was. I suppose they could be confused.
Would one have to create a new image and swap it out or is there a way to alter the source and rebuild like on a normal server?
Something is not quite clear to me here. First of all EC2 images running in the cloud are just like normal virtual servers, so If you have an access to the running instance there is no difference between instance in the cloud and instance on another pc in your home for example.
You have to find out how opentaps are installed on the provided amis, then do your modifications, create an image from the modified instance and save it to s3 for backup if necessary.
If you want to start with fresh instance, you can start up any linux/windows distro on the EC2, install opentaps yourself your way and you are done.

Resources