I'm doing pre-processing tasks using EC2.
I execute shell commands using the userdata variable. The last line of my userdata has sudo shutdown now -h. So the instance gets terminated automatically once the pre-processing task completed.
This is how my code looks like.
import boto3
userdata = '''#!/bin/bash
pip3 install boto3 pandas scikit-learn
aws s3 cp s3://.../main.py .
python3 main.py
sudo shutdown now -h
'''
def launch_ec2():
ec2 = boto3.resource('ec2',
aws_access_key_id="",
aws_secret_access_key="",
region_name='us-east-1')
instances = ec2.create_instances(
ImageId='ami-0c02fb55956c7d316',
MinCount=1,
MaxCount=1,
KeyName='',
InstanceInitiatedShutdownBehavior='terminate',
IamInstanceProfile={'Name': 'S3fullaccess'},
InstanceType='m6i.4xlarge',
UserData=userdata,
InstanceMarketOptions={
'MarketType': 'spot',
'SpotOptions': {
'SpotInstanceType': 'one-time',
}
}
)
print(instances)
launch_ec2()
The problem is, sometime when there is an error in my python script, the script dies and the instance get terminated.
Is there a way I can collect error/info logs and send it to cloudwatch before the instance get terminated? This way, I would know what went wrong.
You can achieve the desired behavior by leveraging bash functionality.
You could in fact create a log file for the entire execution of the UserData, and you could use trap to make sure that the log file is copied over to S3 before terminating if an error occurs.
Here's how it could look:
#!/bin/bash -xe
exec &>> /tmp/userdata_execution.log
upload_log() {
aws s3 cp /tmp/userdata_execution.log s3://... # use a bucket of your choosing here
}
trap 'upload_log' ERR
pip3 install boto3 pandas scikit-learn
aws s3 cp s3://.../main.py .
python3 main.py
sudo shutdown now -h
A log file (/tmp/userdata_execution.log) that contains stdout and stderr will be generated for the UserData; if there is an error during the execution of the UserData, the log file will be upload to an S3 bucket.
If you wanted to, you could of course also stream the log file to CloudWatch, however to do so you would have to install the CloudWatch agent on the instance and configure it accordingly. I believe that for your use case uploading the log file to S3 is the best solution.
Related
I'm having some problems with retrieving job output from an AWS glacier vault.
I initiated a job (aws glacier initiate-job), the job is indicated as complete via aws glacier, and then I tried to retrieve the job output
aws glacier get-job-output --account-id - --vault-name <myvaultname> --job-id <jobid> output.json
However, I receive an error: [Errno 2] No such file or directory: 'output.json'
Thinking that perhaps the file needed be created first, and if i did create the file first, (which really doesn't make sense), one would receive the [Errno 9] Bad file descriptor error.
I'm currently using the following version of the AWS CLI:
aws-cli/2.4.10 Python/3.8.8 Windows/10 exe/AMD64 prompt/off
I tried using the aws CLI from both an Administrative and non-Administrative command prompt with the same result. Any ideas on making this work?
From a related reported issue you can try run this command in a DOS window::
copy "c:\Program Files\Amazon\AWSCLI\botocore\vendored\requests\cacert.pem" "c:\Program Files\Amazon\AWSCLI\certifi"
It seems to be an certificate error
I have a python script that create a boto3 session with:
session = boto3.Session(profile_name='myprofile')
Then I try to do:
parquet_meta = subprocess.check_output(f'parquet-tools inspect {file}', shell=True)
But this returns that cannot access s3 file or it does not exists.
I also tried to define an s3 client using the session:
service_resource = session.resource('s3')
But neither works.
There is a way to run that parquet-tools command from s3 in the case I need to test it from local, and I need then a profile?
I know the code is okey because if I test the parquet-tools statement using a parquet file in my localhost it returns the expected output.
Finally solved.
You can use the param --awsprofile:
parquet_meta = subprocess.check_output(f'parquet-tools inspect {file} --awsprofile {mypofile}', shell=True)
I'm facing logging issues with DockerOperator.
I'm running a python script inside the docker container using DockerOperator and I need airflow to spit out the logs from the python script running inside the container. Airlfow is marking the job as success but the script inside the container is failing and I have no clue of what is going as I cannot see the logs properly. Is there way to set up logging for DockerOpertor apart from setting up tty option to True as suggested in docs
It looks like you can have logs pushed to XComs, but it's off by default. First, you need to pass xcom_push=True for it to at least start sending the last line of output to XCom. Then additionally, you can pass xcom_all=True to send all output to XCom, not just the first line.
Perhaps not the most convenient place to put debug information, but it's pretty accessible in the UI at least either in the XCom tab when you click into a task or there's a page you can list and filter XComs (under Browse).
Source: https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L112-L117 and https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L248-L250
Instead of DockerOperator you can use client.containers.run and then do the following:
with DAG(dag_id='dag_1',
default_args=default_args,
schedule_interval=None,
tags=['my_dags']) as dag:
#task(task_id='task_1')
def start_task(**kwargs):
# get the docker params from the environment
client = docker.from_env()
# run the container
response = client.containers.run(
# The container you wish to call
image='__container__:latest',
# The command to run inside the container
command="python test.py",
version='auto',
auto_remove=True,
stdout = True,
stderr=True,
tty=True,
detach=True,
remove=True,
ipc_mode='host',
network_mode='bridge',
# Passing the GPU access
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
],
# Give the proper system volume mount point
volumes=[
'src:/src',
],
working_dir='/src'
)
output = response.attach(stdout=True, stream=True, logs=True)
for line in output:
print(line.decode())
return str(response)
test = start_task()
Then in your test.py script (in the docker container) you have to do the logging using the standard Python logging module:
import logging
logger = logging.getLogger("airflow.task")
logger.info("Log something.")
Reference: here
I have had a task assigned to me to think of a way to set up a cloud function in GCP that does the following:
Monitors a Google Cloud Storage bucket for new files
Triggers when it detects a new file in the bucket
Copies that file to a directory inside a Compute Instance (Ubuntu)
I've been doing some research and am coming up empty. I know I can easily set up a cron job that syncs the bucket/directory every minute or something like that, but one of the design philosophies of the system we are building is to operate off triggers rather than timers.
Is what I am asking possible?
You can trigger a Cloud Function from a Google Cloud Storage bucket, and by selecting the Event Type to be Finalize/Create, each time a file is uploaded in the bucket, the Cloud Function will be called.
Each time a new object is created in the bucket, the cloud function will receive a notification with a Cloud Storage object format.
Now, onto the second step, I could not find any API that can upload files from Cloud Storage to an instance VM. However, I did the following as a workaround, assuming that your instance VM has a server configured that can receive HTTP requests (for example Apache or Nginx):
main.py
import requests
from google.cloud import storage
def hello_gcs(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the file is sent as a request to
"""
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
client = storage.Client()
bucket = client.get_bucket(data['bucket'])
blob = bucket.get_blob(data['name'])
contents = blob.download_as_string()
headers = {
'Content-type': 'text/plain',
}
data = '{"text":"{}"}'.format(contents)
response = requests.post('https://your-instance-server/endpoint-to-download-files', headers=headers, data=data)
return "Request sent to your instance with the data of the object"
requirements.txt
google-cloud-storage
requests
Most likely, it would be better to just send the object name and the bucket name to your server endpoint, and from there download the files using the Cloud Client Library.
Now you may ask...
How to make a Compute Engine instance to handle the request?
Create a Compute Engine instance VM. Make sure it's in the same region as the cloud Function, and when creating it, allow HTTP connections to it. Documentation. I used a debian-9 image for this test.
SSH into the instance, and run the following commands:
Install apache server
sudo apt-get update
sudo apt-get install apache2
sudo apt-get install libapache2-mod-wsgi
Install this python libraries as well:
sudo apt-get install python-pip
sudo pip install flask
Set up environment for your application:
cd ~/
mkdir app
sudo ln -sT ~/app /var/www/html/app
Last line should point to the folder path where apache serves the index.html file from.
Create your app in /home/<user_name>/app:
main.py
from flask import Flask, request
app = Flask(__name__)
#app.route('/', methods=['POST'])
def receive_file():
file_content = request.form['data']
# TODO
# Implement process to save this data onto a file
return 'Hello from Flask!'
if __name__ == '__main__':
app.run()
Create wsgi server entrypoint, in the same directory:
main.wsgi
import sys
sys.path.insert(0, '/var/www/html/app')
from main import app as application
Add the following line to /etc/apache2/sites-enabled/000-default.conf, after the DocumentRoot tag:
WSGIDaemonProcess flaskapp threads=5
WSGIScriptAlias / /var/www/html/app/main.wsgi
<Directory app>
WSGIProcessGroup main
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
</Directory>
Run sudo apachectl restart. You should be able to send post requests to your application, to the internal IP of the VM instance (you can see it in the Console, in the Compute Engine section). Once you have it, in your cloud function, you should change the response line to:
response = requests.post('<INTERNAL_INSTANCE_IP>/', headers=headers, data=data)
return "Request sent to your instance with the data of the object"
I was in the need to move files with a aws-lambda from a SFTP server to my AWS account,
then I've found this article:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Talking about paramiko as a SSHclient candidate to move files over ssh.
Then I've written this calss wrapper in python to be used from my serverless handler file:
import paramiko
import sys
class FTPClient(object):
def __init__(self, hostname, username, password):
"""
creates ftp connection
Args:
hostname (string): endpoint of the ftp server
username (string): username for logging in on the ftp server
password (string): password for logging in on the ftp server
"""
try:
self._host = hostname
self._port = 22
#lets you save results of the download into a log file.
#paramiko.util.log_to_file("path/to/log/file.txt")
self._sftpTransport = paramiko.Transport((self._host, self._port))
self._sftpTransport.connect(username=username, password=password)
self._sftp = paramiko.SFTPClient.from_transport(self._sftpTransport)
except:
print ("Unexpected error" , sys.exc_info())
raise
def get(self, sftpPath):
"""
creates ftp connection
Args:
sftpPath = "path/to/file/on/sftp/to/be/downloaded"
"""
localPath="/tmp/temp-download.txt"
self._sftp.get(sftpPath, localPath)
self._sftp.close()
tmpfile = open(localPath, 'r')
return tmpfile.read()
def close(self):
self._sftpTransport.close()
On my local machine it works as expected (test.py):
import ftp_client
sftp = ftp_client.FTPClient(
"host",
"myuser",
"password")
file = sftp.get('/testFile.txt')
print(file)
But when I deploy it with serverless and run the handler.py function (same as the test.py above) I get back the error:
Unable to import module 'handler': No module named 'paramiko'
Looks like the deploy is unable to import paramiko (by the article above it seems like it should be available for lambda python 3 on AWS) isn't it?
If not what's the best practice for this case? Should I include the library into my local project and package/deploy it to aws?
A comprehensive guide tutorial exists at :
https://serverless.com/blog/serverless-python-packaging/
Using the serverless-python-requirements package
as serverless node plugin.
Creating a virtual env and Docker Deamon will be required to packup your serverless project before deploying on AWS lambda
In the case you use
custom:
pythonRequirements:
zip: true
in your serverless.yml, you have to use this code snippet at the start of your handler
try:
import unzip_requirements
except ImportError:
pass
all details possible to find in Serverless Python Requirements documentation
You have to create a virtualenv, install your dependencies and then zip all files under sites-packages/
sudo pip install virtualenv
virtualenv -p python3 myvirtualenv
source myvirtualenv/bin/activate
pip install paramiko
cp handler.py myvirtualenv/lib/python
zip -r myvirtualenv/lib/python3.6/site-packages/ -O package.zip
then upload package.zip to lambda
You have to provide all dependencies that are not installed in AWS' Python runtime.
Take a look at Step 7 in the tutorial. Looks like he is adding the dependencies from the virtual environment to the zip file. So I'd assume your ZIP file to contain the following:
your worker_function.py on top level
a folder paramico with the files installed in virtual env
Please let me know if this helps.
I tried various blogs and guides like:
web scraping with lambda
AWS Layers for Pandas
spending hours of trying out things. Facing SIZE issues like that or being unable to import modules etc.
.. and I nearly reached the end (that is to invoke LOCALLY my handler function), but then my function even though it was fully deployed correctly and even invoked LOCALLY with no problems, then it was impossible to invoke it on AWS.
The most comprehensive and best by far guide or example that is ACTUALLY working is the above mentioned by #koalaok ! Thanks buddy!
actual link