parquet-tools from s3 cannot access file using boto3 - python-3.x

I have a python script that create a boto3 session with:
session = boto3.Session(profile_name='myprofile')
Then I try to do:
parquet_meta = subprocess.check_output(f'parquet-tools inspect {file}', shell=True)
But this returns that cannot access s3 file or it does not exists.
I also tried to define an s3 client using the session:
service_resource = session.resource('s3')
But neither works.
There is a way to run that parquet-tools command from s3 in the case I need to test it from local, and I need then a profile?
I know the code is okey because if I test the parquet-tools statement using a parquet file in my localhost it returns the expected output.

Finally solved.
You can use the param --awsprofile:
parquet_meta = subprocess.check_output(f'parquet-tools inspect {file} --awsprofile {mypofile}', shell=True)

Related

AWS EC2 log userdata output to cloudwatch logs

I'm doing pre-processing tasks using EC2.
I execute shell commands using the userdata variable. The last line of my userdata has sudo shutdown now -h. So the instance gets terminated automatically once the pre-processing task completed.
This is how my code looks like.
import boto3
userdata = '''#!/bin/bash
pip3 install boto3 pandas scikit-learn
aws s3 cp s3://.../main.py .
python3 main.py
sudo shutdown now -h
'''
def launch_ec2():
ec2 = boto3.resource('ec2',
aws_access_key_id="",
aws_secret_access_key="",
region_name='us-east-1')
instances = ec2.create_instances(
ImageId='ami-0c02fb55956c7d316',
MinCount=1,
MaxCount=1,
KeyName='',
InstanceInitiatedShutdownBehavior='terminate',
IamInstanceProfile={'Name': 'S3fullaccess'},
InstanceType='m6i.4xlarge',
UserData=userdata,
InstanceMarketOptions={
'MarketType': 'spot',
'SpotOptions': {
'SpotInstanceType': 'one-time',
}
}
)
print(instances)
launch_ec2()
The problem is, sometime when there is an error in my python script, the script dies and the instance get terminated.
Is there a way I can collect error/info logs and send it to cloudwatch before the instance get terminated? This way, I would know what went wrong.
You can achieve the desired behavior by leveraging bash functionality.
You could in fact create a log file for the entire execution of the UserData, and you could use trap to make sure that the log file is copied over to S3 before terminating if an error occurs.
Here's how it could look:
#!/bin/bash -xe
exec &>> /tmp/userdata_execution.log
upload_log() {
aws s3 cp /tmp/userdata_execution.log s3://... # use a bucket of your choosing here
}
trap 'upload_log' ERR
pip3 install boto3 pandas scikit-learn
aws s3 cp s3://.../main.py .
python3 main.py
sudo shutdown now -h
A log file (/tmp/userdata_execution.log) that contains stdout and stderr will be generated for the UserData; if there is an error during the execution of the UserData, the log file will be upload to an S3 bucket.
If you wanted to, you could of course also stream the log file to CloudWatch, however to do so you would have to install the CloudWatch agent on the instance and configure it accordingly. I believe that for your use case uploading the log file to S3 is the best solution.

Load/access/mount directory to aws sagemaker from S3

I am a newbee to aws s3/sagemaker. I am strugling to access my data [data meaning folders/directories, not any specific file/files] from S3 bucket to sagemaker jupyter notebook.
Say, my URI is:
s3://data/sub/dir/, where dir may contain multiple directories with files. I need to acess the directory (e.g., dir) in such a way where I can access any sub directories/files from it. I tried-
!aws s3 cp s3://data/sub/dir tempdata --recursive but did not work, getting error like-
fatal error: An error occurred (404) when calling the HeadObject operation: Key "sub/dir" does not exist.
Please advice, how can I access the dirs from s3 buckets to my aws sagemaker jupyter lab.
Or how to mount s3 buckets to sagemaker? I also tried this link and installed with no errors but s3fs wont show when I run dh -f, thus not worked as well! Thanks in advance.
Your cp syntax is correct.
S3 Sync could be an alternative way to get the same result, and the error response, if you got something wrong, could be more informative: !aws s3 sync s3://data/sub/dir tempdata

How to get logs from inside the container executed using DockerOperator?(Airflow)

I'm facing logging issues with DockerOperator.
I'm running a python script inside the docker container using DockerOperator and I need airflow to spit out the logs from the python script running inside the container. Airlfow is marking the job as success but the script inside the container is failing and I have no clue of what is going as I cannot see the logs properly. Is there way to set up logging for DockerOpertor apart from setting up tty option to True as suggested in docs
It looks like you can have logs pushed to XComs, but it's off by default. First, you need to pass xcom_push=True for it to at least start sending the last line of output to XCom. Then additionally, you can pass xcom_all=True to send all output to XCom, not just the first line.
Perhaps not the most convenient place to put debug information, but it's pretty accessible in the UI at least either in the XCom tab when you click into a task or there's a page you can list and filter XComs (under Browse).
Source: https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L112-L117 and https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L248-L250
Instead of DockerOperator you can use client.containers.run and then do the following:
with DAG(dag_id='dag_1',
default_args=default_args,
schedule_interval=None,
tags=['my_dags']) as dag:
#task(task_id='task_1')
def start_task(**kwargs):
# get the docker params from the environment
client = docker.from_env()
# run the container
response = client.containers.run(
# The container you wish to call
image='__container__:latest',
# The command to run inside the container
command="python test.py",
version='auto',
auto_remove=True,
stdout = True,
stderr=True,
tty=True,
detach=True,
remove=True,
ipc_mode='host',
network_mode='bridge',
# Passing the GPU access
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
],
# Give the proper system volume mount point
volumes=[
'src:/src',
],
working_dir='/src'
)
output = response.attach(stdout=True, stream=True, logs=True)
for line in output:
print(line.decode())
return str(response)
test = start_task()
Then in your test.py script (in the docker container) you have to do the logging using the standard Python logging module:
import logging
logger = logging.getLogger("airflow.task")
logger.info("Log something.")
Reference: here

Why am I getting : Unable to import module 'handler': No module named 'paramiko'?

I was in the need to move files with a aws-lambda from a SFTP server to my AWS account,
then I've found this article:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Talking about paramiko as a SSHclient candidate to move files over ssh.
Then I've written this calss wrapper in python to be used from my serverless handler file:
import paramiko
import sys
class FTPClient(object):
def __init__(self, hostname, username, password):
"""
creates ftp connection
Args:
hostname (string): endpoint of the ftp server
username (string): username for logging in on the ftp server
password (string): password for logging in on the ftp server
"""
try:
self._host = hostname
self._port = 22
#lets you save results of the download into a log file.
#paramiko.util.log_to_file("path/to/log/file.txt")
self._sftpTransport = paramiko.Transport((self._host, self._port))
self._sftpTransport.connect(username=username, password=password)
self._sftp = paramiko.SFTPClient.from_transport(self._sftpTransport)
except:
print ("Unexpected error" , sys.exc_info())
raise
def get(self, sftpPath):
"""
creates ftp connection
Args:
sftpPath = "path/to/file/on/sftp/to/be/downloaded"
"""
localPath="/tmp/temp-download.txt"
self._sftp.get(sftpPath, localPath)
self._sftp.close()
tmpfile = open(localPath, 'r')
return tmpfile.read()
def close(self):
self._sftpTransport.close()
On my local machine it works as expected (test.py):
import ftp_client
sftp = ftp_client.FTPClient(
"host",
"myuser",
"password")
file = sftp.get('/testFile.txt')
print(file)
But when I deploy it with serverless and run the handler.py function (same as the test.py above) I get back the error:
Unable to import module 'handler': No module named 'paramiko'
Looks like the deploy is unable to import paramiko (by the article above it seems like it should be available for lambda python 3 on AWS) isn't it?
If not what's the best practice for this case? Should I include the library into my local project and package/deploy it to aws?
A comprehensive guide tutorial exists at :
https://serverless.com/blog/serverless-python-packaging/
Using the serverless-python-requirements package
as serverless node plugin.
Creating a virtual env and Docker Deamon will be required to packup your serverless project before deploying on AWS lambda
In the case you use
custom:
pythonRequirements:
zip: true
in your serverless.yml, you have to use this code snippet at the start of your handler
try:
import unzip_requirements
except ImportError:
pass
all details possible to find in Serverless Python Requirements documentation
You have to create a virtualenv, install your dependencies and then zip all files under sites-packages/
sudo pip install virtualenv
virtualenv -p python3 myvirtualenv
source myvirtualenv/bin/activate
pip install paramiko
cp handler.py myvirtualenv/lib/python
zip -r myvirtualenv/lib/python3.6/site-packages/ -O package.zip
then upload package.zip to lambda
You have to provide all dependencies that are not installed in AWS' Python runtime.
Take a look at Step 7 in the tutorial. Looks like he is adding the dependencies from the virtual environment to the zip file. So I'd assume your ZIP file to contain the following:
your worker_function.py on top level
a folder paramico with the files installed in virtual env
Please let me know if this helps.
I tried various blogs and guides like:
web scraping with lambda
AWS Layers for Pandas
spending hours of trying out things. Facing SIZE issues like that or being unable to import modules etc.
.. and I nearly reached the end (that is to invoke LOCALLY my handler function), but then my function even though it was fully deployed correctly and even invoked LOCALLY with no problems, then it was impossible to invoke it on AWS.
The most comprehensive and best by far guide or example that is ACTUALLY working is the above mentioned by #koalaok ! Thanks buddy!
actual link

exec sh from PySpark

I'm trying to run a .sh file loading from a .py file in a PySpark's job but I receive a message always saying that .sh file is not found
This is my code:
test.py:
import os,sys
os.system("sh ./check.sh")
and my gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster mserver file:///home/myuser/test.py
test.py file is loaded well but the system can't find check.sh file
I figure out that is something related with the file's path but not sure
I tried also with os.system("sh home/myuser/check.sh") and same result
I think that this should be easy to do so ... ideas?
The "current working directory" used by Dataproc jobs submitted through the API is a temporary directory with a unique name for each job; if the file wasn't uploaded with the job itself, you'll have to access it using your absolute path.
If you indeed added the check.sh file manually to /home/myuser/check.sh, then you should be able to call it using the fully qualified path, os.system("sh /home/myuser/check.sh"); make sure to start your absolute path with a /.

Resources