Python code to transfer file from GCS to SFTP server using GCP's Cloud Function - python-3.x

Hi I am newto Python and I was wondering whether someone can help me with the following:
I need to write a code in the cloud function to copy a .csv file from a bucket in GCS to a sftp server.
My bucket is called 001b and the file is called test.csv and i have the hostname user name and port number and password of the sftp server.username=uid password=mypassword port = 22 host https://....
I am trying to create a cloud function with a trigger that every time the file is created in the above bucket it will then transfer it to the sftp server. There will always be one file in the bucket as the csv is overwritten daily.
I am using 2nd gen environment and have my trigger set to Cloud Storage with event type as google.cloud.storage.object.v1.finalized.
I really need help with the code for main.py and requirements.txt for python 3.8
Any help is appreciated

Related

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() in databricks

I am trying to connect to abfss directly(without mounting to DBFS) and trying to open json file using open() method in databricks.
json_file = open("abfss://#.dfs.core.windows.net/test.json') databricks is unable to open file present in azure blob container and getting below error :
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://#.dfs.core.windows.net/test.json'
I have done all the configuration setting using service principal. Please suggest other way of opening file using abfss direct path.
the open method works only with local files - it doesn't know anything about abfss or other cloud storages. You have following choice:
use dbutils.fs.cp to copy file from ADLS to local disk of driver node, and then work with it, like: dbutils.fs.cp("abfss:/....", "file:/tmp/my-copy")
Copy file from ADLS to driver node using the Azure SDK
The first method is easier to use than second

How to start an ec2 instance using sqs and trigger a python script inside the instance

I have a python script which takes video and converts it to a series of small panoramas. Now, theres an S3 bucket where a video will be uploaded (mp4). I need this file to be sent to the ec2 instance whenever it is uploaded.
This is the flow:
Upload video file to S3.
This should trigger EC2 instance to start.
Once it is running, I want the file to be copied to a particular directory inside the instance.
After this, I want the py file (panorama.py) to start running and read the video file from the directory and process it and then generate output images.
These output images need to be uploaded to a new bucket or the same bucket which was initially used.
Instance should terminate after this.
What I have done so far is, I have created a lambda function that is triggered whenever an object is added to that bucket. It stores the name of the file and the path. I had read that I now need to use an SQS queue and pass this name and path metadata to the queue and use the SQS to trigger the instance. And then, I need to run a script in the instance which pulls the metadata from the SQS queue and then use that to copy the file(mp4) from bucket to the instance.
How do i do this?
I am new to AWS and hence do not know much about SQS or how to transfer metadata and automatically trigger instance, etc.
Your wording is a bit confusing. It says that you want to "start" an instance (which suggests that the instance already exists), but then it says that it wants to "terminate" an instance (which would permanently remove it). I am going to assume that you actually intend to "stop" the instance so that it can be used again.
You can put a shell script in the /var/lib/cloud/scripts/per-boot/ directory. This script will then be executed every time the instance starts.
When the instance has finished processing, it can call sudo shutdown now -h to turn off the instance. (Alternatively, it can tell EC2 to stop the instance, but using shutdown is easier.)
For details, see: Auto-Stop EC2 instances when they finish a task - DEV Community
I tried to answer in the most minimalist way, there are many points below that can be further improved. I think below is still quite some as you mentioned you are new to AWS.
Using AWS Lambda with Amazon S3
Amazon S3 can send an event to a Lambda function when an object is created or deleted. You configure notification settings on a bucket, and grant Amazon S3 permission to invoke a function on the function's resource-based permissions policy.
When the object uploaded it will trigger the lambda function. Which creates the instance with ec2 user data Run commands on your Linux instance at launch.
For the ec2 instance make you provide the necessary permissions via Using instance profiles for download and uploading the objects.
user data has a script that does the rest of the work which you need for your workflow
Download the s3 object, you can pass the name and s3 bucket name in the same script
Once #1 finished, start the panorama.py which processes the videos.
In the next step you can start uploading the objects to the S3 bucket.
Eventually terminating the instance will be a bit tricky which you can achieve Change the instance initiated shutdown behavior
OR
you can use below method for terminating the instnace, but in that case your ec2 instance profile must have access to terminate the instance.
ec2-terminate-instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
You can wrap the above steps into a shell script inside the userdata.
Lambda ec2 start instance:
def launch_instance(EC2, config, user_data):
ec2_response = EC2.run_instances(
ImageId=config['ami'], # ami-0123b531fc646552f
InstanceType=config['instance_type'],
KeyName=config['ssh_key_name'],
MinCount=1,
MaxCount=1,
SecurityGroupIds=config['security_group_ids'],
TagSpecifications=tag_specs,
# UserData=base64.b64encode(user_data).decode("ascii")
UserData=user_data
)
new_instance_resp = ec2_response['Instances'][0]
instance_id = new_instance_resp['InstanceId']
print(f"[DEBUG] Full ec2 instance response data for '{instance_id}': {new_instance_resp}")
return (instance_id, new_instance_resp)
Upload file to S3 -> Launch EC2 instance

How do I use a cloud function to unzip a large file in cloud storage?

I have a cloud function which is triggered when a zip is uploaded to cloud storage and is supposed to unpack it. However the function runs out of memory, presumably since the unzipped file is too large (~2.2 Gb).
I was wondering what my options are for dealing with this problem? I read that it's possible to stream large files into cloud storage but I don't know how to do this from a cloud function or while unzipping. Any help would be appreciated.
Here is the code of the cloud function so far:
storage_client = storage.Client()
bucket = storage_client.get_bucket("bucket-name")
destination_blob_filename = "large_file.zip"
blob = bucket.blob(destination_blob_filename)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
blob = bucket.blob(contentfilename)
blob.upload_from_string(contentfile)
Your target process is risky:
If you stream file without unzipping it totally, you can't validate the checksum of the zip
If you stream data into GCS, file integrity is not guaranteed
Thus, you have 2 successful operation without checksum validation!
Before having Cloud Function or Cloud Run with more memory, you can use Dataflow template to unzip your files

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

Boto/Boto3: bucket.get_key(): 403 Forbidden

I am trying to connect to AWS S3 without using credentials. I attached the Role S3 fullaccess for my instance to check if the file exists or not; if it is not, upload it into S3 bucket. If is isn't I want to check md5sum and if it is different from the local local file, upload a new version.
I try to get key of file in S3 via boto by using bucket.get_key('mykey') and get this error:
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 193, in get_key key, resp = self._get_key_internal(key_name, headers, query_args_l)
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 232, in _get_key_internal response.status, response.reason, '') boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden"
I searched and added "validate=False" when getting the bucket, but this didn't resolve my issue. I'm using Python 3.5, boto and boto3.
Here is my code:
import boto3
import boto
from boto import ec2
import os
import boto.s3.connection
from boto.s3.key import Key
bucket_name = "abc"
conn = boto.s3.connect_to_region('us-west-1', is_secure = True, calling_format = boto.s3.connection.OrdinaryCallingFormat())
bucket = conn.get_bucket(bucket_name, validate=False)
key = bucket.get_key('xxxx')
print (key)
I don't know why I get that error. Please help me to clearly this problem. Thanks!
Updated
I've just find root cause this problem. Cause by "The difference between the request time and the current time is too large".
Then it didn't get key of file from S3 bucket. I updated ntp service to synchronize local time and UTC time. It run success.
Synchronization time by:
sudo service ntp stop
sudo ntpdate -s 0.ubuntu.pool.ntp.org
sudo service ntp start
Thanks!
IAM role is the last in the order of search. I bet you have the credentials stored before the search order which doesn't have full S3 access. Check Configuration Settings and Precedence and make sure no credentials is present so that IAM role is used to fetch the credentials. Though it is for CLI, it applies to scripts too.
The AWS CLI looks for credentials and configuration settings in the following order:
Command line options – region, output format and profile can be specified as command options to override default settings.
Environment variables – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN.
The AWS credentials file – located at ~/.aws/credentials on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\credentials on Windows. This file can contain multiple named profiles in addition to a default profile.
The CLI configuration file – typically located at ~/.aws/config on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\config on Windows. This file can contain a default profile, named profiles, and CLI specific configuration parameters for each.
Container credentials – provided by Amazon Elastic Container Service on container instances when you assign a role to your task.
Instance profile credentials – these credentials can be used on EC2 instances with an assigned instance role, and are delivered through the Amazon EC2 metadata service.

Resources