Why am I getting : Unable to import module 'handler': No module named 'paramiko'? - python-3.x

I was in the need to move files with a aws-lambda from a SFTP server to my AWS account,
then I've found this article:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Talking about paramiko as a SSHclient candidate to move files over ssh.
Then I've written this calss wrapper in python to be used from my serverless handler file:
import paramiko
import sys
class FTPClient(object):
def __init__(self, hostname, username, password):
"""
creates ftp connection
Args:
hostname (string): endpoint of the ftp server
username (string): username for logging in on the ftp server
password (string): password for logging in on the ftp server
"""
try:
self._host = hostname
self._port = 22
#lets you save results of the download into a log file.
#paramiko.util.log_to_file("path/to/log/file.txt")
self._sftpTransport = paramiko.Transport((self._host, self._port))
self._sftpTransport.connect(username=username, password=password)
self._sftp = paramiko.SFTPClient.from_transport(self._sftpTransport)
except:
print ("Unexpected error" , sys.exc_info())
raise
def get(self, sftpPath):
"""
creates ftp connection
Args:
sftpPath = "path/to/file/on/sftp/to/be/downloaded"
"""
localPath="/tmp/temp-download.txt"
self._sftp.get(sftpPath, localPath)
self._sftp.close()
tmpfile = open(localPath, 'r')
return tmpfile.read()
def close(self):
self._sftpTransport.close()
On my local machine it works as expected (test.py):
import ftp_client
sftp = ftp_client.FTPClient(
"host",
"myuser",
"password")
file = sftp.get('/testFile.txt')
print(file)
But when I deploy it with serverless and run the handler.py function (same as the test.py above) I get back the error:
Unable to import module 'handler': No module named 'paramiko'
Looks like the deploy is unable to import paramiko (by the article above it seems like it should be available for lambda python 3 on AWS) isn't it?
If not what's the best practice for this case? Should I include the library into my local project and package/deploy it to aws?

A comprehensive guide tutorial exists at :
https://serverless.com/blog/serverless-python-packaging/
Using the serverless-python-requirements package
as serverless node plugin.
Creating a virtual env and Docker Deamon will be required to packup your serverless project before deploying on AWS lambda

In the case you use
custom:
pythonRequirements:
zip: true
in your serverless.yml, you have to use this code snippet at the start of your handler
try:
import unzip_requirements
except ImportError:
pass
all details possible to find in Serverless Python Requirements documentation

You have to create a virtualenv, install your dependencies and then zip all files under sites-packages/
sudo pip install virtualenv
virtualenv -p python3 myvirtualenv
source myvirtualenv/bin/activate
pip install paramiko
cp handler.py myvirtualenv/lib/python
zip -r myvirtualenv/lib/python3.6/site-packages/ -O package.zip
then upload package.zip to lambda

You have to provide all dependencies that are not installed in AWS' Python runtime.
Take a look at Step 7 in the tutorial. Looks like he is adding the dependencies from the virtual environment to the zip file. So I'd assume your ZIP file to contain the following:
your worker_function.py on top level
a folder paramico with the files installed in virtual env
Please let me know if this helps.

I tried various blogs and guides like:
web scraping with lambda
AWS Layers for Pandas
spending hours of trying out things. Facing SIZE issues like that or being unable to import modules etc.
.. and I nearly reached the end (that is to invoke LOCALLY my handler function), but then my function even though it was fully deployed correctly and even invoked LOCALLY with no problems, then it was impossible to invoke it on AWS.
The most comprehensive and best by far guide or example that is ACTUALLY working is the above mentioned by #koalaok ! Thanks buddy!
actual link

Related

how can we run google app engine with python3 with ndb on local

I am using python google app engine
could you tell me, how i can run python3 google app engine with ndb on local system?
Help me
https://cloud.google.com/appengine/docs/standard/python3
Please try this
Go to service account https://cloud.google.com/docs/authentication/getting-started
create json file
and add install this pip
$ pip install google-cloud-ndb
now open linux terminal
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
if window then open command prompt
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\credentials.json
run this code in python3 in your terminal/command prompt
from google.cloud import ndb
client = ndb.Client()
with client.context():
contact1 = Contact(name="John Smith",
phone="555 617 8993",
email="john.smith#gmail.com")
contact1.put()
see this result in your datastore.. Google console
App Engine is a Serverless service provided by Google Cloud Platform where you can deploy your applications and configure Cloud resources like instances' CPU, memory, scaling method, etc. This will provide you the architecture to run your app.
This service is not meant to be used on local environments. Instead, it is a great option to host an application that (ideally) has been tested on local environments.
Let's say: You don't run a Django application with Datastore dependencies using App Engine locally, You run a Django application with Datastore (and other) dependencies locally and then deploy it to App Engine once it is ready.
Most GCP services have their Client libraries so we can interact with them via code, even on local environments. The ndb you asked belongs to the Google Cloud Datastore and can be installed in Python environments with:
pip install google-cloud-ndb
After installing it, you will be ready to interact with Datastore locally. Please find details about setting up credentials and code snippets in the Datastore Python Client Library reference.
Hope this is helpful! :)
You can simply create emulator instance of the datastore on your local:
gcloud beta emulators datastore start --project test --host-port "0.0.0.0:8002" --no-store-on-disk --consistency=1
And then use it in the code in main app file:
from google.cloud import ndb
def get_ndb_client(namespace):
if config.ENVIRONMENT != ENVIRONMENTS.LOCAL:
# production
db = ndb.Client(namespace=namespace)
else:
# localhost
import mock
credentials = mock.Mock(spec=google.auth.credentials.Credentials)
db = ndb.Client(project="test", credentials=credentials, namespace=namespace)
return db
ndb_client = get_ndb_client("ns1")

waitress+flask+gcloud how to set it up server

I have been trying to deploy a basic app to google engine app(because Azure is an extortion) for the past few days, I have learned that Gunicode does not work on windows system and that the alternative is waitress. I read all the answers related to the subject here, before I posted this question!!!
So I have been trying different setups, reading about it and I still can't get it running. My field is data science, but deployment seems to be obligatory nowadays. If someone can help me out please, it would be very appreciated.
app.py file
from flask import Flask, render_template, request
from waitress import serve
app = Flask(__name__)
#app.route('/')
def index():
name = request.args.get("name")
if name == None:
name = "Reinhold"
return render_template("index.html", name=name)
if __name__ == '__main__':
#app.run(debug=True)
serve(app, host='0.0.0.0', port=8080)
Gcloud app deploy will look for the gunicode to start the deployment which will be at the app.yaml file, I tried different setups there and I ended up setting it up None as Flask will look for an alternative in my humble view. Though I still think that would be better to setup the waitress server there.
app.yaml file
runtime: python37
#entrypoint: None
entrypoint: waitress-serve --listen=*:8080 serve:app
GCloud also will look for an appengine_config.py file where it will find the dependencies(I think)
from google.appengine.ext import vendor
vendor.add('venv\Lib')
The requirements.txt file will be the following:
astroid==2.3.3
autopep8==1.4.4
Click==7.0
colorama==0.4.3
dominate==2.4.0
Flask==1.1.1
Flask-Bootstrap==3.3.7.1
Flask-WTF==0.14.2
isort==4.3.21
itsdangerous==1.1.0
Jinja2==2.10.3
lazy-object-proxy==1.4.3
MarkupSafe==1.1.1
mccabe==0.6.1
pycodestyle==2.5.0
pylint==2.4.4
six==1.13.0
typed-ast==1.4.1
visitor==0.1.3
waitress==1.4.2
Werkzeug==0.16.0
wrapt==1.11.2
WTForms==2.2.1
In the google console I could access the log view to see what was going wrong during the deployment and that is what I got from the code I shared here.
{
insertId: "5e1e9b4500029d71f92c1db9"
labels: {…}
logName: "projects/bokehflaskgcloud/logs/stderr"
receiveTimestamp: "2020-01-15T04:55:33.288839846Z"
resource: {…}
textPayload: "/bin/sh: 1: exec: None: not found"
timestamp: "2020-01-15T04:55:33.171377Z"
}
If someone could help solve this, that would be great because google seems to be a good alternative to deploy some work. Azure and VScode have a good interaction so it isnt as hard to deploy it there, but the cost of it after the trial is insane.
That is what I get once I try to deploy the application.
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
easily run your flask app using Gunicorn:
runtime: python37
entrypoint: gunicorn -b :$PORT main:app
you need to add gunicorn to your requirments.txt
check this documentation on how to define application startup in python 3
make sure that you run your app using flask run method, in case you want to test your app locally:
if __name__ == '__main__':
app.run(host='127.0.0.1', port=8080, debug=True)
appengine_config.py is not used in Python 3. The Python 2 runtime uses this file to install client libraries and provide values for constants and "hook functions". The Python 3 runtime doesn't use this file.
the app.py file there is no mention of flask library
Please add following import at line 2.
from flask import Flask, request, render_template

Unable to run tasks on Azure Batch:Nodes go into unusable state after starting-up

I am trying to parallelize a Python App using Azure Batch.The workflow that I have followed in the Python client-side script is:
1) Upload Local files to Azure Blob Container using blobxfer utility (input-container)
2) Start Batch service to Process the files in input-container after logging in using the service principal account with azure-cli.
3) Upload the files to output-container through the python app distributed across the Nodes with Azure Batch.
I am experiencing a problem very similar to the one I read here but unfortunately no solution was given in this post.
Nodes go into Unusable State
I will now give the relevant information so that one can reproduce this error:
The image that was used for Azure Batch is custom.
1) Ubuntu Server 18.04 LTS was chosen as the OS for the VM and the following ports were opened-ssh,http,https.The rest of the setting were kept default in the Azure portal.
2)The following script was run once the server was available.
sudo apt-get install build-essential checkinstall -y
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev -y
cd /usr/src
sudo wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
sudo tar xzf Python-3.6.6.tgz
cd Python-3.6.6
sudo ./configure --enable-optimizations
sudo make altinstall
sudo pip3.6 install --upgrade pip
sudo pip3.6 install pymupdf==1.13.20
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
3) An Image of this server was created using the process outlined in this link.
Creating VM Image in Azure Linux
Also during deprovision the user was not deleted:sudo waagent -deprovision
4) The Resource Id of the image was noted from the Azure Portal.This will be supplied as one of the parameters in the python-client-side script
The packages installed on the client side server where the python script for Batch would run
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
sudo pip3.6 install pandas==0.22.0
The Resources used during Azure Batch were created in the following way:
1) Service Principal account with contributor privileges was created using the cmd.
$az ad sp create-for-rbac --name <SERVICE-PRINCIPAL-ACCOUNT>
2) Resource-Group,Batch-Account and Storage associated with Batch Account were created in the following way:
$ az group create --name <RESOURCE-GROUP-NAME> --location eastus2
$ az storage account create --resource-group <RESOURCE-GROUP-NAME> --name <STORAGE-ACCOUNT-NAME> --location eastus2 --sku Standard_LRS
$ az batch account create --name <BATCH-ACCOUNT-NAME> --storage-account <STORAGE-ACCOUNT-NAME> --resource-group <RESOURCE-GROUP-NAME> --location eastus2
The client-side Python script which initiates the upload and processing:
(Update 3)
import subprocess
import os
import time
import datetime
import tqdm
import pandas
import sys
import fitz
import parmap
import numpy as np
import sentry_sdk
import multiprocessing as mp
def batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path):
try:
subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Login Credentials")
sys.exit("Invalid Azure Login Credentials")
dir_flag=False
while dir_flag==False:
try:
no_of_dir=input("Enter the number of directories to upload:")
no_of_dir=int(no_of_dir)
if no_of_dir<0:
print("\nRetry:Enter an integer value")
else:
dir_flag=True
except ValueError:
print("\nRetry:Enter an integer value")
dir_path_list=[]
for dir in range(no_of_dir):
path_exists=False
while path_exists==False:
dir_path=input("\nEnter the local absolute path of the directory no.{}:".format(dir+1))
print("\n")
dir_path=dir_path.replace('"',"")
path_exists=os.path.isdir(dir_path)
if path_exists==True:
dir_path_list.append(dir_path)
else:
print("\nRetry:Enter a valid directory path")
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
input_azure_container="pdf-processing-input"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","storage","container","create","--name",input_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Storage Credentials.")
sys.exit("Invalid Azure Storage Credentials.")
log_file_path=os.path.join(log_dir_path,"upload-logs"+"-"+timestamp_humanreadable+".txt")
dir_upload_success=[]
dir_upload_failure=[]
for dir in tqdm.tqdm(dir_path_list,desc="Uploading Directories"):
try:
subprocess.check_output(["blobxfer","upload","--remote-path",input_azure_container,"--storage-account",azure_storage_account,\
"--enable-azure-storage-logger","--log-file",\
log_file_path,"--storage-account-key",azure_storage_account_key,"--local-path",dir])
dir_upload_success.append(dir)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Failed to upload directory: {}".format(dir))
dir_upload_failure.append(dir)
return(input_azure_container)
def query_azure_storage(azure_storage_container,azure_storage_account,azure_storage_account_key,blob_file_path):
try:
blob_list=subprocess.check_output(["az","storage","blob","list","--container-name",azure_storage_container,\
"--account-key",azure_storage_account_key,"--account-name",azure_storage_account,"--auth-mode","login","--output","tsv"])
blob_list=blob_list.decode("utf-8")
with open(blob_file_path,"w") as f:
f.write(blob_list)
blob_df=pandas.read_csv(blob_file_path,sep="\t",header=None)
blob_df=blob_df.iloc[:,3]
blob_df=blob_df.to_frame(name="container_files")
blob_df=blob_df.assign(container=azure_storage_container)
return(blob_df)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Storage Credentials")
sys.exit("Invalid Azure Storage Credentials.")
def analyze_files_for_tasks(data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,download_folder):
try:
blob_df=data_split
some_calculation_factor=2
analyzed_azure_blob_df=pandas.DataFrame()
analyzed_azure_blob_df=analyzed_azure_blob_df.assign(container="empty",container_files="empty",pages="empty",max_time="empty")
for index,row in blob_df.iterrows():
file_to_analyze=os.path.join(download_folder,row["container_files"])
subprocess.check_output(["az","storage","blob","download","--container-name",azure_storage_container,"--file",file_to_analyze,"--name",row["container_files"],\
"--account-name",azure_storage_account,"--auth-mode","key"]) #Why does login auth not work for this while we are multiprocessing
doc=fitz.open(file_to_analyze)
page_count=doc.pageCount
analyzed_azure_blob_df=analyzed_azure_blob_df.append([{"container":azure_storage_container,"container_files":row["container_files"],"pages":page_count,"max_time":some_calculation_factor*page_count}])
doc.close()
os.remove(file_to_analyze)
return(analyzed_azure_blob_df)
except Exception as e:
sentry_sdk.capture_exception(e)
def estimate_task_completion_time(azure_storage_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path):
try:
cores=mp.cpu_count() #Number of CPU cores on your system
partitions = cores-2
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
file_download_location=os.path.join(azure_blob_downloads_file_path,"Blob_Download"+"-"+timestamp_humanreadable)
os.mkdir(file_download_location)
data_split = np.array_split(azure_blob_df,indices_or_sections=partitions,axis=0)
analyzed_azure_blob_df=pandas.concat(parmap.map(analyze_files_for_tasks,data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,file_download_location,\
pm_pbar=True,pm_processes=partitions))
analyzed_azure_blob_df=analyzed_azure_blob_df.reset_index(drop=True)
return(analyzed_azure_blob_df)
except Exception as e:
sentry_sdk.capture_exception(e)
sys.exit("Unable to Estimate Job Completion Status")
def azure_batch_create_pool(azure_storage_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df):
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
pool_id="pdf-processing"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","batch","account","login","--name", azure_batch_account,"--resource-group",azure_resource_group])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to log into the Batch account")
sys.exit("Unable to log into the Batch account")
#Pool autoscaling formula would go in here
try:
subprocess.check_output(["az","batch","pool","create","--account-endpoint",azure_batch_account_endpoint, \
"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--id",pool_id,\
"--node-agent-sku-id","batch.node.ubuntu 18.04",\
"--image",vm_image_name,"--target-low-priority-nodes",str(no_nodes),"--vm-size",vm_compute_size])
return(pool_id)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
sys.exit("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
def azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
job_id="pdf-processing-job"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","batch","job","create","--account-endpoint",azure_batch_account_endpoint,"--account-key",\
azure_batch_account_key,"--account-name",azure_batch_account,"--id",job_id,"--pool-id",pool_info])
return(job_id)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to create a Job on the Pool :{}".format(pool_info))
sys.exit("Unable to create a Job on the Pool :{}".format(pool_info))
def azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,azure_storage_container,analyzed_azure_blob_df):
print("\n")
for i in tqdm.tqdm(range(180),desc="Waiting for the Pool to Warm-up"):
time.sleep(1)
successful_task_list=[]
unsuccessful_task_list=[]
input_azure_container=azure_storage_container
output_azure_container= "pdf-processing-output"+"-"+input_azure_container.split("-input-")[-1]
try:
subprocess.check_output(["az","storage","container","create","--name",output_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
except subprocess.CalledProcessError:
sentry_sdk.cpature_message("Unable to create an output container")
sys.exit("Unable to create an output container")
print("\n")
pbar = tqdm.tqdm(total=analyzed_azure_blob_df.shape[0],desc="Creating and distributing Tasks")
for index,row in analyzed_azure_blob_df.iterrows():
try:
task_info="mytask-"+str(index)
subprocess.check_output(["az","batch","task","create","--task-id",task_info,"--job-id",job_info,"--command-line",\
"python3 /home/avadhut/pdf_processing.py {} {} {}".format(input_azure_container,output_azure_container,row["container_files"])])
pbar.update(1)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("unable to create the Task: mytask-{}".format(i))
pbar.update(1)
pbar.close()
def wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df):
try:
print(analyzed_azure_blob_df)
nrows_tasks_df=analyzed_azure_blob_df.shape[0]
print("\n")
pbar=tqdm.tqdm(total=nrows_tasks_df,desc="Waiting for task to complete")
for index,row in analyzed_azure_blob_df.iterrows():
task_list=subprocess.check_output(["az","batch","task","list","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,\
"--output","tsv"])
task_list=task_list.decode("utf-8")
with open(task_file_path,"w") as f:
f.write(task_list)
task_df=pandas.read_csv(task_file_path,sep="\t",header=None)
task_df=task_df.iloc[:,21]
active_task_list=[]
for x in task_df:
if x =="active":
active_task_list.append(x)
if len(active_task_list)>0:
time.sleep(row["max_time"]) #This time can be changed in accordance with the time taken to complete each task
pbar.update(1)
continue
else:
pbar.close()
return("success")
pbar.close()
return("failure")
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Error in retrieving task status")
def azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info):
try:
subprocess.check_output(["az","batch","job","delete","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to delete Job-{}".format(job_info))
def azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
try:
subprocess.check_output(["az","batch","pool","delete","--pool-id",pool_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to delete Pool--{}".format(pool_info))
if __name__=="__main__":
print("\n")
print("-"*40+"Azure Batch processing POC"+"-"*40)
print("\n")
#Credentials and initializations
sentry_sdk.init(<SENTRY-CREDENTIALS>) #Sign-up for a Sentry trail account
azure_username=<AZURE-USERNAME>
azure_password=<AZURE-PASSWORD>
azure_tenant=<AZURE-TENANT>
azure_resource_group=<RESOURCE-GROUP-NAME>
azure_storage_account=<STORAGE-ACCOUNT-NAME>
azure_storage_account_key=<STORAGE-KEY>
azure_batch_account_endpoint=<BATCH-ENDPOINT>
azure_batch_account_key=<BATCH-ACCOUNT-KEY>
azure_batch_account=<BATCH-ACCOUNT-NAME>
vm_image_name=<VM-IMAGE>
vm_compute_size="Standard_A4_v2"
no_nodes=2
log_dir_path="/home/user/azure_batch_upload_logs/"
azure_blob_downloads_file_path="/home/user/blob_downloads/"
blob_file_path="/home/user/azure_batch_upload.tsv"
task_file_path="/home/user/azure_task_list.tsv"
input_azure_container=batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path)
azure_blob_df=query_azure_storage(input_azure_container,azure_storage_account,azure_storage_account_key,blob_file_path)
analyzed_azure_blob_df=estimate_task_completion_time(input_azure_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path)
pool_info=azure_batch_create_pool(input_azure_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df)
job_info=azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,input_azure_container,analyzed_azure_blob_df)
task_status=wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df)
if task_status=="success":
azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
print("\n\n")
sys.exit("Job Complete")
else:
azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
print("\n\n")
sys.exit("Job Unsuccessful")
cmd used to create the zip file:
zip pdf_process_1.zip pdf_processing.py
The Python App that was packaged in zip file and uploaded to batch through the client-side script
(Update 3)
import os
import fitz
import subprocess
import argparse
import time
from tqdm import tqdm
import sentry_sdk
import sys
import datetime
def azure_active_directory_login(azure_username,azure_password,azure_tenant):
try:
azure_login_output=subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Login Credentials")
sys.exit("Invalid Azure Login Credentials")
def download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path):
file_to_download=os.path.join(input_azure_container,file_to_process)
try:
subprocess.check_output(["az","storage","blob","download","--container-name",input_azure_container,"--file",os.path.join(pdf_docs_path,file_to_process),"--name",file_to_process,"--account-key",azure_storage_account_key,\
"--account-name",azure_storage_account,"--auth-mode","login"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("unable to download the pdf file")
sys.exit("unable to download the pdf file")
def pdf_to_png(input_folder_path,output_folder_path):
pdf_files=[x for x in os.listdir(input_folder_path) if x.endswith((".pdf",".PDF"))]
pdf_files.sort()
for pdf in tqdm(pdf_files,desc="pdf--->png"):
doc=fitz.open(os.path.join(input_folder_path,pdf))
page_count=doc.pageCount
for f in range(page_count):
page=doc.loadPage(f)
pix = page.getPixmap()
if pdf.endswith(".pdf"):
png_filename=pdf.split(".pdf")[0]+"___"+"page---"+str(f)+".png"
pix.writePNG(os.path.join(output_folder_path,png_filename))
elif pdf.endswith(".PDF"):
png_filename=pdf.split(".PDF")[0]+"___"+"page---"+str(f)+".png"
pix.writePNG(os.path.join(output_folder_path,png_filename))
def upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path):
try:
subprocess.check_output(["az","storage","blob","upload-batch","--destination",output_azure_container,"--source",png_docs_path,"--account-key",azure_storage_account_key,\
"--account-name",azure_storage_account,"--auth-mode","login"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to upload file to the container")
if __name__=="__main__":
#Credentials
sentry_sdk.init(<SENTRY-CREDENTIALS>)
azure_username=<AZURE-USERNAME>
azure_password=<AZURE-PASSWORD>
azure_tenant=<AZURE-TENANT>
azure_storage_account=<AZURE-STORAGE-NAME>
azure_storage_account_key=<AZURE-STORAGE-KEY>
try:
parser = argparse.ArgumentParser()
parser.add_argument("input_azure_container",type=str,help="Location to download files from")
parser.add_argument("output_azure_container",type=str,help="Location to upload files to")
parser.add_argument("file_to_process",type=str,help="file link in azure blob storage")
args = parser.parse_args()
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
task_working_dir=os.getcwd()
file_to_process=args.file_to_process
input_azure_container=args.input_azure_container
output_azure_container=args.output_azure_container
pdf_docs_path=os.path.join(task_working_dir,"pdf_files"+"-"+timestamp_humanreadable)
png_docs_path=os.path.join(task_working_dir,"png_files"+"-"+timestamp_humanreadable)
os.mkdir(pdf_docs_path)
os.mkdir(png_docs_path)
except Exception as e:
sentry_sdk.capture_exception(e)
azure_active_directory_login(azure_username,azure_password,azure_tenant)
download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path)
pdf_to_png(pdf_docs_path,png_docs_path)
upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path)
Update 1:
I have solved the Server Nodes going into unusable state error.The way I solved this issue is:
1) I did not use the cmds I mentioned above to set up Python env 3.6 on Ubuntu as Ubuntu 18.04 LTS comes with its own python 3 environment.Initially I had googled "Install Python 3 on Ubuntu" and had gotten this Python 3.6 installation on Ubuntu link.Avoided this step completely during the server set-up.
All I did was install these packages this time.
sudo apt-get install -y python3-pip
sudo -H pip3 install tqdm==4.19.9
sudo -H pip3 install sentry-sdk==0.4.1
sudo -H pip3 install blobxfer==1.5.0
sudo -H pip3 install pandas==0.22.0
The Azure cli was installed on the machine using the cmds in this link
Install Azure CLI with apt
2) Created a snapshot of the OS-disk and then created the image out of this snapshot and finally referencing this image in the client-side script.
I am now faced with another issue where the stderr.txt files on the node tell me that:
python3: can't open file '$AZ_BATCH_APP_PACKAGE_pdfprocessingapp/pdf_processing.py': [Errno 2] No such file or directory
Logging in to the server with the random user I see that the directory _azbatch is created but there are no contents inside this directory.
I know for certain that it is the command line of the azure_batch_create_task() function that things are going haywire but I am not able to put my finger on it.I have done everything that this docs recommends:Install app packages to Azure Batch Compute Nodes Please review my client-side Python script and let me know on what I am doing wrong!
Edit 3:
The problem looks very similar to the one described in this post:
Unable to pass app path to Tasks
Update 2:
I was able to overcome the file/directory not found error using a dirty hack which i am not particularly fond of.I placed the python app in the home directory of the user which was used to create the VM and all the directories required for processing were created in the working directory of the task.
I still would want to know how I would run the workflow by using the application package way to deploy it to the node.
Update 3
I have updated the client side code and python app to reflect the latest changes made. Things that are significant are the same.....
I will comment on #fparks points that he/she has raised.
The Original python App that I intend to use in Azure Batch contains many modules and some config files and a quite lengthy requirements.txt file for Python packages.Azure also recommends using custom Image in such cases.
Also downloading the python modules per Task is a bit irrational in my case as 1 task is equal to a multipage pdfs and my expected workload is 25k multipage pdfs
I used CLI because the docs for Python SDK were sparse and hard to follow.The nodes going into unusable state has been solved.I do agree with you on the blobxfer error.
Answers and a few observations:
It is unclear to me why you need a custom image. You can use a platform image, i.e., Canonical, UbuntuServer, 18.04-LTS, and then just install what you need as part of the start task. Python3.6 can simply be installed via apt in 18.04. You may be prematurely optimizing your workflow by opting for a custom image when in fact using a platform image + start task may be faster and stable.
Your script is in Python, yet you are calling out to the Azure CLI. You may want to consider directly using the Azure Batch Python SDK instead (samples).
When nodes go unusable, you should first examine the node for errors. You should see if the ComputeNodeError field is populated. Additionally, you can try to fetch stdout.txt and stderr.txt files from the startup directory to diagnose what's going on. You can do both of these actions in the Azure Portal or via Batch Explorer. If that doesn't work, you can fetch the compute node service logs and file a support request. However, typically unusable means that your custom image was provisioned incorrectly, you have a virtual network with an NSG misconfigured, or you have an application package that is incorrect.
Your application package consists of a single python file; instead use a resource file. Simply upload the script to Azure Storage blob and reference it in your task as a Resource File using a SAS URL. See the --resource-files argument in az batch task create if using the CLI. Your command to invoke would then simply be python3 pdf_processing.py (assuming you keep the resource file downloading to the task working directory).
If you insist on using an application package, consider using a task application package instead. This will decouple your node startup issues potentially originating from bad application packages to debugging task executions instead.
The blobxfer error is pretty clear. Your locale is not set properly. The easy way to fix this is to set the environment variables for the task. See the --environment-settings argument if using the CLI and set two environment variables LC_ALL=C.UTF-8 and LANG=C.UTF-8 as part of your task.

Deploying python using CherryPy in Elastic Beanstalk

I am new to python. I have to run a python application from Amazon Cloud. I am using CherryPy and deploying through Beanstalk. Here is my simple HelloWorld code
import cherrypy
class Hello(object):
#cherrypy.expose
def index(self):
return "Hello world!"
if __name__ == '__main__':
cherrypy.config.update({'server.socket_host': '0.0.0.0',
'server.socket_port': 80,})
cherrypy.quickstart(Hello())
In requirements.txt file I have CherryPy==10.2.2. Still, I am not able to see any output in beanstalk URL. While deploying I get the following error,
Your WSGIPath refers to a file that does not exist.
Can anyone give any insight?
The problem was the WSGIPath variable in Software Configuration specifies application.py as the init file. The Hello class in the above code was in a file named differently.
Make sure the initial code is in a file named application.py or change the configuration.

Openshift app with flask, sqlalchemy and sqlite - problems with database reverting

I have a problem pretty much exactly like this:
How to preserve a SQLite database from being reverted after deploying to OpenShift?
I don't understand his answer fully and clearly not enough to apply it to my own app and since I can't comment his answer (not enough rep) I figured I had to make ask my own question.
Problem is that when pushing my local files (not including the database file) my database on openshift becomes the one I have locally (all changes made through the server are reverted).
I've googled alot and pretty much understand the problem being that the database should be located somewhere else but I can't grasp fully where to place it and how to deploy it if it's outside the repo.
EDIT: Quick solution: If you have this problem, try connecting to your openshift app with rhc ssh appname
and then cp app-root/repo/database.db app-root/data/database.db
if you have the openshift data dir as reference to SQLALCHEMY_DATABASE_URI. I recommend the accepted answer below though!
I've attached my filestructure and here's some related code:
config.py
import os
basedir = os.path.abspath(os.path.dirname(__file__))
SQLALCHEMY_DATABASE_URI = 'sqlite:///' + os.path.join(basedir, 'database.db')
SQLALCHEMY_MIGRATE_REPO = os.path.join(basedir, 'db_repository')
app/__ init.py__
from flask import Flask
from flask.ext.sqlalchemy import SQLAlchemy
app = Flask(__name__)
#so that flask doesn't swallow error messages
app.config['PROPAGATE_EXCEPTIONS'] = True
app.config.from_object('config')
db = SQLAlchemy(app)
from app import rest_api, models
wsgi.py:
#!/usr/bin/env python
import os
virtenv = os.path.join(os.environ.get('OPENSHIFT_PYTHON_DIR', '.'), 'virtenv')
#
# IMPORTANT: Put any additional includes below this line. If placed above this
# line, it's possible required libraries won't be in your searchable path
#
from app import app as application
## runs server locally
if __name__ == '__main__':
from wsgiref.simple_server import make_server
httpd = make_server('localhost', 4599, application)
httpd.serve_forever()
filestructure: http://sv.tinypic.com/r/121xseh/8 (can't attach image..)
Via the note at the top of the OpenShift Cartridge Guide:
"Cartridges and Persistent Storage: Every time you push, everything in your remote repo directory is recreated. Store long term items (like an sqlite database) in the OpenShift data directory, which will persist between pushes of your repo. The OpenShift data directory can be found via the environment variable $OPENSHIFT_DATA_DIR."
You can keep your existing project structure as-is and just use a deploy hook to move your database to persistent storage.
Create a deploy action hook (executable file) .openshift/action_hooks/deploy:
#!/bin/bash
# This deploy hook gets executed after dependencies are resolved and the
# build hook has been run but before the application has been started back
# up again.
# if this is the initial install, copy DB from repo to persistent storage directory
if [ ! -f ${OPENSHIFT_DATA_DIR}database.db ]; then
cp -rf ${OPENSHIFT_REPO_DIR}database.db ${OPENSHIFT_DATA_DIR}/database.db 2>/dev/null
fi
# remove the database from the repo during all deploys
if [ -d ${OPENSHIFT_REPO_DIR}database.db ]; then
rm -rf ${OPENSHIFT_REPO_DIR}database.db
fi
# create symlink from repo directory to new database location in persistent storage
ln -sf ${OPENSHIFT_DATA_DIR}database.db ${OPENSHIFT_REPO_DIR}database.db
As another person pointed out, also make sure you are actually committing/pushing your database (make sure your database isn't included in your .gitignore).

Resources