Dataflow Error with Apache Beam SDK 2.20.0 - python-3.x

I am trying to build an Apache Beam pipeline in Python 3.7 with beam sdk version 2.20.0, the pipeline gets deployed on Dataflow successfully but does not seem to be doing anything. In the worker logs, I can see the following error message repeatedly reported
Error syncing pod xxxxxxxxxxx (), skipping: Failed to start container
worker log
I have tried everything I could but this error is quite stubborn, my pipeline looks like this.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import WorkerOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import DebugOptions
options = PipelineOptions()
options.view_as(GoogleCloudOptions).project = PROJECT
options.view_as(GoogleCloudOptions).job_name = job_name
options.view_as(GoogleCloudOptions).region = region
options.view_as(GoogleCloudOptions).staging_location = staging_location
options.view_as(GoogleCloudOptions).temp_location = temp_location
options.view_as(WorkerOptions).zone = zone
options.view_as(WorkerOptions).network = network
options.view_as(WorkerOptions).subnetwork = sub_network
options.view_as(WorkerOptions).use_public_ips = False
options.view_as(StandardOptions).runner = 'DataflowRunner'
options.view_as(StandardOptions).streaming = True
options.view_as(SetupOptions).sdk_location = ''
options.view_as(SetupOptions).save_main_session = True
options.view_as(DebugOptions).experiments = []
print('running pipeline...')
with beam.Pipeline(options=options) as pipeline:
(
pipeline
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=topic_name).with_output_types(bytes)
| 'ProcessMessage' >> beam.ParDo(Split())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(table=bq_table_name,
schema=bq_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = pipeline.run()
I have tried supplying a beam sdk 2.20.0.tar.gz from the compute instance using sdk_location parameter, that doesn't work either. I can't use sdk_location = default as that triggers a download from pypi.org. I am working in an offline environment and connectivity to internet is not an option. Any help would be highly appreciated.
The pipeline itself is deployed on a container and all libraries that go with apache beam 2.20.0 are specified in a requirements.txt file, docker image installs all the libraries.

TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a variable.
I was also struggling with this setup. Finally I found a solution - even if your question was raised quite some days ago, this answer might help someone else.
There are probably multiple ways to do that, but the following two are quite simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp, because this is going to be the path you can set in your SetupOptions.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Build the Docker image with Cloud Build
gcloud builds submit --tag $TEMPLATE_IMAGE .
Create a Flex template
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
Run generated flex-template in your subnetwork (if required)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow#your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location, but it didn't work for me. But the following should work:
Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute. Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
Copy the apache-beam-sdk in your source code to eg. /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
Now you can set the sdk_location to the path you've downloaded the sdk archive.
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Now your Pipeline should be able to run without internet breakout.

Related

Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow

I have an inverse frequency parquet file of the wiki corpus on Google Cloud Storage (GCS). I want to load it from GCS to dataproc serverless (batch). However, the time to load the parquet with pyspark.read on dataproc batch is much slower than my local MacBook (16GB RAM, 8cores Intel CPU). In my local machine, it takes less than 10s to finish the loading and persistent. However, in dataproc batch, it takes 20-30s to finish the reading. I am curious where I am wrong in the setting of dataproc batch.
The inverse_freq.parquet file is 148.8MB and the bucket is using standard storage class. I am using the version 2.0 of the dataproc batch runtime. I also try some smaller parquet in ~50MB, the pyspark.read in dataproc batch still takes 20-30s to read. I think my configuration or setting of dataproc batch has some problems.
I hope someone can tell me how to shorten the time of loading a file from GCS on Google cloud dataproc batch.
Custom docker image
# Debian 11 is recommended.
FROM debian:11-slim
# Suppress interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
# (Required) Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini libjemalloc2
# RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys B8F25A8A73EACF41
# Enable jemalloc2 as default memory allocator
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
# (Optional) Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}"
#COPY spark-bigquery-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}"
# (Optional) Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3
ENV PYSPARK_PYTHON=${CONDA_HOME}/bin/python
ENV PATH=${CONDA_HOME}/bin:${PATH}
COPY Miniconda3-py39_4.10.3-Linux-x86_64.sh .
RUN bash Miniconda3-py39_4.10.3-Linux-x86_64.sh -b -p /opt/miniconda3 \
&& ${CONDA_HOME}/bin/conda config --system --set always_yes True \
&& ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
&& ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
&& ${CONDA_HOME}/bin/conda config --system --set channel_priority strict
# (Optional) Install Conda packages.
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
&& ${CONDA_HOME}/bin/mamba install \
conda \
google-cloud-logging \
python
ENV REQUIREMENTSPATH=/opt/requirements/requirements.txt
COPY requirements.txt "${REQUIREMENTSPATH}"
RUN pip install -r "${REQUIREMENTSPATH}"
ENV NLTKDATA_PATH=${CONDA_HOME}/nltk_data/corpora
RUN bash -c 'mkdir -p $NLTKDATA_PATH/{stopwords,wordnet}'
COPY nltk_data/stopwords ${NLTKDATA_PATH}/stopwords
COPY nltk_data/wordnet ${NLTKDATA_PATH}/wordnet
# (Optional) Add extra Python modules.
ENV PYTHONPATH=/opt/python/packages
RUN mkdir -p "${PYTHONPATH}"
RUN bash -c 'mkdir -p $PYTHONPATH/{utils,GCP}'
COPY utils "$PYTHONPATH/utils"
COPY GCP "$PYTHONPATH/GCP"
# (Required) Create the 'spark' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
GCloud CLI to submit a job to dataproc batch
APP_NAME="context-graph"
BUCKET="context-graph"
IDF_PATH='idf_data/idf_data/inverse_freq.parquet'
DOC_PATH="articles/text.txt"
gcloud dataproc batches submit pyspark main.py \
--version 2.0\
--batch test \
--container-image "custom_image:tag1" \
--project project_id \
--region us-central1 \
--deps-bucket context_graph_deps \
--service-account account#example.com \
--subnet default \
--properties spark.dynamicAllocation.initialExecutors=2,spark.dynamicAllocation.minExecutors=2,spark.executor.cores=4,spark.driver.cores=8,spark.driver.memory='16g',\
spark.executor.heartbeatInterval=200s,spark.network.timeout=250s\
-- --app-name=${APP_NAME} --idf-uri=gs://${BUCKET}/${IDF_PATH} \
--bucket-name=${BUCKET} --doc-path=${DOC_PATH}
main.py, a very simple code to read the inverse frequent parquet
import time
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
start = time.time()
df = (
spark.read.option("inferSchema", "true")
.option("header", "true")
.parquet("gs://bucket/inverse_freq.parquet")
)
df.persist()
end = time.time()
print("loading time:", end - start)
Warning and error in the log of Cloud Dataproc Batch
Solution:
I found that I can add master("local[*]") to fix the problem during create sparksession.
spark = SparkSession.builder.master("local[*]").config(conf=conf).getOrCreate()
If I follow the official's examples or some online resources, they don't use master("local[*]"), it will make the load()/read() of spark from GCS slow. Not just reading parquet will be slow, loading a pyspark.ml model pipeline from GCS is also slow. So if you want to have any read/write from GCS, you should add master("local[*]").
You need to benchmark your app on a bigger scale, because reading a small file maybe slower on distributed system than locally on laptop.
In regard to calling .master("local[*]") in your code, it makes your Spark app run in local execution mode, i.e. it executes only on a single driver node and does not scale Spark app execution to executor nodes. You should not modify spark.master property in Dataproc Serverless for Spark - it's already correctly set by the system.

How to read a file from a cloud storage bucket via a python app in a local Docker container

Let me preface this with the fact that I am fairly new to Docker, Jenkins, GCP/Cloud Storage and Python.
Basically, I would like to write a Python app, that runs locally in a Docker container (alpine3.7 image) and reads chunks, line by line, from a very large text file that is dropped into a GCP cloud storage bucket. Each line should just be output to the console for now.
I learn best by looking at working code, I am spinning my wheels trying to put all the pieces together using these technologies (new to me).
I already have the key file for that cloud storage bucket on my local machine.
I am also aware of these posts:
How to Read .json file in python code from google cloud storage bucket.
Lazy Method for Reading Big File in Python?
I just need some help putting all these pieces together into a working app.
I understand that I need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the key file in the container. However, I don't know how to do that in a way that works well for multiple developers and multiple environments (Local, Dev, Stage and Prod).
This is just a simple quickstart (I am sure it can be done better) to read a file from a Google Cloud Storage bucket via a python app (Docker container deployed to Google Cloud Run):
You can find more information here link
Create a directory with the following files:
a. app.py
import os
from flask import Flask
from google.cloud import storage
app = Flask(__name__)
#app.route('/')
def hello_world():
storage_client = storage.Client()
file_data = 'file_data'
bucket_name = 'bucket'
temp_file_name = 'temp_file_name'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(file_data)
blob.download_to_filename(temp_file_name)
temp_str=''
with open (temp_file_name, "r") as myfile:
temp_str = myfile.read().replace('\n', '')
return temp_str
if __name__ == "__main__":
app.run(debug=True,host='0.0.0.0',port=int(os.environ.get('PORT', 8080)))
b. Dockerfile
# Use an official Python runtime as a parent image
FROM python:2.7-slim
# Set the working directory fo /app
WORKDIR /app
# Copy the current directory contents into the container /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
RUN pip install google-cloud-storage
# Make port 80 available to the world outside the container
EXPOSE 80
# Define environment variable
ENV NAME World
# Run app.py when the container launches
CMD ["python", "app.py"]
c. requirements.txt
Flask==1.1.1
gunicorn==19.9.0
google-cloud-storage==1.19.1
Create a service account to access the storage form Cloud Run:
gcloud iam service-accounts create cloudrun --description 'cloudrun'
Set the permission of the service account:
gcloud projects add-iam-policy-binding wave25-vladoi --member serviceAccount:cloud-run#project.iam.gserviceaccount.com --role roles/storage.admin
Build the container image:
gcloud builds submit --tag gcr.io/project/hello
Deploy the application to Cloud Run:
gcloud run deploy --image gcr.io/project/hello --platform managed ----service-account cloud-run#project.iam.gserviceaccount.com
EDIT :
One way to develop locally is :
Your Dev Opp Team will get the service account key.json:
gcloud iam service-accounts keys create ~/key.json --iam-account serviceAccount:cloudrun#project.iam.gserviceaccount.com
Store the key.json file in the same working directory
The Dockerfile command `COPY . /app ' will copy the file to Docker container
Change the app.py to :
storage.Client.from_service_account_json('key.json')

Using the Environment Class with Pipeline Runs

I am using an estimator step for a pipeline using the Environment class, in order to have a custom Docker image as I need some apt-get packages to be able to install a specific pip package. It appears from the logs that it's completely ignoring, unlike the non-pipeline version of the estimator, the docker portion of the environment variable. Very simply, this seems broken :
I'm running on SDK v1.0.65, and my dockerfile is completely ignored, I'm using
FROM mcr.microsoft.com/azureml/base:latest\nRUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc
in the base_dockerfile property of my code.
Here's a snippet of my code :
from azureml.core import Environment
from azureml.core.environment import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package('pymssql==2.1.1')
myenv = Environment(name="mssqlenv")
myenv.python.conda_dependencies=conda_dep
myenv.docker.enabled = True
myenv.docker.base_dockerfile = 'FROM mcr.microsoft.com/azureml/base:latest\nRUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc'
myenv.docker.base_image = None
This works well when I use an Estimator by itself, but if I insert this estimator in a Pipeline, it fails. Here's my code to launch it from a Pipeline run:
from azureml.pipeline.steps import EstimatorStep
sql_est_step = EstimatorStep(name="sql_step",
estimator=est,
estimator_entry_script_arguments=[],
runconfig_pipeline_params=None,
compute_target=cpu_cluster)
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
pipeline = Pipeline(workspace=ws, steps=[sql_est_step])
pipeline_run = exp.submit(pipeline)
When launching this, the logs for the container building service reveal:
FROM continuumio/miniconda3:4.4.10... etc.
Which indicates it's ignoring my FROM mcr.... statement in the Environment class I've associated with this Estimator, and my pip install fails.
Am I missing something? Is there a workaround?
I can confirm that this is a bug on the AML Pipeline side. Specifically, the runconfig property environment.docker.base_dockerfile is not being passed through correctly in pipeline jobs. We are working on a fix. In the meantime, you can use the workaround from this thread of building the docker image first and specifying it with environment.docker.base_image (which is passed through correctly).
I found a workaround for now, which is to build your own Docker image. You can do this by using these options of the DockerSection of the Environment :
myenv.docker.base_image_registry.address = '<your_acr>.azurecr.io'
myenv.docker.base_image_registry.username = '<your_acr>'
myenv.docker.base_image_registry.password = '<your_acr_password>'
myenv.docker.base_image = '<your_acr>.azurecr.io/testimg:latest'
and use obviously whichever docker image you built and pushed to the container registry linked to the Azure Machine Learning Workspace.
To create the image, you would run something like this at the command line of a machine that can build a linux based container (like a Notebook VM):
docker build . -t <your_image_name>
# Tag it for upload
docker tag <your_image_name:latest <your_acr>.azurecr.io/<your_image_name>:latest
# Login to Azure
az login
# login to the container registry so that the push will work
az acr login --name <your_acr>
# push the image
docker push <your_acr>.azurecr.io/<your_image_name>:latest
Once the image is pushed, you should be able to get that working.
I also initially used EstimatorStep for custom images, but recently have figured out how to successfully pass Environment's first to RunConfiguration's, then to PythonScriptStep's. (example below)
Another workaround similar to your workaround would be to publish your custom docker image to Docker hub, then the param, docker_base_image becomes the URI, in our case mmlspark:0.16.
def get_environment(env_name, yml_path, user_managed_dependencies, enable_docker, docker_base_image):
env = Environment(env_name)
cd = CondaDependencies(yml_path)
env.python.conda_dependencies = cd
env.python.user_managed_dependencies = user_managed_dependencies
env.docker.enabled = enable_docker
env.docker.base_image = docker_base_image
return env
spark_env = f.get_environment(env_name='spark_env',
yml_path=os.path.join(os.getcwd(), 'compute/aml_config/spark_compute_dependencies.yml'),
user_managed_dependencies=False, enable_docker=True,
docker_base_image='microsoft/mmlspark:0.16')
# use pyspark framework
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
roll_step = PythonScriptStep(
name='rolling window',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data,
'--script_dir', ".",
'--min_date', '2015-06-30',
'--pct_rank', 'True'],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)
A couple of other points (that may be wrong):
PythonScriptStep is effectively a wrapper for ScriptRunConfig, which takes run_config as an argument
Estimator is a wrapper for ScriptRunConfig where RunConfig settings are made available as parameters
IMHO EstimatorStep shouldn't exist because it is better to define Env's and Steps separately instead of at the same time in one call.

Unable to run tasks on Azure Batch:Nodes go into unusable state after starting-up

I am trying to parallelize a Python App using Azure Batch.The workflow that I have followed in the Python client-side script is:
1) Upload Local files to Azure Blob Container using blobxfer utility (input-container)
2) Start Batch service to Process the files in input-container after logging in using the service principal account with azure-cli.
3) Upload the files to output-container through the python app distributed across the Nodes with Azure Batch.
I am experiencing a problem very similar to the one I read here but unfortunately no solution was given in this post.
Nodes go into Unusable State
I will now give the relevant information so that one can reproduce this error:
The image that was used for Azure Batch is custom.
1) Ubuntu Server 18.04 LTS was chosen as the OS for the VM and the following ports were opened-ssh,http,https.The rest of the setting were kept default in the Azure portal.
2)The following script was run once the server was available.
sudo apt-get install build-essential checkinstall -y
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev -y
cd /usr/src
sudo wget https://www.python.org/ftp/python/3.6.6/Python-3.6.6.tgz
sudo tar xzf Python-3.6.6.tgz
cd Python-3.6.6
sudo ./configure --enable-optimizations
sudo make altinstall
sudo pip3.6 install --upgrade pip
sudo pip3.6 install pymupdf==1.13.20
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
3) An Image of this server was created using the process outlined in this link.
Creating VM Image in Azure Linux
Also during deprovision the user was not deleted:sudo waagent -deprovision
4) The Resource Id of the image was noted from the Azure Portal.This will be supplied as one of the parameters in the python-client-side script
The packages installed on the client side server where the python script for Batch would run
sudo pip3.6 install tqdm==4.19.9
sudo pip3.6 install sentry-sdk==0.4.1
sudo pip3.6 install blobxfer==1.5.0
sudo pip3.6 install azure-cli==2.0.47
sudo pip3.6 install pandas==0.22.0
The Resources used during Azure Batch were created in the following way:
1) Service Principal account with contributor privileges was created using the cmd.
$az ad sp create-for-rbac --name <SERVICE-PRINCIPAL-ACCOUNT>
2) Resource-Group,Batch-Account and Storage associated with Batch Account were created in the following way:
$ az group create --name <RESOURCE-GROUP-NAME> --location eastus2
$ az storage account create --resource-group <RESOURCE-GROUP-NAME> --name <STORAGE-ACCOUNT-NAME> --location eastus2 --sku Standard_LRS
$ az batch account create --name <BATCH-ACCOUNT-NAME> --storage-account <STORAGE-ACCOUNT-NAME> --resource-group <RESOURCE-GROUP-NAME> --location eastus2
The client-side Python script which initiates the upload and processing:
(Update 3)
import subprocess
import os
import time
import datetime
import tqdm
import pandas
import sys
import fitz
import parmap
import numpy as np
import sentry_sdk
import multiprocessing as mp
def batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path):
try:
subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Login Credentials")
sys.exit("Invalid Azure Login Credentials")
dir_flag=False
while dir_flag==False:
try:
no_of_dir=input("Enter the number of directories to upload:")
no_of_dir=int(no_of_dir)
if no_of_dir<0:
print("\nRetry:Enter an integer value")
else:
dir_flag=True
except ValueError:
print("\nRetry:Enter an integer value")
dir_path_list=[]
for dir in range(no_of_dir):
path_exists=False
while path_exists==False:
dir_path=input("\nEnter the local absolute path of the directory no.{}:".format(dir+1))
print("\n")
dir_path=dir_path.replace('"',"")
path_exists=os.path.isdir(dir_path)
if path_exists==True:
dir_path_list.append(dir_path)
else:
print("\nRetry:Enter a valid directory path")
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
input_azure_container="pdf-processing-input"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","storage","container","create","--name",input_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Storage Credentials.")
sys.exit("Invalid Azure Storage Credentials.")
log_file_path=os.path.join(log_dir_path,"upload-logs"+"-"+timestamp_humanreadable+".txt")
dir_upload_success=[]
dir_upload_failure=[]
for dir in tqdm.tqdm(dir_path_list,desc="Uploading Directories"):
try:
subprocess.check_output(["blobxfer","upload","--remote-path",input_azure_container,"--storage-account",azure_storage_account,\
"--enable-azure-storage-logger","--log-file",\
log_file_path,"--storage-account-key",azure_storage_account_key,"--local-path",dir])
dir_upload_success.append(dir)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Failed to upload directory: {}".format(dir))
dir_upload_failure.append(dir)
return(input_azure_container)
def query_azure_storage(azure_storage_container,azure_storage_account,azure_storage_account_key,blob_file_path):
try:
blob_list=subprocess.check_output(["az","storage","blob","list","--container-name",azure_storage_container,\
"--account-key",azure_storage_account_key,"--account-name",azure_storage_account,"--auth-mode","login","--output","tsv"])
blob_list=blob_list.decode("utf-8")
with open(blob_file_path,"w") as f:
f.write(blob_list)
blob_df=pandas.read_csv(blob_file_path,sep="\t",header=None)
blob_df=blob_df.iloc[:,3]
blob_df=blob_df.to_frame(name="container_files")
blob_df=blob_df.assign(container=azure_storage_container)
return(blob_df)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Storage Credentials")
sys.exit("Invalid Azure Storage Credentials.")
def analyze_files_for_tasks(data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,download_folder):
try:
blob_df=data_split
some_calculation_factor=2
analyzed_azure_blob_df=pandas.DataFrame()
analyzed_azure_blob_df=analyzed_azure_blob_df.assign(container="empty",container_files="empty",pages="empty",max_time="empty")
for index,row in blob_df.iterrows():
file_to_analyze=os.path.join(download_folder,row["container_files"])
subprocess.check_output(["az","storage","blob","download","--container-name",azure_storage_container,"--file",file_to_analyze,"--name",row["container_files"],\
"--account-name",azure_storage_account,"--auth-mode","key"]) #Why does login auth not work for this while we are multiprocessing
doc=fitz.open(file_to_analyze)
page_count=doc.pageCount
analyzed_azure_blob_df=analyzed_azure_blob_df.append([{"container":azure_storage_container,"container_files":row["container_files"],"pages":page_count,"max_time":some_calculation_factor*page_count}])
doc.close()
os.remove(file_to_analyze)
return(analyzed_azure_blob_df)
except Exception as e:
sentry_sdk.capture_exception(e)
def estimate_task_completion_time(azure_storage_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path):
try:
cores=mp.cpu_count() #Number of CPU cores on your system
partitions = cores-2
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
file_download_location=os.path.join(azure_blob_downloads_file_path,"Blob_Download"+"-"+timestamp_humanreadable)
os.mkdir(file_download_location)
data_split = np.array_split(azure_blob_df,indices_or_sections=partitions,axis=0)
analyzed_azure_blob_df=pandas.concat(parmap.map(analyze_files_for_tasks,data_split,azure_storage_container,azure_storage_account,azure_storage_account_key,file_download_location,\
pm_pbar=True,pm_processes=partitions))
analyzed_azure_blob_df=analyzed_azure_blob_df.reset_index(drop=True)
return(analyzed_azure_blob_df)
except Exception as e:
sentry_sdk.capture_exception(e)
sys.exit("Unable to Estimate Job Completion Status")
def azure_batch_create_pool(azure_storage_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df):
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
pool_id="pdf-processing"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","batch","account","login","--name", azure_batch_account,"--resource-group",azure_resource_group])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to log into the Batch account")
sys.exit("Unable to log into the Batch account")
#Pool autoscaling formula would go in here
try:
subprocess.check_output(["az","batch","pool","create","--account-endpoint",azure_batch_account_endpoint, \
"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--id",pool_id,\
"--node-agent-sku-id","batch.node.ubuntu 18.04",\
"--image",vm_image_name,"--target-low-priority-nodes",str(no_nodes),"--vm-size",vm_compute_size])
return(pool_id)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
sys.exit("Unable to create a Pool corresponding to Container:{}".format(azure_storage_container))
def azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
job_id="pdf-processing-job"+"-"+timestamp_humanreadable
try:
subprocess.check_output(["az","batch","job","create","--account-endpoint",azure_batch_account_endpoint,"--account-key",\
azure_batch_account_key,"--account-name",azure_batch_account,"--id",job_id,"--pool-id",pool_info])
return(job_id)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to create a Job on the Pool :{}".format(pool_info))
sys.exit("Unable to create a Job on the Pool :{}".format(pool_info))
def azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,azure_storage_container,analyzed_azure_blob_df):
print("\n")
for i in tqdm.tqdm(range(180),desc="Waiting for the Pool to Warm-up"):
time.sleep(1)
successful_task_list=[]
unsuccessful_task_list=[]
input_azure_container=azure_storage_container
output_azure_container= "pdf-processing-output"+"-"+input_azure_container.split("-input-")[-1]
try:
subprocess.check_output(["az","storage","container","create","--name",output_azure_container,"--account-name",azure_storage_account,"--auth-mode","login","--fail-on-exist"])
except subprocess.CalledProcessError:
sentry_sdk.cpature_message("Unable to create an output container")
sys.exit("Unable to create an output container")
print("\n")
pbar = tqdm.tqdm(total=analyzed_azure_blob_df.shape[0],desc="Creating and distributing Tasks")
for index,row in analyzed_azure_blob_df.iterrows():
try:
task_info="mytask-"+str(index)
subprocess.check_output(["az","batch","task","create","--task-id",task_info,"--job-id",job_info,"--command-line",\
"python3 /home/avadhut/pdf_processing.py {} {} {}".format(input_azure_container,output_azure_container,row["container_files"])])
pbar.update(1)
except subprocess.CalledProcessError:
sentry_sdk.capture_message("unable to create the Task: mytask-{}".format(i))
pbar.update(1)
pbar.close()
def wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df):
try:
print(analyzed_azure_blob_df)
nrows_tasks_df=analyzed_azure_blob_df.shape[0]
print("\n")
pbar=tqdm.tqdm(total=nrows_tasks_df,desc="Waiting for task to complete")
for index,row in analyzed_azure_blob_df.iterrows():
task_list=subprocess.check_output(["az","batch","task","list","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,\
"--output","tsv"])
task_list=task_list.decode("utf-8")
with open(task_file_path,"w") as f:
f.write(task_list)
task_df=pandas.read_csv(task_file_path,sep="\t",header=None)
task_df=task_df.iloc[:,21]
active_task_list=[]
for x in task_df:
if x =="active":
active_task_list.append(x)
if len(active_task_list)>0:
time.sleep(row["max_time"]) #This time can be changed in accordance with the time taken to complete each task
pbar.update(1)
continue
else:
pbar.close()
return("success")
pbar.close()
return("failure")
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Error in retrieving task status")
def azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info):
try:
subprocess.check_output(["az","batch","job","delete","--job-id",job_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to delete Job-{}".format(job_info))
def azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info):
try:
subprocess.check_output(["az","batch","pool","delete","--pool-id",pool_info,"--account-endpoint",azure_batch_account_endpoint,"--account-key",azure_batch_account_key,"--account-name",azure_batch_account,"--yes"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to delete Pool--{}".format(pool_info))
if __name__=="__main__":
print("\n")
print("-"*40+"Azure Batch processing POC"+"-"*40)
print("\n")
#Credentials and initializations
sentry_sdk.init(<SENTRY-CREDENTIALS>) #Sign-up for a Sentry trail account
azure_username=<AZURE-USERNAME>
azure_password=<AZURE-PASSWORD>
azure_tenant=<AZURE-TENANT>
azure_resource_group=<RESOURCE-GROUP-NAME>
azure_storage_account=<STORAGE-ACCOUNT-NAME>
azure_storage_account_key=<STORAGE-KEY>
azure_batch_account_endpoint=<BATCH-ENDPOINT>
azure_batch_account_key=<BATCH-ACCOUNT-KEY>
azure_batch_account=<BATCH-ACCOUNT-NAME>
vm_image_name=<VM-IMAGE>
vm_compute_size="Standard_A4_v2"
no_nodes=2
log_dir_path="/home/user/azure_batch_upload_logs/"
azure_blob_downloads_file_path="/home/user/blob_downloads/"
blob_file_path="/home/user/azure_batch_upload.tsv"
task_file_path="/home/user/azure_task_list.tsv"
input_azure_container=batch_upload_local_to_azure_blob(azure_username,azure_password,azure_tenant,azure_storage_account,azure_storage_account_key,log_dir_path)
azure_blob_df=query_azure_storage(input_azure_container,azure_storage_account,azure_storage_account_key,blob_file_path)
analyzed_azure_blob_df=estimate_task_completion_time(input_azure_container,azure_storage_account,azure_storage_account_key,azure_blob_df,azure_blob_downloads_file_path)
pool_info=azure_batch_create_pool(input_azure_container,azure_resource_group,azure_batch_account,azure_batch_account_endpoint,azure_batch_account_key,vm_image_name,no_nodes,vm_compute_size,analyzed_azure_blob_df)
job_info=azure_batch_create_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
azure_batch_create_task(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info,job_info,azure_storage_account,azure_storage_account_key,input_azure_container,analyzed_azure_blob_df)
task_status=wait_for_tasks_to_complete(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info,task_file_path,analyzed_azure_blob_df)
if task_status=="success":
azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
print("\n\n")
sys.exit("Job Complete")
else:
azure_delete_job(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,job_info)
azure_delete_pool(azure_batch_account,azure_batch_account_key,azure_batch_account_endpoint,pool_info)
print("\n\n")
sys.exit("Job Unsuccessful")
cmd used to create the zip file:
zip pdf_process_1.zip pdf_processing.py
The Python App that was packaged in zip file and uploaded to batch through the client-side script
(Update 3)
import os
import fitz
import subprocess
import argparse
import time
from tqdm import tqdm
import sentry_sdk
import sys
import datetime
def azure_active_directory_login(azure_username,azure_password,azure_tenant):
try:
azure_login_output=subprocess.check_output(["az","login","--service-principal","--username",azure_username,"--password",azure_password,"--tenant",azure_tenant])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Invalid Azure Login Credentials")
sys.exit("Invalid Azure Login Credentials")
def download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path):
file_to_download=os.path.join(input_azure_container,file_to_process)
try:
subprocess.check_output(["az","storage","blob","download","--container-name",input_azure_container,"--file",os.path.join(pdf_docs_path,file_to_process),"--name",file_to_process,"--account-key",azure_storage_account_key,\
"--account-name",azure_storage_account,"--auth-mode","login"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("unable to download the pdf file")
sys.exit("unable to download the pdf file")
def pdf_to_png(input_folder_path,output_folder_path):
pdf_files=[x for x in os.listdir(input_folder_path) if x.endswith((".pdf",".PDF"))]
pdf_files.sort()
for pdf in tqdm(pdf_files,desc="pdf--->png"):
doc=fitz.open(os.path.join(input_folder_path,pdf))
page_count=doc.pageCount
for f in range(page_count):
page=doc.loadPage(f)
pix = page.getPixmap()
if pdf.endswith(".pdf"):
png_filename=pdf.split(".pdf")[0]+"___"+"page---"+str(f)+".png"
pix.writePNG(os.path.join(output_folder_path,png_filename))
elif pdf.endswith(".PDF"):
png_filename=pdf.split(".PDF")[0]+"___"+"page---"+str(f)+".png"
pix.writePNG(os.path.join(output_folder_path,png_filename))
def upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path):
try:
subprocess.check_output(["az","storage","blob","upload-batch","--destination",output_azure_container,"--source",png_docs_path,"--account-key",azure_storage_account_key,\
"--account-name",azure_storage_account,"--auth-mode","login"])
except subprocess.CalledProcessError:
sentry_sdk.capture_message("Unable to upload file to the container")
if __name__=="__main__":
#Credentials
sentry_sdk.init(<SENTRY-CREDENTIALS>)
azure_username=<AZURE-USERNAME>
azure_password=<AZURE-PASSWORD>
azure_tenant=<AZURE-TENANT>
azure_storage_account=<AZURE-STORAGE-NAME>
azure_storage_account_key=<AZURE-STORAGE-KEY>
try:
parser = argparse.ArgumentParser()
parser.add_argument("input_azure_container",type=str,help="Location to download files from")
parser.add_argument("output_azure_container",type=str,help="Location to upload files to")
parser.add_argument("file_to_process",type=str,help="file link in azure blob storage")
args = parser.parse_args()
timestamp = time.time()
timestamp_humanreadable= datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d-%H-%M-%S')
task_working_dir=os.getcwd()
file_to_process=args.file_to_process
input_azure_container=args.input_azure_container
output_azure_container=args.output_azure_container
pdf_docs_path=os.path.join(task_working_dir,"pdf_files"+"-"+timestamp_humanreadable)
png_docs_path=os.path.join(task_working_dir,"png_files"+"-"+timestamp_humanreadable)
os.mkdir(pdf_docs_path)
os.mkdir(png_docs_path)
except Exception as e:
sentry_sdk.capture_exception(e)
azure_active_directory_login(azure_username,azure_password,azure_tenant)
download_from_azure_blob(azure_storage_account,azure_storage_account_key,input_azure_container,file_to_process,pdf_docs_path)
pdf_to_png(pdf_docs_path,png_docs_path)
upload_to_azure_blob(azure_storage_account,azure_storage_account_key,output_azure_container,png_docs_path)
Update 1:
I have solved the Server Nodes going into unusable state error.The way I solved this issue is:
1) I did not use the cmds I mentioned above to set up Python env 3.6 on Ubuntu as Ubuntu 18.04 LTS comes with its own python 3 environment.Initially I had googled "Install Python 3 on Ubuntu" and had gotten this Python 3.6 installation on Ubuntu link.Avoided this step completely during the server set-up.
All I did was install these packages this time.
sudo apt-get install -y python3-pip
sudo -H pip3 install tqdm==4.19.9
sudo -H pip3 install sentry-sdk==0.4.1
sudo -H pip3 install blobxfer==1.5.0
sudo -H pip3 install pandas==0.22.0
The Azure cli was installed on the machine using the cmds in this link
Install Azure CLI with apt
2) Created a snapshot of the OS-disk and then created the image out of this snapshot and finally referencing this image in the client-side script.
I am now faced with another issue where the stderr.txt files on the node tell me that:
python3: can't open file '$AZ_BATCH_APP_PACKAGE_pdfprocessingapp/pdf_processing.py': [Errno 2] No such file or directory
Logging in to the server with the random user I see that the directory _azbatch is created but there are no contents inside this directory.
I know for certain that it is the command line of the azure_batch_create_task() function that things are going haywire but I am not able to put my finger on it.I have done everything that this docs recommends:Install app packages to Azure Batch Compute Nodes Please review my client-side Python script and let me know on what I am doing wrong!
Edit 3:
The problem looks very similar to the one described in this post:
Unable to pass app path to Tasks
Update 2:
I was able to overcome the file/directory not found error using a dirty hack which i am not particularly fond of.I placed the python app in the home directory of the user which was used to create the VM and all the directories required for processing were created in the working directory of the task.
I still would want to know how I would run the workflow by using the application package way to deploy it to the node.
Update 3
I have updated the client side code and python app to reflect the latest changes made. Things that are significant are the same.....
I will comment on #fparks points that he/she has raised.
The Original python App that I intend to use in Azure Batch contains many modules and some config files and a quite lengthy requirements.txt file for Python packages.Azure also recommends using custom Image in such cases.
Also downloading the python modules per Task is a bit irrational in my case as 1 task is equal to a multipage pdfs and my expected workload is 25k multipage pdfs
I used CLI because the docs for Python SDK were sparse and hard to follow.The nodes going into unusable state has been solved.I do agree with you on the blobxfer error.
Answers and a few observations:
It is unclear to me why you need a custom image. You can use a platform image, i.e., Canonical, UbuntuServer, 18.04-LTS, and then just install what you need as part of the start task. Python3.6 can simply be installed via apt in 18.04. You may be prematurely optimizing your workflow by opting for a custom image when in fact using a platform image + start task may be faster and stable.
Your script is in Python, yet you are calling out to the Azure CLI. You may want to consider directly using the Azure Batch Python SDK instead (samples).
When nodes go unusable, you should first examine the node for errors. You should see if the ComputeNodeError field is populated. Additionally, you can try to fetch stdout.txt and stderr.txt files from the startup directory to diagnose what's going on. You can do both of these actions in the Azure Portal or via Batch Explorer. If that doesn't work, you can fetch the compute node service logs and file a support request. However, typically unusable means that your custom image was provisioned incorrectly, you have a virtual network with an NSG misconfigured, or you have an application package that is incorrect.
Your application package consists of a single python file; instead use a resource file. Simply upload the script to Azure Storage blob and reference it in your task as a Resource File using a SAS URL. See the --resource-files argument in az batch task create if using the CLI. Your command to invoke would then simply be python3 pdf_processing.py (assuming you keep the resource file downloading to the task working directory).
If you insist on using an application package, consider using a task application package instead. This will decouple your node startup issues potentially originating from bad application packages to debugging task executions instead.
The blobxfer error is pretty clear. Your locale is not set properly. The easy way to fix this is to set the environment variables for the task. See the --environment-settings argument if using the CLI and set two environment variables LC_ALL=C.UTF-8 and LANG=C.UTF-8 as part of your task.

Why am I getting : Unable to import module 'handler': No module named 'paramiko'?

I was in the need to move files with a aws-lambda from a SFTP server to my AWS account,
then I've found this article:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Talking about paramiko as a SSHclient candidate to move files over ssh.
Then I've written this calss wrapper in python to be used from my serverless handler file:
import paramiko
import sys
class FTPClient(object):
def __init__(self, hostname, username, password):
"""
creates ftp connection
Args:
hostname (string): endpoint of the ftp server
username (string): username for logging in on the ftp server
password (string): password for logging in on the ftp server
"""
try:
self._host = hostname
self._port = 22
#lets you save results of the download into a log file.
#paramiko.util.log_to_file("path/to/log/file.txt")
self._sftpTransport = paramiko.Transport((self._host, self._port))
self._sftpTransport.connect(username=username, password=password)
self._sftp = paramiko.SFTPClient.from_transport(self._sftpTransport)
except:
print ("Unexpected error" , sys.exc_info())
raise
def get(self, sftpPath):
"""
creates ftp connection
Args:
sftpPath = "path/to/file/on/sftp/to/be/downloaded"
"""
localPath="/tmp/temp-download.txt"
self._sftp.get(sftpPath, localPath)
self._sftp.close()
tmpfile = open(localPath, 'r')
return tmpfile.read()
def close(self):
self._sftpTransport.close()
On my local machine it works as expected (test.py):
import ftp_client
sftp = ftp_client.FTPClient(
"host",
"myuser",
"password")
file = sftp.get('/testFile.txt')
print(file)
But when I deploy it with serverless and run the handler.py function (same as the test.py above) I get back the error:
Unable to import module 'handler': No module named 'paramiko'
Looks like the deploy is unable to import paramiko (by the article above it seems like it should be available for lambda python 3 on AWS) isn't it?
If not what's the best practice for this case? Should I include the library into my local project and package/deploy it to aws?
A comprehensive guide tutorial exists at :
https://serverless.com/blog/serverless-python-packaging/
Using the serverless-python-requirements package
as serverless node plugin.
Creating a virtual env and Docker Deamon will be required to packup your serverless project before deploying on AWS lambda
In the case you use
custom:
pythonRequirements:
zip: true
in your serverless.yml, you have to use this code snippet at the start of your handler
try:
import unzip_requirements
except ImportError:
pass
all details possible to find in Serverless Python Requirements documentation
You have to create a virtualenv, install your dependencies and then zip all files under sites-packages/
sudo pip install virtualenv
virtualenv -p python3 myvirtualenv
source myvirtualenv/bin/activate
pip install paramiko
cp handler.py myvirtualenv/lib/python
zip -r myvirtualenv/lib/python3.6/site-packages/ -O package.zip
then upload package.zip to lambda
You have to provide all dependencies that are not installed in AWS' Python runtime.
Take a look at Step 7 in the tutorial. Looks like he is adding the dependencies from the virtual environment to the zip file. So I'd assume your ZIP file to contain the following:
your worker_function.py on top level
a folder paramico with the files installed in virtual env
Please let me know if this helps.
I tried various blogs and guides like:
web scraping with lambda
AWS Layers for Pandas
spending hours of trying out things. Facing SIZE issues like that or being unable to import modules etc.
.. and I nearly reached the end (that is to invoke LOCALLY my handler function), but then my function even though it was fully deployed correctly and even invoked LOCALLY with no problems, then it was impossible to invoke it on AWS.
The most comprehensive and best by far guide or example that is ACTUALLY working is the above mentioned by #koalaok ! Thanks buddy!
actual link

Resources