I used Elastic Beanstalk to check that the code that saves the time as .txt every hour with the cron job works.
After that, I wrote the code that saves the time as saving the crawl result, but it did not work.
After a bit of debugging, it seems like an error that occurs because the module, such as from bs4 import BeautifulSoup, is not installed.
What should I do?
I am using the AL2 platform and below are test.py and cron-linux.config.
test.py
import time
from bs4 import BeautifulSoup
def function_test():
now = time.strftime('%H%M%S')
now_int = int(now) + 90000 # UTC TO KST 시차가 +9시간 발생
now_kst = now_int % 240000 # 시간은 24시 넘어가면 안되므로 24시간으로 나누어 줌
now_str = str(now_kst).zfill(6)
if 90000 < now_kst and now_kst < 160000:
print("intime")
else:
print("outtime")
f=open('Enterprise.txt','w',encoding='UTF-8')
f.write(now_str)
f.close()
function_test()
cron-linux.config
files:
"/etc/cron.d/mycron":
mode: "000644"
owner: root
group: root
content: |
*/1 * * * * root /usr/local/bin/myscript.sh
"/usr/local/bin/myscript.sh":
mode: "000755"
owner: root
group: root
content: |
#!/bin/bash
python3 /var/app/current/test.py
exit 0
commands:
remove_old_cron:
command: "rm -f /etc/cron.d/mycron.bak"
The current .zip configuration is
-.ebextensions
>cron-linx.config
-static
-templates
-application.py
-Enterprise.txt
-requirements.txt
-test.py
You can install bs4 using container commands:
You can use the container_commands key to execute commands that affect your application source code. Container commands run after the application and web server have been set up and the application version archive has been extracted, but before the application version is deployed.
Thus you can create new config in your .ebextensions.
For example:
.ebextensions/10_commands.config
container_commands:
10_install_bs4:
command: pip install bs4
20_install_something_else:
command: pip install <other package>
requirements.txt - easier alternative
The easiest way would be to simply put all your pip requirements in the root folder of your app in requirements.txt file:
Specifying dependencies using a requirements file
Related
I am trying to build an Apache Beam pipeline in Python 3.7 with beam sdk version 2.20.0, the pipeline gets deployed on Dataflow successfully but does not seem to be doing anything. In the worker logs, I can see the following error message repeatedly reported
Error syncing pod xxxxxxxxxxx (), skipping: Failed to start container
worker log
I have tried everything I could but this error is quite stubborn, my pipeline looks like this.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import WorkerOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import DebugOptions
options = PipelineOptions()
options.view_as(GoogleCloudOptions).project = PROJECT
options.view_as(GoogleCloudOptions).job_name = job_name
options.view_as(GoogleCloudOptions).region = region
options.view_as(GoogleCloudOptions).staging_location = staging_location
options.view_as(GoogleCloudOptions).temp_location = temp_location
options.view_as(WorkerOptions).zone = zone
options.view_as(WorkerOptions).network = network
options.view_as(WorkerOptions).subnetwork = sub_network
options.view_as(WorkerOptions).use_public_ips = False
options.view_as(StandardOptions).runner = 'DataflowRunner'
options.view_as(StandardOptions).streaming = True
options.view_as(SetupOptions).sdk_location = ''
options.view_as(SetupOptions).save_main_session = True
options.view_as(DebugOptions).experiments = []
print('running pipeline...')
with beam.Pipeline(options=options) as pipeline:
(
pipeline
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic=topic_name).with_output_types(bytes)
| 'ProcessMessage' >> beam.ParDo(Split())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(table=bq_table_name,
schema=bq_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = pipeline.run()
I have tried supplying a beam sdk 2.20.0.tar.gz from the compute instance using sdk_location parameter, that doesn't work either. I can't use sdk_location = default as that triggers a download from pypi.org. I am working in an offline environment and connectivity to internet is not an option. Any help would be highly appreciated.
The pipeline itself is deployed on a container and all libraries that go with apache beam 2.20.0 are specified in a requirements.txt file, docker image installs all the libraries.
TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a variable.
I was also struggling with this setup. Finally I found a solution - even if your question was raised quite some days ago, this answer might help someone else.
There are probably multiple ways to do that, but the following two are quite simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp, because this is going to be the path you can set in your SetupOptions.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Build the Docker image with Cloud Build
gcloud builds submit --tag $TEMPLATE_IMAGE .
Create a Flex template
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
Run generated flex-template in your subnetwork (if required)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow#your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location, but it didn't work for me. But the following should work:
Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute. Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
Copy the apache-beam-sdk in your source code to eg. /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
Now you can set the sdk_location to the path you've downloaded the sdk archive.
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Now your Pipeline should be able to run without internet breakout.
I have created a simple Flask app which I am trying to deploy to Docker.
The basic user interface will load on localhost, but when I execute a command which calls a specific function, it keeps showing:
"Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."
Looking at Docker logs I can see the problem is that the file cannot be found by the subprocess.popen command:
"FileNotFoundError: [Errno 2] No such file or directory: 'test_2.bat': 'test_2.bat'
172.17.0.1 - - [31/Oct/2019 17:01:55] "POST /login HTTP/1.1" 500"
The file certainly exists in the Docker environment, within the container I can see it listed in the root directory.
I have also tried changing:
item = subprocess.Popen(["test_2.bat", i], shell=False,stdout=subprocess.PIPE)
to:
item = subprocess.Popen(["./test_2.bat", i], shell=False,stdout=subprocess.PIPE)
which generated the alternative error:
"OSError: [Errno 8] Exec format error: './test_2.bat'
172.17.0.1 - - [31/Oct/2019 16:58:54] "POST /login HTTP/1.1" 500"
I have added a shebang to the top of both .py files involved in the Flask app (although I may have done this wrong):
#!/usr/bin/env python3
and this is the Dockerfile:
FROM python:3.6
RUN adduser lighthouse
WORKDIR /home/lighthouse
COPY requirements.txt requirements.txt
# RUN python -m venv venv
RUN pip install -r requirements.txt
RUN pip install gunicorn
COPY templates templates
COPY json_logs_nl json_logs_nl
COPY app.py full_script_manual_with_list.py schema_all.json ./
COPY bq_load_indv_jsons_v3.bat test_2.bat ./
RUN chmod 644 app.py
RUN pip install flask
ENV FLASK_APP app.py
RUN chown -R lighthouse:lighthouse ./
USER lighthouse
# EXPOSE 5000
CMD ["flask", "run", "--host=0.0.0.0"]
I am using Ubuntu and WSL2 to run Docker on a Windows machine without a virtual box. I have no trouble navigating my Windows file system or building Docker images so I think this configuration is not the problem - but just in case.
If anyone has any ideas to help subprocess locate test_2.bat I would be very grateful!
Edit: the app works exactly as expected when executed locally via the command line with "flask run"
If anyone is facing a similar problem, the solution was to put the command directly into the Python script rather than calling it in a separate file. It is split into separate strings to allow the "url" variable to be iteratively updated, as this all occurs within a for loop:
url = str(i)
var_command = "lighthouse " + url + " --quiet --chrome-flags=\" --headless\" --output=json output-path=/home/lighthouse/result.json"
item = subprocess.Popen([var_command], stdout=subprocess.PIPE, shell=True)
item.communicate()
As a side note, if you would like to run Lighthouse within a container you need to install it just as you would to run it on the command line, in a Node container. This container can then communicate with my Python container if both deployed in the same pod via Kubernetes and share a namespace. Here is a Lighthouse container Dockerfile I've used: https://github.com/GoogleChromeLabs/lighthousebot/blob/master/builder/Dockerfile
I have been stuck trying to figure out how to edit a python flask code after pulling from a Docker Hub repository on a different computer. I want to create a Folder in my Linux Desktop that contains all of the packages the image has when running as a container (Dockerfile, requirements.txt, app.py) that way I can edit the app.py regardless of what computer I have or even if my classmates want to edit it they can simply just pull my image, run the container, and be able to have a copy of the code saved on their local machine for them to open it using Visual Studio Code (or any IDE) and edit it. This is what I tried.
I first pulled from the Docker hub:
sudo docker pull woonx/dockertester1
Then used this command to run the image as a container and create a directory:
sudo docker run --name=test1 -v ~/testfile:/var/lib/docker -p 4000:80 woonx/dockertester1
I was able to create a local directory called testfile but it was an empty folder when I opened it. No app.py, dockerfile, nothing.
The example code I am using to test is from following the example guide on the Docker website: https://docs.docker.com/get-started/part2/
Dockerfile:
# Use an official Python runtime as a parent image
FROM python:2.7-slim
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Define environment variable
ENV NAME World
# Run app.py when the container launches
CMD ["python", "app.py"]
requirements.txt:
Flask
Redis
app.py:
from flask import Flask
from redis import Redis, RedisError
import os
import socket
# Connect to Redis
redis = Redis(host="redis", db=0, socket_connect_timeout=2, socket_timeout=2)
app = Flask(__name__)
#app.route("/")
def hello():
try:
visits = redis.incr("counter")
except RedisError:
visits = "<i>cannot connect to Redis, counter disabled</i>"
html = "<h3>Hello {name}!</h3>" \
"<b>Hostname:</b> {hostname}<br/>" \
"<b>Visits:</b> {visits}"
return html.format(name=os.getenv("NAME", "world"), hostname=socket.gethostname(), visits=visits)
if __name__ == "__main__":
app.run(host='0.0.0.0', port=80)
What I do is;
First, I issue docker run command.
sudo docker run --name=test1 -v ~/testfile:/var/lib/docker -p 4000:80 woonx/dockertester1
At this stage, files are created in container. Then I stop the container (lets say container id is 0101010101) .
docker container stop 0101010101
What I do is simply copying those files from container to the appropriate directory on my machine by using :
docker cp <container_name>:/path/in/container /path/of/host
or
cd ~/testfile
docker cp <container_name>:/path/in/container .
So, You have the files craeted by docker run on you local host. Now you can use them with -v option.
sudo docker run --name=test1 -v ~/testfile:/var/lib/docker -p 4000:80 woonx/dockertester1
Normally, when you change a setting in your configuration, it should be enough to stop/start container to take in action.
I hope this approach solves your problem.
Regards
I was in the need to move files with a aws-lambda from a SFTP server to my AWS account,
then I've found this article:
https://aws.amazon.com/blogs/compute/scheduling-ssh-jobs-using-aws-lambda/
Talking about paramiko as a SSHclient candidate to move files over ssh.
Then I've written this calss wrapper in python to be used from my serverless handler file:
import paramiko
import sys
class FTPClient(object):
def __init__(self, hostname, username, password):
"""
creates ftp connection
Args:
hostname (string): endpoint of the ftp server
username (string): username for logging in on the ftp server
password (string): password for logging in on the ftp server
"""
try:
self._host = hostname
self._port = 22
#lets you save results of the download into a log file.
#paramiko.util.log_to_file("path/to/log/file.txt")
self._sftpTransport = paramiko.Transport((self._host, self._port))
self._sftpTransport.connect(username=username, password=password)
self._sftp = paramiko.SFTPClient.from_transport(self._sftpTransport)
except:
print ("Unexpected error" , sys.exc_info())
raise
def get(self, sftpPath):
"""
creates ftp connection
Args:
sftpPath = "path/to/file/on/sftp/to/be/downloaded"
"""
localPath="/tmp/temp-download.txt"
self._sftp.get(sftpPath, localPath)
self._sftp.close()
tmpfile = open(localPath, 'r')
return tmpfile.read()
def close(self):
self._sftpTransport.close()
On my local machine it works as expected (test.py):
import ftp_client
sftp = ftp_client.FTPClient(
"host",
"myuser",
"password")
file = sftp.get('/testFile.txt')
print(file)
But when I deploy it with serverless and run the handler.py function (same as the test.py above) I get back the error:
Unable to import module 'handler': No module named 'paramiko'
Looks like the deploy is unable to import paramiko (by the article above it seems like it should be available for lambda python 3 on AWS) isn't it?
If not what's the best practice for this case? Should I include the library into my local project and package/deploy it to aws?
A comprehensive guide tutorial exists at :
https://serverless.com/blog/serverless-python-packaging/
Using the serverless-python-requirements package
as serverless node plugin.
Creating a virtual env and Docker Deamon will be required to packup your serverless project before deploying on AWS lambda
In the case you use
custom:
pythonRequirements:
zip: true
in your serverless.yml, you have to use this code snippet at the start of your handler
try:
import unzip_requirements
except ImportError:
pass
all details possible to find in Serverless Python Requirements documentation
You have to create a virtualenv, install your dependencies and then zip all files under sites-packages/
sudo pip install virtualenv
virtualenv -p python3 myvirtualenv
source myvirtualenv/bin/activate
pip install paramiko
cp handler.py myvirtualenv/lib/python
zip -r myvirtualenv/lib/python3.6/site-packages/ -O package.zip
then upload package.zip to lambda
You have to provide all dependencies that are not installed in AWS' Python runtime.
Take a look at Step 7 in the tutorial. Looks like he is adding the dependencies from the virtual environment to the zip file. So I'd assume your ZIP file to contain the following:
your worker_function.py on top level
a folder paramico with the files installed in virtual env
Please let me know if this helps.
I tried various blogs and guides like:
web scraping with lambda
AWS Layers for Pandas
spending hours of trying out things. Facing SIZE issues like that or being unable to import modules etc.
.. and I nearly reached the end (that is to invoke LOCALLY my handler function), but then my function even though it was fully deployed correctly and even invoked LOCALLY with no problems, then it was impossible to invoke it on AWS.
The most comprehensive and best by far guide or example that is ACTUALLY working is the above mentioned by #koalaok ! Thanks buddy!
actual link
When running my app with a gunicorn upstart, I get:
TypeError: 'newline' is an invalid keyword argument for this function
When I run it from the command line, however, I have no problem.
I've seen solutions that indicate newline should be in the file opening, not with the csv.writer. As you can see though, I do indeed have it in the file opening.
To recreate:
save my_app.py to /home/--your home--/
chmod u+x /home/--your home--/my_app.py
save my_upstart.conf to /etc/init/
edit my_upstart.conf to replace with your home dir
sudo service my_upstart start
curl localhost:5001/vis -H "Content-Type: text/csv"
sudo cat /var/log/upstart/my_upstart.log
In my_upstart.log, you will see theTypeError mentioned above
my_app.py
#!/usr/bin/python3
from flask import Flask, request
app = Flask(__name__)
#app.route('/vis/', strict_slashes=False)
def vis():
with (open('~/test.csv', mode='w', newline='')) as f:
writer = csv.writer(f)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001)
my_upstart.conf
description "Gunicorn config file for serving the Wellness app"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
setuid ubuntu
setgid ubuntu
script
cd /home/<your home>/
exec gunicorn --bind 0.0.0.0:5001 my_app:app
end script
Compare the documentation of open for Python version 2 and 3 and you will note that there is quite a bit of difference in what parameters can be passed. In particular the parameter newline is not available in Python 2.
So my guess is that when gunicorn runs it will pick up a version 2 Python executable.
See Cannot get gunicorn to use Python 3 for more details.
gunicorn was using python 2 and it's corresponding distribution package, wheras I'm using python 3. Followed these steps to fix:
sudo pip3 install gunicorn
in /usr/bin/gunicorn,
edited first line to read #!/usr/bin/python3 (instead of python) and
changed the gunicorn version anywhere it appeared to match what gunicorn --version says.