How to run a PySpark job with Airflow in a dockerized environment - apache-spark

I followed the official Airflow docker guide.
It works fine for most of the simple jobs I have.
I tried to use this guide for that I needed to add in the .env file this line:
_PIP_ADDITIONAL_REQUIREMENTS=pyspark xlrd apache-airflow-providers-apache-spark
Unfortunately, the dag is not being loaded.
The problem seems to be related to JAVA_HOME because the docker output shows this message:
airflow-scheduler_1 | is not set
In the Airflow web GUI it shows the following erro:
Broken DAG: [/opt/airflow/dags/SparkETL.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/context.py", line 339, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/home/airflow/.local/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
I tried to add install -y openjdk-11-jdk command in the docker-compose, and set JAVA_HOME: '/usr/lib/jvm/java-11-openjdk-amd64' also in the docker compose. In this situation airflow_schedule dumps that the path does not exist.

Related

MLflow saves models to relative place instead of tracking_uri

sorry if my question is too basic, but cannot solve it.
I am experimenting with mlflow currently and facing the following issue:
Even if I have set the tracking_uri, the mlflow artifacts are saved to the ./mlruns/... folder relative to the path from where I run mlfow run path/to/train.py (in command line). The mlflow server searches for the artifacts following the tracking_uri (mlflow server --default-artifact-root here/comes/the/same/tracking_uri).
Through the following example it will be clear what I mean:
I set the following in the training script before the with mlflow.start_run() as run:
mlflow.set_tracking_uri("file:///home/#myUser/#SomeFolders/mlflow_artifact_store/mlruns/")
My expectation would be that mlflow saves all the artifacts to the place I gave in the registry uri. Instead, it saves the artifacts relative to place from where I run mlflow run path/to/train.py, i.e. running the following
/home/#myUser/ mlflow run path/to/train.py
creates the structure:
/home/#myUser/mlruns/#experimentID/#runID/artifacts
/home/#myUser/mlruns/#experimentID/#runID/metrics
/home/#myUser/mlruns/#experimentID/#runID/params
/home/#myUser/mlruns/#experimentID/#runID/tags
and therefore it doesn't find the run artifacts in the tracking_uri, giving the error message:
Traceback (most recent call last):
File "train.py", line 59, in <module>
with mlflow.start_run() as run:
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 204, in start_run
active_run_obj = client.get_run(existing_run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/client.py", line 151, in get_run
return self._tracking_client.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/_tracking_service/client.py", line 57, in get_run
return self.store.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 524, in get_run
run_info = self._get_run_info(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 544, in _get_run_info
"Run '%s' not found" % run_uuid, databricks_pb2.RESOURCE_DOES_NOT_EXIST
mlflow.exceptions.MlflowException: Run '788563758ece40f283bfbf8ba80ceca8' not found
2021/07/23 16:54:16 ERROR mlflow.cli: === Run (ID '788563758ece40f283bfbf8ba80ceca8') failed ===
Why is that so? How can I change the place where the artifacts are stored, this directory structure is created? I have tried mlflow run --storage-dir here/comes/the/path, setting the tracking_uri, registry_uri. If I run the /home/path/to/tracking/uri mlflow run path/to/train.py it works, but I need to run the scripts remotely.
My endgoal would be to change the artifact uri to an NFS drive, but even in my local computer I cannot do the trick.
Thanks for reading it, even more thanks if you suggest a solution! :)
Have a great day!
This issue was solved by the following:
I have mixed the tracking_uri with the backend_store_uri.
The tracking_uri is where the MLflow related data (e.g. tags, parameters, metrics, etc.) are saved, which can be a database. On the other hand, the artifact_location is where the artifacts (other, not MLflow related data belonging to the preprocessing/training/evaluation/etc. scripts).
What led me to mistakes is that by running mlflow server from command line one should set up for the --backend-store-uri the tracking_uri (also in the script by setting the mlflow.set_tracking_uri()) and for --default-artifact-location the location of the artifacts. Somehow I didn't get that the tracking_uri = backend_store_uri.
Here's my solution
Launch the server
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD#DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME
Set the the tracking uri to an HTTP URI like
mlflow.set_tracking_uri("http://my-tracking-server:5000/")

I have a python script running inside a container of kubernetes pod.How do i stop the script which runs along with the starting of the pod?

I am running two threads inside the container of the Kubernetes pod one thread pushes some data to db and other thread (flask app) shows the data from database. So as soon as the pod starts up main.py(starts both the threads mentioned above) will be called.
Docker file:
FROM python:3
WORKDIR /usr/src/app
COPY app/requirements.txt .
RUN pip install -r requirements.txt
COPY app .
CMD ["python3","./main.py"]
I have two questions:
Is logs the only way to see the output of the running script? Can't we see its output continuously as it runs on the terminal?
Also, I m not able to run the same main.py file by going into the container. It throws below error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 920, in run
run_simple(t.cast(str, host), port, self, **options)
File "/usr/local/lib/python3.9/site-packages/werkzeug/serving.py", line 1008, in run_simple
inner()
File "/usr/local/lib/python3.9/site-packages/werkzeug/serving.py", line 948, in inner
srv = make_server(
File "/usr/local/lib/python3.9/site-packages/werkzeug/serving.py", line 780, in make_server
return ThreadedWSGIServer(
File "/usr/local/lib/python3.9/site-packages/werkzeug/serving.py", line 686, in __init__
super().__init__(server_address, handler) # type: ignore
File "/usr/local/lib/python3.9/socketserver.py", line 452, in __init__
self.server_bind()
File "/usr/local/lib/python3.9/http/server.py", line 138, in server_bind
socketserver.TCPServer.server_bind(self)
File "/usr/local/lib/python3.9/socketserver.py", line 466, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use (edited)
How do I stop the main.py script which starts along with the pod and be able to run the main.py from the container itself directly?
Thank you.
The error message says all:
OSError: [Errno 98] Address already in use (edited)
It looks like your python script tries to open the same port twice. You cannot do that. Check your code and fix it.
Now answering your other question:
Is logs the only way to see the output of the running script? Can't we see its output continuously as it runs on the terminal?
Running kubectl logs -f will follow the logs, which should let you see the output continuously as it runs in terminal.

Docker firewall issue with cBioportal

we are sitting behind a firewall and try to run a docker image (cBioportal). The docker itself could be installed with a proxy but now we encounter the following issue:
Starting validation...
INFO: -: Unable to read xml containing cBioPortal version.
DEBUG: -: Requesting cancertypes from portal at 'http://cbioportal-container:8081'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error occurred during validation step:
Traceback (most recent call last):
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4491, in request_from_portal_api
response.raise_for_status()
File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://cbioportal-container:8081/api-legacy/cancertypes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/metaImport.py", line 127, in <module>
exitcode = validateData.main_validate(args)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4969, in main_validate
portal_instance = load_portal_info(server_url, logger)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4622, in load_portal_info
parsed_json = request_from_portal_api(path, api_name, logger)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4495, in request_from_portal_api
) from e
ConnectionError: Failed to fetch metadata from the portal at [http://cbioportal-container:8081/api-legacy/cancertypes]
Now we know that it is a firewall issue, because it works when we install it outside the firewall. But we do not know how to change the firewall yet. Our idea was to look up the files and lines which throw the errors. But we do not know how to look into the files since they are within the docker.
So we can not just do something like
vim /cbioportal/core/src/main/scripts/importer/validateData.py
...because ... there is nothing. Of course we know this file is within the docker image, but like i said we dont know how to look into it. At the moment we do not know how to solve this riddle - any help appreciated.
maybe you still might need this.
You can access this python file within the container by usingdocker-compose exec cbioportal sh or docker-compose exec cbioportal bash
Then you can us cd, cat, vi, vim or else to access the given path in your post.
I'm not sure which command you're actually running but when I did the import call like
docker-compose run --rm cbioportal metaImport.py -u http://cbioportal:8080 -s study/lgg_ucsf_2014/lgg_ucsf_2014/ -o
I had to replace the http://cbioportal:8080 with the servers ip address.
Also notice that the studies path is one level deeper than in the official documentation.
In cbioportal behind proxy the study import is only available in offline mode via:
First you need to get inside the container
docker exec -it cbioportal-container bash
Then generate portal info folder
cd $PORTAL_HOME/core/src/main/scripts ./dumpPortalInfo.pl $PORTAL_HOME/my_portal_info_folder
Then import the study offline. -o is important to overwrite despite of warnings.
cd $PORTAL_HOME/core/src/main/scripts
./importer/metaImport.py -p $PORTAL_HOME/my_portal_info_folder -s /study/lgg_ucsf_2014 -v -o
Hope this helps.

Launch Jupyter Notebook on ec2 Ubuntu instance

I’m trying to install Anaconda, Python 3 and Jupyter notebooks on an AWS EC2 instance. I’m running Ubuntu on the instance. I’ve installed Python using Anaconda. I’ve set the default Python to the Anaconda version. I created a Jupyter notebook config file. In the Jupyter notebook config file I added:
c = get_config()
# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False
# Fix port to 8888
c.NotebookApp.port = 8888
I also created a directory for the certs using the code below:
mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
But when I try to run Jupiter notebook with the command below:
jupyter notebook
I get the error message below. My end goal is to be able to launch Jupiter notebook on the AWS EC2 instance and then connect to it remotely in a browser on my laptop. Does anyone know what my issue might be?
Error:
Writing notebook server cookie secret to /run/user/1000/jupyter/notebook_cookie_secret
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/traitlets/traitlets.py", line 528, in get
value = obj._trait_values[self.name]
KeyError: 'allow_remote_access'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 864, in _default_allow_remote
addr = ipaddress.ip_address(self.ip)
File "/home/ubuntu/anaconda3/lib/python3.7/ipaddress.py", line 54, in ip_address
address)
ValueError: '' does not appear to be an IPv4 or IPv6 address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/bin/jupyter-notebook", line 11, in <module>
sys.exit(main())
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jupyter_core/application.py", line 266, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "</home/ubuntu/anaconda3/lib/python3.7/site-packages/decorator.py:decorator-gen-7>", line 2, in initialize
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 1630, in initialize
self.init_webapp()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 1378, in init_webapp
self.jinja_environment_options,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 159, in __init__
default_url, settings_overrides, jinja_env_options)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 252, in init_settings
allow_remote_access=jupyter_app.allow_remote_access,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/traitlets/traitlets.py", line 556, in __get__
return self.get(obj, cls)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/traitlets/traitlets.py", line 535, in get
value = self._validate(obj, dynamic_default())
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/notebook/notebookapp.py", line 867, in _default_allow_remote
for info in socket.getaddrinfo(self.ip, self.port, 0, socket.SOCK_STREAM):
File "/home/ubuntu/anaconda3/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
Go to your AWS instance security group and then configure inbound security group like this below screenshot:
If you are sure that AWS is configured correctly for permissions, check if your network is not blocking the outbound traffic. You could try to do port tunneling when SSHing into your instance by doing:
ssh -i -L 8888:127.0.0.1:8888
then you can access jupyter locally by going to localhost:8888 on your browser.
In the Jupyter notebook config file that you have shared in the question above a few lines seem to be missing.
To configure the jupyter config file thoroughly, follow these steps:
cd ~/.jupyter/
vi jupyter_notebook_config.py
Insert this at the beginning of the document:
c = get_config()
# Kernel config
c.IPKernelApp.pylab = 'inline' # if you want plotting support always in your notebook
# Notebook config
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem' #location of your certificate file
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False #so that the ipython notebook does not opens up a browser by default
c.NotebookApp.password = u'sha1:98ff0e580111:12798c72623a6eecd54b51c006b1050f0ac1a62d' #the encrypted password we generated above
# Set the port to 8888, the port we set up in the AWS EC2 set-up
c.NotebookApp.port = 8888
Once you enter these above lines, make sure you save the config file before you exit the vi editor!
And also, most importantly remember to replace sha1:98ff0e580111:12798c72623a6eecd54b51c006b1050f0ac1a62d with your password!
Note that since in the above config file we have given port as 8888, the same is added in the security group. (Custom TCP type,TCP protocol, port range as 8888 and source is custom)
Now you are good to go!
Type the following command:
screen
This command will allow you to create a separate screen for just your Jupyter process logs while you continue to do other work on the ec2 instance.
And now start the jupyter notebook by typing the command:
jupyter notebook
To visit the jupyter notebook from the browser in your local machine:
Your EC2 instance will have a long url, like this:
ec2-52-39-239-66.us-west-2.compute.amazonaws.com
Visit that URL in you browser locally. Make sure to have https at the beginning and port 8888 at the end as shown below.
https://ec2-52-39-239-66.us-west-2.compute.amazonaws.com:8888/
You can start the jupyter server using the following command:-
jupyter notebook --ip=*
If you want to keep it running even after the terminal is closed then use:-
nohup jupyter notebook --ip=* > nohup_jupyter.out&
Remember to open the port 8888 in the AWS EC2 security group inbound to Anywhere (0.0.0.0/0, ::/0)
Then you can access jupyter using http://:8888
Hope this helps. This just a one liner solution!!

Logging chokes on BlockingIOError: write could not complete without blocking

I recently ported my scripts from 2.x to 3.x. During production runs through automation (rundeck) we are seeing errors caused by the logger not handling blocking I/O. Any ideas how to resolve would be great.
Ubuntu 18.04.1 LTS
Python 3.6.7
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
BlockingIOError: [Errno 11] write could not complete without blocking
I was getting the same error on CI builds. It looks like it was a capacity issue with the output stream. After reducing the log output, the errors went away.
I recently faced the error while building my Docker image using docker-compose in CI and I found one Workaround maybe that will help someone:
the Error:
BlockingIOError: [Errno 11] write could not complete without blocking
if you do not want to lose any logs , you can send all the logs to file and save it as an artifact , tested on Bamboo and Jenkins:
docker-compose build --no-cache my_image > myfile.txt
if you do not want the logs:
docker-compose build --no-cache my_image > /dev/null

Resources