Python3 command taking forever to run in Airflow - python-3.x

I am calling a task that runs a python3 command. I have put a statement main is called in the first line of the if __name__ == '__main__': statement. However, this statement never gets executed. How am I assured that the file is to be called and everything else has before has been executed? By logs:
[2018-12-20 07:15:24,456] {bash_operator.py:87} INFO - Temporary script location: /tmp/airflowtmpah5gx32p/pscript_pclean_phlc9h6grzqdhm6sc0zrxjne_UdOgg0xdoblvr
[2018-12-20 07:15:24,456] {bash_operator.py:97} INFO - Running command: python3 /usr/local/airflow/rootfs/mopng_beneficiary_v2/scripts/pclean_phlc9h6grzqdhm6sc0zrxjne_UdOgg.py /usr/local/airflow/rootfs/mopng_beneficiary_v2/manual__2018-12-18T12:06:14+00:00/appended/euoEQHIwIQTe1wXtg46fFYok.csv /usr/local/airflow/rootfs/mopng_beneficiary_v2/external/5Feb18_master_ujjwala_latlong_dist_dno_so_v7.csv /usr/local/airflow/rootfs/mopng_beneficiary_v2/external/ppac_master_v3_mmi_enriched_with_sanity_check.csv /usr/local/airflow/rootfs/mopng_beneficiary_v2/manual__2018-12-18T12:06:14+00:00/pcleaned/Qc01sB1s1WBLhljjIzt2S0Ex.csv
[2018-12-20 07:15:24,467] {bash_operator.py:106} INFO - Output:

Related

How to get `run_id` when using MLflow Project

When using MLflow Projects (via an MLproject file) I get this message at starting time:
INFO mlflow.projects.backend.local:
=== Running command 'source /anaconda3/bin/../etc/profile.d/conda.sh &&
conda activate mlflow-4736797b8261ec1b3ab764c5060cae268b4c8ffa 1>&2 &&
python3 main.py' in run with ID 'e2f0e8c670114c5887963cd6a1ac30f9' ===
I want to access the run_id shown above (e2f0e8c670114c5887963cd6a1ac30f9) from inside the main script.
I expected a run to be active but:
mlflow.active_run()
> None
Initiating a run inside the main script does give me access the correct run_id, although any subsequent runs will have a different run_id.
# first run inside the script - correct run_id
with mlflow.start_run():
print(mlflow.active_run().info.run_id)
> e2f0e8c670114c5887963cd6a1ac30f9
# second run inside the script - wrong run_id
with mlflow.start_run():
print(mlflow.active_run().info.run_id)
> 417065241f1946b98a4abfdd920239b1
Seems like a strange behavior, and I was wondering if there's another way to access the run_id assigned at the beginning of the MLproject run?
with mlflow.start_run() as run:
print(run.info.run_id)

Python Script Executes Manually in CMD but Errors in Scheduler

So, I have a python script, that I can run in cmd using python [path to script]
I have it set in scheduler, but it doesn't run through scheduler and finds an error. Cmd closes out before being able to read the error. I created a batch file to launch to script, and it shows an error that a package doesn't exist [lxml]. But, the package exists as the script will run when manually executed
Any thoughts?
Script scrapes data from a website, creates a dataframe, posts dataframe to a google sheet, then pulls the full google sheet that it posts to, turns that into a dataframe with all of the data, then creates a plotly graph, turns that plotly into an html file, then sends the html file to a SFTP server
Figured it out...
import sys
import platform
import imp
print("Python EXE : " + sys.executable)
print("Architecture : " + platform.architecture()[0])
#print("Path to arcpy : " + imp.find_module("arcpy")[1])
#raw_input("\n\nPress ENTER to quit")
Run this to get the proper path to your Python.exe, and place this in Program/Script.
Then you need to verify that EVERY directory in your script starts at C:/ and works its way through the full path.

Jenkins console prints only logger commands from main.py

I have a pipeline in Jenkins, which triggers a python file:
status = sh script: '''
python3 main.py --command check_artfiacts
''', returnStatus:true
as long I'm in the main.py, I'm getting the expected result from logger in the console:
2019-11-28 22:14:32,027 - __main__ - INFO - starting application from: C:\Tools\BB\jfrog_distribution_shared_lib\resources\com\amdocs\jfrog\methods, with args: C:\Tools\BB\jfrog_distribution_shared_lib\resources\com\amdocs\jfrog\methods
2019-11-28 22:14:32,036 - amd_distribution_check_artifacts_exists - INFO - Coming from func: build_aql_queries
however, when calling a function that exists on another python file, it doesn't work (it behaves like a normal print):
added helm to aql
amd_distribution_check_artifacts_exists: build_docker_aql_query_by_item
I know for sure it's some pipeline issue, coz when running the code from PyCharm, it prints everything as expected.
Did anyone face such an issue?
I found the solution in this thread:
jenkins-console-output-not-in-realtime
so -u worked for me:
python3 -u main.py --command check_artfiacts

Execute Long running jobs from bottle web server

What I am trying to do
I have a front end system that is generating output. I am accessing this data(JSON) with a post request using bottle. My post receives the json without issue. I need to execute a backend python program(blender automation) and pass this JSON data to that program.
What I have tried to do
Subprocess - Using subprocess call the program and pass the input. In appearance seems to execute but when i check System Monitor the program is not starting but my server continues to run as it should. This subprocess command runs perfectly fine when executed independently from the server.
blender, script, and json are all string objects with absolute file paths
sub = subprocess.Popen([blender + " -b -P " + script + " -- " + json], stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
C Style os.fork() - Same as above which i found reading pydoc that subprocess operates using these methods
Double Fork - From a posting on here tried forking from the server and then calling subprocess from that fork and terminating the parent of the subprocess to create an orphan. My subprocess command still does not execute and never shows up in System Monitor.
What I need
I need a solution that will run from the bottle server in its own process. It will handle multiple requests so the subprocess cannot block in the server. The process being called is fully automated and just requires sending the JSON data in the execution command. The result of the subprocess program will be string path to a file created on the server.
The above subprocess works perfectly fine when called from my test driver program. I just need to connect the execution to the webservice so my front end can trigger its execution.
My bottle post method - prints json when called without issue.
#post('/getData')
def getData():
json_text = request.json
print(json_text)
I am not sure where to go from here. From what i have read thus far, subprocess should work. Any help or suggestions would be very much appreciated. If additional information is needed please let me know. I will edit with more details. Thank you.
Relevant Information:
OS: Ubuntu 16.04 LTS,
Python 3.x
*EDIT
This isn't an elegant solution but my subprocess call works now.
cmd = blender
cmd += " -b -P "
cmd += script
cmd += " -- "
cmd += str(json)
sub = subprocess.Popen([cmd], shell=True)
It seems by setting shell=True and removing the stdout, stderr=PIPE allowed me to see output where i was throwing an unhandled exception because my json data was a list and not a string.
When using python for executing your scripts a process created by Popen.subprocess will unintentionally inherited and keeps open a file descriptor.
You need to close that so that process can run independently. (close_fds=True)
subprocess.Popen(['python', "-u", Constant.WEBAPPS_FOLDER + 'convert_file.py', src, username], shell=False, bufsize=-1, close_fds=True)
Alsso, u dont have to use shell for creating another process. It might have unintended consequences.
I had the exact same problem where bottle was not returning/hangs. It works now.

Disconnect subprocess from main process

I have a Python 3 script which, among other things, launches Chrome with certain command-line parameters. The relevant portion of the code looks like this:
import multiprocessing as mp
from subprocess import call
import time
import logging
def launch_tab(datadir, url):
# Constructs the command line and launches Chrome via subprocess.call()
def open_browser(urls):
'''Arranges for Chrome to be launched in a subprocess with the specified
URLs, using the already-configured profile directory.
urls: a list of URLs to be loaded one at a time.'''
first_run = True
for tab in urls:
logging.debug('open_browser: {}'.format(tab))
proc = mp.Process(name=tab, target=launch_tab, args=(config.chromedir, tab))
proc.start()
if first_run:
first_run = False
time.sleep(10)
else:
time.sleep(0.5)
What Happens
When I run the script with Chrome already running as launched by the script:
Chrome launches as expected, sees that it is already running, follows the instructions provided on the command line, then terminates the process started on the command line.
Since the process is now terminated, my script also terminates. This is the behavior I want.
When I run the script while Chrome is not running:
Chrome sees that it is not already running, and so doesn't terminate the process started by the command line.
Because the process hasn't terminated, my script doesn't exit until Chrome does, despite its having nothing to do. I have to remember to place the script in the background.
What I Want
I want the subprocess which launches Chrome to be completely independent of the main process, such that I get my command prompt back immediately after Chrome is launched. After launching Chrome, my script has completed its job. I thought of using the daemonize module, but it apparently doesn't work with Python 3.
I'm not wedded to the multiprocessing module. Any reasonable approach which produces the desired end result is acceptable. Switching to Python 2 so I can try daemonize would be too difficult.
The script will only ever run on Linux, so portability isn't an issue.

Resources