How can I force pyspark code to run after a function

How can I force pyspark code to run after a function - apache-spark

I want to be able to force spark to execute my code in the order I want.
In the example below, foo and bar functions does data manipulation but send_request is just a web trigger unaffected by those functions. When Spark executes below code it runs the send_request the first and foo and bar later.
It does not work for me because after foo and bar completed, I have timeout from my request. If the request ran after foo, the result would be ready at the same time bar ends.
How could I achieve this in spark.
I could have separate scripts for each step but cluster starting times adds up hence I would like to be able modify execution order.
I am using databricks on Azure if it helps.
import os
import base64
import requests
import pyspark
sc.addFile("dbfs:/bar.py")
from bar import bar
sc.addFile("dbfs:/foo.py")
from foo import foo
if __name__ == '__main__':
foo()
response = send_request(request=request_json)
bar()
the content of foo and bar and send_request are as follows
def foo():
df = spark.read.parquet(file_1_path)
df = df.filter(F.col('IDType') == 'E') \
.select(F.col('col1'),F.col('col2')).distinct()
df.repartition(10).write.parquet(file_1_new_path)
logger.info('1 foo is done')
and
def bar():
df = spark.read.parquet(file_2_path)
df = df.filter(F.col('IDType') == 'M') \
.select(F.col('col1'),F.col('col2')).distinct()
df.repartition(10).write.parquet(file_2_new_path)
logger.info('3 bar is done')
and
def send_request():
response_json = http_response.json()
logger.info('2 request is sent')
I will try to be more clear. When I run above code in spark the output I get is as follows
2 request is sent
1 foo is done
3 bar is done
But I want it to be in this order
1 foo is done
2 request is sent
3 bar is done

Related

How to yield rich.progress from another python script using with statement?

I have a basic rich progress bar implemented like this:
import time
from rich.progress import *
with Progress(TextColumn("[progress.description]{task.description}"),
BarColumn(), TaskProgressColumn(),
TimeElapsedColumn()) as progress:
total = 20
for x in range(total):
task1 = progress.add_task(f"[green]Processing Algorithm-{x}.",
total=total)
progress.update(task1, advance=1)
time.sleep(0.1)
It works as expected.
But now I want to remove the initialization the the progress bar in a separate file, so I
created a file task_progress.py and put the code in there.
from rich.progress import *
import contextlib
#contextlib.contextmanager
def init_progress():
yield Progress(BarColumn(), TaskProgressColumn(), TimeElapsedColumn())
And I updated the original progress bar as below:
import time
from task_progress import init_progress
with init_progress() as progress:
total = 20
for x in range(total):
task1 = progress.add_task(f"[green]Processing Algorithm-{x}.",
total=total)
progress.update(task1, advance=1)
time.sleep(0.1)
But, now when I run the code the progress bar does not appear on the terminal!

You don't need to wrap the creation of the Progress class in a context manager. The Progress class can already act like a context manager. A function that returns an Progress object will work fine:
from rich.progress import *
def init_progress():
Progress(BarColumn(), TaskProgressColumn(), TimeElapsedColumn())

Is there a way I could somehow do some sort of nested import statement?

So in python you can do from foo import bar, this gets the bar from foo but how can I get something from inside of the bar object
I.e
from foo import bar.childDef # gets childDef() from inside of bar
Because normally you would have to type bar.childDef() with a regular import statement using from,
however I just wanna use childDef() instead of bar.childDef()
Sorry if this is a confusing or just bad question in general, I'm just curious.

you could use:
import datetime
now = datetime.datetime.now
or
def now():
import datetime
return datetime.datetime.now()
now()

## foo.py
class bar:
def childDef ():
print('Child Def')
And import from another script,
from foo import bar
# import method
imported_fun = bar.childDef
#call method
bar.childDef() # or imported_fun()

Can't capture stdout with unittest

I have a python3.7 script, which takes a YAML file as input and processes it depending on the instructions within. The YAML file I am using for unit testing looks like this:
...
tasks:
- echo '1'
- echo '2'
- echo '3'
- echo '4'
- echo '5'
The script loops over tasks and then runs each one, using os.system() call.
The manual testing indicates, that the output is as expected:
1
2
3
4
5
But I can't make it work in my unit test. Here's how I'm trying to capture the output:
from application import application
from io import StringIO
import unittest
from unittest.mock import patch
class TestApplication(unittest.TestCase):
def test_application_tasks(self):
expected = ['1','2','3','4','5']
with patch('sys.stdout', new=StringIO()) as fakeOutput:
application.parse_event('some event') # print() is called here within parse_event()
self.assertEqual(fakeOutput.getvalue().strip().split(), expected)
When running python3 -m unittest discover -s tests, all I get is AssertionError: Lists differ: [] != ['1', '2', '3', '4', '5'].
I also tried using with patch('sys.stdout', new_callable=StringIO) as fakeOutput: instead, but to no avail.
Another thing I tried was self.assertEqual(fakeOutput.getvalue(), '1\n2\n3\n4\n5'), and here what the unittest outputs:
AssertionError: '' != '1\n2\n3\n4\n5'
+ 1
+ 2
+ 3
+ 4
+ 5
Obviously, the script works and outputs the right result, but fakeOutput does not capture it.
Using patch as a decorator does not work either:
from application import application
from io import StringIO
import unittest
from unittest.mock import patch
class TestApplication(unittest.TestCase):
#patch('sys.stdout', new_callable=StringIO)
def test_application_tasks(self):
expected = ['1','2','3','4','5']
application.parse_event('some event') # print() is called here within parse_event()
self.assertEqual(fakeOutput.getvalue().strip().split(), expected)
Would output absolutely the same error: AssertionError: Lists differ: [] != ['1', '2', '3', '4', '5']

os.system runs a new process. If you monkey-patch sys.stdout this affects the current process but has no consequences for any new processes.
Consider:
import sys
from os import system
from io import BytesIO
capture = sys.stdout = BytesIO()
system("echo Hello")
sys.stdout = sys.__stdout__
print(capture.getvalue())
Nothing is captured because only the child process has written to its stdout. Nothing has written to the stdout of your Python process.
Generally, avoid os.system. Instead, use the subprocess module which will let you capture output from the process that is run.

Thank you, Jean-Paul Calderone. I realized the fact, that os.system() creates a completely different process and therefore I need to tackle the problem differently, only after I posted the question :)
To actually be able to test my code, I had to rewrite it using subprocess instead of os.system(). In the end, I went with subprocess_run_result = subprocess.run(task, shell=True, stdout=subprocess.PIPE) and then getting the result using subprocess_run_result.stdout.strip().decode("utf-8").
In the tests I just create an instance of class and call a method, which runs the tasks in subprocess.
My whole refactored code and tests are here in this commit if anyone would like to take a look.

Your solution is fine, just use getvalue instead, like so:
with patch("sys.stdout", new_callable=StringIO) as f:
print("Foo")
r = f.getvalue()
print("r: {r!r} ;".format(r=r))
r: "Foo" ;

Getting user input within tqdm loops

I'm writing a script where a user has to provide input for each element of a large list. I'm trying to use tqdm to provide a progress bar for the user, but I can't find a good way to get input within the tqdm loop without breaking the output.
I'm aware of tqdm.write() for writing to the terminal during a tqdm loop, but is there a way of getting input?
For an example of what I'm trying to do, consider the code below:
from tqdm import tqdm
import sys
from time import sleep
def do_stuff(x): sleep(0.5)
stuff_list = ['Alpha', 'Beta', 'Gamma', 'Omega']
for thing in tqdm(stuff_list):
input_string = input(thing + ": ")
do_stuff(input_string)
If I run this code, I get the following output:
0%| | 0/4 [00:00<?, ?it/s]Alpha: A
25%|█████████████████████ | 1/4 [00:02<00:07, 2.54s/it]Beta: B
50%|██████████████████████████████████████████ | 2/4 [00:03<00:04, 2.09s/it]Gamma: C
75%|███████████████████████████████████████████████████████████████ | 3/4 [00:04<00:01, 1.72s/it]Omega: D
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.56s/it]
I've tried using tqdm.external_write_mode, but this simply didn't display the progress bar whenever an input was waiting, which is not the behaviour I'm looking for.
Is there an easy way of doing this, or am I going to have to swap libraries?

It isn't possible to display the progress bar while inside the input() function, because once a line is finished, it cannot be removed any more. It's a technical limitation of how command lines work. You can only remove the current line until you wrote a newline.
Therefore, I think the only solution is to remove the status bar, let the user input happen and then display it again.
from tqdm import tqdm
import sys
from time import sleep
def do_stuff(x): sleep(0.5)
stuff_list = ['Alpha', 'Beta', 'Gamma', 'Omega']
# To have more fine-control, you need to create a tqdm object
progress_iterator = tqdm(stuff_list)
for thing in progress_iterator:
# Remove progress bar
progress_iterator.clear()
# User input
input_string = input(thing + ": ")
# Write the progress bar again
progress_iterator.refresh()
# Do stuff
do_stuff(input_string)
If you don't like the fact that the progress_iterator object exists after the loop, use the with syntax:
with tqdm(stuff_list) as progress_iterator:
for thing in progress_iterator:
...
EDIT:
If you are willed to sacrifice platform independence, you can freely move the cursor and delete lines with this:
from tqdm import tqdm
import sys
from time import sleep
def do_stuff(x): sleep(0.5)
stuff_list = ['Alpha', 'Beta', 'Gamma', 'Omega']
# Special console commands
CURSOR_UP_ONE = '\x1b[1A'
# To have more fine-control, you need to create a tqdm object
progress_iterator = tqdm(stuff_list)
for thing in progress_iterator:
# Move the status bar one down
progress_iterator.clear()
print(file=sys.stderr)
progress_iterator.refresh()
# Move the cursor back up
sys.stderr.write('\r')
sys.stderr.write(CURSOR_UP_ONE)
# User input
input_string = input(thing + ": ")
# Refresh the progress bar, to move the cursor back to where it should be.
# This step can be omitted.
progress_iterator.refresh()
# Do stuff
do_stuff(input_string)
I think this is the closest you will get to tqdm.write(). Note that the behaviour of input() can never be identical to tqdm.write(), because tqdm.write() first deletes the bar, then writes the message, and then writes the bar again. If you want to display the bar while being in input(), you have to do some platform-dependent stuff like this.

resetting tqdm progress bar

I want to reset a tqdm progress bar.
This is my code:
s = tqdm(range(100))
for x in s:
pass
# Reset it here
s.reset(0)
for x in s:
pass
Tqdm PB works only for the first loop. I tried to reset it using .reset(0) function but it doesn't work.
The ouput of the above code is:
100%|██████████| 100/100 [00:00<?, ?it/s]
I noticed that they use here: Restting progress bar counter this code
pbar.n = 0
pbar.refresh()
but it doesn't work as well.

When wrapping an iterable, tqdm will close() the bar when the iterable has been exhausted. This means reusing (refresh() etc) won't work. You can solve your problem manually:
from tqdm import tqdm
s = range(100)
t = tqdm(total=len(s))
for x in s:
t.update()
t.refresh() # force print final state
t.reset() # reuse bar
for x in s:
t.update()
t.close() # close the bar permanently

Try just creating a new progress bar over the old one. The garbage collector will take care of the old one afterwards, getting it out of memory once nothing in the code references it any more.
s = tqdm(range(100))
for x in s:
pass
# reset it here
s = tqdm(range(100))
for x in s:
pass

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I force pyspark code to run after a function - apache-spark

Related

How to yield rich.progress from another python script using with statement?

Is there a way I could somehow do some sort of nested import statement?

Can't capture stdout with unittest

Getting user input within tqdm loops

resetting tqdm progress bar

Categories

Resources