Spark UDF for keeping cumulative sum

Spark UDF for keeping cumulative sum - apache-spark

In Spark Structured Streaming Dataframe I have a column Diff.
Based on Diff column I wanted to create another call which maintains cumulative sum of the Diff column.
Approach 1
So I created a UDF with below code:
elapsed_time = 0
def cumulative(val):
global elapsed_time
elapsed_time = elapsed_time + int(val)
return elapsed_time
CumSum = udf(lambda t: cumulative(t))
Then I am calling the udf as below:
#Adding Cummlative Sum Column using UDF
dataframe2 = dataframe1.withColumn("CumulativeTime", CumSum(f.col('Diff')))
This is working perfectly fine
Will it cause any issues further? Since the variable elapsed_time is updated by all the worker nodes.
Approach 2
I tried an approach which uses Accumulator as below:
elapsed_time = sc.accumulator(0)
def cumulative(val):
global elapsed_time
elapsed_time.add(int(val))
return elapsed_time.value
CumSum = udf(lambda t: cumulative(t))
but I received an exception An exception was thrown from a UDF: 'Exception: Accumulator.value cannot be accessed inside tasks
Any alternative, workaround or suggestion would be helpful.

Related

Optimizing Pyspark UDF on large data

I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].
When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.
def create_dummy(dframe,col_name,top_name,categories,**options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
dframe = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
dframe = dframe.select(lst_tmp_col+ [str(top_name)])
return dframe
In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?
Appreciate your suggestions. Thanks

You don't need a UDF for encoding the categories. You can use isin:
import pyspark.sql.functions as F
def create_dummy(dframe, col_name, top_name, categories, **options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
dframe = dframe.select(lst_tmp_col + [str(top_name)])
return dframe

Multiprocessing/for loop is skipping elements randomly

The dataset has billions of data points for each pair. I tried the multiprocessing loop to make it faster.
Why multiprocessing/for loop is skipping some elements from the Pairs?
Once I run again, this skips some other names randomly and the code ends.
import pandas as pd
import pickle
import time
import concurrent.futures
start = time.perf_counter()
pairs = ['GBPUSD', 'AUDUSD', 'EURUSD', 'EURJPY', 'GBPJPY', 'USDJPY', 'USDCAD', 'EURGBP']
def pickling_joined(p):
df = pd.read_csv(f'C:\\Users\\Ghosh\\Downloads\\dataset\\data_joined\\{p}.csv')
df['LTP'] = (df['Bid'] + df['Ask']) / 2
print(f'\n=====>> Converting Date format for {p} ....')
df['Date'] = df['Date'].apply(pd.to_datetime)
print(f'\n=====>> Date format converted for {p} ....')
df.set_index('Date', inplace=True)
df = pd.DataFrame(df)
with open(f'C:\\Users\\Ghosh\\Downloads\\dataset\\data_pickled\\{p}.pkl', 'wb') as pickle_file:
pickle.dump(df, pickle_file)
print(f'\n=====>> Pickling done for {p} !!!')
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(pickling_joined, pairs)
finish = time.perf_counter()
print(f'Finished in {finish - start} seconds')

Python doesn't handle Thread/Multiprocessing well with heavy files, I would recommend DASK here. DASK uses clustering which works like multiprocessing which takes lesser time, and then you can use multiprocessing, in addition, to run it faster.
def pickling_joined(p):
df = dd.read_csv(f'C:\\Users\\Ghosh\\Downloads\\dataset\\data_joined\\{p}.csv')
df['LTP'] = (df['Bid'] + df['Ask']) / 2
print(f'\n=====>> Converting Date format for {p} ....')
df['Date'] = dd.to_datetime(df.Date)
print(f'\n=====>> Date format converted for {p} ....')
df = df.set_index('Date', sorted=True)
df = df.compute()
with open(f'C:\\Users\\Ghosh\\Downloads\\dataset\\data_pickled\\{p}.pkl', 'wb') as pickle_file:
pickle.dump(df, pickle_file)
print(f'\n=O=O=O=O=O>> Pickling done for {p} !!!')
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(pickling_joined, pairs)
finish = time.perf_counter()
print(f'\nFinished in {finish - start} seconds')

Java will be a better choice for such action, python will always skip steps if large DF.

Calling a file_write after multiprocessing causes issue in one case and not the other

I'm using a multiprocessing pool with 80 processes on a 16GB machine. The flow is as follows:
Read objects in batches from an input file
Send the entire batch to a multiprocessing pool, and record the time taken by the pool to process the batch
Write the time recorded in step 2 above to an output file
To achieve the above, I wrote code in 2 ways:
Way 1:
with open('input_file', 'r') as input_file, open('output_file', 'a') as of:
batch = read_next_batch_of_lines()
start_time = time.time()
call_api_for_each_item_in_batch(batch)
end_time = time.time()
of.write('{}\n'.format(end_time-start_time))
Way 2:
with open('input_file', 'r') as input_file:
batch = read_next_batch_of_lines()
start_time = time.time()
call_api_for_each_item_in_batch(batch)
end_time = time.time()
with open('output_file', 'a') as of:
of.write('{}\n'.format(end_time-start_time))
In the first case, nothing is being appended to the output file despite batches being processed. I'm unable to figure out the reason for this.
Details of call_api_for_each_item_in_batch():
def call_api_for_each_item_in_batch(batch):
intervals = get_intervals(batch, pool_size) #this gives intervals. Ex. if batch size is 10 and pool size is 3, then intervals would be (0, 4, 7, 10)
pool = mp.Pool(pool_size)
arguments = list(zip(intervals, intervals[1:]))
pool.starmap(call_api, arguments)
pool.close()
def call_api(start, end):
for i in range(start, end):
item = batch[i]
call_external_api(item)
How is Way 1 different from Way 2 when a pool.close() is called in the call_api_for_each_item_in_batch itself?
Also, I used pool.close() followed by pool.join(), but faced the same issue.

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?

The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.

An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")

Python Multiprocessing Scheduling

In Python 3.6, I am running multiple processes in parallel, where each process pings a URL and returns a Pandas dataframe. I want to keep running the (2+) processes continually, I have created a minimal representative example as below.
My questions are:
1) My understanding is that since I have different functions, I cannot use Pool.map_async() and its variants. Is that right? The only examples of these I have seen were repeating the same function, like on this answer.
2) What is the best practice to make this setup to run perpetually? In my code below, I use a while loop, which I suspect is not suited for this purpose.
3) Is the way I am using the Process and Manager optimal? I use multiprocessing.Manager.dict() as the shared dictionary to return the results form the processes. I saw in a comment on this answer that using a Queue here would make sense, however the Queue object has no `.dict()' method. So, I am not sure how that would work.
I would be grateful for any improvements and suggestions with example code.
import numpy as np
import pandas as pd
import multiprocessing
import time
def worker1(name, t , seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
time.sleep(t)
np.random.seed(seed)
df= pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
def worker2(name, t, seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
np.random.seed(seed)
time.sleep(t)
df = pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
if __name__ == '__main__':
t=1
while True:
start_time = time.time()
manager = multiprocessing.Manager()
parallel_dict = manager.dict()
seed=np.random.randint(0,1000,1) # send seed to worker to return a diff df
jobs = []
p1 = multiprocessing.Process(target=worker1, args=('name1', t, seed, parallel_dict))
p2 = multiprocessing.Process(target=worker2, args=('name2', t, seed+1, parallel_dict))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
for proc in jobs:
proc.join()
parallel_end_time = time.time() - start_time
#print(parallel_dict)
df1= pd.DataFrame(parallel_dict['name1'][1:],columns=parallel_dict['name1'][0])
df2 = pd.DataFrame(parallel_dict['name2'][1:], columns=parallel_dict['name2'][0])
merged_df = pd.concat([df1,df2], axis=0)
print(merged_df)

Answer 1 (map on multiple functions)
You're technically right.
With map, map_async and other variations, you should use a single function.
But this constraint can be bypassed by implementing an executor, and passing the function to execute as part of the parameters:
def dispatcher(args):
return args[0](*args[1:])
So a minimum working example:
import multiprocessing as mp
def function_1(v):
print("hi %s"%v)
return 1
def function_2(v):
print("by %s"%v)
return 2
def dispatcher(args):
return args[0](*args[1:])
with mp.Pool(2) as p:
tasks = [
(function_1, "A"),
(function_2, "B")
]
r = p.map_async(dispatcher, tasks)
r.wait()
results = r.get()
Answer 2 (Scheduling)
I would remove the while from the script and schedule a cron job (on GNU/Linux) (on windows) so that the OS will be responsible for it's execution.
On Linux you can run cronotab -e and add the following line to make the script run every 5 minutes.
*/5 * * * * python /path/to/script.py
Answer 3 (Shared Dictionary)
yes but no.
To my knowledge using the Manager for data such as collections is the best way.
For Arrays or primitive types (int, floats, ecc) exists Value and Array which are faster.
As in the documentation
A manager object returned by Manager() controls a server process which holds > Python objects and allows other processes to manipulate them using proxies.
A manager returned by Manager() will support types list, dict, Namespace, Lock, > RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and > Array.
Server process managers are more flexible than using shared memory objects because they can be made to support arbitrary object types. Also, a single manager can be shared by processes on different computers over a network. They are, however, slower than using shared memory.
But you have only to return a Dataframe, so the shared dictionary it's not needed.
Cleaned Code
Using all the previous ideas the code can be rewritten as:
map version
import numpy as np
import pandas as pd
from time import sleep
import multiprocessing as mp
def worker1(t , seed):
print('worker1 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
def worker2(t , seed):
print('worker2 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
def dispatcher(args):
return args[0](*args[1:])
def task_generator(sleep_time=1):
seed = np.random.randint(0,1000,1)
yield worker1, sleep_time, seed
yield worker2, sleep_time, seed + 1
with mp.Pool(2) as p:
results = p.map(dispatcher, task_generator())
merged = pd.concat(results, axis=0)
print(merged)
If the process of concatenation of the Dataframe is the bottleneck, An approach with imap might become optimal.
imap version
with mp.Pool(2) as p:
merged = pd.DataFrame()
for result in p.imap_unordered(dispatcher, task_generator()):
merged = pd.concat([merged,result], axis=0)
print(merged)
The main difference is that in the map case, the program first wait for all the process tasks to end, and then concatenate all the Dataframes.
While in the imap_unoredered case, As soon as a task as ended, the Dataframe is concatenated ot the current results.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark UDF for keeping cumulative sum - apache-spark

Related

Optimizing Pyspark UDF on large data

Multiprocessing/for loop is skipping elements randomly

Calling a file_write after multiprocessing causes issue in one case and not the other

How best to parallelize grakn queries with Python?

Python Multiprocessing Scheduling

Categories

Resources