Numba RuntimeError only when directly running on databricks notebook - databricks

I am trying to understand the source of runtime error when I run the following python function in a databricks notebook vs importing and invoking it from a module
Running directly on databricks notebook
def test_numba_func():
from numba import jit
#jit(cache=True)
def test():
return .5 ** 2 / 4.0
print(test())
invoking this function does not work
test_numba_func()
RuntimeError: cannot cache function 'test_numba_func..test': no locator available for file ''
However if I create a module say databricks_test.py with the same function, then the following import works without any issues.
Module import
import databricks_test
databricks_test.test_numba_func()
Databricks notebook
I am able to run in directly in Colab though.
I think this is due to some rights issue. How can I fix my code to make it work against databricks?

The difference is that notebooks on Databricks aren't the real files - they are kept in memory when you're using them, and persisted to something like Database on the fly. This is different from the using the module that is real file, or Google notebooks where they are also files on the disk.
Theoretically you can try to set environment variable NUMBA_CACHE_DIR to something like /tmp/cache (docs), but I'm not sure that it will work. Also, it will work only until cluster is terminated.

Related

Using PySpark instead of the multiprocessing module - How can I translate my code?

I am currently using the multiprocessing module to parallelize iterations as shown in this example. The thing is that this way I would be using only 1 worker and its cores, but not using all the workers available. Also I'm not able to parallelize experiments (I'm running several experiments, and several iterations for each experiment).
This code is taking too long to run, and I have the understanding that the runtime could be highly reduced using PySpark. My Spark knowledge is very little and I donĀ“t know how to translate this code in order to use it with Spark.
All the functions and classes used here are written using purely python, (numpy and pandas)
import concurrent.futures
import multiprocessing as mp
def process_simulation(experiment):
number_of_workers = mp.cpu_count()
with concurrent.futures.ProcessPoolExecutor(max_workers=number_of_workers) as executor:
results = list(executor.map(Simulation.simulation_steps(), iterations_generator()))
experiment.simulations = []
for i, v in enumerate(results):
experiment.simulations.append(results[v])
For context, Experiment and Simulation are classes (there's no inheritance). One experiment needs multiple simulations to be completed.
Thank you!
You can use Fugue to bring this type of logic to PySpark with a minimal wrapper. The only thing is you need to start with a DataFrame of inputs, and then you can do something like:
from fugue import transform
transform(input_df, Simulation.simulation_steps, schema=<your output schema here>, partition={"how": "per_row"}, engine="spark"
I can always help more get it in this shape if I have more details to what the logic is. It may just need one wrapper function. (contact info in bio).

Calling multiple notebook from other in databricks

I have a total of 5 notebooks
The first is the Main class notebook. The remaining Four are sub/child Notebooks.
Let the names of notebooks be:(all are in scala language)
mainclass,
child1,
child2,
child3,
child4
I want to call child Notebooks based on IF conditions from the Main class notebook and execute
concurrently/parallelly.
for example:
In main class
var child1="Y"
var child2="Y"
var child3="N"
var child4="N"
I want to call notebooks which as flag as "Y" and run concurrently.
if(child1=="Y")
same for all notebooks
Kindly suggest a way to do this.
Thanks!
Calling a notebook from within a notebook will not result in a concurrent run as u desire
Since you are on Azure , you should look at Azure Data Factory
You can build a pipeline based on the parameters you supply to control the flow of executions for each of the notebooks along with scheduling and other utilities provided within ADF

Getting "error while emitting; method too large" while creating dataframe in databricks

I am creating a dataframe in a function and returning that dataframe
def getDataFrame(rdd: RDD[MyCaseClass]) = {
spark.createDataFrame(rdd)
}
The file in which this function is present compiles without any error. While running this in another file throws error:
%run "./Load_Dataframe"
The execution of this command did not finish successfully
import java.util.Properties
import org.apache.spark.rdd.RDD
defined class MyCaseClass
error: Error while emitting $$$cbf4485eb7852af86a790a85973a466$$$$w$STHierarchy$$typecreator1$1
Method too large: $$$cbf4485eb7852af86a790a85973a466$$$$w$STHierarchy$$typecreator1$1.apply (Lscala/reflect/api/Mirror;)Lscala/reflect/api/Types$TypeApi;
Most of the solutions online tells to divide the function into multiple smaller function. But my function has only single line of code not sure how I can divide it into multiple functions.
Try to restart your cluster.
This has resolved after restarting the cluster
I have asked this question. Below are the steps which I have taken to resolve this issue. I have no idea about the root cause of this issue and I just know that these workaround works:
Detach and reattach the cluster. This resolves the issue 80% of the time.
Restart the cluster and then try again.
There is a very big class in your file, It would be better if you break it into multiple classes.
There is a huge function in your file. Break that function into multiple smaller functions.
In my case I had a huge class which I break into two classes and it resolved my issue.

Dask ML won't connect to remote cluster

I've connected to my remote cluster via Client, now I'm trying to use Dask-ml
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
#import dask_ml.joblib
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)
with joblib.parallel_backend('dask', scatter = [X,y]):
clf.fit(X,y)
Error 1) there is no dask_ml.joblib-- I get a module does not exist error
Error 2) if i remove this import I get a streaming connection closed error
Not seeing any good documentation on this. Any ideas on how to get Dask-ml to work with a remote cluster?
Error 1
dask_ml.joblib has been removed. You just need to create a Client and use joblib.parallel_backend now.
Error 2
Might be a spill-to-disk issue. Try reducing your dataframe size and check if you still get this issue.
I know you might have already solved your problem but this answer might help other people.

Fetch data from a large azure table and how to avoid timeout error?

I am trying to fetch data from a large Azure Table and after few hours running into the following error:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
The following is my code :
from azure.storage import TableService,Entity
from azure import *
import json
from datetime import datetime as dt
from datetime import timezone, timedelta
ts=TableService(account_name='dev',account_key='key')
i=0
next_pk=None
next_rk=None
N=10
date_N_days_ago = datetime.now(timezone.utc) -timedelta(days=N)
while True:
entities=ts.query_entities('Events',next_partition_key=next_pk,next_row_key=next_rk,top=1000)
i+=1
with open('blobdata','a') as fil:
for entity in entities:
if (entity.Timestamp) > date_N_days_ago:
fil.write(str(entity.DetailsJSON)+'\n')
with open('1k_data','a') as fil2:
if i%5000==0:
fil2.write('{}|{}|{}|{}'.format(i,entity.PartitionKey, entity.Timestamp,entity.DetailsJSON+'\n'))
if hasattr(entities,'x_ms_continuation'):
x_ms_continuation=getattr(entities,'x_ms_continuation')
next_pk=x_ms_continuation['nextpartitionkey']
next_rk=x_ms_continuation['nextrowkey']
else:
break;
Also, if someone has a better idea of how to achieve this process in a better fashion please do tell as the table is very large and the code is taking too long to process.
This exception can happen in all sorts of network calls on occasion. It should be entirely transient. I would recommend simply catching the error, waiting a little bit, and trying again.
The Azure Storage Python Library recently moved and we will be doing a ton of improvements on it in the coming months including built-in retry policies. So, in the future the library itself will retry these sorts of errors for you.
In general if you want to make this faster you could try adding some multithreading to the processing of your entities. Even parallelizing writing to the two different files could really help.

Resources