Dask ML won't connect to remote cluster - dask-ml

I've connected to my remote cluster via Client, now I'm trying to use Dask-ml
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
#import dask_ml.joblib
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1)
with joblib.parallel_backend('dask', scatter = [X,y]):
clf.fit(X,y)
Error 1) there is no dask_ml.joblib-- I get a module does not exist error
Error 2) if i remove this import I get a streaming connection closed error
Not seeing any good documentation on this. Any ideas on how to get Dask-ml to work with a remote cluster?

Error 1
dask_ml.joblib has been removed. You just need to create a Client and use joblib.parallel_backend now.
Error 2
Might be a spill-to-disk issue. Try reducing your dataframe size and check if you still get this issue.
I know you might have already solved your problem but this answer might help other people.

Related

Using PySpark instead of the multiprocessing module - How can I translate my code?

I am currently using the multiprocessing module to parallelize iterations as shown in this example. The thing is that this way I would be using only 1 worker and its cores, but not using all the workers available. Also I'm not able to parallelize experiments (I'm running several experiments, and several iterations for each experiment).
This code is taking too long to run, and I have the understanding that the runtime could be highly reduced using PySpark. My Spark knowledge is very little and I donĀ“t know how to translate this code in order to use it with Spark.
All the functions and classes used here are written using purely python, (numpy and pandas)
import concurrent.futures
import multiprocessing as mp
def process_simulation(experiment):
number_of_workers = mp.cpu_count()
with concurrent.futures.ProcessPoolExecutor(max_workers=number_of_workers) as executor:
results = list(executor.map(Simulation.simulation_steps(), iterations_generator()))
experiment.simulations = []
for i, v in enumerate(results):
experiment.simulations.append(results[v])
For context, Experiment and Simulation are classes (there's no inheritance). One experiment needs multiple simulations to be completed.
Thank you!
You can use Fugue to bring this type of logic to PySpark with a minimal wrapper. The only thing is you need to start with a DataFrame of inputs, and then you can do something like:
from fugue import transform
transform(input_df, Simulation.simulation_steps, schema=<your output schema here>, partition={"how": "per_row"}, engine="spark"
I can always help more get it in this shape if I have more details to what the logic is. It may just need one wrapper function. (contact info in bio).

Getting "error while emitting; method too large" while creating dataframe in databricks

I am creating a dataframe in a function and returning that dataframe
def getDataFrame(rdd: RDD[MyCaseClass]) = {
spark.createDataFrame(rdd)
}
The file in which this function is present compiles without any error. While running this in another file throws error:
%run "./Load_Dataframe"
The execution of this command did not finish successfully
import java.util.Properties
import org.apache.spark.rdd.RDD
defined class MyCaseClass
error: Error while emitting $$$cbf4485eb7852af86a790a85973a466$$$$w$STHierarchy$$typecreator1$1
Method too large: $$$cbf4485eb7852af86a790a85973a466$$$$w$STHierarchy$$typecreator1$1.apply (Lscala/reflect/api/Mirror;)Lscala/reflect/api/Types$TypeApi;
Most of the solutions online tells to divide the function into multiple smaller function. But my function has only single line of code not sure how I can divide it into multiple functions.
Try to restart your cluster.
This has resolved after restarting the cluster
I have asked this question. Below are the steps which I have taken to resolve this issue. I have no idea about the root cause of this issue and I just know that these workaround works:
Detach and reattach the cluster. This resolves the issue 80% of the time.
Restart the cluster and then try again.
There is a very big class in your file, It would be better if you break it into multiple classes.
There is a huge function in your file. Break that function into multiple smaller functions.
In my case I had a huge class which I break into two classes and it resolved my issue.

Numba RuntimeError only when directly running on databricks notebook

I am trying to understand the source of runtime error when I run the following python function in a databricks notebook vs importing and invoking it from a module
Running directly on databricks notebook
def test_numba_func():
from numba import jit
#jit(cache=True)
def test():
return .5 ** 2 / 4.0
print(test())
invoking this function does not work
test_numba_func()
RuntimeError: cannot cache function 'test_numba_func..test': no locator available for file ''
However if I create a module say databricks_test.py with the same function, then the following import works without any issues.
Module import
import databricks_test
databricks_test.test_numba_func()
Databricks notebook
I am able to run in directly in Colab though.
I think this is due to some rights issue. How can I fix my code to make it work against databricks?
The difference is that notebooks on Databricks aren't the real files - they are kept in memory when you're using them, and persisted to something like Database on the fly. This is different from the using the module that is real file, or Google notebooks where they are also files on the disk.
Theoretically you can try to set environment variable NUMBA_CACHE_DIR to something like /tmp/cache (docs), but I'm not sure that it will work. Also, it will work only until cluster is terminated.

Understanding Dask's Task Stream

I'm running dask locally using the distributed scheduler on my machine with 8 cores. On initialization I see:
Which looks correct, but I'm confused by the task stream in the diagnostics (shown below):
I was expecting 8 rows corresponding to the 8 workers/cores, is that incorrect?
Thanks
AJ
I've added the code I'm running:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()
progress(client)
# load datasets
trd = (dd.read_csv('trade_201811*.csv', compression='gzip',
blocksize=None, dtype={'Notional': 'float64'})
.assign(timestamp=lambda x: dd.to_datetime(x.timestamp.str.replace('D', 'T')))
.set_index('timestamp', sorted=True))
Each line corresponds to a single thread. Some more sophisticated Dask operations will start up additional threads, this happens particularly when tasks launch other tasks, which is common especially in machine learning workloads.
My guess is that you're using one of the following approaches:
dask.distributed.get_client or dask.distributed.worker_client
Scikit-Learn's Joblib
Dask-ML
If so, the behavior that you're seeing is normal. The task stream plot will look a little odd, yes, but hopefully it is still interpretable.

Fetch data from a large azure table and how to avoid timeout error?

I am trying to fetch data from a large Azure Table and after few hours running into the following error:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
The following is my code :
from azure.storage import TableService,Entity
from azure import *
import json
from datetime import datetime as dt
from datetime import timezone, timedelta
ts=TableService(account_name='dev',account_key='key')
i=0
next_pk=None
next_rk=None
N=10
date_N_days_ago = datetime.now(timezone.utc) -timedelta(days=N)
while True:
entities=ts.query_entities('Events',next_partition_key=next_pk,next_row_key=next_rk,top=1000)
i+=1
with open('blobdata','a') as fil:
for entity in entities:
if (entity.Timestamp) > date_N_days_ago:
fil.write(str(entity.DetailsJSON)+'\n')
with open('1k_data','a') as fil2:
if i%5000==0:
fil2.write('{}|{}|{}|{}'.format(i,entity.PartitionKey, entity.Timestamp,entity.DetailsJSON+'\n'))
if hasattr(entities,'x_ms_continuation'):
x_ms_continuation=getattr(entities,'x_ms_continuation')
next_pk=x_ms_continuation['nextpartitionkey']
next_rk=x_ms_continuation['nextrowkey']
else:
break;
Also, if someone has a better idea of how to achieve this process in a better fashion please do tell as the table is very large and the code is taking too long to process.
This exception can happen in all sorts of network calls on occasion. It should be entirely transient. I would recommend simply catching the error, waiting a little bit, and trying again.
The Azure Storage Python Library recently moved and we will be doing a ton of improvements on it in the coming months including built-in retry policies. So, in the future the library itself will retry these sorts of errors for you.
In general if you want to make this faster you could try adding some multithreading to the processing of your entities. Even parallelizing writing to the two different files could really help.

Resources