Computing dask delayed objects stored in dataframe - python-3.x

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.

You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Related

Dask Dataframe groupby and aggregate for column

I had a pd.DataFrame that I converted to Dask.DataFrame for faster computations.
My requirement is that I have to find out the 'Total Views' of a channel.
In pandas it would be, df.groupby(['ChannelTitle'])['VideoViewCount'].sum() but in dask the columns dtypes is object and groupby is taking these as string and not int(see image 2)
To handle above issue, I added two columns separating figure(115) and multiplier(6 for M, 3 for K) of views hoping to do an operation like ddf['new_views_f'] * (10**ddf['new_views_m']), but now I cannot find mul for two columns in dask.
Either I am missing something or complicating the requirement.
It does sound like you are complicating the requirement. For column multiplication, the regular pandas syntax will work (df['c'] = df['a'] * df['b']). In your case, it's possible to use pd.eval to get the actual numeric value for views:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import random
df = pd.DataFrame(15*np.random.rand(15), columns=['views'])
df['views'] = df['views'].round(2).astype('str') + [random.choice(['K views', 'M views']) for _ in range(len(df))]
df['group'] = [random.choice([1,2,3]) for _ in range(len(df))]
ddf = dd.from_pandas(df, npartitions=2)
ddf['views_digits'] = ddf['views'].replace({'K views': '*1e3', 'M views': '*1e6'}, regex=True).map(pd.eval, meta=ddf['group'])
aggregate_df = ddf.groupby(['group']).agg({'views_digits': 'sum'}).compute()

Pandas importing DataFrame Vs whole library - Speed and memory concern

I have a helper function that use pandas DataFrame multiple times, but I don't need any other pandas functions.
Is it better to import the whole library and call DataFrame to make the code more consistent or just import DataFrame?
My function will be called over 100,000 times and return a dictionary.
import pandas as pd
temp_df = pd.DataFrame()
VS.
from pandas import DataFrame
temp_df = DataFrame()

numpy reading a csv file to an numpy array

I am new to python and using numpy to read a csv into an array .So I used two methods:
Approach 1
train = np.asarray(np.genfromtxt(open("/Users/mac/train.csv","rb"),delimiter=","))
Approach 2
with open('/Users/mac/train.csv') as csvfile:
rows = csv.reader(csvfile)
for row in rows:
newrow = np.array(row).astype(np.int)
train.append(newrow)
I am not sure what is the difference between these two approaches? What is recommended to use?
I am not concerned which is faster since my data size is small but instead concerned more about differences in the resulting data type.
You can use pandas also, it is better and simple to use.
import pandas as pd
import numpy as np
dataset = pd.read_csv('file.csv')
# get all headers in csv
values = list(dataset.columns.values)
# get the labels, assuming last row is labels in csv
y = dataset[values[-1:]]
y = np.array(y, dtype='float32')
X = dataset[values[0:-1]]
X = np.array(X, dtype='float32')
So what is the difference in the result?
genfromtxt is the numpy csv reader. It returns an array. No need for an extra asarray.
The second expression is incomplete, looks like would produce a list of arrays, one for each line of the file. It uses the generic python csv reader which doesn't do much other than read a line and split it into strings.

Spark matrix multiplication code takes a lot of time to execute

I have a simple PySpark environment set up using findspark.init() on Spyder and I'm running the code on localhost. I am confused as to how can simple matrix multiplication take hours and hours of time using BlockMatrix in Spark, whereas the same code takes a few mins to run on numpy.
Here's the code I'm using:
import numpy as np
import pandas as pd
from sklearn import cross_validation as cv
import itertools
import random
import findspark
import time
start=time.time()
findspark.init()
from pyspark.mllib.linalg.distributed import *
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('myapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
def prediction(P,Q):
# np.r_[ pp,np.zeros(len(pp)) ].reshape(2,20)
Pn=np.r_[ P,np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)) ].reshape(5,len(P))
Qn=np.r_[ Q,np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)) ].reshape(5,len(Q))
A = Pn[:1]
B = Qn[:1].T
distP = sc.parallelize(A)
distQ = sc.parallelize(B)
mat=as_block_matrix(distP).multiply(as_block_matrix(distQ))
blocksRDD = mat.blocks
m=(list(blocksRDD.collect())[0][1])
#print(m)
return m.toArray()[0,0]
for epoch in range(1):
for u, i in zip(users,items):
e = R[u, i] - prediction(P[:,u],Q[:,i])
Not knowing the size of your matrices makes it more difficult to answer this question, but if you are working with high dimensional sparse matrices, one possible issue is inherent to the way pyspark does matrix multiplication. In order to multiply sparse matrices, pyspark converts the sparse matrices to dense matrices. This is noted in the documentation:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix.multiply
which states that:
multiply(other) Left multiplies this BlockMatrix by other, another BlockMatrix. The colsPerBlock of this matrix must equal the rowsPerBlock of other. If other contains any SparseMatrix blocks, they will have to be converted to DenseMatrix blocks. The output BlockMatrix will only consist of DenseMatrix blocks. This may cause some performance issues until support for multiplying two sparse matrices is added.
As far as I know, there isn't a good work around for this if you intend to use the built in matrix data types. One way to fix is to abandon the matrix datatypes and hand roll your own matrix multiplication using rdd or dataframe join operations. For example, if you can use dataframes, the following has been tested and works fairly well at scale:
from pyspark.sql.functions import sum
def multiply_df_matrices(A,B):
return A.join(B,A['column']==B['row'])\
.groupBy(A['row'],B['column'])\
.agg(sum(A['value']*B['value']).alias('value'))
You can do something similar by joining two rdds.

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Resources