Spark matrix multiplication code takes a lot of time to execute - apache-spark

I have a simple PySpark environment set up using findspark.init() on Spyder and I'm running the code on localhost. I am confused as to how can simple matrix multiplication take hours and hours of time using BlockMatrix in Spark, whereas the same code takes a few mins to run on numpy.
Here's the code I'm using:
import numpy as np
import pandas as pd
from sklearn import cross_validation as cv
import itertools
import random
import findspark
import time
start=time.time()
findspark.init()
from pyspark.mllib.linalg.distributed import *
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('myapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
def prediction(P,Q):
# np.r_[ pp,np.zeros(len(pp)) ].reshape(2,20)
Pn=np.r_[ P,np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)) ].reshape(5,len(P))
Qn=np.r_[ Q,np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)) ].reshape(5,len(Q))
A = Pn[:1]
B = Qn[:1].T
distP = sc.parallelize(A)
distQ = sc.parallelize(B)
mat=as_block_matrix(distP).multiply(as_block_matrix(distQ))
blocksRDD = mat.blocks
m=(list(blocksRDD.collect())[0][1])
#print(m)
return m.toArray()[0,0]
for epoch in range(1):
for u, i in zip(users,items):
e = R[u, i] - prediction(P[:,u],Q[:,i])

Not knowing the size of your matrices makes it more difficult to answer this question, but if you are working with high dimensional sparse matrices, one possible issue is inherent to the way pyspark does matrix multiplication. In order to multiply sparse matrices, pyspark converts the sparse matrices to dense matrices. This is noted in the documentation:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix.multiply
which states that:
multiply(other) Left multiplies this BlockMatrix by other, another BlockMatrix. The colsPerBlock of this matrix must equal the rowsPerBlock of other. If other contains any SparseMatrix blocks, they will have to be converted to DenseMatrix blocks. The output BlockMatrix will only consist of DenseMatrix blocks. This may cause some performance issues until support for multiplying two sparse matrices is added.
As far as I know, there isn't a good work around for this if you intend to use the built in matrix data types. One way to fix is to abandon the matrix datatypes and hand roll your own matrix multiplication using rdd or dataframe join operations. For example, if you can use dataframes, the following has been tested and works fairly well at scale:
from pyspark.sql.functions import sum
def multiply_df_matrices(A,B):
return A.join(B,A['column']==B['row'])\
.groupBy(A['row'],B['column'])\
.agg(sum(A['value']*B['value']).alias('value'))
You can do something similar by joining two rdds.

Related

Parallelizing fastText.get_sentence_vector with dask gives pickling error

I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.
The problem here is that fasttext objects apparently can't be pickled, and Dask doesn't know how to serialize and deserialize this data structure without pickling.
The simplest way to use Dask here (but likely not the most efficient), would be to have each process define the ft model itself, which would avoid the need to transfer it (and thus avoid the attempted pickling). Something like the following would work. Notice that ft is defined inside the function being mapped across partitions.
First, some example data.
import dask.dataframe as dd
import fasttext
import pandas as pd
import dask
import numpy as np
df = pd.DataFrame({"text":['this is a test sentence', None, 'this is another one.', 'one more']})
ddf = dd.from_pandas(df, npartitions=2)
ddf
Dask DataFrame Structure:
text
npartitions=2
0 object
2 ...
3 ...
Dask Name: from_pandas, 2 tasks
Next, we can tweak your functions to define ft within each process. This duplicates effort, but avoids the need to transfer the model. With that, we can smoothly run it via map_partitions.
def get_embeddings(sent, model):
return model.get_sentence_vector(sent) if not pd.isna(sent) else np.zeros(10)
def func(df):
ft = fasttext.load_model("amazon_review_polarity.bin") # arbitrary model
res = df['text'].apply(lambda x: get_embeddings(x, model=ft))
return res
ddf['sentence_vector'] = ddf.map_partitions(func)
ddf.compute(scheduler='processes')
text sentence_vector
0 this is a test sentence [-0.01934033, 0.03729743, -0.04679677, -0.0603...
1 None [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 this is another one. [-0.0025579212, 0.0353713, -0.027139299, -0.05...
3 one more [-0.014522496, 0.10396308, -0.13107553, -0.198...
Note that this nested data structure (list in a column) is probably not the optimal way to handle these vectors, but it will depend on your use case. Also, there is probably a way to do this computation in batches using fastext rather than one row at a time (in Python), but I'm not well versed in the nuances of fastext.
I had the same problem, but I found a solution using Multiprocessing - Python's Standard Library.
First step - wrap
model = fasttext.load_model(file_name_model)
def get_vec(txt):
'''
First tried to put model.get_sentence_vector into map (look below), but it resulted in pickle error.
This works, lol.
'''
return model.get_sentence_vector(txt)
Then, I'm doing this:
from multiprocessing import Pool
text = ["How to sell drugs (fast)", "House of Cards", "The Crown"]
with Pool(40) as p: # I have 40 cores
result = p.map(get_vec, text)
With 40 cores processing 10M short texts took me ~80s.

how to create a new column with dense vectors in pyspark table using Pandas UDF?

My table is stored in pyspark in databricks. The table has two columns id and text. I am trying to get a dense vector for the text column. I have a ML model to generate the text dense representation into a new column called dense_embedding. The model generate a numpy array to represent the input text. ``
work like this model.encode(text_input). I want to use this model to generate the all text dense representation for the column text.
Here is what I did:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import *
import pandas as pd
# Use pandas_udf to define a Pandas UDF
#pandas_udf('???', PandasUDFType.SCALAR)
# Input/output are text and dense vector
def embedding(v):
return Vectors.dense(model.encode([v]))
small.withColumn('dense_embedding', embedding(small.text))
I am not sure is what data type shall I put into the pandas_udf function? is it correct to convert dense_vector like what I did?

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Transform RDD to valid input for kmeans

I am calculating TF and IDF using spark mllib algorithm of a directory that contains csv files with the following code:
import argparse
from os import system
### args parsing
parser = argparse.ArgumentParser(description='runs TF/IDF on a directory of
text docs')
parser.add_argument("-i","--input", help="the input in HDFS",
required=True)
parser.add_argument("-o", '--output', help="the output in HDFS",
required=True )
parser.add_argument("-mdf", '--min_document_frequency', default=1 )
args = parser.parse_args()
docs_dir = args.input
d_out = "hdfs://master:54310/" + args.output
min_df = int(args.min_document_frequency)
# import spark-realated stuff
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
sc = SparkContext(appName="TF-IDF")
# Load documents (one per line).
documents = sc.textFile(docs_dir).map(lambda title_text:
title_text[1].split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
# IDF
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
#print(tfidf.collect())
#save
tfidf.saveAsTextFile(d_out)
Using
print(tfidf.collect())
I get this output:
[SparseVector(1048576, {812399: 4.3307}), SparseVector(1048576, {411697:
0.0066}), SparseVector(1048576, {411697: 0.0066}), SparseVector(1048576,
{411697: 0.0066}), SparseVector(1048576, {411697: 0.0066}), ....
I have also tested the KMeans mllib algorithm :
from __future__ import print_function
import sys
import numpy as np
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
runs=4
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kmeans <file> <k>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="KMeans")
lines = sc.textFile(sys.argv[1])
data = lines.map(parseVector)
k = int(sys.argv[2])
model = KMeans.train(data, k, runs)
print("Final centers: " + str(model.clusterCenters))
print("Total Cost: " + str(model.computeCost(data)))
sc.stop()
with this sample test case
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
and it works fine.
Now I want to apply the rdd output from tfidf above in the KMeans algorithm but I don't know how is it possible to transform the rdd like the sample text above, or how to split properly the rdd in the KMeans algorithm to work properly.
I really need some help with this one.
UPDATE
My real question is how can i read the input to apply it to KMeans mllib from a text file like this
(1048576,[155412,857472,756332],[1.75642010278,2.41857747478,1.97365255252])
(1048576,[159196,323305,501636],[2.98856378408,1.63863706713,2.44956728334])
(1048576,[135312,847543,743411],[1.42412015238,1.58759872958,2.01237484818])
UPDATE2
I am not sure at all but i think i need to go from above vectors to the below array so as to apply it directly to KMeans mllib algorithm
1.75642010278 2.41857747478 1.97365255252
2.98856378408 1.63863706713 2.44956728334
1.42412015238 1.58759872958 2.01237484818
The output of IDF is a dataframe of SparseVector. KMeans takes a vector as input (sparse or dense), hence, there should be no need to make any transformations. You should be able to use the output column from IDF directly as input to KMeans.
If you need to save the data to disk in between running the TFIDF and KMeans, I would recommend saving it as a csv through the dataframe API.
First convert to a dataframe using Row:
from pyspark.sql import Row
row = Row("features") # column name
df = tfidf.map(row).toDF()
An alternative way to convert without import:
df = tfidf.map(lambda x: (x, )).toDF(["features"])
After the conversion save the dataframe as a parquet file:
df.write.parquet('/path/to/save/file')
To read the data, simply use:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet('/path/to/file')
# converting from dataframe into an RDD[Vector]
data = df.rdd.map(list)
If you in any case need to convert from a vector saved as a string, that is also possible. Here is some example code:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize(["(7,[1,2,4],[1,1,1])"]).toDF(["features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
First an example dataframe is created with the same formatting. Then an UDF is used to parse the string into a vector. If you want an rdd instead of the dataframe, use the code above at the "reading from parquet" part to convert.
However, the output from IDF is very sparse. The vectors have a length of 1048576 and only one of these have a values over 1. KMeans would not give you any interesting results.
I would recommend you to look into word2vec instead. It will give you a more compact vector for each word and clustering these vectors would make more sense. Using this method you can receive a map of words to their vector representations which can be used for clustering.

Resources