using a modules method in Pyspark map - apache-spark

I have heard that it is available to call a method of another module in python to bring some calculations that is not implemented in spark and of course it is inefficient to do that.
I need a method to compute eigenvector centrality of a graph (as it is not available in graphframes module) .
I am aware that there is a way to do that in Scala using sparkling graph, but I need python to include everything.
I am newbie to spark RDD and I am wondering what is the wrong with the code below or even if this is a proper way of doing this
import networkx as nx
def func1(dt):
G = nx.Graph()
src = dt.Provider
dest = dt.AttendingPhysician
gr = src.zip(dest)
G = nx.from_edgelist(gr)
deg =nx.eigenvector_centrality(G)
return deg
rdd2=inpatient.rdd.map(lambda x: func1(x))'
rdd2.collect()
inpatient is a dataframe read from a CSV file which I am looking forward to make a graph that is directed from nodes in column Provider to nodes in column AttendingPhysician
there is an error that I am encountered with which is:
AttributeError: 'str' object has no attribute 'zip'

What you need is to understand the so called user defined functions functionality. However plain python UDFs are not very efficient.
You can use UDFs efficiently via the provided pandas_udf GROUPED_MAP functionality.
from the documentation:
Grouped map Pandas UDFs first splits a Spark DataFrame into groups based on the conditions specified in the groupby operator, applies a user-defined function (pandas.DataFrame -> pandas.DataFrame) to each group, combines and returns the results as a new Spark DataFrame.
An example for networkx eigenvector centrality running on pyspark is below. I group per cluster_id which is the result from graphframes connected components function:
def eigencentrality(
sparkdf, src="src", dst="dst", cluster_id_colname="cluster_id",
):
"""
Args:
sparkdf: imput edgelist Spark DataFrame
src: src column name
dst: dst column name
distance_colname: distance column name
cluster_id_colname: Graphframes-created connected components created cluster_id
Returns:
node_id:
eigen_centrality: eigenvector centrality of cluster cluster_id
cluster_id: cluster_id corresponding to the node_id
"""
ecschema = StructType(
[
StructField("node_id", StringType()),
StructField("eigen_centrality", DoubleType()),
StructField(cluster_id_colname, LongType()),
]
)
psrc = src
pdst = dst
#pandas_udf(ecschema, PandasUDFType.GROUPED_MAP)
def eigenc(pdf: pd.DataFrame) -> pd.DataFrame:
nxGraph = nx.Graph()
nxGraph = nx.from_pandas_edgelist(pdf, psrc, pdst)
ec = eigenvector_centrality(nxGraph, tol=1e-03)
out_df = (
pd.DataFrame.from_dict(ec, orient="index", columns=["eigen_centrality"])
.reset_index()
.rename(
columns={"index": "node_id", "eigen_centrality": "eigen_centrality"}
)
)
cluster_id = pdf[cluster_id_colname][0]
out_df[cluster_id_colname] = cluster_id
return out_df
out = sparkdf.groupby(cluster_id_colname).apply(eigenc)
return out
NB. I have created the splink_graph package in order to run parallelised networkx graph operations like the above on Pyspark efficiently. This is where this code comes from. If you are interested have a look there to see how other graph metrics have been implemented...

Related

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

Fitting new data points into existing clusters

This is my first attempt on clustering!
I have a situation where I need to fit my test data set into the existing clusters that I have already built using my train dataset and I have got neat 6 clusters using the HAC method. Now I want to fit the new test dataframe on the same HAC method that I have used. How I can I do that?
My code is as follows :
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(15, 15))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(df_pca_reduced, method ='ward')))
# import hierarchical clustering libraries
# create clusters
hc = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'ward')
# save clusters for chart
y_hc = hc.fit_predict(df_pca_reduced)
hiersclus_frame = pd.DataFrame(df1)
hiersclus_frame['cluster'] = y_hc
df_pca_reduced is the dataset that I have achieved after perfroming the PCA.
Now my clusters are stored in column cluster within df1.
The test dataset is "df" on which I want to run the same fit_predcit function to cluster this dataframe as well to get a similar cluster column on the df dataframe as well.
How should I achieve this?

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

How will Spark react if an RDD gets bigger?

We have code running in Apache Spark. After a detailed examination of the code, I've determined that one of our mappers is modifying an object that is in an RDD, rather than making a copy of the object for the output. That is, we have an RDD of dicts, and the map function is adding things to the dictionary, rather than returning new dictionaries.
RDDs are supposed to be immutable. Ours are being mutated.
We are also having memory errors.
Question: Will Spark be confused if the size of an RDD suddenly increases?
While it probably does not crash, it can cause some unspecified behaviour. For example this snippet
val rdd = sc.parallelize({
val m = new mutable.HashMap[Int, Int]
m.put(1, 2)
m
} :: Nil)
rdd.cache() // comment out to change behaviour!
rdd.map(m => {
m.put(2, 3)
m
}).collect().foreach(println) // "Map(2 -> 3, 1 -> 2)"
rdd.collect().foreach(println) // Either "Map(1 -> 2)" or "Map(2 -> 3, 1 -> 2)" depending if caching is used
the behaviour changes depending if the RDD gets cached or not. In the Spark API there is a bunch of functions that are allowed to mutate the data and that is clearly pointed out in the documentation, see this for example https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#aggregateByKey-U-scala.Function2-scala.Function2-scala.reflect.ClassTag-
Consider having a RDD[(K, V)] of map entries instead of maps i.e. RDD[Map[K, V]]. This would enable adding new entries in a standard way using flatMap or mapPartitions. If needed, the map representation can be eventually generating by grouping etc.
Okay, I developed some code to test out what happens if an object referred to in an RDD is mutated by the mapper, and I am happy to report that it is not possible if you are programming from Python.
Here is my test program:
from pyspark.sql import SparkSession
import time
COUNT = 5
def funnydir(i):
"""Return a directory for i"""
return {"i":i,
"gen":0 }
def funnymap(d):
"""Take a directory and perform a funnymap"""
d['gen'] = d.get('gen',0) + 1
d['id' ] = id(d)
return d
if __name__=="__main__":
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
dfroot = sc.parallelize(range(COUNT)).map(funnydir)
dfroot.persist()
df1 = dfroot.map(funnymap)
df2 = df1.map(funnymap)
df3 = df2.map(funnymap)
df4 = df3.map(funnymap)
print("===========================================")
print("*** df1:",df1.collect())
print("*** df2:",df2.collect())
print("*** df3:",df3.collect())
print("*** df4:",df4.collect())
print("===========================================")
ef1 = dfroot.map(funnymap)
ef2 = ef1.map(funnymap)
ef3 = ef2.map(funnymap)
ef4 = ef3.map(funnymap)
print("*** ef1:",ef1.collect())
print("*** ef2:",ef2.collect())
print("*** ef3:",ef3.collect())
print("*** ef4:",ef4.collect())
If you run this, you'll see that the id for the dictionary d is different in each of the dataframes. Apparently Spark is serializing deserializing the objects as they are passed from mapper to mapper. So each gets its own version.
If this were not true, then the first call to funnymap to make df1 would also change the generation in the dfroot data frame, and as a result ef4 would have different generation numbers that df4.

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
rdd.mapPartitions(iter=>{
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
}
})
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
}
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.

Resources