Fitting new data points into existing clusters - python-3.x

This is my first attempt on clustering!
I have a situation where I need to fit my test data set into the existing clusters that I have already built using my train dataset and I have got neat 6 clusters using the HAC method. Now I want to fit the new test dataframe on the same HAC method that I have used. How I can I do that?
My code is as follows :
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(15, 15))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(df_pca_reduced, method ='ward')))
# import hierarchical clustering libraries
# create clusters
hc = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'ward')
# save clusters for chart
y_hc = hc.fit_predict(df_pca_reduced)
hiersclus_frame = pd.DataFrame(df1)
hiersclus_frame['cluster'] = y_hc
df_pca_reduced is the dataset that I have achieved after perfroming the PCA.
Now my clusters are stored in column cluster within df1.
The test dataset is "df" on which I want to run the same fit_predcit function to cluster this dataframe as well to get a similar cluster column on the df dataframe as well.
How should I achieve this?

Related

Running PySpark on AWS' EMR need help removing stop words from dataframe

As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data
| docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")
now I just need to figure out stopwords so I can continue.
Thnak you for your time.
You can use Spark ML transformer for that:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
text = """
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])
# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)
# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")
display(final_df)
Links:
StopWordsRemover

using a modules method in Pyspark map

I have heard that it is available to call a method of another module in python to bring some calculations that is not implemented in spark and of course it is inefficient to do that.
I need a method to compute eigenvector centrality of a graph (as it is not available in graphframes module) .
I am aware that there is a way to do that in Scala using sparkling graph, but I need python to include everything.
I am newbie to spark RDD and I am wondering what is the wrong with the code below or even if this is a proper way of doing this
import networkx as nx
def func1(dt):
G = nx.Graph()
src = dt.Provider
dest = dt.AttendingPhysician
gr = src.zip(dest)
G = nx.from_edgelist(gr)
deg =nx.eigenvector_centrality(G)
return deg
rdd2=inpatient.rdd.map(lambda x: func1(x))'
rdd2.collect()
inpatient is a dataframe read from a CSV file which I am looking forward to make a graph that is directed from nodes in column Provider to nodes in column AttendingPhysician
there is an error that I am encountered with which is:
AttributeError: 'str' object has no attribute 'zip'
What you need is to understand the so called user defined functions functionality. However plain python UDFs are not very efficient.
You can use UDFs efficiently via the provided pandas_udf GROUPED_MAP functionality.
from the documentation:
Grouped map Pandas UDFs first splits a Spark DataFrame into groups based on the conditions specified in the groupby operator, applies a user-defined function (pandas.DataFrame -> pandas.DataFrame) to each group, combines and returns the results as a new Spark DataFrame.
An example for networkx eigenvector centrality running on pyspark is below. I group per cluster_id which is the result from graphframes connected components function:
def eigencentrality(
sparkdf, src="src", dst="dst", cluster_id_colname="cluster_id",
):
"""
Args:
sparkdf: imput edgelist Spark DataFrame
src: src column name
dst: dst column name
distance_colname: distance column name
cluster_id_colname: Graphframes-created connected components created cluster_id
Returns:
node_id:
eigen_centrality: eigenvector centrality of cluster cluster_id
cluster_id: cluster_id corresponding to the node_id
"""
ecschema = StructType(
[
StructField("node_id", StringType()),
StructField("eigen_centrality", DoubleType()),
StructField(cluster_id_colname, LongType()),
]
)
psrc = src
pdst = dst
#pandas_udf(ecschema, PandasUDFType.GROUPED_MAP)
def eigenc(pdf: pd.DataFrame) -> pd.DataFrame:
nxGraph = nx.Graph()
nxGraph = nx.from_pandas_edgelist(pdf, psrc, pdst)
ec = eigenvector_centrality(nxGraph, tol=1e-03)
out_df = (
pd.DataFrame.from_dict(ec, orient="index", columns=["eigen_centrality"])
.reset_index()
.rename(
columns={"index": "node_id", "eigen_centrality": "eigen_centrality"}
)
)
cluster_id = pdf[cluster_id_colname][0]
out_df[cluster_id_colname] = cluster_id
return out_df
out = sparkdf.groupby(cluster_id_colname).apply(eigenc)
return out
NB. I have created the splink_graph package in order to run parallelised networkx graph operations like the above on Pyspark efficiently. This is where this code comes from. If you are interested have a look there to see how other graph metrics have been implemented...

How to get a dataframe from word2vec model using spark

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
What I want to do:
loading a input textfile
create a word2vec model
create a dataframe with a column word and a column Vector
using the dataframe as input for h2o
By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:
word | Vector
assert | [0.3, 0.4.....]
sense | [0.6, 0.2.....]
and so on.
This is my code so far:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation (map to variable
wordVectorsDF = model.getVectors()
# Transform wordVectorsDF word into dataframe
Is there any approach to that or functions provided by spark?
Thanks in advance
I found out that there are two libraries for a Word2Vec transformation - I don't know why.
from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.
Maybe somebody can comment on that concerning the 2 different libraries.
Thanks in advance.
First of all in H2O we don't support a Vector column type, you'd have to make a frame like this:
word | V1 | V2 | ...
assert | 0.3 | 0.4 | ...
sense | 0.6 | 0.2 | ...
Now for the actual question - no, since it's a Scala Map, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.
Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize on map.toSeq) and then convert it to an H2OFrame:
import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this:
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features")
val idf = new IDF().setInputCol("features").setOutputCol("idffeatures")
val nb = new org.apache.spark.ml.classification.NaiveBayes()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb))
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setNumFolds(10)
val cvModel = cv.fit(df)
As you can see I defined a CrossValidator using a MultiClassClassificationEvaluator. I have seen a lot of examples getting metrics like Precision/Recall during testing process but these metris are gotten when you use a different set of data for testing purposes (See for example this documentation).
From my understanding, CrossValidator is going to create folds and one fold will be use for testing purposes, then CrossValidator will choose the best model. My question is, is possible to get Precision/Recall metrics during training process?
Well, the only metric which is actually stored is the one you define when you create an instance of an Evaluator. For the BinaryClassificationEvaluator this can take one of the two values:
areaUnderROC
areaUnderPR
with the former one being default, and can be set using setMetricName method.
These values are collected during training process and can accessed using CrossValidatorModel.avgMetrics. Order of values corresponds to the order of EstimatorParamMaps (CrossValidatorModel.getEstimatorParamMaps).

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
rdd.mapPartitions(iter=>{
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
}
})
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
}
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.

Resources