How to get a dataframe from word2vec model using spark - apache-spark

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
What I want to do:
loading a input textfile
create a word2vec model
create a dataframe with a column word and a column Vector
using the dataframe as input for h2o
By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:
word | Vector
assert | [0.3, 0.4.....]
sense | [0.6, 0.2.....]
and so on.
This is my code so far:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation (map to variable
wordVectorsDF = model.getVectors()
# Transform wordVectorsDF word into dataframe
Is there any approach to that or functions provided by spark?
Thanks in advance

I found out that there are two libraries for a Word2Vec transformation - I don't know why.
from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.
Maybe somebody can comment on that concerning the 2 different libraries.
Thanks in advance.

First of all in H2O we don't support a Vector column type, you'd have to make a frame like this:
word | V1 | V2 | ...
assert | 0.3 | 0.4 | ...
sense | 0.6 | 0.2 | ...
Now for the actual question - no, since it's a Scala Map, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.
Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize on map.toSeq) and then convert it to an H2OFrame:
import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)

Related

Running PySpark on AWS' EMR need help removing stop words from dataframe

As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data
| docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")
now I just need to figure out stopwords so I can continue.
Thnak you for your time.
You can use Spark ML transformer for that:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
text = """
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])
# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)
# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")
display(final_df)
Links:
StopWordsRemover

Fitting new data points into existing clusters

This is my first attempt on clustering!
I have a situation where I need to fit my test data set into the existing clusters that I have already built using my train dataset and I have got neat 6 clusters using the HAC method. Now I want to fit the new test dataframe on the same HAC method that I have used. How I can I do that?
My code is as follows :
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc
plt.figure(figsize =(15, 15))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(df_pca_reduced, method ='ward')))
# import hierarchical clustering libraries
# create clusters
hc = AgglomerativeClustering(n_clusters=6, affinity = 'euclidean', linkage = 'ward')
# save clusters for chart
y_hc = hc.fit_predict(df_pca_reduced)
hiersclus_frame = pd.DataFrame(df1)
hiersclus_frame['cluster'] = y_hc
df_pca_reduced is the dataset that I have achieved after perfroming the PCA.
Now my clusters are stored in column cluster within df1.
The test dataset is "df" on which I want to run the same fit_predcit function to cluster this dataframe as well to get a similar cluster column on the df dataframe as well.
How should I achieve this?

How do i convert spark dataframe to RDD and get bag of words

I have a dataframe called article
+--------------------+
| processed_title|
+--------------------+
|[new, relictual, ...|
|[once, upon,a,time..|
+--------------------+
I want to flatten it to get it as bag of words.
How could I achieve this using the current situation. I have tried the code below which seems to give me a Type mismatch issue.
val bow_corpus = article.select("processed_title").rdd.flatMap(y => y)
I eventually want to use this bow_corpus to train a word2vec model.
Thanks
Assuming that processed_title is represented in SQL as array<string>:
article.select("processed_title").rdd.flatMap(_.getSeq[String](0))
There is also Word2Vec transformer which can be trained directly on a DataFrame:
import org.apache.spark.ml.feature.Word2Vec
val word2Vec = new Word2Vec()
.setInputCol("processed_title")
.setOutputCol("vectors")
.setMinCount(0)
.fit(article)
word2Vec.findSynonyms("foo", 1)
See also Spark extracting values from a Row

Spark and categorical string variables

I'm trying to understand how spark.ml handles string categorical independent variables. I know that in Spark I have to convert strings to doubles using StringIndexer.
Eg., "a"/"b"/"c" => 0.0/1.0/2.0.
But what I really would like to avoid is then having to use OneHotEncoder on that column of doubles. This seems to make the pipeline unnecessarily messy. Especially since Spark knows that the data is categorical. Hopefully the sample code below makes my question clearer.
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
val df = sqlContext.createDataFrame(Seq(
Tuple2(0.0,"a"), Tuple2(1.0, "b"), Tuple2(1.0, "c"), Tuple2(0.0, "c")
)).toDF("y", "x")
// index the string column "x"
val indexer = new StringIndexer().setInputCol("x").setOutputCol("xIdx").fit(df)
val indexed = indexer.transform(df)
// build a data frame of label, vectors
val assembler = (new VectorAssembler()).setInputCols(List("xIdx").toArray).setOutputCol("features")
val assembled = assembler.transform(indexed)
// build a logistic regression model and fit it
val logreg = (new LogisticRegression()).setFeaturesCol("features").setLabelCol("y")
val model = logreg.fit(assembled)
The logistic regression sees this as a model with only one independent variable.
model.coefficients
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]
But the independent variable is categorical with three categories = ["a", "b", "c"]. I know I never did a one of k encoding but the metadata of the data frame knows that the feature vector is nominal.
import org.apache.spark.ml.attribute.AttributeGroup
AttributeGroup.fromStructField(assembled.schema("features"))
res2: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":
{"nominal":[{"vals":["c","a","b"],"idx":0,"name":"xIdx"}]},
"num_attrs":1}}
How do I pass this information to LogisticRegression? Is this not the whole point of keeping dataframe metadata? There does not seem to be a CategoricalFeaturesInfo in SparkML. Do I really need to do a 1 of k encoding for each categorical feature?
Maybe I am missing something, but this really looks like the job for RFormula (https://spark.apache.org/docs/latest/ml-features.html#rformula).
As the name suggests, it takes an "R-style" formula that describes how the feature vector is composed from the input data columns.
For each categorical input columns (that is, StringType as type) it adds a StringIndexer + OneHotEncoder to the final pipeline implementing the formula under the hoods.
The output is a feature vector (of doubles) that can be used with any algorithm in the org.apache.spark.ml package, as the one you are targeting.

Apache Spark: Applying a function from sklearn parallel on partitions

I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark map()."
Actually, we have a library which does exactly that. We have several sklearn transformators and predictors up and running. It's name is sparkit-learn.
From our examples:
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
X = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parralelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])
local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))
local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))
y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])
You can find it here.
Im not 100% sure that I am following, but there are a number of partition methods, such as mapPartitions. These operators hand you the Iterator on each node, and you can do whatever you want to the data and pass it back through a new Iterator
rdd.mapPartitions(iter=>{
//Spin up something expensive that you only want to do once per node
for(item<-iter) yield {
//do stuff to the items using your expensive item
}
})
If your data set is small (it is possible to load it and train on one worker) you can do something like this:
def trainModel[T](modelId: Int, trainingSet: List[T]) = {
//trains model with modelId and returns it
}
//fake data
val data = List()
val numberOfModels = 100
val broadcastedData = sc.broadcast(data)
val trainedModels = sc.parallelize(Range(0, numberOfModels))
.map(modelId => (modelId, trainModel(modelId, broadcastedData.value)))
I assume you have some list of models (or some how parametrized models) and you can give them ids. Then in function trainModel you pick one depending on id. And as result you will get rdd of pairs of trained models and their ids.

Resources