Spark Decision tree with categorical variables - apache-spark

My data has categorical variables (response and some of features variables).
How can I convert it to libsvm format after converting the categorical variables to binary features?

If your data is an RDD you may call the method: saveAsLibSVMFile(rdd, path) It's part of apache.spark.mllib.util.MLUtils package.
For official documentation see: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.util.MLUtils$
Here's a scala example assuming you have converted your categorical data to binary features: (you can do the same in python or java too)
val responseData=sc.textFile("response.txt")
val responseValue = responseData.map(line => line.trim().split(" ").map(_.toDouble))
val featuresData=sc.textFile("features.txt")
val featuresValue = featuresData.map(line => {
val featureInt = line.trim().toInt
})
val data = featuresValue.zip(featuresData).map(
line => LabeledPoint(line._1, Vectors.dense(line._2))
)
saveAsLibSVMFile(data, "data.libsvm")
If you want the PySpark Version, haven't tested this but something like:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
responseData=sc.textFile("response.txt")
responseValue = responseData.map(lambda line: map(lambda x: Decimal(x), line.strip().split(" ")))
# for clarity you can also extract the lambda into a function
featuresData=sc.textFile("features.txt")
featuresValue = featuresData.map(lambda line: Int(line.strip()))
mtx = zip(featuresValue.collect(),featuresData.collect())
data = map(lambda line: LabeledPoint(line[0], Vectors.sparse(line[1]), mtx))
saveAsLibSVMFile(data, "data.libsvm")

Related

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

Preparing data for LDA training with PySpark 1.6

I have a corpus of documents that I'm reading into a spark data frame.
I have tokeniked and vectorized the text and now I want to feed the vectorized data into an mllib LDA model. The LDA API docs seems to require the data to be:
rdd – RDD of documents, which are tuples of document IDs and term (word) count vectors. The term count vectors are “bags of words” with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
How can I get from my data frame to a suitable rdd?
from pyspark.mllib.clustering import LDA
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer
#read the data
tf = sc.wholeTextFiles("20_newsgroups/*")
#transform into a data frame
df = tf.toDF(schema=['file','text'])
#tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenized = tokenizer.transform(df)
#vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
model = cv.fit(tokenized)
result = model.transform(tokenized)
#transform into a suitable rdd
myrdd = ?
#LDA
model = LDA.train(myrdd, k=2, seed=1)
PS : I'm using Apache Spark 1.6.3
Let's first organize imports, read the data, do some simple special characters removal and transform it into a DataFrame:
import re # needed to remove special character
from pyspark import Row
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer, CountVectorizer
from pyspark.mllib.clustering import LDA
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, LongType
pattern = re.compile('[\W_]+')
rdd = sc.wholeTextFiles("./data/20news-bydate/*/*/*") \
.mapValues(lambda x: pattern.sub(' ', x)).cache() # ref. https://stackoverflow.com/a/1277047/3415409
df = rdd.toDF(schema=['file', 'text'])
We will need to add an index to each Row. The following code snippet is inspired from this question about adding primary keys with Apache Spark :
row_with_index = Row(*["id"] + df.columns)
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
indexed = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
Once we have added the index, we can proceed to the features cleansing, extraction and transformation :
# tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokenized = tokenizer.transform(indexed)
# remove stop words
remover = StopWordsRemover(inputCol="tokens", outputCol="words")
cleaned = remover.transform(tokenized)
# vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
count_vectorizer_model = cv.fit(cleaned)
result = count_vectorizer_model.transform(cleaned)
Now, let's transform the results dataframe back to rdd
corpus = result.select(F.col('id').cast("long"), 'vectors').rdd \
.map(lambda x: [x[0], x[1]])
Our data is now ready to be trained :
# training data
lda_model = LDA.train(rdd=corpus, k=10, seed=12, maxIterations=50)
# extracting topics
topics = lda_model.describeTopics(maxTermsPerTopic=10)
# extraction vocabulary
vocabulary = count_vectorizer_model.vocabulary
We can print the topics descriptions now as followed :
for topic in range(len(topics)):
print("topic {} : ".format(topic))
words = topics[topic][0]
scores = topics[topic][1]
[print(vocabulary[words[word]], "->", scores[word]) for word in range(len(words))]
PS : This above code was tested with Spark 1.6.3.

How to convert type Row into Vector to feed to the KMeans

when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
The error i get:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
how can i convert this two columns to Vector and feed it to KMeans?
ML
The problem is that you missed the documentation's example, and it's pretty clear that the method train requires a DataFrame with a Vector as features.
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
Besides, you should also normalize your features using the class MinMaxScaler to obtain better results.
MLLib
In order to achieve this using MLLib you need to use a map function first, to convert all your string values into Double, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
After this point you can train your MLlib's KMeans model using the rdd variable.
I got PySpark 2.3.1 to perform KMeans on a DataFrame as follows:
Write a list of the columns you want to include in the clustering analysis:
feat_cols = ['latitude','longitude']`
You need all of the columns to be numeric values:
expr = [col(c).cast("Double").alias(c) for c in feat_cols]
df2 = df2.select(*expr)
Create your features vector with mllib.linalg.Vectors:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=feat_cols, outputCol="features")
df3 = assembler.transform(df2).select('features')
You should normalize your features as normalization is not always required, but it rarely hurts (more about this here):
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withStd=True,
withMean=False)
scalerModel = scaler.fit(df3)
df4 = scalerModel.transform(df3).drop('features')\
.withColumnRenamed('scaledFeatures', 'features')
Turn your DataFrame object df4 into a dense vector RDD:
from pyspark.mllib.linalg import Vectors
data5 = df4.rdd.map(lambda row: Vectors.dense([x for x in row['features']]))
Use the obtained RDD object as input for KMeans training:
from pyspark.mllib.clustering import KMeans
model = KMeans.train(data5, k=3, maxIterations=10)
Example: classify a point p in your vector space:
prediction = model.predict(p)

Spark MLlib RowMatrix from SparseVector

I am trying to create a RowMatrix from an RDD of SparseVectors but am getting the following error:
<console>:37: error: type mismatch;
found : dataRows.type (with underlying type org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector])
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.SparseVector <: org.apache.spark.mllib.linalg.Vector (and dataRows.type <: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector]), but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)
My code is:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}
val DATA_FILE_DIR = "/user/cloudera/data/"
val DATA_FILE_NAME = "dataOct.txt"
val dataRows = sc.textFile(DATA_FILE_DIR.concat(DATA_FILE_NAME)).map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse)
val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)
My input data file is approximately 150 rows by 50,000 columns of space separated integers.
I am running:
Spark: Version 1.5.0-cdh5.5.1
Java: 1.7.0_67
Just provide explicit type annotation either for a RDD
val dataRows: org.apache.spark.rdd.RDD[Vector] = ???
or result of the anonymous function:
...
.map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse: Vector)

Text classification using Naive Bayes (Hashing Term Frequnecy)

I am trying to build a text classification model using Naive-bayes algorithm.
Here's my sample data (label and feature):
1|combusting [chemical]
1|industrial purposes
1|
2|salt for preserving,
2|other for foodstuffs
2|auxiliary
2|fluids for use with abrasives
3|vulcanisation
3|accelerators
3|anti-frothing solutions for batteries
4|anti-frothing solutions for accumulators
4|acetates
4|[chemicals]*
4|acetate of cellulose, unprocessed
Following is my sample code
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.feature.HashingTF
val rawData = sc.textFile("~/data.csv")
val rawData1 = rawData.map(x => x.replaceAll(",",""))
val htf = new HashingTF(1000)
val parsedData = rawData1.map { line =>
val values = (line.split("|").toSeq)
val featureVector = htf.transform(values(1).split(" "))
val label = values(0).toDouble
LabeledPoint(label, featureVector)
}
val splits = parsedData.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 2.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new MulticlassMetrics(predictionAndLabels)
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
val testData1 = htf.transform("salt")
val predictionAndLabels1 = model.predict(testData1)
I am getting approx 33% accuracy (very less), and testing data predict wrong label. I have printed parsedData which contains label and feature as below:
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
I am not able to find it out what's missing; hashing term frequency function seems generating repeated data term frequency. Kindly suggest me to improve the model performance, Thanks in advance
You have to ask yourself many questions before starting implementing your algorithm:
Your texts looks very short, what is the size of your vocabulary, answering this will help you in tuning the value of the HashingTF dimensionality. In your case, you might need to use lower value.
You might need to consider doing some pre-processing on your texts. e.g. using StopWordsRemover, stemming, Tokenizer?
A tokenizer will construct a better text than the ad-hoc text processing you are doing.
Change your parameters, namely the NumFeatures of the HashingTF and the lambda of the Naive Bayes.
Basically in Machin Learning you will need to do CrossValidation on a set of parameters in order to optimise your results. Check this example and try to do something similar by tuning your HashingTF and the lambda as follow:
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.addGrid(naiveBayes.lambda, Array(0.1, 0.01))
.build()
In general, using Pipelines and CrossValidation works with Naive Bayes for multi-class classification, so have a look on here rather than hardcoding all septs with your hands.

Resources