How to read a csv to use in pyspark MLlib? - apache-spark

I have a csv file that I'm trying to use as input of a KMeans algorithm in pyspark. I'm using the code from MLlib documentation.
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
dataset = spark.read.format("libsvm").load("P.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
I'm getting the error:
java.lang.NumberFormatException: For input string: "-6.71,-1.14"
I tried to read the file as
dataset = spark.read.format("csv").load("P.txt")
But I get another error:
java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1
I'm beginner in pyspark, I tried to look for tutorials on that but I did't find any.

I found the problem. A DataFrame input of kmeans.fit needs to have a field "features", as the error java.lang.IllegalArgumentException: Field "features" does not exist. Available fields: _c0, _c1 was indicating.
To do this we need a VectorAssembler, but before we need to convert the columns to a numeric type, otherwise we get the error java.lang.IllegalArgumentException: Data type string of column _c0 is not supported.
from pyspark.sql.functions import col
df = spark.read.csv('P.txt')
# Convert columns to float
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
assembler = VectorAssembler(
inputCols=["_c0", "_c1"],
outputCol="features")
df = assembler.transform(df)
df = df.drop("_c0")
df = df.drop("_c1")
df.show()

Check This method for reading CSV files:
df = spark.read.options(header=True).csv('csvFile.csv')
df.show()

Available fields: _c0, _c1
Check the first row of your data file. There is a high probability, that you didn't use headers=True parameter when saving it to hdfs while creating.

Related

How to test/train split by column value rather than by row in pyspark

I would like to generate a train and test set for machine learning. Let's say I have a dataframe with the following columns:
account_id | session_id | feature_1 | feature_2 | label
In this dataset, each row will have a unique session_id, but an account_id can show up multiple times. However, I want my train and test sets to have mutually exclusive account_ids. (Almost seems like the opposite of stratified sampling).
In pandas, this is simple enough. I have something like the following:
def train_test_split(df, split_col, feature_cols, label_col, test_fraction=0.2):
"""
While sklearn train_test_split splits by each row in the dataset,
this function will split by a specific column. In that way, we can
separate account_id such that train and test sets will have mutually
exclusive accounts, to minimize cross-talk between train and test sets.
"""
split_values = df[split_col].drop_duplicates()
test_values = split_values.sample(frac=test_fraction, random_state=42)
df_test = df[df[split_col].isin(test_values)]
df_train = df[~df[split_col].isin(test_values)]
return df_test, df_train
Now, my dataset is large enough that it cannot fit into memory, and I have to switch over from pandas to doing all of this in pyspark. How can I split a train and test set to have mutually exclusive account_ids in pyspark, without fitting everything into memory?
You can use the rand() function from pyspark.sql.functions for generating a random number for each of the distinct account_id and create train and test dataframes based on this random number.
from psypark.sql import functions as F
TEST_FRACTION = 0.2
train_test_split = (df.select("account_id")
.distinct() # removing duplicate account_ids
.withColumn("rand_val", F.rand())
.withColumn("data_type", F.when(F.col("rand_val") < TEST_FRACTION, "test")
.otherwise("train")))
train_df = (train_test_split.filter(F.col("data_type") == "train")
.join(df, on="account_id")) # inner join removes all rows other than train
test_df = (train_test_split.filter(F.col("data_type") == "test")
.join(df, on="account_id"))
Since an account_id cannot be both train and test at a time, train_df and test_df will have mutually exclusive account_ids.

Transform RDD to valid input for kmeans

I am calculating TF and IDF using spark mllib algorithm of a directory that contains csv files with the following code:
import argparse
from os import system
### args parsing
parser = argparse.ArgumentParser(description='runs TF/IDF on a directory of
text docs')
parser.add_argument("-i","--input", help="the input in HDFS",
required=True)
parser.add_argument("-o", '--output', help="the output in HDFS",
required=True )
parser.add_argument("-mdf", '--min_document_frequency', default=1 )
args = parser.parse_args()
docs_dir = args.input
d_out = "hdfs://master:54310/" + args.output
min_df = int(args.min_document_frequency)
# import spark-realated stuff
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
sc = SparkContext(appName="TF-IDF")
# Load documents (one per line).
documents = sc.textFile(docs_dir).map(lambda title_text:
title_text[1].split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
# IDF
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
#print(tfidf.collect())
#save
tfidf.saveAsTextFile(d_out)
Using
print(tfidf.collect())
I get this output:
[SparseVector(1048576, {812399: 4.3307}), SparseVector(1048576, {411697:
0.0066}), SparseVector(1048576, {411697: 0.0066}), SparseVector(1048576,
{411697: 0.0066}), SparseVector(1048576, {411697: 0.0066}), ....
I have also tested the KMeans mllib algorithm :
from __future__ import print_function
import sys
import numpy as np
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
runs=4
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kmeans <file> <k>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="KMeans")
lines = sc.textFile(sys.argv[1])
data = lines.map(parseVector)
k = int(sys.argv[2])
model = KMeans.train(data, k, runs)
print("Final centers: " + str(model.clusterCenters))
print("Total Cost: " + str(model.computeCost(data)))
sc.stop()
with this sample test case
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
and it works fine.
Now I want to apply the rdd output from tfidf above in the KMeans algorithm but I don't know how is it possible to transform the rdd like the sample text above, or how to split properly the rdd in the KMeans algorithm to work properly.
I really need some help with this one.
UPDATE
My real question is how can i read the input to apply it to KMeans mllib from a text file like this
(1048576,[155412,857472,756332],[1.75642010278,2.41857747478,1.97365255252])
(1048576,[159196,323305,501636],[2.98856378408,1.63863706713,2.44956728334])
(1048576,[135312,847543,743411],[1.42412015238,1.58759872958,2.01237484818])
UPDATE2
I am not sure at all but i think i need to go from above vectors to the below array so as to apply it directly to KMeans mllib algorithm
1.75642010278 2.41857747478 1.97365255252
2.98856378408 1.63863706713 2.44956728334
1.42412015238 1.58759872958 2.01237484818
The output of IDF is a dataframe of SparseVector. KMeans takes a vector as input (sparse or dense), hence, there should be no need to make any transformations. You should be able to use the output column from IDF directly as input to KMeans.
If you need to save the data to disk in between running the TFIDF and KMeans, I would recommend saving it as a csv through the dataframe API.
First convert to a dataframe using Row:
from pyspark.sql import Row
row = Row("features") # column name
df = tfidf.map(row).toDF()
An alternative way to convert without import:
df = tfidf.map(lambda x: (x, )).toDF(["features"])
After the conversion save the dataframe as a parquet file:
df.write.parquet('/path/to/save/file')
To read the data, simply use:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet('/path/to/file')
# converting from dataframe into an RDD[Vector]
data = df.rdd.map(list)
If you in any case need to convert from a vector saved as a string, that is also possible. Here is some example code:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize(["(7,[1,2,4],[1,1,1])"]).toDF(["features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
First an example dataframe is created with the same formatting. Then an UDF is used to parse the string into a vector. If you want an rdd instead of the dataframe, use the code above at the "reading from parquet" part to convert.
However, the output from IDF is very sparse. The vectors have a length of 1048576 and only one of these have a values over 1. KMeans would not give you any interesting results.
I would recommend you to look into word2vec instead. It will give you a more compact vector for each word and clustering these vectors would make more sense. Using this method you can receive a map of words to their vector representations which can be used for clustering.

Load Data for Machine Learning in Spark [duplicate]

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
I am using Spark 2.0.
The issue you are facing can be divided into the following :
Converting your ratings (I believe) into LabeledPoint data X.
Saving X in libsvm format.
1. Converting your ratings into LabeledPoint data X
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
You can handle those raw ratings as a coordinate list matrix (COO).
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2. Saving LabeledPoint data in libsvm format
Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
EDIT:
In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
In order to convert an existing to a typed DataSet I suggest the following; Use the following case class:
case class LibSvmEntry (
value: Double,
features: L.Vector)
The you can use the map function to convert it to a LibSVM entry like so:
df.map[LibSvmEntry](r: Row => /* Do your stuff here*/)
libsvm datatype features is a sparse vector, u can use pyspark.ml.linalg.SparseVector to solve the problem
a = SparseVector(4, [1, 3], [3.0, 4.0])
def sparsevecfuc(len,index,score):
"""
args: len int, index array, score array
"""
return SparseVector(len,index,score)
trans_sparse = udf(sparsevecfuc,VectorUDT())

Preparing data for LDA training with PySpark 1.6

I have a corpus of documents that I'm reading into a spark data frame.
I have tokeniked and vectorized the text and now I want to feed the vectorized data into an mllib LDA model. The LDA API docs seems to require the data to be:
rdd – RDD of documents, which are tuples of document IDs and term (word) count vectors. The term count vectors are “bags of words” with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
How can I get from my data frame to a suitable rdd?
from pyspark.mllib.clustering import LDA
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer
#read the data
tf = sc.wholeTextFiles("20_newsgroups/*")
#transform into a data frame
df = tf.toDF(schema=['file','text'])
#tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokenized = tokenizer.transform(df)
#vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
model = cv.fit(tokenized)
result = model.transform(tokenized)
#transform into a suitable rdd
myrdd = ?
#LDA
model = LDA.train(myrdd, k=2, seed=1)
PS : I'm using Apache Spark 1.6.3
Let's first organize imports, read the data, do some simple special characters removal and transform it into a DataFrame:
import re # needed to remove special character
from pyspark import Row
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer, CountVectorizer
from pyspark.mllib.clustering import LDA
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, LongType
pattern = re.compile('[\W_]+')
rdd = sc.wholeTextFiles("./data/20news-bydate/*/*/*") \
.mapValues(lambda x: pattern.sub(' ', x)).cache() # ref. https://stackoverflow.com/a/1277047/3415409
df = rdd.toDF(schema=['file', 'text'])
We will need to add an index to each Row. The following code snippet is inspired from this question about adding primary keys with Apache Spark :
row_with_index = Row(*["id"] + df.columns)
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
indexed = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
Once we have added the index, we can proceed to the features cleansing, extraction and transformation :
# tokenize
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
tokenized = tokenizer.transform(indexed)
# remove stop words
remover = StopWordsRemover(inputCol="tokens", outputCol="words")
cleaned = remover.transform(tokenized)
# vectorize
cv = CountVectorizer(inputCol="words", outputCol="vectors")
count_vectorizer_model = cv.fit(cleaned)
result = count_vectorizer_model.transform(cleaned)
Now, let's transform the results dataframe back to rdd
corpus = result.select(F.col('id').cast("long"), 'vectors').rdd \
.map(lambda x: [x[0], x[1]])
Our data is now ready to be trained :
# training data
lda_model = LDA.train(rdd=corpus, k=10, seed=12, maxIterations=50)
# extracting topics
topics = lda_model.describeTopics(maxTermsPerTopic=10)
# extraction vocabulary
vocabulary = count_vectorizer_model.vocabulary
We can print the topics descriptions now as followed :
for topic in range(len(topics)):
print("topic {} : ".format(topic))
words = topics[topic][0]
scores = topics[topic][1]
[print(vocabulary[words[word]], "->", scores[word]) for word in range(len(words))]
PS : This above code was tested with Spark 1.6.3.

How to convert type Row into Vector to feed to the KMeans

when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
The error i get:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
how can i convert this two columns to Vector and feed it to KMeans?
ML
The problem is that you missed the documentation's example, and it's pretty clear that the method train requires a DataFrame with a Vector as features.
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
Besides, you should also normalize your features using the class MinMaxScaler to obtain better results.
MLLib
In order to achieve this using MLLib you need to use a map function first, to convert all your string values into Double, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
After this point you can train your MLlib's KMeans model using the rdd variable.
I got PySpark 2.3.1 to perform KMeans on a DataFrame as follows:
Write a list of the columns you want to include in the clustering analysis:
feat_cols = ['latitude','longitude']`
You need all of the columns to be numeric values:
expr = [col(c).cast("Double").alias(c) for c in feat_cols]
df2 = df2.select(*expr)
Create your features vector with mllib.linalg.Vectors:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=feat_cols, outputCol="features")
df3 = assembler.transform(df2).select('features')
You should normalize your features as normalization is not always required, but it rarely hurts (more about this here):
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withStd=True,
withMean=False)
scalerModel = scaler.fit(df3)
df4 = scalerModel.transform(df3).drop('features')\
.withColumnRenamed('scaledFeatures', 'features')
Turn your DataFrame object df4 into a dense vector RDD:
from pyspark.mllib.linalg import Vectors
data5 = df4.rdd.map(lambda row: Vectors.dense([x for x in row['features']]))
Use the obtained RDD object as input for KMeans training:
from pyspark.mllib.clustering import KMeans
model = KMeans.train(data5, k=3, maxIterations=10)
Example: classify a point p in your vector space:
prediction = model.predict(p)

Resources