How to Convert DataFrame to RDD using sql Context - apache-spark

I have created Data frame to read csv file using sqlContext from which I need to convert a column of table to RDD and then dense Vector to perform matrix multiplication .
I am finding it difficult to do so.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("inferSchema","true")
.load("/home/project/SparkRead/train.csv")
val result1 = sqlContext.sql("SELECT Sales from train").rdd
how to convert it to dense vector?

You can convert Dataframe to Vector using VectorAssembler. Check out the code below:
val df = spark.read.
format("com.databricks.spark.csv").
option("header","true").
option("inferSchema","true").
load("/tmp/train.csv")
// assuming input
// a,b,c,d
// 1,2,3,4
// 1,1,2,3
// 1,3,4,5
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().
setInputCols(Array("a", "b", "c", "d")).
setOutputCol("vect")
val output = assembler.transform(df)
// show the result
output.show()
// +---+---+---+---+-----------------+
// | a| b| c| d| vect|
// +---+---+---+---+-----------------+
// | 1| 2| 3| 4|[1.0,2.0,3.0,4.0]|
// | 1| 1| 2| 3|[1.0,1.0,2.0,3.0]|
// | 1| 3| 4| 5|[1.0,3.0,4.0,5.0]|
// +---+---+---+---+-----------------+

Related

Issue in Pyspark Cross Validation

I'm trying to cross validate RF model on Pyspark in the code below and is throwing error :
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code
trainData = raw_data_
numFolds = 5
rf = RandomForestClassifier(labelCol="Target", featuresCol="Scaled_features")
evaluator = MulticlassClassificationEvaluator() #
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder()\
.addGrid(rf.numTrees, [3, 10])\
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
tr_model = crossval.fit(trainData)
But this is resulting in an error
My raw_data_ variable is :
| features|Position_Group| Scaled_features|Target|
+--------------------+--------------+--------------------+------+
|[173.735992431640...| FWD|[12.9261366722264...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| FWD|[13.3796859647366...| 0|
|[155.752807617187...| MID|[11.5881692110224...| 2|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| DEF|[13.6064606109917...| 1|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| MID|[13.6064606109917...| 2|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[170.688003540039...| MID|[12.6993631612409...| 2|
|[155.447998046875...| FWD|[11.5654910652351...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| MID|[13.3796859647366...| 2|
|[188.975997924804...| MID|[14.0600087682323...| 2|
|[185.927993774414...| FWD|[13.8332341219772...| 0|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
+--------------------+--------------+--------------------+------+
Any suggestions on why and where the issue is happening? How can the issue be resolved?
Thanks
The error says
Error while calling evaluate. Field "label" does not exist.
which suggests that something's wrong with the evaluator. In your definition of the evaluator, you did not specify the label column, so the evaluator attempts to use the default "label" column, but that does not exist.
To resolve this, you need to specify the label column when instantiating the evaluator, just as what you did for the classifier. e.g.
evaluator = MulticlassClassificationEvaluator(labelCol="Target")

Convert libsvm format string ("field1:value field2:value") to DenseVector of the values

I have a column in libsvm format (ml library of spark) field1:value field2:value ...
+--------------+-----+
| features|label|
+--------------+-----+
| a:1 b:2 c:3| 0|
| a:4 b:5 c:6| 0|
| a:7 b:8 c:9| 1|
|a:10 b:11 c:12| 0|
+--------------+-----+
I want to extract the values and save them in arrays for each row in pyspark
features.printSchema()
root
|-- features: string (nullable = false)
|-- label: integer (nullable = true)
I am using the following udf because the column affected is part of a dataframe
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
features_expl = udf(lambda features: Vectors.dense(features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))))
features=features.withColumn("feats", features_expl(features.features))
The result I get is:
ValueError: could not convert string to float: mobile:0.0
It seems that it doesn't perform the second split and calls float() on a string.
What i would like to get is:
+--------------+-----+
| features|label|
+--------------+-----+
| [1, 2, 3]| 0|
| [4, 5, 6]| 0|
| [7, 8, 9]| 1|
| [10, 11, 12]| 0|
+--------------+-----+
You have two major problems with your udf. Firstly, it doesn't work as you intended. Consider the heart of your code as the following function:
from pyspark.ml.linalg import Vectors
def features_expl_non_udf(features):
return Vectors.dense(
features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))
)
If you call it on one of your strings:
features_expl_non_udf("a:1 b:2 c:3")
#ValueError: could not convert string to float: a:1
Because features.split(" ") returns ['a:1', 'b:2', 'c:3'], which you are passing to the Vectors.dense constructor. This does not make any sense.
What you intended to do was first split on space, then split each value of the resultant list on :. Then you can convert these values to float and pass the list to Vectors.dense.
Here is the proper implementation of your logic:
def features_expl_non_udf(features):
return Vectors.dense(map(lambda feat: float(feat.split(":")[1]), features.split()))
features_expl_non_udf("a:1 b:2 c:3")
#DenseVector([1.0, 2.0, 3.0])
Now the second problem with your udf is that you didn't specify a returnType. For a DenseVector you need to use VectorUDT as the returnType.
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
features_expl = udf(
lambda features: Vectors.dense(
map(lambda feat: float(feat.split(":")[1]), features.split())
),
VectorUDT()
)
features.withColumn("feats", features_expl(features.features)).show()
#+--------------+-----+----------------+
#| features|label| feats|
#+--------------+-----+----------------+
#| a:1 b:2 c:3| 0| [1.0,2.0,3.0]|
#| a:4 b:5 c:6| 0| [4.0,5.0,6.0]|
#| a:7 b:8 c:9| 1| [7.0,8.0,9.0]|
#|a:10 b:11 c:12| 0|[10.0,11.0,12.0]|
#+--------------+-----+----------------+
As an alternative, you can do the string processing part on the spark side using regexp_replace and split but you'll still have to use a udf to convert the final output to a DenseVector.
from pyspark.sql.functions import regexp_replace, split, udf
from pyspark.ml.linalg import Vectors, VectorUDT
toDenseVector = udf(Vectors.dense, VectorUDT())
features.withColumn(
"features",
toDenseVector(
split(regexp_replace("features", r"\w+:", ""), "\s+").cast("array<float>")
)
).show()
#+----------------+-----+
#| features|label|
#+----------------+-----+
#| [1.0,2.0,3.0]| 0|
#| [4.0,5.0,6.0]| 0|
#| [7.0,8.0,9.0]| 1|
#|[10.0,11.0,12.0]| 0|
#+----------------+-----+

Calculate percentile with groupBy on PySpark dataframe

I am trying to groupBy and then calculate percentile on PySpark dataframe. I've tested the following piece of code according to this Stack Overflow post:
from pyspark.sql.types import FloatType
import pyspark.sql.functions as func
import numpy as np
qt_udf = func.udf(lambda x,qt: float(np.percentile(x,qt)), FloatType())
df_out = df_in.groupBy('Id').agg(func.collect_list('value').alias('data'))\
.withColumn('median', qt_udf(func.col('data'),func.lit(0.5)).cast("string"))
df_out.show()
But get the following error:
Traceback (most recent call last): > df_out.show() ....> return lambda *a: f(*a) AttributeError: 'module' object has no attribute 'percentile'
This is because of numpy version (1.4.1), the percentile function was added from version 1.5. It is not possible to update numpy version in the short term.
Define a window and use the inbuilt percent_rank function to compute percentile values.
from pyspark.sql import Window
from pyspark.sql import functions as func
w = Window.partitionBy(df_in.Id).orderBy(df_in.value) #assuming default ascending order
df_out = df_in.withColumn('percentile_col',func.percent_rank().over(w))
Question's title suggests that OP wanted to calculate percentiles. But the body shows that he needed to calculate median in groups.
Test dataset:
from pyspark.sql import SparkSession, functions as F, Window as W, Window
spark = SparkSession.builder.getOrCreate()
df_in = spark.createDataFrame(
[('1', 10),
('1', 11),
('1', 12),
('1', 13),
('2', 20)],
['Id', 'value']
)
Percentiles of given data points in groups:
w = W.partitionBy('Id').orderBy('value')
df_in = df_in.withColumn('percentile_of_value_by_Id', F.percent_rank().over(w))
df_in.show()
#+---+-----+-------------------------+
#| Id|value|percentile_of_value_by_Id|
#+---+-----+-------------------------+
#| 1| 10| 0.0|
#| 1| 11| 0.3333333333333333|
#| 1| 12| 0.6666666666666666|
#| 1| 13| 1.0|
#| 2| 20| 0.0|
#+---+-----+-------------------------+
Median (accurate and approximate):
df_out = (df_in.groupBy('Id').agg(
F.expr('percentile(value, .5)').alias('median_accurate'), # for small-mid dfs
F.percentile_approx('value', .5).alias('median_approximate') # for mid-large dfs
))
df_out.show()
#+---+---------------+------------------+
#| Id|median_accurate|median_approximate|
#+---+---------------+------------------+
#| 1| 11.5| 11|
#| 2| 20.0| 20|
#+---+---------------+------------------+

Error when running a query involving ROUND function in spark sql

I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)
With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+

How can I get the indices of categorical variables in a Spark DataFrame? [duplicate]

How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.
There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.
I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf
You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.

Resources