I am trying to groupBy and then calculate percentile on PySpark dataframe. I've tested the following piece of code according to this Stack Overflow post:
from pyspark.sql.types import FloatType
import pyspark.sql.functions as func
import numpy as np
qt_udf = func.udf(lambda x,qt: float(np.percentile(x,qt)), FloatType())
df_out = df_in.groupBy('Id').agg(func.collect_list('value').alias('data'))\
.withColumn('median', qt_udf(func.col('data'),func.lit(0.5)).cast("string"))
df_out.show()
But get the following error:
Traceback (most recent call last): > df_out.show() ....> return lambda *a: f(*a) AttributeError: 'module' object has no attribute 'percentile'
This is because of numpy version (1.4.1), the percentile function was added from version 1.5. It is not possible to update numpy version in the short term.
Define a window and use the inbuilt percent_rank function to compute percentile values.
from pyspark.sql import Window
from pyspark.sql import functions as func
w = Window.partitionBy(df_in.Id).orderBy(df_in.value) #assuming default ascending order
df_out = df_in.withColumn('percentile_col',func.percent_rank().over(w))
Question's title suggests that OP wanted to calculate percentiles. But the body shows that he needed to calculate median in groups.
Test dataset:
from pyspark.sql import SparkSession, functions as F, Window as W, Window
spark = SparkSession.builder.getOrCreate()
df_in = spark.createDataFrame(
[('1', 10),
('1', 11),
('1', 12),
('1', 13),
('2', 20)],
['Id', 'value']
)
Percentiles of given data points in groups:
w = W.partitionBy('Id').orderBy('value')
df_in = df_in.withColumn('percentile_of_value_by_Id', F.percent_rank().over(w))
df_in.show()
#+---+-----+-------------------------+
#| Id|value|percentile_of_value_by_Id|
#+---+-----+-------------------------+
#| 1| 10| 0.0|
#| 1| 11| 0.3333333333333333|
#| 1| 12| 0.6666666666666666|
#| 1| 13| 1.0|
#| 2| 20| 0.0|
#+---+-----+-------------------------+
Median (accurate and approximate):
df_out = (df_in.groupBy('Id').agg(
F.expr('percentile(value, .5)').alias('median_accurate'), # for small-mid dfs
F.percentile_approx('value', .5).alias('median_approximate') # for mid-large dfs
))
df_out.show()
#+---+---------------+------------------+
#| Id|median_accurate|median_approximate|
#+---+---------------+------------------+
#| 1| 11.5| 11|
#| 2| 20.0| 20|
#+---+---------------+------------------+
Related
I would like to create a data frame from the all possible combination of values of each of the categories listed in the dictionary.
I tried the below code, it is working fine for small dictionary with lesser key and values. But it is not getting executed for larger dictionary as i have given below.
import itertools as it
import pandas as pd
my_dict= {
"A":[0,1,.....25],
"B":[4,5,.....35],
"C":[0,1,......30],
"D":[0,1,........35],
.........
"Y":[0,1,........35],
"Z":[0,1,........35],
}
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
This is the error i get, how to handle this problem with large dictionary.
Traceback (most recent call last):
File "<ipython-input-11-723405257e95>", line 1, in <module>
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
MemoryError
How to handle with the large dictionary to create data frame
If you have a sufficiently large [1] Spark cluster, each list in dictionary can be used as a Spark dataframe and then all these dataframes can be cross-joined:
def to_spark_dfs(dict):
for key in dict:
l=[[e] for e in dict[key]]
yield spark.createDataFrame(l, schema=[key])
dfs=to_spark_dfs(my_dict)
from functools import reduce
res=reduce(lambda df1,df2: df1.crossJoin(df2),dfs)
If the original my_dict is not too large
my_dict= {
"A":[0,1,2],
"B":[4,5,6],
"C":[0,1,2],
"D":[0,1],
"Y":[0,1,2],
"Z":[0,1],
}
the code produces the expected result:
res.show()
#+---+---+---+---+---+---+
#| A| B| C| D| Y| Z|
#+---+---+---+---+---+---+
#| 0| 4| 0| 0| 0| 0|
#| 0| 4| 0| 0| 0| 1|
#| 0| 4| 0| 0| 1| 0|
#| 0| 4| 0| 0| 1| 1|
#...
res.count()
#324
[1]
Using the numbers given in the comment (80 keys and approx 30 values per key) you would need a really large Spark cluster: 30 ^ 80 gives 1.5*10^118 different combination. This is more than the estimated number of atoms (10^80) in the known, observable universe.
In this case, we have a huge number of possible combinations. For example, if columns (A, B, C... Z) can take values [1...10] then total count of rows are equal 10^26, or 100000000000000000000000000.
In my mind there are 2 main directions to solve this issue:
Horizontal scaling: calculate and store results using frameworks for distributed computing (such as Apache Spark or Hadoop)
Vertical scaling: optimize CPU/RAM utilization using:
Vectorization (e.g. avoid loops)
Data types with minimal impact to RAM allocation (use as minimal precision as you need, use factorize() for strings)
mini-batching and download intermediate results (data frames) from RAM to disc in zipped format (e.g. parquet)
benchmark the execution time and object size in RAM.
Let me introduce the code that implements some of the concepts of the vertical scaling approach.
Define the following functions:
create_data_frame_baseline(): data frame generator with loop, not optimal data types (baseline)
create_data_frame_no_loop(): no loop, not optimal data types
create_data_frame_optimize_data_type(): no loop, optimal data types.
import itertools as it
import pandas as pd
import numpy as np
from string import ascii_uppercase
def create_letter_dict(cols_n: int = 10, levels_n: int = 6) -> dict:
letter_dict = {letter: list(range(levels_n)) for letter in ascii_uppercase[0:cols_n]}
return letter_dict
def create_data_frame_baseline(dict: dict) -> pd.DataFrame:
df = pd.DataFrame(columns=dict.keys())
for row in it.product(*dict.values()):
df.loc[len(df.index)] = row
return df
def create_data_frame_no_loop(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
list(it.product(*dict.values())),
columns=dict.keys()
)
def create_data_frame_optimize_data_type(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
np.int8(list(it.product(*dict.values()))),
columns=dict.keys()
)
Benchmarks:
import sys
import timeit
cols_n = 7
levels_n = 5
iteration_n = 2
# Baseline
def create_data_frame_baseline_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_baseline(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_baseline_test).timeit(number=iteration_n))
# No loop, not optimal data types
def create_data_frame_no_loop_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_no_loop(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_no_loop_test).timeit(number=iteration_n))
# No loop, optimal data types.
def create_data_frame_optimize_data_type_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_optimize_data_type(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_optimize_data_type_test).timeit(number=iteration_n))
Outputs*:
Function
Dataframe shape
RAM size, Mb
Execution time, sec
create_data_frame_baseline_test
78125x7
19
485
create_data_frame_no_loop_test
78125x7
4.4
0.20
create_data_frame_optimize_data_type_test
78125x7
0.55
0.16
Using create_data_frame_optimize_data_type_test() I generated* 100M rows in less than 100 seconds.
* Ubuntu Server 20.04, Intel(R) Xeon(R) 8xCPU # 2.60GHz, 32GB RAM
In your case you can not generate all possible combination at once, by using the list() but do it in loop, for example:
import itertools as it
import pandas as pd
from string import ascii_uppercase
N = 36
my_dict = {x: list(range(N)) for x in ascii_uppercase}
df = pd.DataFrame(columns=my_dict.keys())
for row in it.product(*my_dict.values()):
df.loc[len(df.index)] = row
but of cause it take a long time
I'm trying to cross validate RF model on Pyspark in the code below and is throwing error :
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Your code
trainData = raw_data_
numFolds = 5
rf = RandomForestClassifier(labelCol="Target", featuresCol="Scaled_features")
evaluator = MulticlassClassificationEvaluator() #
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder()\
.addGrid(rf.numTrees, [3, 10])\
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
tr_model = crossval.fit(trainData)
But this is resulting in an error
My raw_data_ variable is :
| features|Position_Group| Scaled_features|Target|
+--------------------+--------------+--------------------+------+
|[173.735992431640...| FWD|[12.9261366722264...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| FWD|[13.3796859647366...| 0|
|[155.752807617187...| MID|[11.5881692110224...| 2|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| DEF|[13.6064606109917...| 1|
|[182.880004882812...| FWD|[13.6064606109917...| 0|
|[182.880004882812...| MID|[13.6064606109917...| 2|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
|[176.783996582031...| MID|[13.1529113184815...| 2|
|[170.688003540039...| MID|[12.6993631612409...| 2|
|[155.447998046875...| FWD|[11.5654910652351...| 0|
|[188.975997924804...| FWD|[14.0600087682323...| 0|
|[179.832000732421...| MID|[13.3796859647366...| 2|
|[188.975997924804...| MID|[14.0600087682323...| 2|
|[185.927993774414...| FWD|[13.8332341219772...| 0|
|[176.783996582031...| FWD|[13.1529113184815...| 0|
|[188.975997924804...| DEF|[14.0600087682323...| 1|
+--------------------+--------------+--------------------+------+
Any suggestions on why and where the issue is happening? How can the issue be resolved?
Thanks
The error says
Error while calling evaluate. Field "label" does not exist.
which suggests that something's wrong with the evaluator. In your definition of the evaluator, you did not specify the label column, so the evaluator attempts to use the default "label" column, but that does not exist.
To resolve this, you need to specify the label column when instantiating the evaluator, just as what you did for the classifier. e.g.
evaluator = MulticlassClassificationEvaluator(labelCol="Target")
I have a column in libsvm format (ml library of spark) field1:value field2:value ...
+--------------+-----+
| features|label|
+--------------+-----+
| a:1 b:2 c:3| 0|
| a:4 b:5 c:6| 0|
| a:7 b:8 c:9| 1|
|a:10 b:11 c:12| 0|
+--------------+-----+
I want to extract the values and save them in arrays for each row in pyspark
features.printSchema()
root
|-- features: string (nullable = false)
|-- label: integer (nullable = true)
I am using the following udf because the column affected is part of a dataframe
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
features_expl = udf(lambda features: Vectors.dense(features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))))
features=features.withColumn("feats", features_expl(features.features))
The result I get is:
ValueError: could not convert string to float: mobile:0.0
It seems that it doesn't perform the second split and calls float() on a string.
What i would like to get is:
+--------------+-----+
| features|label|
+--------------+-----+
| [1, 2, 3]| 0|
| [4, 5, 6]| 0|
| [7, 8, 9]| 1|
| [10, 11, 12]| 0|
+--------------+-----+
You have two major problems with your udf. Firstly, it doesn't work as you intended. Consider the heart of your code as the following function:
from pyspark.ml.linalg import Vectors
def features_expl_non_udf(features):
return Vectors.dense(
features.split(" ")).map(lambda feat: float(str(feat.split(":")[1]))
)
If you call it on one of your strings:
features_expl_non_udf("a:1 b:2 c:3")
#ValueError: could not convert string to float: a:1
Because features.split(" ") returns ['a:1', 'b:2', 'c:3'], which you are passing to the Vectors.dense constructor. This does not make any sense.
What you intended to do was first split on space, then split each value of the resultant list on :. Then you can convert these values to float and pass the list to Vectors.dense.
Here is the proper implementation of your logic:
def features_expl_non_udf(features):
return Vectors.dense(map(lambda feat: float(feat.split(":")[1]), features.split()))
features_expl_non_udf("a:1 b:2 c:3")
#DenseVector([1.0, 2.0, 3.0])
Now the second problem with your udf is that you didn't specify a returnType. For a DenseVector you need to use VectorUDT as the returnType.
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
features_expl = udf(
lambda features: Vectors.dense(
map(lambda feat: float(feat.split(":")[1]), features.split())
),
VectorUDT()
)
features.withColumn("feats", features_expl(features.features)).show()
#+--------------+-----+----------------+
#| features|label| feats|
#+--------------+-----+----------------+
#| a:1 b:2 c:3| 0| [1.0,2.0,3.0]|
#| a:4 b:5 c:6| 0| [4.0,5.0,6.0]|
#| a:7 b:8 c:9| 1| [7.0,8.0,9.0]|
#|a:10 b:11 c:12| 0|[10.0,11.0,12.0]|
#+--------------+-----+----------------+
As an alternative, you can do the string processing part on the spark side using regexp_replace and split but you'll still have to use a udf to convert the final output to a DenseVector.
from pyspark.sql.functions import regexp_replace, split, udf
from pyspark.ml.linalg import Vectors, VectorUDT
toDenseVector = udf(Vectors.dense, VectorUDT())
features.withColumn(
"features",
toDenseVector(
split(regexp_replace("features", r"\w+:", ""), "\s+").cast("array<float>")
)
).show()
#+----------------+-----+
#| features|label|
#+----------------+-----+
#| [1.0,2.0,3.0]| 0|
#| [4.0,5.0,6.0]| 0|
#| [7.0,8.0,9.0]| 1|
#|[10.0,11.0,12.0]| 0|
#+----------------+-----+
How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.
There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.
I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf
You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.
I have created Data frame to read csv file using sqlContext from which I need to convert a column of table to RDD and then dense Vector to perform matrix multiplication .
I am finding it difficult to do so.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("inferSchema","true")
.load("/home/project/SparkRead/train.csv")
val result1 = sqlContext.sql("SELECT Sales from train").rdd
how to convert it to dense vector?
You can convert Dataframe to Vector using VectorAssembler. Check out the code below:
val df = spark.read.
format("com.databricks.spark.csv").
option("header","true").
option("inferSchema","true").
load("/tmp/train.csv")
// assuming input
// a,b,c,d
// 1,2,3,4
// 1,1,2,3
// 1,3,4,5
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().
setInputCols(Array("a", "b", "c", "d")).
setOutputCol("vect")
val output = assembler.transform(df)
// show the result
output.show()
// +---+---+---+---+-----------------+
// | a| b| c| d| vect|
// +---+---+---+---+-----------------+
// | 1| 2| 3| 4|[1.0,2.0,3.0,4.0]|
// | 1| 1| 2| 3|[1.0,1.0,2.0,3.0]|
// | 1| 3| 4| 5|[1.0,3.0,4.0,5.0]|
// +---+---+---+---+-----------------+