How to randomly shuffle the values of only one column in pyspark? - apache-spark

I want to break the correlation between a column and the rest of the dataframe. I want to do this while maintaining the distribution of the values in the said column
In pandas, I used to achieve this by simply shuffling the values of a column and then assigning the values to the column. It is not so straightforward in the case of pyspark because of the data being partitioned. I do not think there is even a way in pyspark to set a new column in a dataframe with a column from another dataframe
So, In pyspark, how do I achieve the following?:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
Also, I'm hoping you give me a solution that does not include unnecessary shuffles.
one way to do it is:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import row_number,lit,rand
from pyspark.sql.window import Window
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
dfs = spark.createDataFrame(df)
w = Window().orderBy(lit('A'))
dfs = dfs.withColumn("row_num", row_number().over(w))
dfs_ts = dfs.select('b')
dfs_ts = dfs_ts.withColumn('o',rand()).orderBy('o')
dfs = dfs.drop('b')
dfs_ts = dfs_ts.drop('o')
w = Window().orderBy(lit('A'))
dfs_ts = dfs_ts.withColumn("row_num", row_number().over(w))
dfs = dfs.join(dfs_ts,on='row_num').drop('row_num')
But, I do not need the shuffles that come with join and they are not necessary. If a blind hstack is possible per partition basis in pyspark that should be enough. Also, the window function tells me that I have not defined any partitions so all my data would be collected to one partition. Might as well use pandas in that case

I think you could achieve something like that on the RDD level. Not sure if it would be worth all the extra steps, and depending on the size of the partitions it might not perform really well because you'd have to hold all the values of the partition in memory for the shuffle.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df = pd.DataFrame({'a':[x for x in range(10)],'b':[x for x in range(10)]})
dfs = spark.createDataFrame(df)
dfs.show()
rdd_a = dfs.select("a").rdd
rdd_b = dfs.select("b").rdd
from random import shuffle
def f(iterator):
items = list(iterator)
shuffle(items)
for x in items:
yield x
from pyspark import Row
def reconstruct(x):
return Row(a=x[0].a, b=x[1].b)
rdd_b_reordered = rdd_b.mapPartitions(f)
df_reordered = spark.createDataFrame(rdd_a.zip(rdd_b_reordered).map(reconstruct))
df_reordered.show()
"""
My output:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 4|
| 2| 0|
| 3| 1|
| 4| 3|
| 5| 7|
| 6| 9|
| 7| 5|
| 8| 6|
| 9| 8|
+---+---+
"""
Maybe you can tweak that to your needs. This would only shuffle the things over each partition to avoid the partition shuffle. Also probably better to do it in Scala.

I am not fully confident in this answer, but I think it's right.
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
So if we get your Pandas example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
We can just wrap the .sample expression in a function.
def shuffle(df: pd.DataFrame) -> pd.DataFrame:
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
return df
And then we can bring it to Spark using the transform function. The ``transform` function can take in both Pandas and Spark DataFrames and then will convert it to Spark if you are using the Spark engine.
from fugue import transform
import fugue_spark
transform(df, shuffle, schema="*", engine="spark")
But of course, this transform is applied per partition. If you don't specify the partitions, it uses the default partitions. If you want to shuffle to really randomize it, you can do:
transform(df, shuffle, schema="*", engine="spark", partition={"algo":"rand"}).show()
And Fugue will partition your data randomly before this operation.
You may not see the shuffle for your test case because the data is really small. If you end up with 4 partitions with 1 row each, they will end up returning the same value.

Related

How to create a data frame from the all possible combination of values of each of the categories listed in the large dictionary

I would like to create a data frame from the all possible combination of values of each of the categories listed in the dictionary.
I tried the below code, it is working fine for small dictionary with lesser key and values. But it is not getting executed for larger dictionary as i have given below.
import itertools as it
import pandas as pd
my_dict= {
"A":[0,1,.....25],
"B":[4,5,.....35],
"C":[0,1,......30],
"D":[0,1,........35],
.........
"Y":[0,1,........35],
"Z":[0,1,........35],
}
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
This is the error i get, how to handle this problem with large dictionary.
Traceback (most recent call last):
  File "<ipython-input-11-723405257e95>", line 1, in <module>
    df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
MemoryError
How to handle with the large dictionary to create data frame
If you have a sufficiently large [1] Spark cluster, each list in dictionary can be used as a Spark dataframe and then all these dataframes can be cross-joined:
def to_spark_dfs(dict):
for key in dict:
l=[[e] for e in dict[key]]
yield spark.createDataFrame(l, schema=[key])
dfs=to_spark_dfs(my_dict)
from functools import reduce
res=reduce(lambda df1,df2: df1.crossJoin(df2),dfs)
If the original my_dict is not too large
my_dict= {
"A":[0,1,2],
"B":[4,5,6],
"C":[0,1,2],
"D":[0,1],
"Y":[0,1,2],
"Z":[0,1],
}
the code produces the expected result:
res.show()
#+---+---+---+---+---+---+
#| A| B| C| D| Y| Z|
#+---+---+---+---+---+---+
#| 0| 4| 0| 0| 0| 0|
#| 0| 4| 0| 0| 0| 1|
#| 0| 4| 0| 0| 1| 0|
#| 0| 4| 0| 0| 1| 1|
#...
res.count()
#324
[1]
Using the numbers given in the comment (80 keys and approx 30 values per key) you would need a really large Spark cluster: 30 ^ 80 gives 1.5*10^118 different combination. This is more than the estimated number of atoms (10^80) in the known, observable universe.
In this case, we have a huge number of possible combinations. For example, if columns (A, B, C... Z) can take values [1...10] then total count of rows are equal 10^26, or 100000000000000000000000000.
In my mind there are 2 main directions to solve this issue:
Horizontal scaling: calculate and store results using frameworks for distributed computing (such as Apache Spark or Hadoop)
Vertical scaling: optimize CPU/RAM utilization using:
Vectorization (e.g. avoid loops)
Data types with minimal impact to RAM allocation (use as minimal precision as you need, use factorize() for strings)
mini-batching and download intermediate results (data frames) from RAM to disc in zipped format (e.g. parquet)
benchmark the execution time and object size in RAM.
Let me introduce the code that implements some of the concepts of the vertical scaling approach.
Define the following functions:
create_data_frame_baseline(): data frame generator with loop, not optimal data types (baseline)
create_data_frame_no_loop(): no loop, not optimal data types
create_data_frame_optimize_data_type(): no loop, optimal data types.
import itertools as it
import pandas as pd
import numpy as np
from string import ascii_uppercase
def create_letter_dict(cols_n: int = 10, levels_n: int = 6) -> dict:
letter_dict = {letter: list(range(levels_n)) for letter in ascii_uppercase[0:cols_n]}
return letter_dict
def create_data_frame_baseline(dict: dict) -> pd.DataFrame:
df = pd.DataFrame(columns=dict.keys())
for row in it.product(*dict.values()):
df.loc[len(df.index)] = row
return df
def create_data_frame_no_loop(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
list(it.product(*dict.values())),
columns=dict.keys()
)
def create_data_frame_optimize_data_type(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
np.int8(list(it.product(*dict.values()))),
columns=dict.keys()
)
Benchmarks:
import sys
import timeit
cols_n = 7
levels_n = 5
iteration_n = 2
# Baseline
def create_data_frame_baseline_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_baseline(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_baseline_test).timeit(number=iteration_n))
# No loop, not optimal data types
def create_data_frame_no_loop_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_no_loop(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_no_loop_test).timeit(number=iteration_n))
# No loop, optimal data types.
def create_data_frame_optimize_data_type_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_optimize_data_type(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_optimize_data_type_test).timeit(number=iteration_n))
Outputs*:
Function
Dataframe shape
RAM size, Mb
Execution time, sec
create_data_frame_baseline_test
78125x7
19
485
create_data_frame_no_loop_test
78125x7
4.4
0.20
create_data_frame_optimize_data_type_test
78125x7
0.55
0.16
Using create_data_frame_optimize_data_type_test() I generated* 100M rows in less than 100 seconds.
* Ubuntu Server 20.04, Intel(R) Xeon(R) 8xCPU # 2.60GHz, 32GB RAM
In your case you can not generate all possible combination at once, by using the list() but do it in loop, for example:
import itertools as it
import pandas as pd
from string import ascii_uppercase
N = 36
my_dict = {x: list(range(N)) for x in ascii_uppercase}
df = pd.DataFrame(columns=my_dict.keys())
for row in it.product(*my_dict.values()):
df.loc[len(df.index)] = row
but of cause it take a long time

Spark gives Error when creating DataFrame

I have downloaded spark version 2.3.1 and hadoop version 2.7 and java jdk 8.
Every thing works fine for simple exercises, but when i tried to create dataframe. it start to though error.
the following code runs with out error.
import numpy as np
TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())
but when i tried the following code requires the input to change into dataframe
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
it throws the following error
you were missing import for Row.
from pyspark.sql import Row
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
df.show()
Result:
+---+----+
|age|name|
+---+----+
| 20| ab|
| 20| ab|
+---+----+

User defined function to be applied to Window in PySpark?

I am trying to apply a user defined function to Window in PySpark. I have read that UDAF might be the way to to go, but I was not able to find anything concrete.
To give an example (taken from here: Xinh's Tech Blog and modified for PySpark):
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg
spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()
a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
customers = spark.createDataFrame([["Alice", "2016-05-01", 50.00],
["Alice", "2016-05-03", 45.00],
["Alice", "2016-05-04", 55.00],
["Bob", "2016-05-01", 25.00],
["Bob", "2016-05-04", 29.00],
["Bob", "2016-05-06", 27.00]],
["name", "date", "amountSpent"])
customers.show()
window_spec = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
result = customers.withColumn( "movingAvg", avg(customers["amountSpent"]).over(window_spec))
result.show()
I am applying avg which is already built into pyspark.sql.functions, but if instead of avg I wanted to use something of more complicated and write my own function, how would I do that?
Spark >= 3.0:
SPARK-24561 - User-defined window functions with pandas udf (bounded window) is a a work in progress. Please follow the related JIRA for details.
Spark >= 2.4:
SPARK-22239 - User-defined window functions with pandas udf (unbounded window) introduced support for Pandas based window functions with unbounded windows. General structure is
return_type: DataType
#pandas_udf(return_type, PandasUDFType.GROUPED_AGG)
def f(v):
return ...
w = (Window
.partitionBy(grouping_column)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn('foo', f('bar').over(w))
Please see the doctests and the unit tests for detailed examples.
Spark < 2.4
You cannot. Window functions require UserDefinedAggregateFunction or equivalent object, not UserDefinedFunction, and it is not possible to define one in PySpark.
However, in PySpark 2.3 or later, you can define vectorized pandas_udf, which can be applied on grouped data. You can find a working example Applying UDFs on GroupedData in PySpark (with functioning python example). While Pandas don't provide direct equivalent of window functions, there are expressive enough to implement any window-like logic, especially with pandas.DataFrame.rolling. Furthermore function used with GroupedData.apply can return arbitrary number of rows.
You can also call Scala UDAF from PySpark Spark: How to map Python with Scala or Java User Defined Functions?.
UDFs can be applied to Window now as of Spark 3.0.0.
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html
Extract from the documentation:
from pyspark.sql import Window
#pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
w = Window.partitionBy('id').orderBy('v').rowsBetween(-1, 0)
df.withColumn('mean_v', mean_udf("v").over(w)).show()
+---+----+------+
| id| v|mean_v|
+---+----+------+
| 1| 1.0| 1.0|
| 1| 2.0| 1.5|
| 2| 3.0| 3.0|
| 2| 5.0| 4.0|
| 2|10.0| 7.5|
+---+----+------+

How can I get the indices of categorical variables in a Spark DataFrame? [duplicate]

How do I handle categorical data with spark-ml and not spark-mllib ?
Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.
However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
How should I proceed?
I just wanted to complete Holden's answer.
Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.
In Scala:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer}
val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2")
val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(indexer.getOutputCol, "category2"))
.setOutputCols(Array("category1Vec", "category2Vec"))
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
pipeline.fit(df).transform(df).show
// +---+---------+---------+--------------+-------------+-------------+
// | id|category1|category2|category1Index| category1Vec| category2Vec|
// +---+---------+---------+--------------+-------------+-------------+
// | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
// | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
// | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
// | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
// +---+---------+---------+--------------+-------------+-------------+
In Python:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"])
indexer = StringIndexer(inputCol="category1", outputCol="category1Index")
inputs = [indexer.getOutputCol(), "category2"]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"])
pipeline = Pipeline(stages=[indexer, encoder])
pipeline.fit(df).transform(df).show()
# +---+---------+---------+--------------+-------------+-------------+
# | id|category1|category2|category1Index| categoryVec1| categoryVec2|
# +---+---------+---------+--------------+-------------+-------------+
# | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])|
# | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])|
# | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])|
# | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])|
# +---+---------+---------+--------------+-------------+-------------+
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature, which maps a column of label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
Let's consider the following DataFrame:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"))
.toDF("id", "category")
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| a| 0.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
You can then encode the categoryIndex with OneHotEncoder :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
// +---+-------------+
// | id| categoryVec|
// +---+-------------+
// | 0|(2,[0],[1.0])|
// | 1| (2,[],[])|
// | 2|(2,[1],[1.0])|
// | 3|(2,[0],[1.0])|
// | 4|(2,[0],[1.0])|
// | 5|(2,[1],[1.0])|
// +---+-------------+
I am going to provide an answer from another perspective, since I was also wondering about categorical features with regards to tree-based models in Spark ML (not MLlib), and the documentation is not that clear how everything works.
When you transform a column in your dataframe using pyspark.ml.feature.StringIndexer extra meta-data gets stored in the dataframe that specifically marks the transformed feature as a categorical feature.
When you print the dataframe you will see a numeric value (which is an index that corresponds with one of your categorical values) and if you look at the schema you will see that your new transformed column is of type double. However, this new column you created with pyspark.ml.feature.StringIndexer.transform is not just a normal double column, it has extra meta-data associated with it that is very important. You can inspect this meta-data by looking at the metadata property of the appropriate field in your dataframe's schema (you can access the schema objects of your dataframe by looking at yourdataframe.schema)
This extra metadata has two important implications:
When you call .fit() when using a tree based model, it will scan the meta-data of your dataframe and recognize fields that you encoded as categorical with transformers such as pyspark.ml.feature.StringIndexer (as noted above there are other transformers that will also have this effect such as pyspark.ml.feature.VectorIndexer). Because of this, you DO NOT have to one-hot encode your features after you have transformed them with StringIndxer when using tree-based models in spark ML (however, you still have to perform one-hot encoding when using other models that do not naturally handle categoricals like linear regression, etc.).
Because this metadata is stored in the data frame, you can use pyspark.ml.feature.IndexToString to reverse the numeric indices back to the original categorical values (which are often strings) at any time.
There is a component of the ML pipeline called StringIndexer you can use to convert your strings to Double's in a reasonable way. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer has more documentation, and http://spark.apache.org/docs/latest/ml-guide.html shows how to construct pipelines.
I use the following method for oneHotEncoding a single column in a Spark dataFrame:
def ohcOneColumn(df, colName, debug=False):
colsToFillNa = []
if debug: print("Entering method ohcOneColumn")
countUnique = df.groupBy(colName).count().count()
if debug: print(countUnique)
collectOnce = df.select(colName).distinct().collect()
for uniqueValIndex in range(countUnique):
uniqueVal = collectOnce[uniqueValIndex][0]
if debug: print(uniqueVal)
newColName = str(colName) + '_' + str(uniqueVal) + '_TF'
df = df.withColumn(newColName, df[colName]==uniqueVal)
colsToFillNa.append(newColName)
df = df.drop(colName)
df = df.na.fill(False, subset=colsToFillNa)
return df
I use the following method for oneHotEncoding Spark dataFrames:
from pyspark.sql.functions import col, countDistinct, approxCountDistinct
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']):
if debug: print("Entering method detectAndLabelCat")
newDf = sparkDf
colList = sparkDf.columns
for colName in sparkDf.columns:
uniqueVals = sparkDf.groupBy(colName).count()
if debug: print(uniqueVals)
countUnique = uniqueVals.count()
dtype = str(sparkDf.schema[colName].dataType)
#dtype = str(df.schema[nc].dataType)
if (colName in excludeCols):
if debug: print(str(colName) + ' is in the excluded columns list.')
elif countUnique == 1:
newDf = newDf.drop(colName)
if debug:
print('dropping column ' + str(colName) + ' because it only contains one unique value.')
#end if debug
#elif (1==2):
elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")):
if debug:
print(len(newDf.columns))
oldColumns = newDf.columns
newDf = ohcOneColumn(newDf, colName, debug=debug)
if debug:
print(len(newDf.columns))
newColumns = set(newDf.columns) - set(oldColumns)
print('Adding:')
print(newColumns)
for newColumn in newColumns:
if newColumn in newDf.columns:
try:
newUniqueValCount = newDf.groupBy(newColumn).count().count()
print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn))
except:
print('Uncaught error discussing ' + str(newColumn))
#else:
# newColumns.remove(newColumn)
print('Dropping:')
print(set(oldColumns) - set(newDf.columns))
else:
if debug: print('Nothing done for column ' + str(colName))
#end if countUnique == 1, elif countUnique other condition
#end outer for
return newDf
You can cast a string column type in a spark data frame to a numerical data type using the cast function.
from pyspark.sql import SQLContext
from pyspark.sql.types import DoubleType, IntegerType
sqlContext = SQLContext(sc)
dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv')
dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType()))
dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))
In the above example, we read in a csv file as a data frame, cast the default string datatypes into integer and double, and overwrite the original data frame. We can then use the VectorAssembler to merge the features in a single vector and apply your favorite Spark ML algorithm.

convert dataframe to libsvm format

I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pyspark.ml.classification.LogisticRegression
I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2
df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm
I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error
import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils
Question 1: Is my above approach to convert dataframe to libsvm format in the right direction.
Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format
I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):
This is my way to convert dataframe to libsvm format:
# ... your previous imports
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 3| 6|
| 4| 5| 20|
| 7| 8| 8|
+---+---+---+
# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
What you will see on the "/your/Path/nameFolder/part-0000*" files is:
1.0 1:3.0 2:6.0
4.0 1:5.0 2:20.0
7.0 1:8.0 2:8.0
See here for LabeledPoint docs
I had to do this for it to work
D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))
If you want to convert sparse vectors to a 'sparse' libsvm which is more efficient, try this:
from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
df = spark.createDataFrame([
(0, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)]))
], ["label", "features"])
df.show()
# +-----+-------------------+
# |label| features|
# +-----+-------------------+
# | 0|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# +-----+-------------------+
MLUtils.saveAsLibSVMFile(df.rdd.map(lambda x: LabeledPoint(x.label, MLLibVectors.fromML(x.features))), './libsvm')

Resources