Spark gives Error when creating DataFrame - apache-spark

I have downloaded spark version 2.3.1 and hadoop version 2.7 and java jdk 8.
Every thing works fine for simple exercises, but when i tried to create dataframe. it start to though error.
the following code runs with out error.
import numpy as np
TOTAL = 1000000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())
but when i tried the following code requires the input to change into dataframe
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
it throws the following error

you were missing import for Row.
from pyspark.sql import Row
df = sc.parallelize([Row(name='ab',age=20), Row(name='ab',age=20)]).toDF()
df.show()
Result:
+---+----+
|age|name|
+---+----+
| 20| ab|
| 20| ab|
+---+----+

Related

How to randomly shuffle the values of only one column in pyspark?

I want to break the correlation between a column and the rest of the dataframe. I want to do this while maintaining the distribution of the values in the said column
In pandas, I used to achieve this by simply shuffling the values of a column and then assigning the values to the column. It is not so straightforward in the case of pyspark because of the data being partitioned. I do not think there is even a way in pyspark to set a new column in a dataframe with a column from another dataframe
So, In pyspark, how do I achieve the following?:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
Also, I'm hoping you give me a solution that does not include unnecessary shuffles.
one way to do it is:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import row_number,lit,rand
from pyspark.sql.window import Window
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
dfs = spark.createDataFrame(df)
w = Window().orderBy(lit('A'))
dfs = dfs.withColumn("row_num", row_number().over(w))
dfs_ts = dfs.select('b')
dfs_ts = dfs_ts.withColumn('o',rand()).orderBy('o')
dfs = dfs.drop('b')
dfs_ts = dfs_ts.drop('o')
w = Window().orderBy(lit('A'))
dfs_ts = dfs_ts.withColumn("row_num", row_number().over(w))
dfs = dfs.join(dfs_ts,on='row_num').drop('row_num')
But, I do not need the shuffles that come with join and they are not necessary. If a blind hstack is possible per partition basis in pyspark that should be enough. Also, the window function tells me that I have not defined any partitions so all my data would be collected to one partition. Might as well use pandas in that case
I think you could achieve something like that on the RDD level. Not sure if it would be worth all the extra steps, and depending on the size of the partitions it might not perform really well because you'd have to hold all the values of the partition in memory for the shuffle.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df = pd.DataFrame({'a':[x for x in range(10)],'b':[x for x in range(10)]})
dfs = spark.createDataFrame(df)
dfs.show()
rdd_a = dfs.select("a").rdd
rdd_b = dfs.select("b").rdd
from random import shuffle
def f(iterator):
items = list(iterator)
shuffle(items)
for x in items:
yield x
from pyspark import Row
def reconstruct(x):
return Row(a=x[0].a, b=x[1].b)
rdd_b_reordered = rdd_b.mapPartitions(f)
df_reordered = spark.createDataFrame(rdd_a.zip(rdd_b_reordered).map(reconstruct))
df_reordered.show()
"""
My output:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 4|
| 2| 0|
| 3| 1|
| 4| 3|
| 5| 7|
| 6| 9|
| 7| 5|
| 8| 6|
| 9| 8|
+---+---+
"""
Maybe you can tweak that to your needs. This would only shuffle the things over each partition to avoid the partition shuffle. Also probably better to do it in Scala.
I am not fully confident in this answer, but I think it's right.
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
So if we get your Pandas example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
We can just wrap the .sample expression in a function.
def shuffle(df: pd.DataFrame) -> pd.DataFrame:
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
return df
And then we can bring it to Spark using the transform function. The ``transform` function can take in both Pandas and Spark DataFrames and then will convert it to Spark if you are using the Spark engine.
from fugue import transform
import fugue_spark
transform(df, shuffle, schema="*", engine="spark")
But of course, this transform is applied per partition. If you don't specify the partitions, it uses the default partitions. If you want to shuffle to really randomize it, you can do:
transform(df, shuffle, schema="*", engine="spark", partition={"algo":"rand"}).show()
And Fugue will partition your data randomly before this operation.
You may not see the shuffle for your test case because the data is really small. If you end up with 4 partitions with 1 row each, they will end up returning the same value.

How to create a data frame from the all possible combination of values of each of the categories listed in the large dictionary

I would like to create a data frame from the all possible combination of values of each of the categories listed in the dictionary.
I tried the below code, it is working fine for small dictionary with lesser key and values. But it is not getting executed for larger dictionary as i have given below.
import itertools as it
import pandas as pd
my_dict= {
"A":[0,1,.....25],
"B":[4,5,.....35],
"C":[0,1,......30],
"D":[0,1,........35],
.........
"Y":[0,1,........35],
"Z":[0,1,........35],
}
df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
This is the error i get, how to handle this problem with large dictionary.
Traceback (most recent call last):
  File "<ipython-input-11-723405257e95>", line 1, in <module>
    df=pd.DataFrame(list(it.product(*my_dict.values())), columns=my_dict.keys())
MemoryError
How to handle with the large dictionary to create data frame
If you have a sufficiently large [1] Spark cluster, each list in dictionary can be used as a Spark dataframe and then all these dataframes can be cross-joined:
def to_spark_dfs(dict):
for key in dict:
l=[[e] for e in dict[key]]
yield spark.createDataFrame(l, schema=[key])
dfs=to_spark_dfs(my_dict)
from functools import reduce
res=reduce(lambda df1,df2: df1.crossJoin(df2),dfs)
If the original my_dict is not too large
my_dict= {
"A":[0,1,2],
"B":[4,5,6],
"C":[0,1,2],
"D":[0,1],
"Y":[0,1,2],
"Z":[0,1],
}
the code produces the expected result:
res.show()
#+---+---+---+---+---+---+
#| A| B| C| D| Y| Z|
#+---+---+---+---+---+---+
#| 0| 4| 0| 0| 0| 0|
#| 0| 4| 0| 0| 0| 1|
#| 0| 4| 0| 0| 1| 0|
#| 0| 4| 0| 0| 1| 1|
#...
res.count()
#324
[1]
Using the numbers given in the comment (80 keys and approx 30 values per key) you would need a really large Spark cluster: 30 ^ 80 gives 1.5*10^118 different combination. This is more than the estimated number of atoms (10^80) in the known, observable universe.
In this case, we have a huge number of possible combinations. For example, if columns (A, B, C... Z) can take values [1...10] then total count of rows are equal 10^26, or 100000000000000000000000000.
In my mind there are 2 main directions to solve this issue:
Horizontal scaling: calculate and store results using frameworks for distributed computing (such as Apache Spark or Hadoop)
Vertical scaling: optimize CPU/RAM utilization using:
Vectorization (e.g. avoid loops)
Data types with minimal impact to RAM allocation (use as minimal precision as you need, use factorize() for strings)
mini-batching and download intermediate results (data frames) from RAM to disc in zipped format (e.g. parquet)
benchmark the execution time and object size in RAM.
Let me introduce the code that implements some of the concepts of the vertical scaling approach.
Define the following functions:
create_data_frame_baseline(): data frame generator with loop, not optimal data types (baseline)
create_data_frame_no_loop(): no loop, not optimal data types
create_data_frame_optimize_data_type(): no loop, optimal data types.
import itertools as it
import pandas as pd
import numpy as np
from string import ascii_uppercase
def create_letter_dict(cols_n: int = 10, levels_n: int = 6) -> dict:
letter_dict = {letter: list(range(levels_n)) for letter in ascii_uppercase[0:cols_n]}
return letter_dict
def create_data_frame_baseline(dict: dict) -> pd.DataFrame:
df = pd.DataFrame(columns=dict.keys())
for row in it.product(*dict.values()):
df.loc[len(df.index)] = row
return df
def create_data_frame_no_loop(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
list(it.product(*dict.values())),
columns=dict.keys()
)
def create_data_frame_optimize_data_type(dict: dict) -> pd.DataFrame:
return pd.DataFrame(
np.int8(list(it.product(*dict.values()))),
columns=dict.keys()
)
Benchmarks:
import sys
import timeit
cols_n = 7
levels_n = 5
iteration_n = 2
# Baseline
def create_data_frame_baseline_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_baseline(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_baseline_test).timeit(number=iteration_n))
# No loop, not optimal data types
def create_data_frame_no_loop_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_no_loop(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_no_loop_test).timeit(number=iteration_n))
# No loop, optimal data types.
def create_data_frame_optimize_data_type_test():
my_dict = create_letter_dict(cols_n, levels_n)
df = create_data_frame_optimize_data_type(my_dict)
assert(df.shape == (levels_n**cols_n, cols_n))
print(sys.getsizeof(df))
return df
print(timeit.Timer(create_data_frame_optimize_data_type_test).timeit(number=iteration_n))
Outputs*:
Function
Dataframe shape
RAM size, Mb
Execution time, sec
create_data_frame_baseline_test
78125x7
19
485
create_data_frame_no_loop_test
78125x7
4.4
0.20
create_data_frame_optimize_data_type_test
78125x7
0.55
0.16
Using create_data_frame_optimize_data_type_test() I generated* 100M rows in less than 100 seconds.
* Ubuntu Server 20.04, Intel(R) Xeon(R) 8xCPU # 2.60GHz, 32GB RAM
In your case you can not generate all possible combination at once, by using the list() but do it in loop, for example:
import itertools as it
import pandas as pd
from string import ascii_uppercase
N = 36
my_dict = {x: list(range(N)) for x in ascii_uppercase}
df = pd.DataFrame(columns=my_dict.keys())
for row in it.product(*my_dict.values()):
df.loc[len(df.index)] = row
but of cause it take a long time

Error when running a query involving ROUND function in spark sql

I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)
With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+

User defined function to be applied to Window in PySpark?

I am trying to apply a user defined function to Window in PySpark. I have read that UDAF might be the way to to go, but I was not able to find anything concrete.
To give an example (taken from here: Xinh's Tech Blog and modified for PySpark):
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg
spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()
a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
customers = spark.createDataFrame([["Alice", "2016-05-01", 50.00],
["Alice", "2016-05-03", 45.00],
["Alice", "2016-05-04", 55.00],
["Bob", "2016-05-01", 25.00],
["Bob", "2016-05-04", 29.00],
["Bob", "2016-05-06", 27.00]],
["name", "date", "amountSpent"])
customers.show()
window_spec = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
result = customers.withColumn( "movingAvg", avg(customers["amountSpent"]).over(window_spec))
result.show()
I am applying avg which is already built into pyspark.sql.functions, but if instead of avg I wanted to use something of more complicated and write my own function, how would I do that?
Spark >= 3.0:
SPARK-24561 - User-defined window functions with pandas udf (bounded window) is a a work in progress. Please follow the related JIRA for details.
Spark >= 2.4:
SPARK-22239 - User-defined window functions with pandas udf (unbounded window) introduced support for Pandas based window functions with unbounded windows. General structure is
return_type: DataType
#pandas_udf(return_type, PandasUDFType.GROUPED_AGG)
def f(v):
return ...
w = (Window
.partitionBy(grouping_column)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn('foo', f('bar').over(w))
Please see the doctests and the unit tests for detailed examples.
Spark < 2.4
You cannot. Window functions require UserDefinedAggregateFunction or equivalent object, not UserDefinedFunction, and it is not possible to define one in PySpark.
However, in PySpark 2.3 or later, you can define vectorized pandas_udf, which can be applied on grouped data. You can find a working example Applying UDFs on GroupedData in PySpark (with functioning python example). While Pandas don't provide direct equivalent of window functions, there are expressive enough to implement any window-like logic, especially with pandas.DataFrame.rolling. Furthermore function used with GroupedData.apply can return arbitrary number of rows.
You can also call Scala UDAF from PySpark Spark: How to map Python with Scala or Java User Defined Functions?.
UDFs can be applied to Window now as of Spark 3.0.0.
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html
Extract from the documentation:
from pyspark.sql import Window
#pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
w = Window.partitionBy('id').orderBy('v').rowsBetween(-1, 0)
df.withColumn('mean_v', mean_udf("v").over(w)).show()
+---+----+------+
| id| v|mean_v|
+---+----+------+
| 1| 1.0| 1.0|
| 1| 2.0| 1.5|
| 2| 3.0| 3.0|
| 2| 5.0| 4.0|
| 2|10.0| 7.5|
+---+----+------+

convert dataframe to libsvm format

I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pyspark.ml.classification.LogisticRegression
I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2
df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm
I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error
import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils
Question 1: Is my above approach to convert dataframe to libsvm format in the right direction.
Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format
I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):
This is my way to convert dataframe to libsvm format:
# ... your previous imports
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 3| 6|
| 4| 5| 20|
| 7| 8| 8|
+---+---+---+
# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
What you will see on the "/your/Path/nameFolder/part-0000*" files is:
1.0 1:3.0 2:6.0
4.0 1:5.0 2:20.0
7.0 1:8.0 2:8.0
See here for LabeledPoint docs
I had to do this for it to work
D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))
If you want to convert sparse vectors to a 'sparse' libsvm which is more efficient, try this:
from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
df = spark.createDataFrame([
(0, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)]))
], ["label", "features"])
df.show()
# +-----+-------------------+
# |label| features|
# +-----+-------------------+
# | 0|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# +-----+-------------------+
MLUtils.saveAsLibSVMFile(df.rdd.map(lambda x: LabeledPoint(x.label, MLLibVectors.fromML(x.features))), './libsvm')

Resources