PySpark UDF issues when referencing outside of function - apache-spark

I facing the issue that I get the error
TypeError: cannot pickle '_thread.RLock' object
when I try to apply the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),
('Mike','Williams','M',77),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
def find_n_people_with_higher_age(x):
return df_2.filter(df_2['age']>=x).count()
find_n_people_with_higher_age_udf = udf(find_n_people_with_higher_age, IntegerType())
df_1.select(find_n_people_with_higher_age_udf(col('category_id')))

Here's a good article on python UDF's.
I use it as a reference as I suspected that you were running into a serialization issue. I'm showing the entire paragraph to add context of the sentence but really it's the serialization that's the issue.
Performance Considerations
It’s important to understand the performance implications of Apache
Spark’s UDF features. Python UDFs for example (such as our CTOF
function) result in data being serialized between the executor JVM and
the Python interpreter running the UDF logic – this significantly
reduces performance as compared to UDF implementations in Java or
Scala. Potential solutions to alleviate this serialization bottleneck
include:
If you consider what you are asking maybe you'll see why this isn't working. You are asking all data from your dataframe(data_2) to be shipped(serialized) to an executor that then serializes it and ships it to python to be interpreted. Dataframes don't serialize. So that's your issue, but if they did, you are sending an entire data frame to each executor. Your sample data here isn't an issue, but for trillions of records it would blow up the JVM.
What your asking is doable I just need to figure out how do it. Likely a window or group by would be the trick.

add additional data:
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
# add more data to make it more interesting.
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),('Gia','Rose','F',34),
('Mike','Williams','M',77), ('John','Williams','M',77), ('Bill','Williams','F',79),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
# dataframe to help fill in missing ages
ref = spark.range( 1, 110, 1).toDF("numbers").withColumn("count", lit(0)).withColumn("rolling_Count", lit(0))
countAges = df_2.groupby("age").count()
#this actually give you the short list of ages
rollingCounts = countAges.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#fill in missing ages and remove duplicates
filled = rollingCounts.union(ref).groupBy("age").agg(sum("count").alias("count"))
#add a rolling count across all ages
allAgeCounts = filled.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#do inner join because we've filled in all ages.
df_1.join(allAgeCounts, df_1.age == allAgeCounts.age, "inner").show()
+---------+--------+------+---+---+-----+-------------+
|firstname|lastname|gender|age|age|count|rolling_Count|
+---------+--------+------+---+---+-----+-------------+
| Anna| Rose| F| 41| 41| 0| 3|
| Robert|Williams| M| 62| 62| 0| 3|
| James| Smith| M| 30| 30| 0| 5|
+---------+--------+------+---+---+-----+-------------+
I wouldn't normally want to use a window over an entire table, but here the data it's iterating over <= 110 so this is reasonable.

Related

Is there a way to use a map/dict in Pyspark to avoid CASE WHEN condition equals pairs?

I have a problem in Pyspark creating a column based on values in another column for a new dataframe.
It's boring and seems to me not a good practice to use a lot of
CASE
WHEN column_a = 'value_1' THEN 'value_x'
WHEN column_a = 'value_2' THEN 'value_y'
...
WHEN column_a = 'value_289' THEN 'value_xwerwz'
END
In cases like this, in python, I get used to using a dict or, even better, a configparser file and avoid the if else condition. I just pass the key and python returns the desired value. Also, we have a 'fallback' option for ELSE clause.
The problem seems to me that we are not treating a single row but all of them in one command, so using dict/map/configparser is an unavailable option. I thought about using a loop with dict, but it seems too slow and a waste of computation as we repeat all the conditions.
I'm still looking for this practice, if I find it, I'll post it here. But, you know, probably a lot of people already use it and I don't know yet. But if there is no other way, ok. Use many WHEN THEN conditions won't be a choice.
Thank you
I tried to use a dict and searched for solutions like this
You could create a function which converts a dict into a Spark F.when, e.g.:
import pyspark.sql.functions as F
def create_spark_when(column, conditions, default):
when = None
for key, value in conditions.items():
current_when = F.when(F.col(column) == key, value)
if when is None:
when = current_when.otherwise(default)
else:
when = current_when.otherwise(when)
return when
df = spark.createDataFrame([(0,), (1,), (2,)])
df.show()
my_conditions = {1: "a", 2: "b"}
my_default = "c"
df.withColumn(
"my_column",
create_spark_when("_1", my_conditions, my_default),
).show()
Output:
+---+
| _1|
+---+
| 0|
| 1|
| 2|
+---+
+---+---------+
| _1|my_column|
+---+---------+
| 0| c|
| 1| a|
| 2| b|
+---+---------+
One choice is to use create a dataframe out of dictionary and perform join
This would work:
Creating a Dataframe:
dict={"value_1": "value_x", "value_2": "value_y"}
dict_df=spark.createDataFrame([(k,v) for k,v in dict.items()], ["key","value"])
Performing the join:
df.alias("df1")\
.join(F.broadcast(dict_df.alias("df2")), F.col("column_a")==F.col("key"))\
.selectExpr("df1.*","df2.value as newColumn")\
.show()
We can broadcast the dict_df as it is small.
Input:
Dict_df:
Output:
Alternatively, you can use a UDF - but that is not recommended.

Spark: Transforming multiple dataframes in parallel

Understanding how to achieve best parallelism while transforming multiple dataframes in parallel
I have an array of paths
val paths = Array("path1", "path2", .....
I am loading dataframe from each path then transforming and writing to destination path
paths.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
The transformation processData is independent of dataframe I am loading.
This limits to processing one dataframe at a time and most of my cluster resources are idle. As processing each dataframe is independent, I converted Array to ParArray of scala.
paths.par.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
Now it is using more resources in cluster. I am still trying to understand how it works and how to fine tune the parallel processing here
If I increase the default scala parallelism using ForkJoinPool to higher number, can it lead to more threads spawning at driver side and will be in lock state waiting for foreach function to finish and eventually kill the driver?
How does it effect the centralized spark things like EventLoggingListnener which needs to handle more inflow of events as multiple dataframes are processed in parallel.
What parameters do I consider for optimal resource utilization.
Any other approach
Any resources I can go through to understand this scaling would be very helpful
The reason why this is slow is that spark is very good at parallelizing computations on lots of data, stored in one big dataframe. However, it is very bad at dealing with lots of dataframes. It will start the computation on one using all its executors (even though they are not all needed) and wait for it to finish before starting the next one. This results in a lot of inactive processors. This is bad but that's not what spark was designed for.
I have a hack for you. There might need to refine it a little, but you would have the idea. Here is what I would do. From a list of paths, I would extract all the schemas of the parquet files and create a new big schema that gathers all the columns. Then, I would ask spark to read all the parquet files using this schema (the columns that are not present will be set to null automatically). I would then union all the dataframes and perform the transformation on this big dataframe and finally use partitionBy to store the dataframes in separate files, while still doing all of it in parallel. It would look like this.
// let create two sample datasets with one column in common (id)
// and two different columns x != y
val d1 = spark.range(3).withColumn("x", 'id * 10)
d1.show
+---+----+
| id| x |
+---+----+
| 0| 0|
| 1| 10|
| 2| 20|
+---+----+
val d2 = spark.range(2).withColumn("y", 'id cast "string")
d2.show
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 1|
+---+---+
// And I store them
d1.write.parquet("hdfs:///tmp/d1.parquet")
d2.write.parquet("hdfs:///tmp/d2.parquet")
// Now let's create the big schema
val paths = Seq("hdfs:///tmp/d1.parquet", "hdfs:///tmp/d2.parquet")
val fields = paths
.flatMap(path => spark.read.parquet(path).schema.fields)
.toSet //removing duplicates
.toArray
val big_schema = StructType(fields)
// and let's use it
val dfs = paths.map{ path =>
spark.read
.schema(big_schema)
.parquet(path)
.withColumn("path", lit(path.split("/").last))
}
// The we are ready to create one big dataframe
dfs.reduce( _ unionAll _).show
+---+----+----+----------+
| id| x| y| file|
+---+----+----+----------+
| 1| 1|null|d1.parquet|
| 2| 2|null|d1.parquet|
| 0| 0|null|d1.parquet|
| 0|null| 0|d2.parquet|
| 1|null| 1|d2.parquet|
+---+----+----+----------+
Yet, I do not recommend using unionAll on lots of dataframes. Because of spark's analysis of the execution plan, it can be very slow with many dataframes. I would use the RDD version although it is more verbose.
val rdds = sc.union(dfs.map(_.rdd))
// let's not forget to add the path to the schema
val big_df = spark.createDataFrame(rdds,
big_schema.add(StructField("path", StringType, true)))
transform(big_df)
.write
.partitionBy("path")
.parquet("hdfs:///tmp/processed.parquet")
And having a look at my processed directory, I get this:
hdfs:///tmp/processed.parquet/_SUCCESS
hdfs:///tmp/processed.parquet/path=d1.parquet
hdfs:///tmp/processed.parquet/path=d2.parquet
You should play with some variables here. Most important are: CPU cores, the size of each DF and a little use of futures. The propose is decide the priority of each DF to be processed. You can use FAIR configuration but that don't be enough and process all in parallel could consume a big part of your cluster. You have to assign priorities to DFs and use Future pooll to control the number of parallel Jobs running in your app.

Running a high volume of Hive queries from PySpark

I want execute a very large amount of hive queries and store the result in a dataframe.
I have a very large dataset structured like this:
+-------------------+-------------------+---------+--------+--------+
| visid_high| visid_low|visit_num|genderid|count(1)|
+-------------------+-------------------+---------+--------+--------+
|3666627339384069624| 693073552020244687| 24| 2| 14|
|1104606287317036885|3578924774645377283| 2| 2| 8|
|3102893676414472155|4502736478394082631| 1| 2| 11|
| 811298620687176957|4311066360872821354| 17| 2| 6|
|5221837665223655432| 474971729978862555| 38| 2| 4|
+-------------------+-------------------+---------+--------+--------+
I want to create a derived dataframe which uses each row as input for a secondary query:
result_set = []
for session in sessions.collect()[:100]:
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session['visid_high'],session['visid_low'],session['visit_num'])
result = hc.sql(query).collect()
result_set.append(result)
This works as expected for a hundred rows, but causes livy to time out with higher loads.
I tried using map or foreach:
def f(session):
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session.visid_high,session.visid_low,session.visit_num)
return hc.sql(query)
test = sampleRdd.map(f)
causing PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable. I understand from this answer and this answer that the spark context object is not serializable.
I didn't try generating all queries first, then running the batch, because I understand from this question batch querying is not supported.
How do I proceed?
What I was looking for is:
Querying all required data in one go by writing the appropriate joins
Adding custom columns, based on the values of the large dataframe using pyspark.sql.functions.when() and df.withColumn(), then
Flattening the resulting dataframe with df.groupBy() and pyspark.sql.functions.sum()
I think I didn't fully realize that Spark handles dataframes lazily. The supported way of working is to define large dataframes and then the appropriate transforms. Spark will try to execute the data retrieval and the transforms in one go, at the last second and distributed. I was trying to limit the scope up front, which led to unsupported functionality.

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Averaging over window function leads to StackOverflowError

I am trying to determine the average timespan between dates in a Dataframe column by using a window-function. Materializing the Dataframe however throws a Java exception.
Consider the following example:
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window, functions
from datetime import datetime
sc = SparkContext()
sq = HiveContext(sc)
data = [
[datetime(2014,1,1)],
[datetime(2014,2,1)],
[datetime(2014,3,1)],
[datetime(2014,3,6)],
[datetime(2014,8,23)],
[datetime(2014,10,1)],
]
df = sq.createDataFrame(data, schema=['ts'])
ts = functions.col('ts')
w = Window.orderBy(ts)
diff = functions.datediff(
ts,
functions.lag(ts, count=1).over(w)
)
avg_diff = functions.avg(diff)
While df.select(diff.alias('diff')).show() correctly renders as
+----+
|diff|
+----+
|null|
| 31|
| 28|
| 5|
| 170|
| 39|
+----+
doing df.select(avg_diff).show() gives a java.lang.StackOverflowError.
Am I wrong to assume that this should work? And if so, what am I doing wrong and what could I do instead?
I am using the Python API on Spark 1.6
When I do df2 = df.select(diff.alias('diff')) and then do
df2.select(functions.avg('diff'))
there's no error. Unfortunately that is not an option in my current setup.
It looks like a bug in Catalyst but. Chaining methods should work just fine:
df.select(diff.alias('diff')).agg(functions.avg('diff'))
Nevertheless I would be careful here. Window functions shouldn't be used to perform global (without PARTITION BY clause) operations. These move all data to a single partition and perform a sequential scan. Using RDDs could be a better choice here.

Resources