How to get multiple queries in single run - apache-spark

For example I have a dataframe like below,
df
DataFrame[columnA: int, columnB: int]
If I have to do two checks. I am going over the data two times like below,
df.where(df.columnA == 412).count()
df.where(df.columnB == 25).count()
In normal code I would be having two count variable and incrementing on True. How would I go with spark dataframe? Appreciate if some one could link to right documentation also. Happy to see python or scala.

For example like this:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(412, 0),
(0, 25),
(412, 25),
(0, 25)
)).toDF("columnA", "columnB")
df.agg(
sum(($"columnA" === 412).cast("long")).alias("columnA"),
sum(($"columnB" === 25).cast("long")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
or like this:
import org.apache.spark.sql.functions.{count, when}
df.agg(
count(when($"columnA" === 412, $"columnA")).alias("columnA"),
count(when($"columnB" === 25, $"columnB")).alias("columnB")
).show
// +-------+-------+
// |columnA|columnB|
// +-------+-------+
// | 2| 3|
// +-------+-------+
I am not aware of any specific documentation but I am pretty sure you'll find this in any good SQL reference.

#zero323's answer is spot on, but just to indicate the most flexible programming model is Spark, you could do your checks as if statements inside a map with a lambda function, e.g. (using the same dataframe as above)
import org.apache.spark.sql.functions._
val r1 = df.map(x => {
var x0 = 0
var x1 = 0
if (x(0) == 412) x0=1
if (x(1) == 25) x1=1
(x0, x1)
}).toDF("x0", "x1").select(sum("x0"), sum("x1")).show()
This model lets you do almost anything you can think of, though you're much better off sticking with the specific APIs where available.

Related

PySpark UDF issues when referencing outside of function

I facing the issue that I get the error
TypeError: cannot pickle '_thread.RLock' object
when I try to apply the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),
('Mike','Williams','M',77),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
def find_n_people_with_higher_age(x):
return df_2.filter(df_2['age']>=x).count()
find_n_people_with_higher_age_udf = udf(find_n_people_with_higher_age, IntegerType())
df_1.select(find_n_people_with_higher_age_udf(col('category_id')))
Here's a good article on python UDF's.
I use it as a reference as I suspected that you were running into a serialization issue. I'm showing the entire paragraph to add context of the sentence but really it's the serialization that's the issue.
Performance Considerations
It’s important to understand the performance implications of Apache
Spark’s UDF features. Python UDFs for example (such as our CTOF
function) result in data being serialized between the executor JVM and
the Python interpreter running the UDF logic – this significantly
reduces performance as compared to UDF implementations in Java or
Scala. Potential solutions to alleviate this serialization bottleneck
include:
If you consider what you are asking maybe you'll see why this isn't working. You are asking all data from your dataframe(data_2) to be shipped(serialized) to an executor that then serializes it and ships it to python to be interpreted. Dataframes don't serialize. So that's your issue, but if they did, you are sending an entire data frame to each executor. Your sample data here isn't an issue, but for trillions of records it would blow up the JVM.
What your asking is doable I just need to figure out how do it. Likely a window or group by would be the trick.
add additional data:
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
# add more data to make it more interesting.
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),('Gia','Rose','F',34),
('Mike','Williams','M',77), ('John','Williams','M',77), ('Bill','Williams','F',79),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
# dataframe to help fill in missing ages
ref = spark.range( 1, 110, 1).toDF("numbers").withColumn("count", lit(0)).withColumn("rolling_Count", lit(0))
countAges = df_2.groupby("age").count()
#this actually give you the short list of ages
rollingCounts = countAges.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#fill in missing ages and remove duplicates
filled = rollingCounts.union(ref).groupBy("age").agg(sum("count").alias("count"))
#add a rolling count across all ages
allAgeCounts = filled.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#do inner join because we've filled in all ages.
df_1.join(allAgeCounts, df_1.age == allAgeCounts.age, "inner").show()
+---------+--------+------+---+---+-----+-------------+
|firstname|lastname|gender|age|age|count|rolling_Count|
+---------+--------+------+---+---+-----+-------------+
| Anna| Rose| F| 41| 41| 0| 3|
| Robert|Williams| M| 62| 62| 0| 3|
| James| Smith| M| 30| 30| 0| 5|
+---------+--------+------+---+---+-----+-------------+
I wouldn't normally want to use a window over an entire table, but here the data it's iterating over <= 110 so this is reasonable.

Assigning columns to another columns in a Spark Dataframe using Scala

I was looking at this excellent question so as to improve my Scala skills and the answer: Extract a column value and assign it to another column as an array in spark dataframe
I created my modified code as follows which works, but am left with a few questions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val uniqueVal = df.select("b").distinct().map(x => x.getAs[Int](0)).collect.toList
def myfun: Int => List[Int] = _ => uniqueVal
def myfun_udf = udf(myfun)
df.withColumn("X", myfun_udf( col("b") )).show
+---+---+---+---------+
| ID| a| b| X|
+---+---+---+---------+
| r1| 1| 1|[1, 4, 2]|
| r2| 6| 4|[1, 4, 2]|
| r3| 4| 1|[1, 4, 2]|
| r4| 1| 2|[1, 4, 2]|
+---+---+---+---------+
It works, but:
I note b column is put in twice.
I can also put in column a on the second statement and I get the same result. E.g. and what point is that then?
df.withColumn("X", myfun_udf( col("a") )).show
If I put in col ID then it gets null.
So, I am wondering why the second col is input?
And how this could be made to work generically for all columns?
So, this was code that I looked at elsewhere, but I am missing something.
The code you've shown doesn't make much sense:
It is not scalable - in the worst case scenario size of each row is proportional to the size
As you've already figure out it doesn't need argument at all.
It doesn't need (and what's important it didn't need) udf at the time it was written (on 2016-12-23 Spark 1.6 and 2.0 where already released)
If you still wanted to use udf nullary variant would suffice
Overall it is just another convoluted and misleading answer that served OP at the point. I'd ignore (or vote accordingly) and move on.
So how could this be done:
If you have a local list and you really want to use udf. For single sequence use udf with nullary function:
val uniqueBVal: Seq[Int] = ???
val addUniqueBValCol = udf(() => uniqueBVal)
df.withColumn("X", addUniqueBValCol())
Generalize to:
import scala.reflect.runtime.universe.TypeTag
def addLiteral[T : TypeTag](xs: Seq[T]) = udf(() => xs)
val x = addLiteral[Int](uniqueBVal)
df.withColumn("X", x())
Better don't use udf:
import org.apache.spark.sql.functions._
df.withColumn("x", array(uniquBVal map lit: _*))
As of
And how this could be made to work generically for all columns?
as mentioned at the beginning the whole concept is hard to defend. Either window functions (completely not scalable)
import org.apache.spark.sql.expressions.Window
val w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.select($"*" +: df.columns.map(c => collect_set(c).over(w).alias(s"${c}_unique")): _*)
or cross join with aggregate (most of the time not scalable)
val uniqueValues = df.select(
df.columns map (c => collect_set(col(c)).alias(s"${c}_unique")):_*
)
df.crossJoin(uniqueValues)
In general though - you'll have to rethink your approach, if this comes anywhere actual applications, unless you know for sure, that cardinalities of columns are small and have strict upper bounds.
Take away message is - don't trust random code that random people post in Internet. This one included.

Aggregating several fields simultaneously from Dataset

I have a data with the following scheme:
sourceip
destinationip
packets sent
And I want to calculate several aggregative fields out of this data and have the following schema:
ip
packets sent as sourceip
packets sent as destination
In the happy days of RDDs I could use aggregate, define a map of {ip -> []}, and count the appearances in a corresponding array location.
In the Dataset/Dataframe aggregate is no longer available, instead UDAF could be used, unfortunately, from the experience I had with UDAF they are immutable, means they cannot be used (have to create a new instance on every map update) example + explanation here
on one hand, technically, I could convert the Dataset to RDD, aggregate etc and go back to dataset. Which I expect would result in performance degradation, as Datasets are more optimized. UDAFs are out of the question due to the copying.
Is there any other way to perform aggregations?
It sounds like you need a standard melt (How to melt Spark DataFrame?) and pivot combination:
val df = Seq(
("192.168.1.102", "192.168.1.122", 10),
("192.168.1.122", "192.168.1.65", 10),
("192.168.1.102", "192.168.1.97", 10)
).toDF("sourceip", "destinationip", "packets sent")
df.melt(Seq("packets sent"), Seq("sourceip", "destinationip"), "type", "ip")
.groupBy("ip")
.pivot("type", Seq("sourceip", "destinationip"))
.sum("packets sent").na.fill(0).show
// +-------------+--------+-------------+
// | ip|sourceip|destinationip|
// +-------------+--------+-------------+
// | 192.168.1.65| 0| 10|
// |192.168.1.102| 20| 0|
// |192.168.1.122| 10| 10|
// | 192.168.1.97| 0| 10|
// +-------------+--------+-------------+
One way to go about it without any custom aggregation would be to use flatMap (or explode for dataframes) like this:
case class Info(ip : String, sent : Int, received : Int)
case class Message(from : String, to : String, p : Int)
val ds = Seq(Message("ip1", "ip2", 5),
Message("ip2", "ip3", 7),
Message("ip2", "ip1", 1),
Message("ip3", "ip2", 3)).toDS()
ds
.flatMap(x => Seq(Info(x.from, x.p, 0), Info(x.to, 0, x.p)))
.groupBy("ip")
.agg(sum('sent) as "sent", sum('received) as "received")
.show
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
As far as the performance is concerned, I am not sure a flatMap is an improvement versus a custom aggregation though.
Here is a pyspark version using explode. It is more verbose but the logic is exactly the same as the flatMap version, only with pure dataframe code.
sc\
.parallelize([("ip1", "ip2", 5), ("ip2", "ip3", 7), ("ip2", "ip1", 1), ("ip3", "ip2", 3)])\
.toDF(("from", "to", "p"))\
.select(F.explode(F.array(\
F.struct(F.col("from").alias("ip"),\
F.col("p").alias("received"),\
F.lit(0).cast("long").alias("sent")),\
F.struct(F.col("to").alias("ip"),\
F.lit(0).cast("long").alias("received"),\
F.col("p").alias("sent")))))\
.groupBy("col.ip")\
.agg(F.sum(F.col("col.received")).alias("received"), F.sum(F.col("col.sent")).alias("sent"))
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
Since you didn't mention the context and aggregations, you may do something like below,
val df = ??? // your dataframe/ dataset
From Spark source:
(Scala-specific) Compute aggregates by specifying a map from column
name to aggregate methods. The resulting DataFrame will also contain
the grouping columns. The available aggregate methods are avg, max,
min, sum, count.
// Selects the age of the oldest employee and the aggregate expense
for each department
df
.groupBy("department")
.agg(Map(
"age" -> "max",
"expense" -> "sum"
))

What is the equivalent to Hive's find_in_set function (without registering a temp view)?

Below dataframe has 2 columns,
user_id
user_id_list (array)
requirement is to find the position of user_id in the user_id_list.
Sample record:
user_id = x1
user_id_list = ('X2','X1','X3','X6')
Result:
postition = 2
I need the dataframe with 3rd column which has the position of user_id in the list.
Result dataframe columns:
user_id
user_id_list
position
I can achieve this using find_in_set() hive function after registering the dataframe as view using createOrReplaceTempView.
Is there a sql function available in spark to get this done without registering the view?
My advice would be to implement an UDF, just as Yura mentioned. Here is a short example of what it can look like:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = List((1, Array(2, 3, 1)), (2, Array(1, 2,3))).toDF("user_id","user_id_list")
df.show
+-------+------------+
|user_id|user_id_list|
+-------+------------+
| 1| [2, 3, 1]|
| 2| [1, 2, 3]|
+-------+------------+
val findPosition = udf((user_id: Int, user_id_list: Seq[Int]) => {
user_id_list.indexOf(user_id)
})
val df2 = df.withColumn("position", findPosition($"user_id", $"user_id_list"))
df2.show
+-------+------------+--------+
|user_id|user_id_list|position|
+-------+------------+--------+
| 1| [2, 3, 1]| 2|
| 2| [1, 2, 3]| 1|
+-------+------------+--------+
Is there a sql function available in spark to get this done without registering the view?
No, but you don't have to register a DataFrame to use find_in_set either.
expr function (with find_in_set)
You can (temporarily) switch to SQL mode using expr function instead (see functions object):
Parses the expression string into the column that it represents
val users = Seq(("x1", Array("X2","X1","X3","X6"))).toDF("user_id", "user_id_list")
val positions = users.
as[(String, Array[String])].
map { case (uid, ids) => (uid, ids, ids.mkString(",")) }.
toDF("user_id", "user_id_list", "ids"). // only for nicer column names
withColumn("position", expr("find_in_set(upper(user_id), ids)")).
select("user_id", "user_id_list", "position")
scala> positions.show
+-------+----------------+--------+
|user_id| user_id_list|position|
+-------+----------------+--------+
| x1|[X2, X1, X3, X6]| 2|
+-------+----------------+--------+
posexplode function
You could also use posexplode function (from functions object) that saves you some Scala custom coding and is better optimized than UDFs (that forces deserialization of internal binary rows into JVM objects).
scala> users.
select('*, posexplode($"user_id_list")).
filter(lower($"user_id") === lower($"col")).
select($"user_id", $"user_id_list", $"pos" as "position").
show
+-------+----------------+--------+
|user_id| user_id_list|position|
+-------+----------------+--------+
| x1|[X2, X1, X3, X6]| 1|
+-------+----------------+--------+
I'm not aware of such function is Spark SQL API. There's a function to find if array contains a value (called array_contains) but that's not what you need.
You could use posexplode to explode array to rows with position and then filter by it, like this: dataframe.select($"id", posexplode($"ids")).filter($"id" === $"col").select($"id", $"pos"). Still it may be not optimal solution depending on length of a user ids list. Currently (for version 2.1.1) Spark doesn't do optimization to replace above code with direct array lookup - it will generate rows and filter by it.
Also take into the account that this approach will filter out any rows where user_id is not in user_ids_list so you may want to take extra efforts to overcome this.
I would advice to implement UDF which does exactly what you need. On the downside: Spark can't look into the UDF so it'll have to deserialize data to Java objects and back.

Spark SQL dataframe: best way to compute across rowpairs

I have a Spark DataFrame "deviceDF" like so :
ID date_time state
a 2015-12-11 4:30:00 up
a 2015-12-11 5:00:00 down
a 2015-12-11 5:15:00 up
b 2015-12-12 4:00:00 down
b 2015-12-12 4:20:00 up
a 2015-12-12 10:15:00 down
a 2015-12-12 10:20:00 up
b 2015-12-14 15:30:00 down
I am trying to calculate the downtime for each of the IDs. I started simple by grouping based on id and separately computing the sum of all uptimes and downtimes. Then take the difference of the summed uptime and downtime.
val downtimeDF = deviceDF.filter($"state" === "down")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "down_time")
val uptimeDF = deviceDF.filter($"state" === "up")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "up_time")
val updownjoinDF = uptimeDF.join(downtimeDF, "ID")
val difftimeDF = updownjoinDF
.withColumn("diff_time", $"up_time" - $"down_time")
However there are few conditions that cause errors, such as the device went down but never came back up, in this case, the down_time is the difference between current_time and last_time it was down.
Also if the first entry for a particular device starts with 'up' then the down_time is difference of the first_entry and the time at the begining of this analysis, say 2015-12-11 00:00:00. Whats the best way to handle these border conditions using dataframe? Do I need to write a custom UDAF ?
The first thing you can try is to use window functions. While this is usually not the fastest possible solution it is concise and extremely expressive. Taking your data as an example:
import org.apache.spark.sql.functions.unix_timestamp
val df = sc.parallelize(Array(
("a", "2015-12-11 04:30:00", "up"), ("a", "2015-12-11 05:00:00", "down"),
("a", "2015-12-11 05:15:00", "up"), ("b", "2015-12-12 04:00:00", "down"),
("b", "2015-12-12 04:20:00", "up"), ("a", "2015-12-12 10:15:00", "down"),
("a", "2015-12-12 10:20:00", "up"), ("b", "2015-12-14 15:30:00", "down")))
.toDF("ID", "date_time", "state")
.withColumn("timestamp", unix_timestamp($"date_time"))
Lets define example window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{coalesce, lag, when, sum}
val w = Window.partitionBy($"ID").orderBy($"timestamp")
some helper columns
val previousTimestamp = coalesce(lag($"timestamp", 1).over(w), $"timestamp")
val previousState = coalesce(lag($"state", 1).over(w), $"state")
val downtime = when(
previousState === "down",
$"timestamp" - previousTimestamp
).otherwise(0).alias("downtime")
val uptime = when(
previousState === "up",
$"timestamp" - previousTimestamp
).otherwise(0).alias("uptime")
and finally a basic query:
val upsAndDowns = df.select($"*", uptime, downtime)
upsAndDowns.show
// +---+-------------------+-----+----------+------+--------+
// | ID| date_time|state| timestamp|uptime|downtime|
// +---+-------------------+-----+----------+------+--------+
// | a|2015-12-11 04:30:00| up|1449804600| 0| 0|
// | a|2015-12-11 05:00:00| down|1449806400| 1800| 0|
// | a|2015-12-11 05:15:00| up|1449807300| 0| 900|
// | a|2015-12-12 10:15:00| down|1449911700|104400| 0|
// | a|2015-12-12 10:20:00| up|1449912000| 0| 300|
// | b|2015-12-12 04:00:00| down|1449889200| 0| 0|
// | b|2015-12-12 04:20:00| up|1449890400| 0| 1200|
// | b|2015-12-14 15:30:00| down|1450103400|213000| 0|
// +---+-------------------+-----+----------+------+--------+
In a similar manner you cna look forward and if there is no more records in a group you can adjust total uptime / downtime using current timestamp.
Window functions provide some other useful features like window definitions with ROWS BETWEEN and RANGE BETWEEN clauses.
Another possible solution is to move your data to RDD and use low level operations with RangePartitioner, mapPartitions and sliding windows. For basic things you can even groupBy. This requires significantly more effort but is also much more flexible.
Finally there is a spark-timeseries package from Cloudera. Documentation is close to non-existent but tests are comprehensive enough to give you some idea how to use it.
Regarding custom UDAFs I wouldn't be to optimistic. UDAF API is rather specific and not exactly flexible.

Resources