Column name when using concat - apache-spark

I am concatenating some columns in Spark SQL using the concat function. Here is some dummy code:
import org.apache.spark.sql.functions.{concat, lit}
val data1 = sc.parallelize(Array(2, 0, 3, 4, 5))
val data2 = sc.parallelize(Array(4, 0, 0, 6, 7))
val data3 = sc.parallelize(Array(1, 2, 3, 4, 10))
val dataJoin = data1.zip(data2).zip(data3).map((x) => (x._1._1, x._1._2, x._2 )).toDF("a1","a2","a3")
val dataConcat = dataJoin.select($"a1",concat(lit("["),$"a1", lit(","),$"a2", lit(","),$"a3", lit("]")))
Is there a way to specify or to change the label of the columns in order to avoid the default name which is not very practical?
+---+------------------------+
| a1|concat([,a1,,,a2,,,a3,])|
+---+------------------------+
| 2| [2,4,1]|
| 0| [0,0,2]|
| 3| [3,0,3]|
| 4| [4,6,4]|
| 5| [5,7,10]|
+---+------------------------+

Use as or alias methods to give a name to your column.

Related

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

spark dynamically create struct/json per group

I have a spark dataframe like
+-----+---+---+---+------+
|group| a| b| c|config|
+-----+---+---+---+------+
| a| 1| 2| 3| [a]|
| b| 2| 3| 4|[a, b]|
+-----+---+---+---+------+
val df = Seq(("a", 1, 2, 3, Seq("a")),("b", 2, 3,4, Seq("a", "b"))).toDF("group", "a", "b","c", "config")
How can I add an additional column i.e.
df.withColumn("select_by_config", <<>>).show
as a struct or JSON which combines a number of columns (specified by config) in something similar to a hive named struct / spark struct / json column? Note, this struct is specific per group and not constant for the whole dataframe; it is specified in config column.
I can imagine that a df.map could do the trick, but the serialization overhead does not seem to be efficient. How can this be achieved via SQL only expressions? Maybe as a Map-type column?
edit
a possible but really clumsy solution for 2.2 is:
val df = Seq((1,"a", 1, 2, 3, Seq("a")),(2, "b", 2, 3,4, Seq("a", "b"))).toDF("id", "group", "a", "b","c", "config")
df.show
import spark.implicits._
final case class Foo(id:Int, c1:Int, specific:Map[String, Int])
df.map(r => {
val config = r.getAs[Seq[String]]("config")
print(config)
val others = config.map(elem => (elem, r.getAs[Int](elem))).toMap
Foo(r.getAs[Int]("id"), r.getAs[Int]("c"), others)
}).show
are there any better ways to solve the problem for 2.2?
If you use a recent build (Spark 2.4.0 RC 1 or later) a combination of higher order functions should do the trick. Create a map of columns:
import org.apache.spark.sql.functions.{
array, col, expr, lit, map_from_arrays, map_from_entries
}
val cols = Seq("a", "b", "c")
val dfm = df.withColumn(
"cmap",
map_from_arrays(array(cols map lit: _*), array(cols map col: _*))
)
and transform the config:
dfm.withColumn(
"config_mapped",
map_from_entries(expr("transform(config, k -> struct(k, cmap[k]))"))
).show
// +-----+---+---+---+------+--------------------+----------------+
// |group| a| b| c|config| cmap| config_mapped|
// +-----+---+---+---+------+--------------------+----------------+
// | a| 1| 2| 3| [a]|[a -> 1, b -> 2, ...| [a -> 1]|
// | b| 2| 3| 4|[a, b]|[a -> 2, b -> 3, ...|[a -> 2, b -> 3]|
// +-----+---+---+---+------+--------------------+----------------+

pyspark. zip arrays in a dataframe

I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?
Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
The number of rows in your DataFrame (df.count())
The length of data (given)
Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():
import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
f.collect_list("id").alias("id"),
f.array(
[
f.array(
[f.collect_list("data").getItem(j).getItem(i)
for j in range(numRows)]
)
for i in range(dataLength)
]
).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id |data |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+
You can simply use a udf function for the zip function but before that you will have to use collect_list function
from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
return zip(*array)
zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))
df.select(
f.collect_list(df.id).alias('id'),
zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)
which would give you
+---------+------------------------------------------------------------------------------+
|id |data |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+

Spark: Find the value with the highest occurrence per group over rolling time window

Starting from the following spark data frame:
from io import StringIO
import pandas as pd
from pyspark.sql.functions import col
pd_df = pd.read_csv(StringIO("""device_id,read_date,id,count
device_A,2017-08-05,4041,3
device_A,2017-08-06,4041,3
device_A,2017-08-07,4041,4
device_A,2017-08-08,4041,3
device_A,2017-08-09,4041,3
device_A,2017-08-10,4041,1
device_A,2017-08-10,4045,2
device_A,2017-08-11,4045,3
device_A,2017-08-12,4045,3
device_A,2017-08-13,4045,3"""),infer_datetime_format=True, parse_dates=['read_date'])
df = spark.createDataFrame(pd_df).withColumn('read_date', col('read_date').cast('date'))
df.show()
Output:
+--------------+----------+----+-----+
|device_id | read_date| id|count|
+--------------+----------+----+-----+
| device_A|2017-08-05|4041| 3|
| device_A|2017-08-06|4041| 3|
| device_A|2017-08-07|4041| 4|
| device_A|2017-08-08|4041| 3|
| device_A|2017-08-09|4041| 3|
| device_A|2017-08-10|4041| 1|
| device_A|2017-08-10|4045| 2|
| device_A|2017-08-11|4045| 3|
| device_A|2017-08-12|4045| 3|
| device_A|2017-08-13|4045| 3|
+--------------+----------+----+-----+
I would like to find the most frequent id for each (device_id, read_date) combination, over a 3 day rolling window. For each group of rows selected by the time window, I need to find the most frequent id by summing up the counts per id, then return the top id.
Expected Output:
+--------------+----------+----+
|device_id | read_date| id|
+--------------+----------+----+
| device_A|2017-08-05|4041|
| device_A|2017-08-06|4041|
| device_A|2017-08-07|4041|
| device_A|2017-08-08|4041|
| device_A|2017-08-09|4041|
| device_A|2017-08-10|4041|
| device_A|2017-08-11|4045|
| device_A|2017-08-12|4045|
| device_A|2017-08-13|4045|
+--------------+----------+----+
I am starting to think this is only possible using a custom aggregation function. Since spark 2.3 is not out I will have to write this in Scala or use collect_list. Am I missing something?
Add window:
from pyspark.sql.functions import window, sum as sum_, date_add
df_w = df.withColumn(
"read_date", window("read_date", "3 days", "1 day")["start"].cast("date")
)
# Then handle the counts
df_w = df_w.groupBy('device_id', 'read_date', 'id').agg(sum_('count').alias('count'))
Use one of the solutions from Find maximum row per group in Spark DataFrame for example
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
rolling_window = 3
top_df = (
df_w
.withColumn(
"rn",
row_number().over(
Window.partitionBy("device_id", "read_date")
.orderBy(col("count").desc())
)
)
.where(col("rn") == 1)
.orderBy("read_date")
.drop("rn")
)
# results are calculated on the start of the time window - adjust read_date as needed
final_df = top_df.withColumn('read_date', date_add('read_date', rolling_window - 1))
final_df.show()
# +---------+----------+----+-----+
# |device_id| read_date| id|count|
# +---------+----------+----+-----+
# | device_A|2017-08-05|4041| 3|
# | device_A|2017-08-06|4041| 6|
# | device_A|2017-08-07|4041| 10|
# | device_A|2017-08-08|4041| 10|
# | device_A|2017-08-09|4041| 10|
# | device_A|2017-08-10|4041| 7|
# | device_A|2017-08-11|4045| 5|
# | device_A|2017-08-12|4045| 8|
# | device_A|2017-08-13|4045| 9|
# | device_A|2017-08-14|4045| 6|
# | device_A|2017-08-15|4045| 3|
# +---------+----------+----+-----+
I managed to find a very inefficient solution. Hopefully someone can spot improvements to avoid the python udf and call to collect_list.
from pyspark.sql import Window
from pyspark.sql.functions import col, collect_list, first, udf
from pyspark.sql.types import IntegerType
def top_id(ids, counts):
c = Counter()
for cnid, count in zip(ids, counts):
c[cnid] += count
return c.most_common(1)[0][0]
rolling_window = 3
days = lambda i: i * 86400
# Define a rolling calculation window based on time
window = (
Window()
.partitionBy("device_id")
.orderBy(col("read_date").cast("timestamp").cast("long"))
.rangeBetween(-days(rolling_window - 1), 0)
)
# Use window and collect_list to store data matching the window definition on each row
df_collected = df.select(
'device_id', 'read_date',
collect_list(col('id')).over(window).alias('ids'),
collect_list(col('count')).over(window).alias('counts')
)
# Get rid of duplicate rows where necessary
df_grouped = df_collected.groupBy('device_id', 'read_date').agg(
first('ids').alias('ids'),
first('counts').alias('counts'),
)
# Register and apply udf to return the most frequently seen id
top_id_udf = udf(top_id, IntegerType())
df_mapped = df_grouped.withColumn('top_id', top_id_udf(col('ids'), col('counts')))
df_mapped.show(truncate=False)
returns:
+---------+----------+------------------------+------------+------+
|device_id|read_date |ids |counts |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041] |[3] |4041 |
|device_A |2017-08-06|[4041, 4041] |[3, 3] |4041 |
|device_A |2017-08-07|[4041, 4041, 4041] |[3, 3, 4] |4041 |
|device_A |2017-08-08|[4041, 4041, 4041] |[3, 4, 3] |4041 |
|device_A |2017-08-09|[4041, 4041, 4041] |[4, 3, 3] |4041 |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041 |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045 |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045 |
|device_A |2017-08-13|[4045, 4045, 4045] |[3, 3, 3] |4045 |
+---------+----------+------------------------+------------+------+

Issue with adding Column to a DataFrame

Below code fails with AnalysisException: sc.version String = 1.6.0
case class Person(name: String, age: Long)
val caseClassDF = Seq(Person("Andy", 32)).toDF()
caseClassDF.count()
val seq = Seq(1)
val rdd = sqlContext.sparkContext.parallelize(seq)
val df2 = rdd.toDF("Counts")
df2.count()
val withCounts = caseClassDF.withColumn("duration", df2("Counts"))
For some reason, it works with UDF:
import org.apache.spark.sql.functions.udf
case class Person(name: String, age: Long, day: Int)
val caseClassDF = Seq(Person("Andy", 32, 1), Person("Raman", 22, 1), Person("Rajan", 40, 1), Person("Andy", 42, 2), Person("Raman", 42, 2), Person("Rajan", 50, 2)).toDF()
val calculateCounts= udf((x: Long, y: Int) =>
x+y)
val df1 = caseClassDF.withColumn("Counts", calculateCounts($"age", $"day"))
df1.show
+-----+---+---+------+
| name|age|day|Counts|
+-----+---+---+------+
| Andy| 32| 1| 33|
|Raman| 22| 1| 23|
|Rajan| 40| 1| 41|
| Andy| 42| 2| 44|
|Raman| 42| 2| 44|
|Rajan| 50| 2| 52|
+-----+---+---+------+
caseClassDF.withColumn("duration", df2("Counts")), Here the column should be of the same dataframe (in your case caseClassDF). AFAIK, Spark does not allow column of a different DataFrame in withColumn.
PS: I am a user of Spark 1.6.x, not sure whether this has come up in Spark 2.x

Resources