How to use groupBy to collect rows into a map? - apache-spark

Context
sqlContext.sql(s"""
SELECT
school_name,
name,
age
FROM my_table
""")
Ask
Given the above table, I would like to group by school name and collect name, age into a Map[String, Int]
For example - Pseudo-code
val df = sqlContext.sql(s"""
SELECT
school_name,
age
FROM my_table
GROUP BY school_name
""")
------------------------
school_name | name | age
------------------------
school A | "michael"| 7
school A | "emily" | 5
school B | "cathy" | 10
school B | "shaun" | 5
df.groupBy("school_name").agg(make_map)
------------------------------------
school_name | map
------------------------------------
school A | {"michael": 7, "emily": 5}
school B | {"cathy": 10, "shaun": 5}

Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map.
val df1 = df.groupBy(col("school_name")).agg(collect_list(map($"name",$"age")) as "map")
df1.show(false)
This will give you below output.
+-----------+------------------------------------+
|school_name|map |
+-----------+------------------------------------+
|school B |[Map(cathy -> 10), Map(shaun -> 5)] |
|school A |[Map(michael -> 7), Map(emily -> 5)]|
+-----------+------------------------------------+
Now you can use UDF to join individual Maps into single Map like below.
import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[String,Int]] => values.flatten.toMap }
val df2 = df1.withColumn("map", joinMap(col("map")))
df2.show(false)
This will give required output with Map[String,Int].
+-----------+-----------------------------+
|school_name|map |
+-----------+-----------------------------+
|school B |Map(cathy -> 10, shaun -> 5) |
|school A |Map(michael -> 7, emily -> 5)|
+-----------+-----------------------------+
If you want to convert a column value into JSON String then Spark 2.1.0 has introduced to_json function.
val df3 = df2.withColumn("map",to_json(struct($"map")))
df3.show(false)
The to_json function will return following output.
+-----------+-------------------------------+
|school_name|map |
+-----------+-------------------------------+
|school B |{"map":{"cathy":10,"shaun":5}} |
|school A |{"map":{"michael":7,"emily":5}}|
+-----------+-------------------------------+

As of spark 2.4 you can use map_from_arrays function to achieve this.
val df = spark.sql(s"""
SELECT *
FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
AS (school, name, age)
""")
val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))
+------+----+---+
|school|name|age|
+------+----+---+
| s1| a| 1|
| s1| b| 2|
| s2| a| 1|
+------+----+---+
+------+----------------+
|school| map|
+------+----------------+
| s2| [a -> 1]|
| s1|[a -> 1, b -> 2]|
+------+----------------+

df.select($"school_name",concat_ws(":",$"age",$"name").as("new_col")).groupBy($"school_name").agg(collect_set($"new_col")).show
+-----------+--------------------+
|school_name|collect_set(new_col)|
+-----------+--------------------+
| school B| [5:shaun, 10:cathy]|
| school A|[7:michael, 5:emily]|
+-----------+--------------------+

Related

pyspark counting number of nulls per group

I have a dataframe that has time series data in it and some categorical data
| cat | TS1 | TS2 | ... |
| A | 1 | null | ... |
| A | 1 | 20 | ... |
| B | null | null | ... |
| A | null | null | ... |
| B | 1 | 100 | ... |
I would like to find out how many null values there are per column per group, so an expected output would look something like:
| cat | TS1 | TS2 |
| A | 1 | 2 |
| B | 1 | 1 |
Currently I can this for one of the groups with something like this
df_null_cats = df.where(df.cat == "A").where(reduce(lambda x, y: x | y, (col(c).isNull() for c in df.columns))).select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_nulls.columns])
but I am struggling to get one that would work for the whole dataframe.
You can use groupBy and aggregation function to get required output.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local").getOrCreate()
# Sample dataframe
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
in_df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
columns = in_df.columns
# Ignoring groupBy column and considering cols which are required in aggregation
columns.remove("cat")
agg_expression = [sum(when(in_df[x].isNull(), 1).otherwise(0)).alias(x) for x in columns]
in_df.groupby("cat").agg(*agg_expression).show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| B| 1| 1|
| A| 1| 2|
+---+---+---+
"Sum" function can be used with condition for null value. On Scala:
val df = Seq(
(Some("A"), Some(1), None),
(Some("A"), Some(1), Some(20)),
(Some("B"), None, None),
(Some("A"), None, None),
(Some("B"), Some(1), Some(100)),
).toDF("cat", "TS1", "TS2")
val aggregatorColumns = df
.columns
.tail
.map(columnName => sum(when(col(columnName).isNull, 1).otherwise(0)).alias(columnName))
df
.groupBy("cat")
.agg(
aggregatorColumns.head, aggregatorColumns.tail: _*
)
#Mohana's answer is good but it's still not dynamic: you need to code the operation for every single column.
In my answer below, we can use Pandas UDFs and applyInPandas to write a simple function in Pandas which will then be applied to our PySpark dataframe.
import pandas as pd
from pyspark.sql.types import *
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
# define output schema: same column names, but we must ensure that the output type is integer
output_schema = StructType(
[StructField('cat', StringType())] + \
[StructField(col, IntegerType(), True) for col in [c for c in df.columns if c.startswith('TS')]]
)
# custom Python function to define aggregations in Pandas
def null_count(pdf):
columns = [c for c in pdf.columns if c.startswith('TS')]
result = pdf\
.groupby('cat')[columns]\
.agg(lambda x: x.isnull().sum())\
.reset_index()
return result
# use applyInPandas
df\
.groupby('cat')\
.applyInPandas(null_count, output_schema)\
.show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| A| 1| 2|
| B| 1| 1|
+---+---+---+

In Spark,According to the mapping table, is there any way to convert an array of strings into the corresponding array of integers

In Spark, according to the mapping table (String -> Integer), is there any way to convert an array of strings into the corresponding array of integers?
Ex: In Spark, There are 500 million arrays,
Array String 1 : ['TOM','White','Black']
Array String 2 : ['BCD','TTTT','Black']
.....
Mapping Tableļ¼š [BCD -> 1, White -> 2,Black -> 3, TTT -> 4 ,TOM ->5, ...] (one million).
Result:
Array Integer 1 : [5,2,3]
Array Integer 2 : [1,4,1]
....
One possible solution is to explode the array column, join with the mapping table, then group back again:
scala> val df = Seq(1 -> Array("TOM","White","Black"),
| 2 -> Array("BCD","TTTT","Black")).toDF("id", "vals")
df: org.apache.spark.sql.DataFrame = [id: int, vals: array<string>]
scala> df.show()
+---+-------------------+
| id| vals|
+---+-------------------+
| 1|[TOM, White, Black]|
| 2| [BCD, TTTT, Black]|
+---+-------------------+
scala> val mapping = Seq("BCD" -> 1, "White" -> 2, "Black" -> 3, "TTTT" -> 4, "TOM" -> 5).toDF("k", "v")
mapping: org.apache.spark.sql.DataFrame = [str: string, ind: int]
scala> mapping.show()
+-----+--+
| k| v|
+-----+--+
| BCD| 1|
|White| 2|
|Black| 3|
| TTTT| 4|
| TOM| 5|
+-----+--+
scala> df
| .select($"id", posexplode($"vals") as Seq("pos", "val")).alias("df_exp")
| .join(mapping as "mapping", $"df_exp.val" === $"mapping.k")
| .repartition($"id")
| .sortWithinPartitions($"pos")
| .groupBy($"id")
| .agg(collect_list($"v") as "mapped")
| .show()
+---+---------+
| id| mapped|
+---+---------+
| 1|[5, 2, 3]|
| 2|[1, 4, 3]|
+---+---------+
Because joining datasets shuffles the data and the consecutive grouping of distributed data is non-deterministic, posexplode() is used to obtain an additional column containing the original position of each value in its array and the data is first repartitioned on id and then sorted within each partition.
If using Spark < 2.1.0, monotonically_increasing_id() can be used instead.
In SQL it should be something like:
WITH df_exp AS (
SELECT
id,
posexplode(vals) AS (pos, val)
FROM df
)
SELECT
id,
collect_list(v) AS mapped
FROM (
SELECT
id,
v
FROM df_exp
JOIN mapping
ON val = k
DISTRIBUTE BY id
SORT BY pos ASC
)
GROUP BY id

Spark filter multiple group of rows to a single row

I am trying to acheive the following,
Lets say I have a dataframe with the following columns
id | name | alias
-------------------
1 | abc | short
1 | abc | ailas-long-1
1 | abc | another-long-alias
2 | xyz | short_alias
2 | xyz | same_length
3 | def | alias_1
I want to groupby id and name and select the shorter alias,
The output I am expecting is
id | name | alias
-------------------
1 | abc | short
2 | xyz | short_alias
3 | def | alias_1
I can achevie this using window and row_number, is there anyother efficient method to get the same result. In general, the thrid column filter condition can be anything in this case its the length of the field.
Any help would be much appreciated.
Thank you.
All you need to do is use length inbuilt function and use that in window function as
from pyspark.sql import functions as f
from pyspark.sql import Window
windowSpec = Window.partitionBy('id', 'name').orderBy('length')
df.withColumn('length', f.length('alias'))\
.withColumn('length', f.row_number().over(windowSpec))\
.filter(f.col('length') == 1)\
.drop('length')\
.show(truncate=False)
which should give you
+---+----+-----------+
|id |name|alias |
+---+----+-----------+
|3 |def |alias_1 |
|1 |abc |short |
|2 |xyz |short_alias|
+---+----+-----------+
A solution without window (Not very pretty..) and the easiest, in my opinion, rdd solution:
from pyspark.sql import functions as F
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rdd = sc.parallelize([(1 , "abc" , "short-alias"),
(1 , "abc" , "short"),
(1 , "abc" , "ailas-long-1"),
(1 , "abc" , "another-long-alias"),
(2 , "xyz" , "same_length"),
(2 , "xyz" , "same_length1"),
(3 , "def" , "short_alias") ])
df = hiveCtx.createDataFrame(\
rdd, ["id", "name", "alias"])
len_df = df.groupBy(["id", "name"]).agg(F.min(F.length("alias")).alias("alias_len"))
df = df.withColumn("alias_len", F.length("alias"))
cond = ["alias_len", "id", "name"]
df.join(len_df, cond).show()
print rdd.map(lambda x: ((x[0], x[1]), x[2]))\
.reduceByKey(lambda x,y: x if len(x) < len(y) else y ).collect()
Output:
+---------+---+----+-----------+
|alias_len| id|name| alias|
+---------+---+----+-----------+
| 11| 3| def|short_alias|
| 11| 2| xyz|same_length|
| 5| 1| abc| short|
+---------+---+----+-----------+
[((2, 'xyz'), 'same_length'), ((3, 'def'), 'short_alias'), ((1, 'abc'), 'short')]

How to lower the case of column names of a data frame but not its values?

How to lower the case of column names of a data frame but not its values? using RAW Spark SQL and Dataframe methods ?
Input data frame (Imagine I have 100's of these columns in uppercase)
NAME | COUNTRY | SRC | CITY | DEBIT
---------------------------------------------
"foo"| "NZ" | salary | "Auckland" | 15.0
"bar"| "Aus" | investment | "Melbourne"| 12.5
taget dataframe
name | country | src | city | debit
------------------------------------------------
"foo"| "NZ" | salary | "Auckland" | 15.0
"bar"| "Aus" | investment | "Melbourne"| 12.5
If you are using scala, you can simply do the following
import org.apache.spark.sql.functions._
df.select(df.columns.map(x => col(x).as(x.toLowerCase)): _*).show(false)
And if you are using pyspark, you can simply do the following
from pyspark.sql import functions as F
df.select([F.col(x).alias(x.lower()) for x in df.columns]).show()
Java 8 solution to convert the column names to lower case.
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.Column;
df.select(Arrays.asList(df.columns()).stream().map(x -> col(x).as(x.toLowerCase())).toArray(size -> new Column[size])).show(false);
How about this:
Some fake data:
scala> val df = spark.sql("select 'A' as AA, 'B' as BB")
df: org.apache.spark.sql.DataFrame = [AA: string, BB: string]
scala> df.show()
+---+---+
| AA| BB|
+---+---+
| A| B|
+---+---+
Now re-select all columns with a new name, which is just their lower-case version:
scala> val cols = df.columns.map(c => s"$c as ${c.toLowerCase}")
cols: Array[String] = Array(AA as aa, BB as bb)
scala> val lowerDf = df.selectExpr(cols:_*)
lowerDf: org.apache.spark.sql.DataFrame = [aa: string, bb: string]
scala> lowerDf.show()
+---+---+
| aa| bb|
+---+---+
| A| B|
+---+---+
Note: I use Scala. If you use PySpark and are not familiar with the Scala syntax, then df.columns.map(c => s"$c as ${c.toLowerCase}") is map(lambda c: c.lower(), df.columns) in Python and cols:_* becomes *cols. Please note I didn't run this translation.
for Java 8
Dataset<Row> input;
for (StructField field : input.schema().fields()) {
String newName = field.name().toLowerCase(Locale.ROOT);
input = input.withColumnRenamed(field.name(), newName);
if (field.dataType() instanceof StructType) {
StructType newStructType = (StructType) StructType.fromJson(field.dataType().json().toLowerCase(Locale.ROOT));
input = input.withColumn(newName, col(newName).cast(newStructType));
}
}
You can use df.withColumnRenamed(col_name,col_name.lower()) for spark dataframe in python

Pyspark : forward fill with last observation for a DataFrame

Using Spark 1.5.1,
I've been trying to forward fill null values with the last known observation for one column of my DataFrame.
It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.
In this post, a solution in Scala was provided for a very similar problem by zero323.
But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?
Thanks for your help.
Below, a simple example sample input:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | null
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | null
| 1 | 2015-12-05 | null
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | null
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | null
| 2 | 2015-12-03 | null
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | null
| 2 | 2015-12-06 | U4
And the expected output:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | U1
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | U1
| 1 | 2015-12-05 | U1
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | U2
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | U1
| 2 | 2015-12-03 | U3
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | U3
| 2 | 2015-12-06 | U4
Another workaround to get this working, is to try something like this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = (
Window
.partitionBy('cookie_id')
.orderBy('Time')
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
final = (
joined
.withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)
So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)
The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.
Load the data
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
rdd = sc.parallelize(values)
df = rdd.toDF(["cookie_id", "c_date", "user_id"])
df = df.withColumn("c_date", df.c_date.cast("date"))
df.show()
The DataFrame is
+---------+----------+-------+
|cookie_id| c_date|user_id|
+---------+----------+-------+
| 1|2015-12-01| null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| null|
| 1|2015-12-05| null|
| 2|2015-12-04| null|
| 2|2015-12-03| null|
| 2|2015-12-02| U3|
| 2|2015-12-05| null|
+---------+----------+-------+
Column used to sort the partitions
# get the sort key
def getKey(item):
return item.c_date
The fill function. Can be used to fill in multiple columns if necessary.
# fill function
def fill(x):
out = []
last_val = None
for v in x:
if v["user_id"] is None:
data = [v["cookie_id"], v["c_date"], last_val]
else:
data = [v["cookie_id"], v["c_date"], v["user_id"]]
last_val = v["user_id"]
out.append(data)
return out
Convert to rdd, partition, sort and fill the missing values
# Partition the data
rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list)
# Sort the data by date
rdd = rdd.mapValues(lambda x: sorted(x, key=getKey))
# fill missing value and flatten
rdd = rdd.mapValues(fill).flatMapValues(lambda x: x)
# discard the key
rdd = rdd.map(lambda v: v[1])
Convert back to DataFrame
df_out = sqlContext.createDataFrame(rdd)
df_out.show()
The output is
+---+----------+----+
| _1| _2| _3|
+---+----------+----+
| 1|2015-12-01|null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| U2|
| 1|2015-12-05| U2|
| 2|2015-12-02| U3|
| 2|2015-12-03| U3|
| 2|2015-12-04| U3|
| 2|2015-12-05| U3|
+---+----------+----+
Hope you find this forward fill function useful. It is written using native pyspark function. Neither udf nor rdd being used (both of them are very slow, especially UDF!).
Let's use example provided by #Sid.
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
df = spark.createDataFrame(values, ['cookie_ID', 'Time', 'User_ID'])
Functions:
def cum_sum(df, sum_col , order_col, cum_sum_col_nm='cum_sum'):
'''Find cumulative sum of a column.
Parameters
-----------
sum_col : String
Column to perform cumulative sum.
order_col : List
Column/columns to sort for cumulative sum.
cum_sum_col_nm : String
The name of the resulting cum_sum column.
Return
-------
df : DataFrame
Dataframe with additional "cum_sum_col_nm".
'''
df = df.withColumn('tmp', lit('tmp'))
windowval = (Window.partitionBy('tmp')
.orderBy(order_col)
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', sum(sum_col).over(windowval).alias('cumsum').cast(StringType()))
df = df.drop('tmp')
return df
def forward_fill(df, order_col, fill_col, fill_col_name=None):
'''Forward fill a column by a column/set of columns (order_col).
Parameters:
------------
df: Dataframe
order_col: String or List of string
fill_col: String (Only work for a column for this version.)
Return:
---------
df: Dataframe
Return df with the filled_cols.
'''
# "value" and "constant" are tmp columns created ton enable forward fill.
df = df.withColumn('value', when(col(fill_col).isNull(), 0).otherwise(1))
df = cum_sum(df, 'value', order_col).drop('value')
df = df.withColumn(fill_col,
when(col(fill_col).isNull(), 'constant').otherwise(col(fill_col)))
win = (Window.partitionBy('cum_sum')
.orderBy(order_col))
if not fill_col_name:
fill_col_name = 'ffill_{}'.format(fill_col)
df = df.withColumn(fill_col_name, collect_list(fill_col).over(win)[0])
df = df.drop('cum_sum')
df = df.withColumn(fill_col_name, when(col(fill_col_name)=='constant', None).otherwise(col(fill_col_name)))
df = df.withColumn(fill_col, when(col(fill_col)=='constant', None).otherwise(col(fill_col)))
return df
Let's see the results.
ffilled_df = forward_fill(df,
order_col=['cookie_ID', 'Time'],
fill_col='User_ID',
fill_col_name = 'User_ID_ffil')
ffilled_df.sort(['cookie_ID', 'Time']).show()
// Forward filling
w1 = Window.partitionBy('cookie_id').orderBy('c_date').rowsBetween(Window.unboundedPreceding,0)
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//Backward filling
final_df = df.withColumn('UserIDFilled', F.coalesce(F.last('user_id', True).over(w1),
F.first('user_id',True).over(w2)))
final_df.orderBy('cookie_id', 'c_date').show(truncate=False)
+---------+----------+-------+------------+
|cookie_id|c_date |user_id|UserIDFilled|
+---------+----------+-------+------------+
|1 |2015-12-01|null |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-03|U2 |U2 |
|1 |2015-12-04|null |U2 |
|1 |2015-12-05|null |U2 |
|2 |2015-12-02|U3 |U3 |
|2 |2015-12-03|null |U3 |
|2 |2015-12-04|null |U3 |
|2 |2015-12-05|null |U3 |
+---------+----------+-------+------------+
Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.
http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/

Resources