How to create Spark dataframe in Scala from Lists - apache-spark

I have below lists,
val col1 = List(1,2,3,4,5)
val col2 = List("a", "b", "c", "d", "e")
val col3 = List(6,7,8)
Requirement is to create a dataframe as below in Scala,
--------------------
| col1 | col2| col3|
--------------------
| 1 | a | 6 |
| 2 | b | 7 |
| 3 | c | 8 |
| 4 | d | null|
| 5 | e | null|
--------------------
Thank you.

If you're using Scala 2.12 or older, zipAll function may be helpful. Also using Option for nullability.
val data = col1.map(Option.apply)
.zipAll(col2.map(Option.apply), None, None)
.zipAll(col3.map(Option.apply), (None, None), None)
.map { case ((c1, c2), c3) => (c1, c2, c3) }
val df = data.toDF("col1", "col2", "col3")

Build rows iterating over indexes to get
import sparkSession.implicits._
val df = List((1,’a’),(2,’b’)).toDF()

Related

pyspark counting number of nulls per group

I have a dataframe that has time series data in it and some categorical data
| cat | TS1 | TS2 | ... |
| A | 1 | null | ... |
| A | 1 | 20 | ... |
| B | null | null | ... |
| A | null | null | ... |
| B | 1 | 100 | ... |
I would like to find out how many null values there are per column per group, so an expected output would look something like:
| cat | TS1 | TS2 |
| A | 1 | 2 |
| B | 1 | 1 |
Currently I can this for one of the groups with something like this
df_null_cats = df.where(df.cat == "A").where(reduce(lambda x, y: x | y, (col(c).isNull() for c in df.columns))).select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_nulls.columns])
but I am struggling to get one that would work for the whole dataframe.
You can use groupBy and aggregation function to get required output.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local").getOrCreate()
# Sample dataframe
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
in_df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
columns = in_df.columns
# Ignoring groupBy column and considering cols which are required in aggregation
columns.remove("cat")
agg_expression = [sum(when(in_df[x].isNull(), 1).otherwise(0)).alias(x) for x in columns]
in_df.groupby("cat").agg(*agg_expression).show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| B| 1| 1|
| A| 1| 2|
+---+---+---+
"Sum" function can be used with condition for null value. On Scala:
val df = Seq(
(Some("A"), Some(1), None),
(Some("A"), Some(1), Some(20)),
(Some("B"), None, None),
(Some("A"), None, None),
(Some("B"), Some(1), Some(100)),
).toDF("cat", "TS1", "TS2")
val aggregatorColumns = df
.columns
.tail
.map(columnName => sum(when(col(columnName).isNull, 1).otherwise(0)).alias(columnName))
df
.groupBy("cat")
.agg(
aggregatorColumns.head, aggregatorColumns.tail: _*
)
#Mohana's answer is good but it's still not dynamic: you need to code the operation for every single column.
In my answer below, we can use Pandas UDFs and applyInPandas to write a simple function in Pandas which will then be applied to our PySpark dataframe.
import pandas as pd
from pyspark.sql.types import *
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
# define output schema: same column names, but we must ensure that the output type is integer
output_schema = StructType(
[StructField('cat', StringType())] + \
[StructField(col, IntegerType(), True) for col in [c for c in df.columns if c.startswith('TS')]]
)
# custom Python function to define aggregations in Pandas
def null_count(pdf):
columns = [c for c in pdf.columns if c.startswith('TS')]
result = pdf\
.groupby('cat')[columns]\
.agg(lambda x: x.isnull().sum())\
.reset_index()
return result
# use applyInPandas
df\
.groupby('cat')\
.applyInPandas(null_count, output_schema)\
.show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| A| 1| 2|
| B| 1| 1|
+---+---+---+

Spark filter multiple group of rows to a single row

I am trying to acheive the following,
Lets say I have a dataframe with the following columns
id | name | alias
-------------------
1 | abc | short
1 | abc | ailas-long-1
1 | abc | another-long-alias
2 | xyz | short_alias
2 | xyz | same_length
3 | def | alias_1
I want to groupby id and name and select the shorter alias,
The output I am expecting is
id | name | alias
-------------------
1 | abc | short
2 | xyz | short_alias
3 | def | alias_1
I can achevie this using window and row_number, is there anyother efficient method to get the same result. In general, the thrid column filter condition can be anything in this case its the length of the field.
Any help would be much appreciated.
Thank you.
All you need to do is use length inbuilt function and use that in window function as
from pyspark.sql import functions as f
from pyspark.sql import Window
windowSpec = Window.partitionBy('id', 'name').orderBy('length')
df.withColumn('length', f.length('alias'))\
.withColumn('length', f.row_number().over(windowSpec))\
.filter(f.col('length') == 1)\
.drop('length')\
.show(truncate=False)
which should give you
+---+----+-----------+
|id |name|alias |
+---+----+-----------+
|3 |def |alias_1 |
|1 |abc |short |
|2 |xyz |short_alias|
+---+----+-----------+
A solution without window (Not very pretty..) and the easiest, in my opinion, rdd solution:
from pyspark.sql import functions as F
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rdd = sc.parallelize([(1 , "abc" , "short-alias"),
(1 , "abc" , "short"),
(1 , "abc" , "ailas-long-1"),
(1 , "abc" , "another-long-alias"),
(2 , "xyz" , "same_length"),
(2 , "xyz" , "same_length1"),
(3 , "def" , "short_alias") ])
df = hiveCtx.createDataFrame(\
rdd, ["id", "name", "alias"])
len_df = df.groupBy(["id", "name"]).agg(F.min(F.length("alias")).alias("alias_len"))
df = df.withColumn("alias_len", F.length("alias"))
cond = ["alias_len", "id", "name"]
df.join(len_df, cond).show()
print rdd.map(lambda x: ((x[0], x[1]), x[2]))\
.reduceByKey(lambda x,y: x if len(x) < len(y) else y ).collect()
Output:
+---------+---+----+-----------+
|alias_len| id|name| alias|
+---------+---+----+-----------+
| 11| 3| def|short_alias|
| 11| 2| xyz|same_length|
| 5| 1| abc| short|
+---------+---+----+-----------+
[((2, 'xyz'), 'same_length'), ((3, 'def'), 'short_alias'), ((1, 'abc'), 'short')]

How to use groupBy to collect rows into a map?

Context
sqlContext.sql(s"""
SELECT
school_name,
name,
age
FROM my_table
""")
Ask
Given the above table, I would like to group by school name and collect name, age into a Map[String, Int]
For example - Pseudo-code
val df = sqlContext.sql(s"""
SELECT
school_name,
age
FROM my_table
GROUP BY school_name
""")
------------------------
school_name | name | age
------------------------
school A | "michael"| 7
school A | "emily" | 5
school B | "cathy" | 10
school B | "shaun" | 5
df.groupBy("school_name").agg(make_map)
------------------------------------
school_name | map
------------------------------------
school A | {"michael": 7, "emily": 5}
school B | {"cathy": 10, "shaun": 5}
Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map.
val df1 = df.groupBy(col("school_name")).agg(collect_list(map($"name",$"age")) as "map")
df1.show(false)
This will give you below output.
+-----------+------------------------------------+
|school_name|map |
+-----------+------------------------------------+
|school B |[Map(cathy -> 10), Map(shaun -> 5)] |
|school A |[Map(michael -> 7), Map(emily -> 5)]|
+-----------+------------------------------------+
Now you can use UDF to join individual Maps into single Map like below.
import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[String,Int]] => values.flatten.toMap }
val df2 = df1.withColumn("map", joinMap(col("map")))
df2.show(false)
This will give required output with Map[String,Int].
+-----------+-----------------------------+
|school_name|map |
+-----------+-----------------------------+
|school B |Map(cathy -> 10, shaun -> 5) |
|school A |Map(michael -> 7, emily -> 5)|
+-----------+-----------------------------+
If you want to convert a column value into JSON String then Spark 2.1.0 has introduced to_json function.
val df3 = df2.withColumn("map",to_json(struct($"map")))
df3.show(false)
The to_json function will return following output.
+-----------+-------------------------------+
|school_name|map |
+-----------+-------------------------------+
|school B |{"map":{"cathy":10,"shaun":5}} |
|school A |{"map":{"michael":7,"emily":5}}|
+-----------+-------------------------------+
As of spark 2.4 you can use map_from_arrays function to achieve this.
val df = spark.sql(s"""
SELECT *
FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
AS (school, name, age)
""")
val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))
+------+----+---+
|school|name|age|
+------+----+---+
| s1| a| 1|
| s1| b| 2|
| s2| a| 1|
+------+----+---+
+------+----------------+
|school| map|
+------+----------------+
| s2| [a -> 1]|
| s1|[a -> 1, b -> 2]|
+------+----------------+
df.select($"school_name",concat_ws(":",$"age",$"name").as("new_col")).groupBy($"school_name").agg(collect_set($"new_col")).show
+-----------+--------------------+
|school_name|collect_set(new_col)|
+-----------+--------------------+
| school B| [5:shaun, 10:cathy]|
| school A|[7:michael, 5:emily]|
+-----------+--------------------+

How to explode columns?

After:
val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")
I have this DataFrame in Apache Spark:
+------+---------+
| Col1 | Col2 |
+------+---------+
| 1 |[2, 3, 4]|
| 1 |[2, 3, 4]|
+------+---------+
How do I convert this into:
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| 1 | 2 | 3 | 4 |
| 1 | 2 | 3 | 4 |
+------+------+------+------+
A solution that doesn't convert to and from RDD:
df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")
Or arguable nicer:
val nElements = 3
df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))
The size of a Spark array column is not fixed, you could for instance have:
+----+------------+
|Col1| Col2|
+----+------------+
| 1| [2, 3, 4]|
| 1|[2, 3, 4, 5]|
+----+------------+
So there is no way to get the amount of columns and create those. If you know the size is always the same, you can set nElements like this:
val nElements = df.select("Col2").first.getList(0).size
Just to give the Pyspark version of sgvd's answer. If the array column is in Col2, then this select statement will move the first nElements of each array in Col2 to their own columns:
from pyspark.sql import functions as F
df.select([F.col('Col2').getItem(i) for i in range(nElements)])
Just add on to sgvd's solution:
If the size is not always the same, you can set nElements like this:
val nElements = df.select(size('Col2).as("Col2_count"))
.select(max("Col2_count"))
.first.getInt(0)
You can use a map:
df.map {
case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2))
}.toDF("Col1", "Col2", "Col3", "Col4").show()
If you are working with SparkR, you can find my answer here where you don't need to use explode but you need SparkR::dapply and stringr::str_split_fixed.

Pyspark : forward fill with last observation for a DataFrame

Using Spark 1.5.1,
I've been trying to forward fill null values with the last known observation for one column of my DataFrame.
It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.
In this post, a solution in Scala was provided for a very similar problem by zero323.
But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?
Thanks for your help.
Below, a simple example sample input:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | null
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | null
| 1 | 2015-12-05 | null
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | null
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | null
| 2 | 2015-12-03 | null
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | null
| 2 | 2015-12-06 | U4
And the expected output:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | U1
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | U1
| 1 | 2015-12-05 | U1
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | U2
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | U1
| 2 | 2015-12-03 | U3
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | U3
| 2 | 2015-12-06 | U4
Another workaround to get this working, is to try something like this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = (
Window
.partitionBy('cookie_id')
.orderBy('Time')
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
final = (
joined
.withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)
So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)
The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.
Load the data
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
rdd = sc.parallelize(values)
df = rdd.toDF(["cookie_id", "c_date", "user_id"])
df = df.withColumn("c_date", df.c_date.cast("date"))
df.show()
The DataFrame is
+---------+----------+-------+
|cookie_id| c_date|user_id|
+---------+----------+-------+
| 1|2015-12-01| null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| null|
| 1|2015-12-05| null|
| 2|2015-12-04| null|
| 2|2015-12-03| null|
| 2|2015-12-02| U3|
| 2|2015-12-05| null|
+---------+----------+-------+
Column used to sort the partitions
# get the sort key
def getKey(item):
return item.c_date
The fill function. Can be used to fill in multiple columns if necessary.
# fill function
def fill(x):
out = []
last_val = None
for v in x:
if v["user_id"] is None:
data = [v["cookie_id"], v["c_date"], last_val]
else:
data = [v["cookie_id"], v["c_date"], v["user_id"]]
last_val = v["user_id"]
out.append(data)
return out
Convert to rdd, partition, sort and fill the missing values
# Partition the data
rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list)
# Sort the data by date
rdd = rdd.mapValues(lambda x: sorted(x, key=getKey))
# fill missing value and flatten
rdd = rdd.mapValues(fill).flatMapValues(lambda x: x)
# discard the key
rdd = rdd.map(lambda v: v[1])
Convert back to DataFrame
df_out = sqlContext.createDataFrame(rdd)
df_out.show()
The output is
+---+----------+----+
| _1| _2| _3|
+---+----------+----+
| 1|2015-12-01|null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| U2|
| 1|2015-12-05| U2|
| 2|2015-12-02| U3|
| 2|2015-12-03| U3|
| 2|2015-12-04| U3|
| 2|2015-12-05| U3|
+---+----------+----+
Hope you find this forward fill function useful. It is written using native pyspark function. Neither udf nor rdd being used (both of them are very slow, especially UDF!).
Let's use example provided by #Sid.
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
df = spark.createDataFrame(values, ['cookie_ID', 'Time', 'User_ID'])
Functions:
def cum_sum(df, sum_col , order_col, cum_sum_col_nm='cum_sum'):
'''Find cumulative sum of a column.
Parameters
-----------
sum_col : String
Column to perform cumulative sum.
order_col : List
Column/columns to sort for cumulative sum.
cum_sum_col_nm : String
The name of the resulting cum_sum column.
Return
-------
df : DataFrame
Dataframe with additional "cum_sum_col_nm".
'''
df = df.withColumn('tmp', lit('tmp'))
windowval = (Window.partitionBy('tmp')
.orderBy(order_col)
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', sum(sum_col).over(windowval).alias('cumsum').cast(StringType()))
df = df.drop('tmp')
return df
def forward_fill(df, order_col, fill_col, fill_col_name=None):
'''Forward fill a column by a column/set of columns (order_col).
Parameters:
------------
df: Dataframe
order_col: String or List of string
fill_col: String (Only work for a column for this version.)
Return:
---------
df: Dataframe
Return df with the filled_cols.
'''
# "value" and "constant" are tmp columns created ton enable forward fill.
df = df.withColumn('value', when(col(fill_col).isNull(), 0).otherwise(1))
df = cum_sum(df, 'value', order_col).drop('value')
df = df.withColumn(fill_col,
when(col(fill_col).isNull(), 'constant').otherwise(col(fill_col)))
win = (Window.partitionBy('cum_sum')
.orderBy(order_col))
if not fill_col_name:
fill_col_name = 'ffill_{}'.format(fill_col)
df = df.withColumn(fill_col_name, collect_list(fill_col).over(win)[0])
df = df.drop('cum_sum')
df = df.withColumn(fill_col_name, when(col(fill_col_name)=='constant', None).otherwise(col(fill_col_name)))
df = df.withColumn(fill_col, when(col(fill_col)=='constant', None).otherwise(col(fill_col)))
return df
Let's see the results.
ffilled_df = forward_fill(df,
order_col=['cookie_ID', 'Time'],
fill_col='User_ID',
fill_col_name = 'User_ID_ffil')
ffilled_df.sort(['cookie_ID', 'Time']).show()
// Forward filling
w1 = Window.partitionBy('cookie_id').orderBy('c_date').rowsBetween(Window.unboundedPreceding,0)
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//Backward filling
final_df = df.withColumn('UserIDFilled', F.coalesce(F.last('user_id', True).over(w1),
F.first('user_id',True).over(w2)))
final_df.orderBy('cookie_id', 'c_date').show(truncate=False)
+---------+----------+-------+------------+
|cookie_id|c_date |user_id|UserIDFilled|
+---------+----------+-------+------------+
|1 |2015-12-01|null |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-03|U2 |U2 |
|1 |2015-12-04|null |U2 |
|1 |2015-12-05|null |U2 |
|2 |2015-12-02|U3 |U3 |
|2 |2015-12-03|null |U3 |
|2 |2015-12-04|null |U3 |
|2 |2015-12-05|null |U3 |
+---------+----------+-------+------------+
Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.
http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/

Resources