Spark groupby, sort values, then take first and last - apache-spark

I'm using Apache Spark and have a dataframe that looks like this:
scala> df.printSchema
root
|-- id: string (nullable = true)
|-- epoch: long (nullable = true)
scala> df.show(10)
+--------------------+-------------+
| id | epoch|
+--------------------+-------------+
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1533180241049|
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1532977853736|
|6825a28d-abe5-4b9...|1532531733106|
|1eb5f3a4-a68c-4af...|1535383198000|
|1eb5f3a4-a68c-4af...|1535129922000|
|1eb5f3a4-a68c-4af...|1534876240000|
|1eb5f3a4-a68c-4af...|1533840537000|
+--------------------+-------------+
only showing top 10 rows
I want to group by the id field to get all the epoch timestamps together for an id. I then want to sort the epochs by ascending timestamp and then take the first and last epochs.
I used the following query, but the first and last epoch values appear to be taken in the order that they appear in the original dataframe. I want the first and last to be taken from a sorted ascending order.
scala> val df2 = df2.groupBy("id").
agg(first("epoch").as("first"), last("epoch").as("last"))
scala> df2.show()
+--------------------+-------------+-------------+
| id| first| last|
+--------------------+-------------+-------------+
|4f433f46-37e8-412...|1535342400000|1531281600000|
|d0cba2f9-cc04-42c...|1535537741000|1530448494000|
|6825a28d-abe5-4b9...|1533926790847|1532531733106|
|e963f265-809c-425...|1534996800000|1534996800000|
|1eb5f3a4-a68c-4af...|1535383198000|1530985221000|
|2e65a033-85ed-4e4...|1535660873000|1530494913413|
|90b94bb0-740c-42c...|1533960000000|1531108800000|
+--------------------+-------------+-------------+
How do I retrieve the first and last from the epoch list sorted by ascending epoch?

first and last functions are meaningless when applied outside Window context. The value which is taken is purely arbitrary.
Instead you should
Use min / max functions if the logic conforms to basic ordering rules (alphanumeric for strings, arrays, and structs, numeric for numbers).
Strongly typed dataset with map -> groupByKey -> reduceGroups or groupByKey -> mapGroups otherwise.

You can just use min and max and cast the resulting columns to string. Here is one way to do it
import org.apache.spark.sql.functions._
val df = Seq(("6825a28d-abe5-4b9",1533926790847.0),
("6825a28d-abe5-4b9",1533926790847.0),
("6825a28d-abe5-4b9",1533180241049.0),
("6825a28d-abe5-4b9",1533926790847.0),
("6825a28d-abe5-4b9",1532977853736.0),
("6825a28d-abe5-4b9",1532531733106.0),
("1eb5f3a4-a68c-4af",1535383198000.0),
("1eb5f3a4-a68c-4af",1535129922000.0),
("1eb5f3a4-a68c-4af",1534876240000.0),
("1eb5f3a4-a68c-4af",1533840537000.0)).toDF("id","epoch").withColumn("epoch",($"epoch"/1000.0).cast("timestamp"))
+-----------------+--------------------+
| id| epoch|
+-----------------+--------------------+
|6825a28d-abe5-4b9|2018-08-10 18:46:...|
|6825a28d-abe5-4b9|2018-08-10 18:46:...|
|6825a28d-abe5-4b9|2018-08-02 03:24:...|
|6825a28d-abe5-4b9|2018-08-10 18:46:...|
|6825a28d-abe5-4b9|2018-07-30 19:10:...|
|6825a28d-abe5-4b9|2018-07-25 15:15:...|
|1eb5f3a4-a68c-4af| 2018-08-27 15:19:58|
|1eb5f3a4-a68c-4af| 2018-08-24 16:58:42|
|1eb5f3a4-a68c-4af| 2018-08-21 18:30:40|
|1eb5f3a4-a68c-4af| 2018-08-09 18:48:57|
+-----------------+--------------------+
val df1 = df.groupBy("id").agg(min($"epoch").cast("string").as("first"), max($"epoch").cast("string"). as("last"))
df1.show
+-----------------+--------------------+--------------------+
| id| first| last|
+-----------------+--------------------+--------------------+
|6825a28d-abe5-4b9|2018-07-25 15:15:...|2018-08-10 18:46:...|
|1eb5f3a4-a68c-4af| 2018-08-09 18:48:57| 2018-08-27 15:19:58|
+-----------------+--------------------+--------------------+
df1: org.apache.spark.sql.DataFrame = [id: string, first: string ... 1 more field]

Related

How can I count occurrences of element in dataframe array?

I have a dataframe that looks like this:
df = spark.sql("""
SELECT list
FROM categories
""")
df.show()
list
1,1,1,2,2,apple
apple,orange,1,2
And I want result something like this
list
frequency_count
1
4
2
3
apple
2
orange
1
This is what I tried.
count_df = df.withColumn('count', F.size(F.split('list', ',')))
count_df.show(truncate=False)
df.createOrReplaceTempView('tmp')
freq_sql = """
select list,count(*) count from
(select explode(flatten(collect_list(split(list, ',')))) list
from tmp)
group by list
"""
freq_df = spark.sql(freq_sql)
freq_df.show(truncate=False)
And I'm getting this error
AnalysisException: cannot resolve 'split(df.`list`, ',', -1)' due to
data type mismatch: argument 1 requires string type, however,
'df.`list`' is of array<string> type.;
You are currently trying to flatten a single list type value ,however flatten function for an array expects an array of arrays:
flatten(arrayOfArrays) - Transforms an array of arrays into a single array.
Hence the resultant error you are facing
You would need to explode the list before you try to split it and finally exploding it to transform the elements as rows and finally groupBy to get the required count
Data Preparation
sparkDF = sql.createDataFrame(
[
(["1,1,1,2,2,apple"],),
(["apple,orange,1,1"],),
],
("list",)
)
Explode
sparkDF.createOrReplaceTempView("dataset")
sql.sql("""
SELECT
explode(list) as exploded
,list
FROM dataset
""").printSchema()
root
|-- exploded: string (nullable = true)
|-- list: array (nullable = true)
| |-- element: string (containsNull = true)
+----------------+------------------+
| exploded| list|
+----------------+------------------+
| 1,1,1,2,2,apple| [1,1,1,2,2,apple]|
|apple,orange,1,1|[apple,orange,1,1]|
+----------------+------------------+
Group By
sql.sql("""
SELECT
exploded
,count(*) as count
FROM (
SELECT
EXPLODE(SPLIT(list,",")) as exploded
FROM (
SELECT
EXPLODE(list) as list
FROM dataset
)
)
GROUP BY 1
""").show()
+--------+-----+
|exploded|count|
+--------+-----+
| orange| 1|
| apple| 2|
| 1| 5|
| 2| 2|
+--------+-----+

How to convert single String column to multiple columns based on delimiter in Apache Spark

I have a data frame with a string column and I want to create multiple columns out of it.
Here is my input data and pagename is my string column
I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -
How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -
scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]
scala> df.show(false)
+---+-----------------+
|id |pagename |
+---+-----------------+
|2 |a:101 b:501 c:201|
|1 |a:100 b:500 c:200|
+---+-----------------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- pagename: string (nullable = false)
Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.
You can use str_to_map, explode the resulting map and pivot:
val df2 = df.select(
col("id"),
expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))
df2.show
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1|100|500|200|
| 2|101|501|201|
+---+---+---+---+
So two options immediately come to mind
Delimiters
You've got some obvious delimiters that you can split on. For this use the split function
from pyspark.sql import functions as F
delimiter = ":"
df = df.withColumn(
"split_column",
F.split(F.col("pagename"), delimiter)
)
# "split_column" is now an array, so we need to pull items out the array
df = df.withColumn(
"a",
F.col("split_column").getItem(0)
)
Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.
Regex
As the format is pretty fixed, you can do the same thing with a regex.
import re
regex_pattern = r"a\:() b\:() c\:()"
match_groups = ["a", "b", "c"]
for i in range(re.compile(regex_pattern).groups):
df = df.withColumn(
match_groups[i],
F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
)
CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

How to write a streaming DataFrame out to Kafka with all rows as JSON array?

I am looking for a solutions for the writing the spark streaming data to kafka.
I am using following method to write data to kafka
df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka")
But my issue is while writing to kafka the data showing as following
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
my expected output is
[
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
]
I want enclose the rows inside the array. How can achieve the same in spark streaming ? can someone advice
I assume the schema of the streaming DataFrame (df) is as follows:
root
|-- country: string (nullable = true)
|-- plan: string (nullable = true)
|-- value: string (nullable = true)
I also assume that you want to write (produce) all rows in the streaming DataFrame (df) out to a Kafka topic as a single record in which the rows are in the form of an array of JSONs.
If so, you should groupBy the rows and collect_list to group all rows into one that you could write out to Kafka.
// df is a batch DataFrame so I could show for demo purposes
scala> df.show
+-------+--------+-----+
|country| plan|value|
+-------+--------+-----+
| US|postpaid| 300|
| CAN| 0.0| 30|
+-------+--------+-----+
val jsons = df.selectExpr("to_json(struct(*)) AS value")
scala> jsons.show(truncate = false)
+------------------------------------------------+
|value |
+------------------------------------------------+
|{"country":"US","plan":"postpaid","value":"300"}|
|{"country":"CAN","plan":"0.0","value":"30"} |
+------------------------------------------------+
val grouped = jsons.groupBy().agg(collect_list("value") as "value")
scala> grouped.show(truncate = false)
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|[{"country":"US","plan":"postpaid","value":"300"}, {"country":"CAN","plan":"0.0","value":"30"}]|
+-----------------------------------------------------------------------------------------------+
I'd do all the above in DataStreamWriter.foreachBatch to get ahold of a DataFrame to work on.
I'm really not sure if that is achievable, but I'll post my suggestion anyway here; so what you can do is transform your Dataframe afterwards:
//Input
inputDF.show(false)
+---+-------+
|int|string |
+---+-------+
|1 |string1|
|2 |string2|
+---+-------+
//convert that to json
inputDF.toJSON.show(false)
+----------------------------+
|value |
+----------------------------+
|{"int":1,"string":"string1"}|
|{"int":2,"string":"string2"}|
+----------------------------+
//then use collect and mkString
println(inputDF.toJSON.collect().mkString("[", "," , "]"))
[{"int":1,"string":"string1"},{"int":2,"string":"string2"}]

Easy way to center a column in a Spark DataFrame

I want to center a column in a Spark DataFrame, i.e., subtract each element in the column by the mean of the column. Currently, I do it manually, i.e., first calculate the mean of a column, get the value out of the reduced DataFrame, and then subtract the column by the average. I wonder whether there is an easy way to do this in Spark? Any built-in function to do it?
There is no inbuilt function for this but you can use user defined function [ udf ] as below
import org.apache.spark.sql.DataFrame
val df = spark.sparkContext.parallelize(List(
(2.06,0.56),
(1.96,0.72),
(1.70,0.87),
(1.90,0.64))).toDF("c1","c2")
def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
def getCenterDF(df: DataFrame, col: String): DataFrame = {
val avg = df.select(mean(col)).first().getAs[Double](0);
df.withColumn(col, subMean(avg)(df(col)))
}
scala> df.show(false)
+----+----+
|c1 |c2 |
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.7 |0.87|
|1.9 |0.64|
+----+----+
scala> getCenterDF(df, "c2").show(false)
+----+--------------------+
|c1 |c2 |
+----+--------------------+
|2.06|-0.13750000000000007|
|1.96|0.022499999999999853|
|1.7 |0.17249999999999988 |
|1.9 |-0.05750000000000011|
+----+--------------------+

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources