Add an aggregate of column to existing spark streaming dataframe - apache-spark

I need to add an aggregated column to a spark streaming dataframe.
My spark dataframe has this form:
+-----------+---------+
| Timestamp | User_id |
+-----------+---------+
| 123343222 | 01 |
| 121212122 | 02 |
| 121212121 | 03 |
+-----------+---------+
I need to have a spark streaming dataframe of that form:
+-----------+---------+--------------+
| Timestamp | User_id | Array_UID |
+-----------+---------+--------------+
| 123343222 | 01 | [01] |
| 121212122 | 02 | [01, 02] |
| 121212121 | 03 | [01, 02, 03] |
+-----------+---------+--------------+
After the creation of this spark streaming dataframe I need to process it with a udf that needs to take in consideration all the user id already arrived.
I tried to collect the id using this code:
presenceDF = dfStreaming\
.groupBy(
window("timestamp", "30 minutes", "30 minutes"), \
).agg((F.collect_set(F.col("User_id"))).alias("Array")) \
The result is the following:
+---------+--------------+
| Window | Array |
+---------+--------------+
| W1 | [01] |
| W2 | [01, 02] |
| W3 | [01, 02, 03] |
+---------+--------------+
I need to have also the information of the latest User_id arrived, that's because this form doesn't work for me.
Is there a way to add the Array column to the streaming dataframe, preserving the original columns?
In static dataframe you can achieve that making a self join, but here is not possible.
Any help?

Related

Can you select the MIN row to display across all "Results Grid" values? (Generic Inquiry)

I have the following tables:
(Example data)
PMProjects
| ContractID | ContractCD | Customer |
| :-------- | :----------| :------- |
| 01 | PR00001 | ABC-Customer |
| 02 | PR00002 | XYZ-Customer |
PMTasks
| Task ID | Project ID| Task CD | Description |
| :--------| :---------| :---------| :-------- |
| 39 | 01 | 01 First | Zulu |
| 40 | 01 | 02 Second | Foxtrot |
| 41 | 01 | 03 Third | Delta |
| 42 | 01 | 04 Fourth | Alpha |
| 55 | 02 | 01 First | Zulu |
| 56 | 02 | 02 Second | Foxtrot |
| 57 | 02 | 03 Third | Delta |
| 58 | 02 | 04 Fourth | Alpha |
I have successfully joined PMTasks to PMProjects on PMProjects.ContractID = PMTasks.ProjectID.
And, I am grouping by PMProjects.ContractID
On the GI, I need to display the MIN TaskID row for each Project.
My GI - Aggregate Functions by MIN value
Result -
When I use the MIN value on the PMTasks.Description field it pulls the value "Alpha".
GI Results
| ContractID | ContractCD | Customer | Task ID | Task CD | Description |
| :--------- | :--------- | :----------- | :------- | :-------- | :---------- |
| 01 | PR00001 | ABC-Customer | 39 | 01 First | Alpha |
| 02 | PR00002 | XYZ-Customer | 55 | 01 First | Alpha |
Documentation - Aggregate Function Descriptions
I see from the documentation that the MIN aggregate function, "Returns the minimum value of all values of the group."
Has anyone found a way to join many-to-one in an Acumatica Generic Inquiry using the MIN (or MAX) value of a row?
Or to put it a different way - has anyone found a way to join one of many rows to a table and have the results grid display only the values from the same row?
I hope this makes sense. Please feel free to ask any clarification questions.
Thanks for any and all feedback.
Example GI related to the question. Please excuse the error under Results Grid - PMProjects.CustomerCD_Description does not need to be aggregated.
You need to add a group by PMContract.ContractID and then add the Min() function in the Data Field

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1.
Given below the sample create table command and its show create table output.
Beeline Output:
Spark-Shell:
I wanted to know if there is any configuration on the Spark side to preserve the data type and length information for CHAR,VARCHAR data types. If there are other ways to get DDL from Hive quickly, I will be fine with that also.
This is in
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."
Create a simple table in Hive in test database
hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.
In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

Updated dataframe column value failed to overwrite in Hive

Consider hive table tbl with column aid and bid
| aid | bid |
---------------
| | 12 |
| 24 | 13 |
| 18 | 3 |
| | 7 |
---------------
requirement is when aid is null or empty string, aid should be overwritten by value of bid
| aid | bid |
---------------
| 12 | 12 |
| 24 | 13 |
| 18 | 3 |
| 7 | 7 |
---------------
code is simple
val df01 = spark.sql("select * from db.tbl")
val df02 = df01.withColumn("aid", when(col("aid").isNull || col("aid") <=> "", col("bid")) otherwise(col("aid")))
and when running in spark-shell, df02.show displayed correct data just like above table
problem is when write the data back to hive
df02.write
.format("orc")
.mode("Overwrite")
.option("header", "false")
.option("orc.compress", "snappy")
.insertInto(tbl)
there is no error but when I validate the data
select * from db.tbl where aid is null or aid= '' limit 10;
I can still see multiple rows return from the query with aid being null
How to overwrite the data back to hive if previously update column value just like above example?
I would try this
df02.write
.orc
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.insertInto(tbl)

How to concatenate spark dataframe columns using Spark sql in databricks

I have two columns called "FirstName" and "LastName" in my dataframe, how can I concatenate this two columns into one.
|Id |FirstName|LastName|
| 1 | A | B |
| | | |
| | | |
I want to make it like this
|Id |FullName |
| 1 | AB |
| | |
| | |
my query look like this but it raises an error
val kgt=spark.sql("""
Select Id,FirstName+' '+ContactLastName AS FullName from tblAA """)
kgt.createOrReplaceTempView("NameTable")
Here we go with the Spark SQL solution:
spark.sql("select Id, CONCAT(FirstName,' ',LastName) as FullName from NameTable").show(false)
OR
spark.sql( " select Id, FirstName || ' ' ||LastName as FullName from NameTable ").show(false)
from pyspark.sql import functions as F
df = df.withColumn('FullName', F.concat(F.col('First_name'), F.col('last_name')))

How to simultaneously group/apply two Spark DataFrames?

/* My question is language-agnostic I think, but I'm using PySpark if it matters. */
Situation
I currently have two Spark DataFrames:
One with per-minute data (1440 rows per person and day) of a person's heart rate per minute:
| Person | date | time | heartrate |
|--------+------------+-------+-----------|
| 1 | 2018-01-01 | 00:00 | 70 |
| 1 | 2018-01-01 | 00:01 | 72 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 11:32 | 123 |
| ... | ... | ... | ... |
And another DataFrame with daily data (1 row per person and day), of daily metadata, including the results of a clustering of days, i.e. which cluster day X of person Y fell into:
| Person | date | cluster | max_heartrate |
|--------+------------+---------+----------------|
| 1 | 2018-01-01 | 1 | 180 |
| 1 | 2018-01-02 | 4 | 166 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 1 | 147 |
| ... | ... | ... | ... |
(Note that clustering is done separately per person, so cluster 1 for person 1 has nothing to do with person 2's cluster 1.)
Goal
I now want to compute, say, the mean heart rate per cluster and per person, that is, each person gets different means. If I have three clusters, I am looking for this DF:
| Person | cluster | mean_heartrate |
|--------+---------+----------------|
| 1 | 1 | 123 |
| 1 | 2 | 89 |
| 1 | 3 | 81 |
| 2 | 1 | 80 |
| ... | ... | ... |
How do I best do this? Conceptually, I want to group these two DataFrames per person and send two DF chunks into an apply function. In there (i.e. per person), I'd group and aggregate the daily DF per day, then join the daily DF's cluster IDs, then compute the per-cluster mean values.
But grouping/applying multiple DFs doesn't work, right?
Ideas
I have two ideas and am not sure which, if any, make sense:
Join the daily DF to the per-minute DF before grouping, which would result in highly redundant data (i.e. the cluster ID replicated for each minute). In my "real" application, I will probably have per-person data too (e.g. height/weight), which would be a completely constant column then, i.e. even more memory wasted. Maybe that's the only/best/accepted way to do it?
Before applying, transform the DF into a DF that can hold complex structures, e.g. like
.
| Person | dataframe | key | column | value |
|--------+------------+------------------+-----------+-------|
| 1 | heartrates | 2018-01-01 00:00 | heartrate | 70 |
| 1 | heartrates | 2018-01-01 00:01 | heartrate | 72 |
| ... | ... | ... | ... | ... |
| 1 | clusters | 2018-01-01 | cluster | 1 |
| ... | ... | ... | ... | ... |
or maybe even
| Person | JSON |
|--------+--------|
| 1 | { ...} |
| 2 | { ...} |
| ... | ... |
What's the best practice here?
But grouping/applying multiple DFs doesn't work, right?
No, AFAIK this does not work not in pyspark nor pandas.
Join the daily DF to the per-minute DF before grouping...
This is the way to go in my opinion. You don't need to merge all redundant columns but only those requrired for your groupby-operation. There is no way to avoid redundancy for your groupby-columns as they will be needed for the groupby-operation.
In pandas, it is possible to provide an extra groupby-column as a pandas Series specifically but it requires to have the exact same shape as the to be grouped dataframe. However, in order create the groupby-column, you will need a merge anyway.
Before applying, transform the DF into a DF that can hold complex structures
Performance and memory wise, I would not go with this solution unless you have multiple required groupby operations which will benefit from more complex data structures. In fact, you will need to put in some effort to actually create the data structure in the first place.

Resources