How to aggregate data per day on spark streaming

How to aggregate data per day on spark streaming - apache-spark

My problem is the following:
I have to write a scala program for spark streaming, this program has to read data from a kafka topic and save it in a cassandra table.
My data is of the form:
transaction_id,customer_id,merchant_id,status,timestamp,invoice_no,invoice_amount
202207280000001,1966,319,SUCCESS,07/28/22,835-975-389,86562
The need is to aggregate the data by day and to count by amount (between 0 and 500, between 500 and 1000, between 1000 and 1500, and over 2000).
Can someone help me?
Thanks
OK let me try to be more clear.
A spark streaming application written in scala, which has to read data from a Kafka topic, the data is structured in the following way:
transaction_id,customer_id,merchant_id,status,timestamp,invoice_no,invoice_amount
Example (fictitious data):
202207280000001,1966,319,SUCCESS,07/28/22,835-975-389,86562
202207280000002,1970,320,SUCCESS,07/28/22,835-980-395,50000
202207280000003,1966,319,SUCCESS,07/28/22,835-975-399,200
202207280000004,658,400,SUCCESS,07/25/22,835-975-200,800
202207280000005,1966,319,SUCCESS,07/25/22,835-975-387,300
From these data I have to calculate in real time the Count of transaction happened falling in price bucket 0-500, 0-1000, 0-2000, and over 2000.
The displayed result would be for example :
07/28/22 | Below500 | 1
07/28/22 | Below1000 | 0
07/28/22 | Below1500 | 0
07/28/22 | Above2000 |2
07/25/22 | Below500 | 1
07/25/22 | Below1000 | 1
07/25/22 | Below1500 | 0
07/25/22 | Above2000 | 0
My program reads the topic well and displays the data coming from Kafka but I can't process the data to display the right result.
Thanks

Related

Pandas: Sliding window, summing app 14 day data

I do wonder how it is possible to make sliding windows in Pandas.
I have a dataframe with three columns.
Country | Number | DayOfTheYear
===================================
No | 50 | 0
No | 20 | 1
No | 37 | 2
I would love to see 14 day chunks for every country and day combination.
The country think can be ignored for the moment, since I can filter those manually in some way. But imagine there is only one country, is there a smart way to get some sort of summed up sliding window, resulting in something like the following?
Country | Sum | DatesOftheYear
===================================
No | 504 | 0-13
No | 207 | 1-14
No | 337 | 2-15
I would also accept if if they where disjunct, being only 0-13, 14-27, etc.
But I just cannot come along with Pandas. I know an old SQL solution, but is there anybody having a nice idea for Pandas?

If you want a rolling windows of your dataframe, you can simply use the .rolling function of pandas : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
In your case : df["Number"].rolling(14).sum()

Spark Structured Streaming ignore old records

I am new to spark and help me to arrive in solutions for this problem. I am receiving the input file it has information about an event occurred and the file itself has the timestamp value. Event Id is the primary column for this input. Refer below the sample input (the actual file has many other columns).
Event_Id | Event_Timestamp
1 | 2018-10-11 12:23:01
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
When we get the above input we need to get the latest record based on event id, timestamp and the expected output would be
Event_Id | Event_Timestamp
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
Hereafter whenever I receive the event information which has timestamp value less than the above value I need to ignore, for example, consider the second input
Event_Id | Event_Timestamp
2 | 2018-10-11 10:25:01
1 | 2018-10-11 08:23:01
3 | 2018-10-11 21:12:01
Now I need to ignore event_id 1 and 2 since it has the old timestamp that the state what we have right now. Only the event 3 would be passed and the expected output here is
3 | 2018-10-11 21:12:01
Assume we have n number of unique(10 billion) event id how it would be stored in spark memory, is there something needs to be taken care.
Thanks in advance

We can take max timestamp and use persist() method with disk_only or disk_only2 storage levels... In that case, we can achieve this I think...
Since it's an streaming data, we can try with memory_only or memory_only2 storage levels too...
Please try and update..

How to cycle a Pandas dataframe grouping by hierarchical multiindex from top to bottom and store results

I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR

I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))

The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

Distribution of time periods over rows with certain status (column value)

I have a Pyspark dataframe containing logs, with each row corresponding to the state of the system at the time it is logged, and a group number. I would like to find the lengths of the time periods for which each group is in an unhealthy state.
For example, if this were my table:
TIMESTAMP | STATUS_CODE | GROUP_NUMBER
--------------------------------------
02:03:11 | healthy | 000001
02:03:04 | healthy | 000001
02:03:03 | unhealthy | 000001
02:03:00 | unhealthy | 000001
02:02:58 | healthy | 000008
02:02:57 | healthy | 000008
02:02:55 | unhealthy | 000001
02:02:54 | healthy | 000001
02:02:50 | healthy | 000007
02:02:48 | healthy | 000004
I would want to return Group 000001 having an unhealthy time period of 9 seconds (from 02:02:55 to 02:03:04).
Other groups could also have unhealthy time periods, and I would want to return those as well.
Due to the possibility of consecutive rows with the same status, and since rows of different groups are interspersed, I am struggling to find a way to do this efficiently.
I cannot convert the Pyspark dataframe to a Pandas dataframe, as it is much too large.
How can I efficiently determine the lengths of these time periods?
Thanks so much!

the pyspark with spark-sql solution would look like this.
First we create the sample data-set. In addition to the dataset we generate row_number field partition on group and order by the timestamp. then we register the generated dataframe as a table say table1
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([
('2017-01-01 02:03:11','healthy','000001'),
('2017-01-01 02:03:04','healthy','000001'),
('2017-01-01 02:03:03','unhealthy','000001'),
('2017-01-01 02:03:00','unhealthy','000001'),
('2017-01-01 02:02:58','healthy','000008'),
('2017-01-01 02:02:57','healthy','000008'),
('2017-01-01 02:02:55','unhealthy','000001'),
('2017-01-01 02:02:54','healthy','000001'),
('2017-01-01 02:02:50','healthy','000007'),
('2017-01-01 02:02:48','healthy','000004')
],['timestamp','state','group_id'])
df = df.withColumn('rownum', row_number().over(Window.partitionBy(df.group_id).orderBy(unix_timestamp(df.timestamp))))
df.registerTempTable("table1")
once the dataframe is registered as a table (table1). the required data can be computed as below using spark-sql
>>> spark.sql("""
... SELECT t1.group_id,sum((t2.timestamp_value - t1.timestamp_value)) as duration
... FROM
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1 WHERE state = 'unhealthy') t1
... LEFT JOIN
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1) t2
... ON t1.group_id = t2.group_id
... AND t1.rownum = t2.rownum - 1
... group by t1.group_id
... """).show()
+--------+--------+
|group_id|duration|
+--------+--------+
| 000001| 9|
+--------+--------+
the sample dateset had unhealthy data for group_id 00001 only. but this solution works for cases other group_ids with unhealthy state.

One straightforward way (may be not optimal) is:
Map to [K,V] with GROUP_NUMBER as the Key K
Use repartitionAndSortWithinPartitions, so you will have all data for every single group in the same partition and have them sorted by TIMESTAMP. Detailed explanation how it works is in this answer: Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria
And finally use mapPartitions to get an iterator over sorted data in single partition, so you could easily find the answer you needed. (explanation for mapPartitions: How does the pyspark mapPartitions function work?)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to aggregate data per day on spark streaming - apache-spark

Related

Pandas: Sliding window, summing app 14 day data

Spark Structured Streaming ignore old records

How to cycle a Pandas dataframe grouping by hierarchical multiindex from top to bottom and store results

Spark: count events based on two columns

Distribution of time periods over rows with certain status (column value)

Categories

Resources