I have the below tree structure where 1,2,3,4,5,6 are some ids and in bracket we have the values.
----------- 1(20)
|
------- 3(50)<------
| |
6 (30) <--- 5(15) <---- ----------- 2 (10)
|
------ 4(10)
I want to get the cumulative sum of all child nodes plus the value of root node at the root level.
In this example, my output should be 6(root) => 135 (sum of all nodes).
Any idea how do I implement this in Spark Graphx or Graphframe?
Related
My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.
I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem
I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.
A simple table join is done usualy in 0.0XX seconds and sometimes in 2.0XX seconds (according to PL/SQL Developer SQL execution). It sill happens when running from SQL Plus.
If I run the SQL 10 times, 8 times it runns fine and 2 times in 2+ seconds.
It's a clean install of Oracle 11.2.0.4 for Linux x86_64 on Centos 7.
I've installed Oracle recommended patches:
Patch 19769489 - Database Patch Set Update 11.2.0.4.5 (Includes CPUJan2015)
Patch 19877440 - Oracle JavaVM Component 11.2.0.4.2 Database PSU (Jan2015)
No change after patching.
The 2 tables have:
LNK_PACK_REP: 13 rows
PACKAGES: 6 rows
In SQL Plus i've enabled all statistics and runned the SQL multiple time. Only the time is changed from 0.1 to 2.1 from time to time. No other statistic is changed if I compare a run in 0.1 second with a run in 2.1 second. The server has 16 Gb RAM and 8 CPU core. Server load is under 0.1 (no user is using the server for the moment).
Output:
SQL> select PACKAGE_ID, id, package_name from LNK_PACK_REP LNKPR INNER JOIN PACKAGES P ON LNKPR.PACKAGE_ID = P.ID;
PACKAGE_ID ID PACKAGE_NAME
3 3 RAPOARTE
3 3 RAPOARTE
121 121 VANZARI
121 121 VANZARI
121 121 VANZARI
2 2 PACHETE
2 2 PACHETE
1 1 DEPARTAMENTE
1 1 DEPARTAMENTE
81 81 ROLURI
81 81 ROLURI
PACKAGE_ID ID PACKAGE_NAME
101 101 UTILIZATORI
101 101 UTILIZATORI
13 rows selected.
Elapsed: 00:00:02.01
Execution Plan
Plan hash value: 2671988802
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 13 | 351 | 3 (0)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
|* 3 | HASH JOIN | | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 6 | 84 | 2 (0)| 00:00:01 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | P->P | HASH |
| 6 | PX BLOCK ITERATOR | | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 7 | TABLE ACCESS FULL| PACKAGES | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
| 8 | BUFFER SORT | | | | | | Q1,02 | PCWC | |
| 9 | PX RECEIVE | | 13 | 169 | 1 (0)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10000 | 13 | 169 | 1 (0)| 00:00:01 | | S->P | HASH |
| 11 | INDEX FULL SCAN | UNQ_PACK_REP | 13 | 169 | 1 (0)| 00:00:01 | | | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
3 - access("LNKPR"."PACKAGE_ID"="P"."ID")
Note
dynamic sampling used for this statement (level=2)
Statistics
24 recursive calls
0 db block gets
10 consistent gets
0 physical reads
0 redo size
923 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
13 rows processed
Table 1 structure:
-- Create table
create table PACKAGES
(
id NUMBER(3) not null,
package_name VARCHAR2(150),
position NUMBER(3),
activ NUMBER(1)
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table PACKAGES
add constraint PACKAGES_ID primary key (ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index PACKAGES_ACTIV on PACKAGES (ID, ACTIV)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
Table 2 structure:
-- Create table
create table LNK_PACK_REP
(
package_id NUMBER(3) not null,
report_id NUMBER(3) not null
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table LNK_PACK_REP
add constraint UNQ_PACK_REP primary key (PACKAGE_ID, REPORT_ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index LNK_PACK_REP_REPORT_ID on LNK_PACK_REP (REPORT_ID)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
In Oracle Enterprise Manager in SQL Monitor I can see the SQL that is runned multiple times. All runns have "Database Time" 0.0s (under 10 microsconds if I hover the list) and "Duration" 0.0s for normal run and 2.0s for thoose with delay.
If I go to Monitored SQL Executions for that run of 2.0s I have:
Duration: 2.0s
Database Time: 0.0s
PL/SQL & Java: 0.0
Wait activity: % (no number here)
Buffer gets: 10
IO Requests: 0
IO Bytes: 0
Fetch calls: 2
Parallel: 4
Theese numbers are consistend with a fast run except Duration that is even smaller than Database Time (10,163 microseconds Database Time and 3,748 microseconds Duration) both dispalyed as 0.0s if no mouse hover.
I don't know what else to check.
Parallel queries cannot be meaningfully tuned to within a few seconds. They are designed for queries that process large amounts of data for a long time.
The best way to optimize parallel statements with small data sets is to temporarily disable it:
alter system set parallel_max_servers=0;
(This is a good example of the advantages of developing on workstations instead of servers. On a server, this change affects everyone and you probably don't even have the privilege to run the command.)
The query may be simple but parallelism adds a lot of complexity in the background.
It's hard to say exactly why it's slower. If you have the SQL Monitoring report the wait events may help. But even those numbers may just be generic waits like "CPU". Parallel queries have a lot of overhead, in expectation of a resource-intensive, long-running query. Here are some types of overhead that may explain where those 2 seconds come from:
Dynamic sampling - Parallelism may automatically cause dynamic sampling, which reads data from the tables. Although dynamic sampling used for this statement (level=2)
may just imply missing optimizer statistics.
OS Thread startup - The SQL statement probably needs to start up 8 additional OS threads, and prepare a large amount of memory to hold all the intermediate data. Perhaps
the parameter PARALLEL_MIN_SERVERS could help prevent some time used to create those threads.
Additional monitoring - Parallel statements are automatically monitored, which requires recursive SELECTs and INSERTs.
Caching - Parallel queries often read directly from disk and skip reading and writing into the buffer cache. The rules for when it caches data are complicated and undocumented.
Downgrading - Finding the correct degree of parallelism is complicated. For example, I've compiled a list of 39 factors that influence the DOP. It's possible that one of those is causing downgrading, making some queries fast and others slow.
And there are probably dozens of other types of overhead I can't think of. Parallelism is great for massively improving the run-time of huge operations. But it doesn't work well for tiny queries.
The delay is due to parallelism as suggested by David Aldridge and Jon Heller but I don't agree the solution proposed by Jon Heller to disable parallelism for all queries (at system level). You can play with "alter session" to disable it and re-enable it before running big queries. The exact reason of the delay it's still unknown as the query finish fast in 8 out of 10 runs and I would expect a 10/10 fast run.
I have a delicate Spark problem, where i just can't wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic. Here are some example date from both tables.
Actions time is actually a timestamp
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic set_at is actually a timestamp
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
How can we join these two tables in a way, that we get a result like this
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
I can't come up with a good solution that feels right, without making a lot of iterations over huge datasets. I always have to think about making a range from the Historic set and then somehow check if the Actions fits in the range e.g (11:00 - 12:15) to make the calculation. But that seems to pretty slow to me. Is there any more efficient way to do that? Seems to me, that this kind of problem could be popular, but i couldn't find any hints on this yet. How would you solve this problem in spark?
My current attempts so far ( in half way done code )
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x) and Historic(id, time, y)
Join the actions with the history (this might be heavy)
filter all historic data not relevant for a given action
key the results by (id,time) - differentiate same key at different times
reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Union Actions and Historic
Combined
id | time | valueX | record-type
1 | 12:05 | 500 | Action
1 | 12:30 | 500 | Action
2 | 12:30 | 125 | Action
1 | 11:00 | 400 | Historic
1 | 12:15 | 450 | Historic
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
Partition-1
1 | 11:00 | 400 | Historic
1 | 12:05 | 500 | Action
1 | 12:15 | 450 | Historic
1 | 12:30 | 500 | Action
Partition-2
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.
If it is a Action record, get the valueY value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.