I am new to Spark and I have a specific question about how to use spark to address my problem, which may be simple.
Promblem:
I have a model, which predicts the sales of products. Each product also belongs to a category like shoes, clothes etc. And we also have actual sales data. So the data look like this:
+------------+----------+-----------------+--------------+
| product_id | category | predicted_sales | actual_sales |
+------------+----------+-----------------+--------------+
| pid1 | shoes | 100.0 | 123 |
| pid2 | hat | 232 | 332 |
| pid3 | hat | 202 | 432 |
+------------+----------+-----------------+--------------+
product_id category predicted_sales actual_sales
What I'd like to do is: I want to calculate the number(or percentage) of intersection between top 5% products ranked by actual_sales and top 5% products ranked by predicted_sales for each category.
Doing this for the whole products instead of for each category would be easy, something like below:
def getIntersectionRatio(df:dataframe, per :Int): Double = {
val limit_num = (df.count() * per / 100.0).toInt
var intersection = df.orderBy("actual_sales").limit(limit_num)
.join(df.orderBy("predicted_sales").limit(limit_num), Seq("product_id"), "inner")
intersection.count() * 100.0 / limit_num
}
However, I need to calculate the intersection for each category. The result will be something like this
+-----------+------------------------+
| Category | intersection_percentage|
+-----------+------------------------+
My ideas
User Defined Aggreation Fuction or Aggregators
I think I can achieve my goal if I use groupBy or GroupByKey with UDAF or Aggregators but they are too inefficient because they take 1 row each time and I will have to store each row in the buffer inside UDAF or Aggregator.
df.groupby("category").agg(myUdaf)
class myUdaf extends UserDefinedAggregateFunction {
//Save all the rows to an arraybuffer
//and then transform the buffer back to df
//And then we the the same thing as we did for whole product in getIntersectionRatio defined previously
}
Self implemented partitioning
I can select the distinct categories and the use map to process each category, in which I join the element with df to get the partition
df.select("category").distinct.map(myfun(df))
def myfun(df: dataframe)(row : Row):Row = {
val dfRow = row.toDF //not supported but feasible with other apis
val group = df.join(broadcast(dfRow), seq(category), inner)
getIntersectionRatio(group)
}
Do we have a better solution for this?
Thanks in advance!
Related
I am using Spark SQL 2.4.0. I have a couple of tables as below:
CUST table:
id | name | age | join_dt
-------------------------
12 | John | 25 | 2019-01-05
34 | Pete | 29 | 2019-06-25
56 | Mike | 35 | 2020-01-31
78 | Alan | 30 | 2020-02-25
REF table:
eff_dt
------
2020-01-31
The requirement is to select all the records from CUST whose join_dt is <= eff_dt in the REF table. So, for this simple requirement, I put together the following query:
version#1:
select
c.id,
c.name,
c.age,
c.join_dt
from cust c
inner join ref r
on c.join_dt <= r.eff_dt;
Now, this creates a BroadcastNestedLoopJoin in the physical plan and hence the query takes a long time to process this.
Question 1:
Is there a better way to implement this same logic without a BNLJ being induced and execute the query faster? Is it possible to alleviate the BNLJ ?
Part 2:
Now,I broke the query into 2 parts as:-
version#2:
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt = r.eff_dt --equi join
union all
select c.id, c.name, c.age, c.join_dt
from cust c
inner join ref r
on c.join_dt < r.eff_dt; --theta join
Now, for the Query in Version#1, the physical plan shows that the CUST table is scanned only once, whereas the physical plan for the Query in Version#2 indicates that the same input table CUST is scanned twice (Once for each of the 2 queries combined with a union). However, I am surprised to find that Version#2 executes faster than version#1.
Question 2:
How does version#2 execute faster than version#1 although version#2 scans the table twice as opposed to once in case of version#1, and also the fact that both the versions induce a BNLJ ?
Can anyone please clarify. Please let me know if additional information is required.
Thanks.
i have 2 model(SaleInvoice and Product) with many to many relation
in SaleInvoice model :
public function products()
{
return $this->belongsToMany(Product::class, 'sale_invoice_product', 'saleInvoice_id', 'product_id')->withPivot('count');
}
in Product Model:
public function saleInvoices()
{
return $this->belongsToMany(SaleInvoice::class, 'sale_invoice_product', 'product_id', 'saleInvoice_id');
}
this is the example of data that recorded in sale_invoice_product table(intermediate table)
id | saleInvoiceId | product_id | count
1 | 1500 | 1 | 3
2 | 1500 | 3 | 2
3 | 1500 | 4 | 4
4 | 1501 | 1 | 1
5 | 1501 | 4 | 1
how can i access to data of product and sale invoice from this table like below(in json mode for api request)
product_id | product_name | count | saleInvoice | date
1 LG 3 1500 2020-05-12
3 SONY 2 1500 2020-05-13
4 OT 4 1500 2020-05-17
1 LG 1 1501 2020-05-19
4 OT 1 1501 2020-05-22
i want to return a json file in SaleInvoiceController with top format
Your work was good, Just enough make a API resource for this model and send attributes as you want, For accessing to pivot table you could use $product->pivot->count.
You can try one of these methods
Building a model for sale_invoice_product table with relations to SaleInvoice and Product. Then manually construct the JSON in your controller
Build an SQL View and a Model for it
Solution 1: Building a model to the intermediate table and manually constructing the JSON
Let's say you built a model called SaleInvoiceProduct that has product() relation to the Products table and saleInvoice() relation to the SaleInvoices table. In your controller you can do this
$resultInvoiceProducts = [];
$allSaleInvoiceProducts = SaleInvoiceProduct::all();
foreach ($allSaleInvoiceProducts as oneSaleInvoiceProduct) {
$tempSaleInvoiceProduct = new stdClass();
$tempSaleInvoiceProduct->product_id = oneSaleInvoiceProduct->product_id;
$tempSaleInvoiceProduct->product_name = oneSaleInvoiceProduct->product->name;
$tempSaleInvoiceProduct->saleInvoiceId = oneSaleInvoiceProduct->saleInvoiceId;
$tempSaleInvoiceProduct->data = oneSaleInvoiceProduct->saleInvoice->date;
array_push($resultInvoiceProducts, $tempSaleInvoiceProduct);
}
Solution 2: Using SQL Views
You can create an SQL View that uses Joins to construct the data you need
DROP VIEW IF EXISTS vSaleInvoiceProduct;
CREATE VIEW vSaleInvoiceProduct AS
SELECT sp.product_id,
sp.saleInvoiceId,
sp.`count`,
p.product_name,
s.`date`
FROM SaleInvoiceProduct sp
LEFT JOIN SaleInvoices s on sp.saleInvoiceId = s.saleInvoiceId
LEFT JOIN Products p on sp.product_id = p.product_id
Then you can create a Laravel model for this View just like you would do for any table, call the ::all() method on it and directly return the results with json()
My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.
I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem
I have a delicate Spark problem, where i just can't wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic. Here are some example date from both tables.
Actions time is actually a timestamp
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic set_at is actually a timestamp
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
How can we join these two tables in a way, that we get a result like this
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
I can't come up with a good solution that feels right, without making a lot of iterations over huge datasets. I always have to think about making a range from the Historic set and then somehow check if the Actions fits in the range e.g (11:00 - 12:15) to make the calculation. But that seems to pretty slow to me. Is there any more efficient way to do that? Seems to me, that this kind of problem could be popular, but i couldn't find any hints on this yet. How would you solve this problem in spark?
My current attempts so far ( in half way done code )
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x) and Historic(id, time, y)
Join the actions with the history (this might be heavy)
filter all historic data not relevant for a given action
key the results by (id,time) - differentiate same key at different times
reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Union Actions and Historic
Combined
id | time | valueX | record-type
1 | 12:05 | 500 | Action
1 | 12:30 | 500 | Action
2 | 12:30 | 125 | Action
1 | 11:00 | 400 | Historic
1 | 12:15 | 450 | Historic
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
Partition-1
1 | 11:00 | 400 | Historic
1 | 12:05 | 500 | Action
1 | 12:15 | 450 | Historic
1 | 12:30 | 500 | Action
Partition-2
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.
If it is a Action record, get the valueY value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.