Spark: join multiple times on the same dataset - apache-spark

In order to enrich my stream data, I join it with a static dataset.
Actually, I join my input dataset twice with the same dataset to add informations about seller and buyer.
input:
+-----------+------+-----+------+
|transaction|seller|buyer|amount|
+-----------+------+-----+------+
| 1 | A | D | 100 |
| 2 | B | A | 10 |
| 3 | C | A | 20 |
+-----------+------+-----+------+
static dataset:
+------+-------+
|person|address|
+------+-------+
| A | #A |
| B | #B |
| C | #C |
| D | #D |
+------+-------+
Code:
iputDF.join(staticDS, iputDF("seller") <=> staticDS("person"))
.join(staticDS, iputDF("buyer") <=> staticDS("person"))
output:
+-----------+------+-------+-----+------+------+
|transaction|seller|#seller|buyer|#buyer|amount|
+-----------+------+-------+-----+------+------+
| 1 | A | #A | D | #D | 100 |
| 2 | B | #B | A | #A | 10 |
| 3 | C | #C | A | #A | 20 |
+-----------+------+-------+-----+------+------+
Is there an optimal solution to do this?

Related

excel cubset function to get 2 columns set

I am trying to use the cubset function to get a set of 2 columns. The data table is something like bellow:
TABLE
+--------+-------+-------+
| CLIENT | PRODA | PRODB |
+--------+-------+-------+
| 1 | A | X |
| 1 | A | Y |
| 1 | B | X |
| 2 | A | Y |
| 2 | B | X |
| 2 | C | Y |
+--------+-------+-------+
The code I running returns only 1 column set
=CUBSET("ThisWorkbookDataModel";"[TABLE].[CLIENT].&[1]*[TABLE].[PRODA].children";"result set")
The code I am trying to perform, I need to return both related columns PROD AND PRODB
=CUBSET("ThisWorkbookDataModel";"[TABLE].[CLIENT].&[1]*[TABLE].[PRODA].[PRODB].children";"result set")
result set
+-------+-------+
| PRODA | PRODB |
+-------+-------+
| A | X |
| A | Y |
| B | X |
+-------+-------+
So what is the correct way to write the code to retrieve both related columns ?
Appreciate any help

Determine range for one value in a column, use to run function over same range in another

Summary
I want to have a column in my spreadsheet that does 2 things.
1) In an ordered column, it will return the range where the column contains a specified value.
2) It will run a function (i.e., =SUM(), =AVERAGE(), etc.) over that same range in a different column.
Examples
Original
| NAME | VAL | FOO |
|-------|-----|-----|
| A | 3 | |
| A | 2 | |
| A | 4 | |
| A | 3 | |
| B | 2 | |
| B | 2 | |
| B | 1 | |
| C | 6 | |
| C | 5 | |
Average
I would want to get the average of VAL for each NAME. I would want the result to be:
| NAME | VAL | FOO |
|-------|-----|-----|
| A | 3 | 3 |
| A | 2 | 3 |
| A | 4 | 3 |
| A | 3 | 3 |
| B | 2 | 1.7 |
| B | 2 | 1.7 |
| B | 1 | 1.7 |
| C | 6 | 5.5 |
| C | 5 | 5.5 |
Sum
Another example would be to get the sum of VAL for each NAME.
| NAME | VAL | FOO |
|-------|-----|-----|
| A | 3 | 12 |
| A | 2 | 12 |
| A | 4 | 12 |
| A | 3 | 12 |
| B | 2 | 5 |
| B | 2 | 5 |
| B | 1 | 5 |
| C | 6 | 11 |
| C | 5 | 11 |
Having "NAME" ordered makes it easy. If "NAME" is in A1. Enter this into C2 for the sum, then fill down:
=IF(A2=A3,C3,SUMIF($A$2:A2,A2,$B$2:B2))
Enter this into C2 for the average, then fill down:
=IF(A2=A3,C3,AVERAGEIF($A$2:A2,A2,$B$2:B2))
Note that the result in C2 won't be what you want until you fill down.
Update for MAXIF
If you don't have Excel 2016, you'll have to use an array formula (commit with ctrl+shift+enter):
=IF(A2=A3,C3,MAX(IF($A$2:A2=A2,$B$2:B2)))

Adding strings to numbers with different length

I want to have an object like this, matching both of them and putting the names in each ID, both objects have a different length so I tried set names but it didn't work.
Any suggestions?
First Object
+----+-------+--+
| ID | Test | |
+----+-------+--+
| 1 | C | |
| 1 | M | |
| 1 | C | |
| 1 | M | |
| 2 | C | |
| 2 | M | |
| 2 | C | |
| 2 | M | |
| 4 | C | |
| 4 | M | |
| 4 | C | |
| 4 | M | |
+----+-------+--+
Second Object
+-----------+-----+--+
| Names | ID | |
+-----------+-----+--+
| Pepsi | 1 | |
| Coke | 2 | |
| Acuarious | 3 | |
| Fanta | 4 | |
| Beer | 5 | |
| Fries | 6 | |
+-----------+-----+--+
+----+-------+--------+--+
| ID | Names | Test | |
+----+-------+--------+--+
| 1 | Pepsi | C | |
| 1 | Pepsi | M | |
| 1 | Pepsi | C | |
| 1 | Pepsi | M | |
| 2 | Coke | C | |
| 2 | Coke | M | |
| 2 | Coke | C | |
| 2 | Coke | M | |
| 4 | Fanta | C | |
| 4 | Fanta | M | |
| 4 | Fanta | C | |
| 4 | Fanta | M | |
+----+-------+--------+--+
I think I sorted it out.
a <- merge(firstobject,secondobject,by.x="ID",by.y="ID",all.x=T,all.y=T)
This create a file that match by ID and at the same time put NA for those ones that donĀ“t match.
To get rid off the NAs
a <- a[!is.na(a$ID),]
I hope this helps.!!!

Expand a data set using two columns

In Excel, I have two columns of data that I wish to combine.
Current set of data:
+---------+---------+
| column1 | column2 |
+---------+---------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| | 5 |
| | 6 |
| | 7 |
+---------+---------+
For each value in column1, I need to assign all of the values in column2 so it looks like this:
+---------+---------+
| column1 | column2 |
+---------+---------+
| a | 1 |
| a | 2 |
| a | 3 |
| a | 4 |
| a | 5 |
| a | 6 |
| a | 7 |
+---------+---------+
| b | 1 |
| b | 2 |
| b | 3 |
| b | 4 |
| b | 5 |
| b | 6 |
| b | 7 |
+---------+---------+
| c | 1 |
| c | 2 |
| c | 3 |
| c | 4 |
| c | 5 |
| c | 6 |
| c | 7 |
+---------+---------+
| d | 1 |
| d | 2 |
| d | 3 |
| d | 4 |
| d | 5 |
| d | 6 |
| d | 7 |
+---------+---------+
How can I do this?
Do I need to find a macro/VB solution?
Since seems unlikely to receive any other answer:
in A1: a
in B1: =MOD(ROW()-1,7)+1
in A2: =IF(MOD(ROW()-1,7)>0,CHAR(CODE(A1)),CHAR(CODE(A1)+1))
Copy both formulae down to suit.

Reducing Rows by Grouping Data

I have a set of spreadsheets which define a set of business rules. These business rules are then processed by our system.
The users that create the spreadsheets do so naively and I have found that by factoring the data across rows - and thus reducing the number of rules - greatly improves performance of the system.
One of the "naively" structured spreadsheets might look like this:
+-----------+------------+------------+------------+------------+--------+
| Rule Name | Criteron 1 | Criteron 2 | Criteron 3 | Criteron 4 | Accept |
+-----------+------------+------------+------------+------------+--------+
| Rule 1 | A | B | C | | Yes |
| Rule 2 | A | C | C | | Yes |
| Rule 3 | A | D | C | | Yes |
| Rule 4 | A | E | C | | Yes |
| Rule 5 | A | F | C | | Yes |
| Rule 6 | A | B | D | | Yes |
| Rule 7 | A | C | D | | Yes |
| Rule 8 | A | D | D | | Yes |
| Rule 9 | A | E | D | | Yes |
| Rule 10 | A | F | D | | Yes |
| Rule 11 | A | B | E | | Yes |
| Rule 12 | A | C | E | | Yes |
| Rule 13 | A | D | E | | Yes |
| Rule 14 | A | E | E | | Yes |
| Rule 15 | A | F | E | | Yes |
| Rule 16 | | | | G | Yes |
| Rule 17 | | | | H | Yes |
| Rule 18 | | | | I | Yes |
| Rule 19 | | | | J | Yes |
| Rule 20 | | | | K | Yes |
| Rule 21 | | | | L | Yes |
| Rule 22 | | | | M | Yes |
| Rule 23 | | | | N | No |
| Rule 24 | | | | O | No |
| Rule 25 | | | | P | No |
| Rule 26 | | | | Q | No |
| Rule 27 | | | | R | No |
| Rule 28 | | | | S | No |
| Rule 29 | A | J | F | | No |
| Rule 30 | A | K | F | | No |
+-----------+------------+------------+------------+------------+--------+
As an example, Rule 1 would be evaluated as:
IF (Criterion 1 == A) AND (Criterion 2 == B) AND (Criterion 3 == C) THEN Accept
Using a bit of thought and assuming we can use OR conditionals in our columns, the above can be reduced to:
+-----------+------------+------------+------------+-------------+--------+
| Rule Name | Criteron 1 | Criteron 2 | Criteron 3 | Criteron 4 | Accept |
+-----------+------------+------------+------------+-------------+--------+
| Rule 1 | A | B,C,D,E,F | C,D,E | | Yes |
| Rule 2 | | | |G,H,I,J,K,L,M| Yes |
| Rule 3 | | | |N,O,P,Q,R,S | No |
| Rule 4 | A | J,K | F | | No |
+-----------+------------+------------+------------+-------------+--------+
Rule 1 is now evaluated as follows:
IF (Criterion 1 == A) AND
(Criterion 2 == B OR Criterion 2 == C OR...) AND
(Criterion 3 == C OR Criterion 3 == D OR...) THEN Accept
Now, I've done this manually. What I want to know is: does Excel have in-built functionality to do this kind of grouping for me. If not, can anyone point me in the direction of an algorithm which will help me implement this efficiently?
this looks like a situation where you could query the table using ADO and OLE DB into an ADO Recordset using GROUP BY HAVING in the SQL Query, then dump the (grouped) results into your new sheet using CopyFromRecordset
Alternatively, perhaps a Pivot Table?

Resources