Spark replicating rows with values of a column from different dataset - apache-spark

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.

What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

Related

Excel - COUNTIF(s) & SUMPRODUCT

I'm trying to count clusters of values in one column but only if a value in another column is above a certain value.
I started with the below code to count how many unique clusters were in a column.
=SUMPRODUCT(1/COUNTIF(B1:B10,B1:B10))
| A | B |
| -------- | -------------- |
| 50 | 1 |
| 200 | 1 |
| 190 | 2 |
| 10 | 5 |
| 100 | 1 |
| 70 | 5 |
| 130 | 2 |
| 10 | 5 |
This would return a value of 3 as there are 3 unique clusters (1,2,5)
However, I am wanting to add a dependacy based on column A. Only count clusters in B if A>100. As there are no values of 5 in column B where A>100, the cluster count in B would be 2.
Any help to achieve the above would be very much appreciated!!
For any version:
=COUNT(1/(FREQUENCY(IF(A1:A8>100,B1:B8),B1:B8)))
array entered (with Ctrl+Shift+Enter) if non-365. If you have 365 use JvdV's answer. :)
With Microsoft365:
=COUNT(UNIQUE(FILTER(B1:B8,A1:A8>100,"")))
With older versions:
=SUMPRODUCT((A1:A8>100)*IFERROR(1/COUNTIFS(A1:A8,">100",B1:B8,B1:B8),0))

How to divide two cells based on match?

In a table 1, I have,
+---+---+----+
| | A | B |
+---+---+----+
| 1 | A | 30 |
| 2 | B | 20 |
| 3 | C | 15 |
+---+---+----+
On table 2, I have
+---+---+---+----+
| | A | B | C |
+---+---+---+----+
| 1 | A | 2 | 15 |
| 2 | A | 5 | 6 |
| 3 | B | 4 | 5 |
+---+---+---+----+
I want the number in second column to divide the number in table 1, based on match, and the result in third column.
The number present in the bracket is the result needed. What is the formula that I must apply in third column in table 2?
Please help me on this.
Thanks in advance
You can use a vlookup() formula to go get the dividend. (assuming table 1 is on Sheet1 and table 2 in Sheet2 where we are doing this formula):
=VLOOKUP(A1,Sheet1!A:B, 2, FALSE)/Sheet2!B1
Since you mention table, with structured references, though it seems you are not applying those here:
=VLOOKUP([#Column1],Table1[#All],2,0)/[#Column2]

Divide two "Calculated Values" within Spofire Graphical Table

I have a spotfire question. Is it possible to divide two "calculated value" columns in a "graphical table".
I have a Count([Type]) calculated value. I then limit the data within the second calculated value to arrive at a different number of Count[Type].
I would like to divide the two in a third calculated value column.
ie.
Calculated value column 1:
Count([Type]) = 100 (NOT LIMITED)
Calculated value column 2:
Count([Type]) = 50 (Limited to [Type]="Good")
Now I would like to say 50/100 = 0.5 in the third calculated value column.
If it is possible to do this all within one calculated column value that is even better. Graphical Tables do not let you have if statements in the custom expression, the only way is to limit data. So I am struggling, any help is appreciated.
Graphical tables do allow IF() in custom expressions. In order to accomplish this you are going to have to move your logic away from the Limit Data Using Expressions and into your expression directly. Here should be your three Axes expressions:
Count([Type])
Count(If([Type]="Good",[Type]))
Count(If([Type]="Good",[Type])) / Count([Type])
Data Set
+----+------+
| ID | Type |
+----+------+
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
+----+------+
Results

Getting Summation from a Table, with matching values from another Table in Excel

I have 2 tables created in Excel, which are identical in structure and the column and row names.
The only difference is that the first table has data (for the effort in work days) in it while the second is a reference table stating which milestone each cell belongs to. A sample of these tables is:
TBL1:
| | App1 | App2 | App3 |
| T1 | 32 | 12 | 48 |
| T2 | 40 | 16 | 30 |
| T3 | 56 | 18 | 36 |
TBL2:
| | App1 | App2 | App3 |
| T1 | 1 | 2 | 3 |
| T2 | 2 | 1 | 2 |
| T3 | 1 | 1 | 1 |
I want to collate these values so that I get SUM of 1, 2 and 3
| | Days Summation |
| 1 | =32+56+16+18+36 |
| 2 | =40+12+30 |
| 3 | =48 |
So basically, want to find:
IF(COL_VAL_IN_TBL2=1) THEN SUM ALL VALUES IN TBL1 CORRESPONDING TO THE ROW-COL IN RESPECTIVE
Is it possible to get a formula which I can use to do this without using something like a Pivot Table?
You can use sumif() to do this:
Here it's just looking at table2 values and comparing them to your 1, 2, or 3 and then summing the corresponding cells from your table1
SUMIF will do the trick if I understand correctly:
If you put 1 in A1, then 2 in A2, etc. Then enter in B1=SUMIF(TBL2Range,A1,TBL1Range) and copy down. Where TBL2Range is the address of your table.

Performance: Group by a subset of previous grouping columns

I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.

Resources