Spark Join Returning Null Values in Columns - apache-spark

I'm pulling my hair out trying to solve what I feel is an extremely simple problem, but I'm not sure if there's some spark voodoo occurring as well.
I have two tables, which are both very small. Table A has about 90K rows and Table B has about 2K rows.
Table A
A B C D
===========
a1 b1 c1 d1
a1 b1 c2 d2
a1 b1 c3 d3
a2 b2 c1 d1
a2 b2 c2 d2
.
.
.
Table B
A B E F
===========
a1 b1 e1 f1
a2 b2 e2 f2
I want a table that looks like
Result Table
A B C D E F
=================
a1 b1 c1 d1 e1 f1
a1 b1 c2 d2 e1 f1
a2 b2 c1 d1 e2 f2
.
.
.
I was a little loose, but the idea is I want to join the table with fewer rows on the table with more rows and it's okay to have multiple associated values in the final table.
This should be really simple:
table_a.join(table_b, table_a.a == table_b.a, table_a.b == table_b.b).select(..stuff..)
HOWEVER, for almost all of the resulting values in the Result Table (which should have about 90K rows since Table A has about 90K rows), I get null values in columns E and F.
When I save the result of just Table B, I see all the columns and values.
When I save the result of just Table A, I see all the columns and values.
(i.e I could do a paper and pencil join)
The weird thing is that even though ~89K rows have null values in columns E and F in the Result Table, there are a few values that do randomly join.
Does anyone know what's going on or how I can diagnose this?

Have you tried <=> instead of == in your join?

Related

return a list of all elements present in one column that is not present in the other

I'm trying to create column where there are hundreds of items in a and b column, and I want to remove common items in b column and list them in different column in excel or google sheet.
a
b
items present in b column only
a1
a1
a5
a2
a2
a6
a3
a5
a4
a6
Excel:
Formula in C2:
=FILTER(B2:B5,COUNTIF(B2:B5,A2:A5)=0)
Google-Sheets:
Almost the same, but less explicit: =FILTER(B2:B,COUNTIF(B2:B,A2:A)=0)

Conditional transpose of rows to columns

I have the following table with around 500 rows that I need to transpose into columns:
A B
A1 B1
A2 B2
A3 B3
The result I'm trying to get is
A B C D E F
A1 B1 A2 B2 A3 B3
Because the results are to be the interleaving of two columns I think not a duplicate of the OP indicated (at one time). Assuming A1 is in cell A2 (i.e. A and B are column labels) I suggest in C2 and copied across to suit:
=IF(ISODD(COLUMN()),OFFSET($A1,COLUMN()/2,0),OFFSET($A1,(COLUMN()/2)-1,1))

Excel VBA to transpose multiple columns to single column multiple rows

I'm a beginner in VBA. Even after trying to find a solution and searching through all the forums, i was not able to find the correct way of dealing with this. Here's my problem:
Data ( in sheet1 )
a a1 a2 a3 a4
b b1 null b3 b4 b5
c c1 c2 c3
....
Required output ( in sheet2 )
a a1
a a2
a a3
a a4
b b1
b null
b b3
b b4
b b5
c c1
c c2
c c3
....
Thanks in advance.
Add column labels (may be deleted later) apply the process described here and filter ColumnC to delete rows blank in that column.

Multiple Column Duplicates : Sum

I have data like:
A1 B1 C1 v1
A1 B1 C1 v2
A1 B1 C2 v3
A1 B2 C1 v4
A2 B3 C2 v5 ....
I would like to sum all duplicate tuple (A, B, C) but only if all three values are same, that is Ai = Aj, Bi = Bj and Ci = Cj
I would like the result to be in format:
A1 B1 C1 [sum of relevant vs]
...
I know about SUMIF and Pivot function, but so far couldn't get them to work as required.
Any help will be appreciated.
PS: Previous search on stackoverflow reveals solutions for duplication across single column only. If I miss anything in my search, I am sorry and would appreciate the link to relevant thread.
A pivot table is the most appropriate solution to me. Put all three columns A, B and C under row labels and put the 4th column under Values. It should automatically sum the values in the 4th column:
After that, pick Tabular and Repeat items and then Do Not Show Subtotals under PivotTable Design:
And you will get this:

Get all combinations of the first column's value and other column's value

I have an irregular table in Excel:
A A1 A2 A3
B B1
C C1 C2 C3 C4
...
How can I get the following its representation?
A A1
A A2
A A3
B B1
C C1
C C2
C C3
C C4
...
This answer in SuperUser to Transform horizontal table layout to vertical table using VBA appears to give exactly what you are looking for.
The code is self explanatory by virtue of working step by step.
Hope it helps.

Resources