How to check if a value in a column is found in a list in a column, with Spark SQL? - apache-spark

I have a delta table A as shown below.
point
cluster
points_in_cluster
37
1
[37,32]
45
2
[45,67,84]
67
2
[45,67,84]
84
2
[45,67,84]
32
1
[37,32]
Also I have a table B as shown below.
id
point
101
37
102
67
103
84
I want a query like the following. Here in obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
id
point
101
37
101
32
102
45
102
67
102
84
103
45
103
67
103
84

Based on your data sample, I'd do an equi-join on point column and then an explode on points_in_cluster :
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)

Related

Find duplicated Rows based on selected columns condition in a pandas Dataframe

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240
Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

Can I apply Groupby on Pandas dataframe and calculate mean across all the columns?

Suppose I have a dataframe that looks like this:
Category Col1 Col2 Col3 Col4 Col5
Footwear 35 55 67 87 94
Apparels 56 65 54 84 77
Footwear 87 85 56 95 35
Handbags 83 62 724 51 62
Handbags 61 512 21 58 78
Apparels 50 62 172 77 5
Now, I want to find the mean and standard deviation for the unique categories, but not for the different columns separately, rather one mean and one std for each category. So I want an output like this:
Category mean stdev
Footwear xxx aaa
Apparels yyy bbb
Handbags zzz ccc
I cannot just calculate the mean and std first across the columns using mean function with axis=1 and then Groupby for the categories. It would yield incorrect results.
So my dilemma is that I want to perform a groupby, while aggregating across rows and columns at the same time.
I have a feeling that a user-defined function could do it, applying it through lambda aggregation along with Groupby. But I couldn't do it. Am I even on the right track? Thanks!
If I understand you correctly, lets try using melt and groupby with agg
df1 = pd.melt(df,id_vars='Category').groupby('Category').agg(mean=('value','mean'),
std=('value','std'))
print(df1)
mean std
Category
Apparels 70.2 41.983595
Footwear 69.6 23.291391
Handbags 171.2 241.295946

Transpose data based on two columns and criteria: Excel

I have data that looks like this:
Group: Class: Value:
A 1 51
A 2 60
B 1 55
B 2 67
B 3 70
C 1 53
C 3 65
Need the data to look like this:
Group: 1: 2: 3:
A 51 60 0
B 55 67 70
C 53 0 65
The code I am trying is doing two things wrong: 1. Skipping rows 2. not matching value to class column which causes an issue for Group C since it puts the 65 value into class 2 not and not in class 3 for the final row (row 3 in this example).
=IFERROR(IF(AND($B2=1,COLUMN()<3+MATCH($B2,$B3:$B11000,0)),OFFSET(B2,COLUMN()-3,2-COLUMN()),""),"")
This has worked for me.Hope it helps!
=SUMPRODUCT(($A$2:$A$8=$E2)*($B$2:$B$8=COLUMN(A:A))*$C$2:$C$8)
If the C column format is text, let's try:
=IFERROR(OFFSET($C$2,MATCH(1,INDEX(($A$2:$A$8=$E2)*($B$2:$B$8=COLUMN(A$1)),),0)-1,),"")
I think it's more efficient to use PivotTable for such tasks:

Using Strings in dot notation Matlab

I wish to access data within a table with dot notation that includes strings. I have a list of strings representing columns of interest within the table. How can I access data using those strings? I wish to create a loop going through the list of strings.
For example for table T I have columns {a b c d e}. I have a 1x3 cell cols={b d e}.
Can I retrieve data using cols in the format (or equivalent) T.cols(1) to to give me the same result as T.b?
You can fetch directly the data using {curly braces} and strings as column indices as you would for a cell array.
For example, lets's create a dummy table (modified from the docs):
clc;
close all;
LastName = {'Smith';'Johnson';'Williams';'Jones';'Brown'};
a = [38;43;38;40;49];
b = [71;69;64;67;64];
c = [176;163;131;133;119];
d = [124 93; 109 77; 125 83; 117 75; 122 80];
T = table(a,b,c,d,...
'RowNames',LastName)
The table looks like this:
T =
a b c d
__ __ ___ __________
Smith 38 71 176 124 93
Johnson 43 69 163 109 77
Williams 38 64 131 125 83
Jones 40 67 133 117 75
Brown 49 64 119 122 80
Now select columns of interest and get the data:
%// Select columns of interest
cols = {'a' 'c'};
%// Fetch data
T{:,cols}
ans =
38 176
43 163
38 131
40 133
49 119
Yay!

Combining data tables

I have two data tables, similar to the ones below:
table1
index value
a 6352
a 67
a 43
b 7765
b 53
c 243
c 7
c 543
table 2
index value
a 425
a 6
b 532
b 125
b 89
b 664
c 314
I would like to combine the data in one table as in the table bellow using the index values. The order is important, so the first batch of values under one index in the common table must be from the table 1
index value
a 6352
a 67
a 43
a 425
a 6
b 7765
b 53
b 532
b 125
b 89
b 664
c 243
c 7
c 543
c 314
I tried to do it using VBA but I'm sadly a complete novice and I was wondering if someone has any pointers how to approach to write the code?
Copy the values of the second table (without the headers) under the values of the first table, select the two resultant columns and sort them by index.
Hope it works!

Resources