How to create an ID that links rows based on multiple fields - apache-spark

I have a requirement to create GROUP_ID based on information present in two other fields. All ID_1 having same values must have a unique Group_ID and likewise, all ID_2 having same values must have a unique Group_ID. The Group_ID need not be contiguous.
ID_1 ID_2 GROUP_ID
X1 10 1
X1 20 1
Y1 30 2
Y2 30 2
A1 100 3
A1 200 3
B1 200 3
B1 200 3
B1 300 3
B1 300 3
C1 300 3
C1 400 3
I am using pyspark and I tried to solve in Spark SQL using window functions (see below), but unable to produce the desired output. Please let me know if there is an efficient way to solve this. My dataset is having >100M rows.
RowNum ID_1 ID_2 ID_1_1 ID_2_1 GROUP_ID
1 X1 10 1 1 1
2 X1 20 1 1 1
3 Y1 30 3 3 3
4 Y2 30 4 3 3
5 A1 100 5 5 5
6 A1 200 5 5 5
7 B1 200 7 5 5
8 B1 200 7 5 5
9 B1 300 7 7 5
10 B1 300 7 7 5
11 C1 300 11 7 7
12 C1 400 11 11 7
Where
ID_1_1 = First(ROWNUM) over (Partition by ID_1 order by RowNum)
ID_2_1 = First(ID_1_1) over (Partition by ID_2 order by ID_1_1)
Group_ID = First(ID_2_1) over (Partition by ID_1_1 order by ID_2_1)
Using above approach, Rows 11 and 12 gets a group ID of 7 instead of 5.

Related

Dynamically updating row values based on a condition in pandas

I am running a simulation test where I want to dynamically change some values present in rows for each column based on certain set of conditions
The Problem Statement
My dataset has 400 rows and my first test case is to update 5% of the rows in each column, so 5% of 400 = 20 rows which needs to be updated
These 20 rows should be only updated for the top 5 categories that are present in my dataset. So 4 rows each which needs to be updated
My dataframe looks like this:
A B C D Category
1 10 3 4 X
4 9 6 9 Y
9 3 7 10 XX
10 1 9 7 YY
10 1 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
The conditions are:
While updating the rows I would want to make sure that 20 rows (5% of the overall dataset) should be updated only where the top 5 categories are encountered. In my case the top 5 categories are X, Y , XX, YY and ZZ. These rows should be updated to value 7 where the previous value was 1,2,3,4,5,6
The resultant datframe should look like this:
A B C D Category
7 10 7 7 X
7 9 7 9 Y
9 7 7 10 XX
10 7 9 7 YY
10 7 9 7 ZZ
10 1 9 7 YZZ
10 1 9 7 YZZ
10 1 9 7 YYYY
......400 rows
In the resultant dataframe, there is no impact on the categories which are not the top 5 categories, this case YZZ or YYYY and to demonstrate an example I can't show all the updated rows but for example in the above dataframe, 2 rows have been updated for column A where previous value was <=6 to a new value 7 and similarly the other two rows will get updated to 7 wherever the condition is met.
How can I achieve this?
You can try the following logic:
# get only desired Categories
m = df['Category'].isin(['X', 'Y', 'XX', 'YY', 'ZZ'])
# select 20 random rows from the above
idx = df[m].sample(n=20).index
# replace the 1 ≤ values ≤ 6 by 7
df.loc[idx] = df.loc[idx].mask(df.loc[idx].ge(1)&df.loc[idx].le(6), 7)
If you rather want 4 rows per Category, use this variant for the random sampling:
idx = df[m].groupby('Category').sample(n=4).index

LISTAGG Partition for Webi

I would like to do something similar to oracle LISTAGG in Webi. Below are my Queries.
Query 1
Id M1 ; columns
1 10
2 20
3 30
4 40
5 50
Query 2
Id D1 ; column
1 A11
1 A12
1 A13
2 A21
2 A22
2 A23
2 A24
3 A31
wanted outcome by merging Query 1 and Query 2 By Id
Id M1 New Column
1 10 A11;A12;A13
2 20 A21;A22;A23;A24
3 30 A31
4 40
5 50
I can get to the point below. Then, use NoFilter to keep values intact when applying a filter. However, the column F2 has the values "#MULTIVALUE". I can get NoFilter to work with one query. But, with two queries like this, NoFilter doesn't work. Any suggestion to address the issue.
Id M1 F1 (Measure) F2
1 10 A11;A12;A13 =NoFilter([F1])
1 10 A11;A12;A13
1 10 A11;A12;A13
2 20 A21;A22;A23;A24
2 20 A21;A22;A23;A24
2 20 A21;A22;A23;A24
2 20 A21;A22;A23;A24
3 30 A31
4 40
5 50
I wonder if anyone could show me how to achieve this.
Many thanks for your help,
Andre

How can I add previous column values to to get new value in Excel?

I am working on graph and in need data in below format. I have data in COL A. I need to calculate COL B values as in below picture.
What is the formula for obtaining this in excel?
You can do with cumsum and shift:
# sample data
df = pd.DataFrame({'COL A': np.arange(11)})
df['COL B'] = df['COL A'].shift(fill_value=0).cumsum()
Output:
COL A COL B
0 0 0
1 1 0
2 2 1
3 3 3
4 4 6
5 5 10
6 6 15
7 7 21
8 8 28
9 9 36
10 10 45
Use simple MS technique.
You can use the formula (A3*A2)/2 for COL2

Select rows from with same values in one column but different value in the other column

I have some duplicates in my data that I need to correct.
This is a sample of a dataframe:
test = pd.DataFrame({'event_id':['1','1','2','3','5','6','9','3','9','10'],
'user_id':[0,0,0,1,1,3,3,4,4,4],
'index':[10,20,30,40,50,60,70,80,90,100]})
I need to select all the rows that have equal values in event_id but differing values in user_id. I tried this (based on a similar question but with no accepted answer):
test.groupby('event_id').filter(lambda g: len(g) > 1).drop_duplicates(subset=['event_id', 'user_id'], keep="first")
out:
event_id user_id index
0 1 0 10
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
But I do not need the first row where user_id is the same - 0.
The second part of the question is - what is the best way to correct the duplicate record? How could I add a suffix to event_id (_new) but only in this row:
event_id user_id index
3 3_new 1 40
6 9_new 3 70
7 3 4 80
8 9 4 90
Ummm, I try to fix your code
test.groupby('event_id').
filter(lambda x : (len(x['event_id'])==x['user_id'].nunique())&(len(x['event_id'])>1))
Out[85]:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
For Correct the duplicate row, you can do with create a new sub key , personally not recommended modify your original columns .
df['subkey']=df.groupby('event_id').cumcount()
Try:
test[test.duplicated(['event_id'], keep=False) &
~test.duplicated(['event_id','user_id'], keep=False)]
Output:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90

Excel - How do I create a cumulative sum column within a group?

In Excel, I have an hours log that looks like this:
PersonID Hours JobCode
1 7 1
1 6 2
1 8 3
1 10 1
2 5 3
2 3 5
2 12 2
2 4 1
What I would like to do is create a column with a running total, but only within each PersonID so I want to create this:
PersonID Hours JobCode Total
1 7 1 7
1 6 2 13
1 8 3 21
1 10 1 31
2 5 3 5
2 3 5 8
2 12 2 20
2 4 1 24
Any ideas on how to do that?
In D2 and fill down:
=SUMIF(A$2:A2,A2,B$2:B2)
Assuming that your data starts in cell A1, this formula will accumulate the hours until it finds a change in person ID.
=IF(A2=A1,D1+B2,B2)
Put the formula in cell D2, and copy down for each row of your data.

Resources