How to get the column names along with the rows by using pivot stage in datastage - pivot

I have a table in format
Data:
ID Sev1 Sev2 Sev3
ABC 0.45 1 1
PQR 0.45 1 2
XYZ 0.45 1 1
I want to change this to the new format as below by using horizontal pivot . How can i get column names ( severity ) as well along with its data
Expected Output:
ID Severity Values
ABC Sev1 0.45
ABC Sev2 1
ABC Sev3 1
PQR Sev1 0.45
PQR Sev2 1
PQR Sev3 2
XYZ Sev1 0.45
XYZ Sev2 1
XYZ Sev3 1
To bring the column names as rows and add along with the corresponding values
After converting using horizontal pivot and after then i am using CDC stage to define if its an insert or update record with CDC_Changecode = 1 or CdcChangeCode = 3. Please find below sample data
ID Severity Values CDCCode
ABC Sev1 0.45 1
ABC Sev2 1 1
ABC Sev3 1 1
PQR Sev1 0.45 3
PQR Sev2 1 3
PQR Sev3 2 3
XYZ Sev1 0.45 3
XYZ Sev2 1 3
XYZ Sev3 1 3
Once i get this output after using CDC stage ( 1 = Insert record , 3 = Update record ) then i want to convert them columns to rows using vertical pivot. When
Expected output after vertical pivot
CDC Code = 1 ( insert record - i am trying to get below data as ouput )
ID Sev1 Sev2 Sev3
ABC 0.45 1
PQR 1
CDC Code = 3 ( Update record - i am trying to get below data as ouput )
ID Sev1 Sev2 Sev3
ABC 1
PQR 0.45 2
XYZ 0.45 1 1

Check the Pivot Index checkbox on the Pivot Properties tab.
Then add the additional column to your output on the mapping tab.
This will generate a column with a number - the index of the column pivoted.
In a additional step you could then transform the numbers back into column names

Related

How can I replace the column name in a panda dataframe

I have a excel file as
Old_name new_name
xyz abc
opq klm
And I have my dataframe as like this
Id timestamp xyz opq
1 04-10-2021 3 4
2 05-10-2021 4 9
As you see I have my old names as column name and I would like to map and replace them with new name as in my csv file. How can I do that?
Try with rename:
df.rename(columns=col_names.set_index('Old_name')['new_name'], inplace=True)
# verify
print(df)
Output:
Id timestamp abc klm
0 1 04-10-2021 3 4
1 2 05-10-2021 4 9

Create duplicate rows in two data frames based on column values (pandas & numpy)

Suppose I have two data frames, DF1 and DF2,
no1 quantity no2
abc 3 123
pqr 5 NaN
and
no1 serial
abc 10
pqr 20
I want to create the following output DF3 and DF4
no1 quantity
abc 3
123 3
pqr 5
and
no1 serial
abc 10
123 10
pqr 20
Kindly help to create DF3. I have thought about repeat the rows of Df1 if DF1['no1'] != 'NA' for Df3 then drop no2 column. It is possible to create DF4 by using pd.merge but the serial number of 123 should be 10 which is required.
For df3 you can use append() method,to_frame() method and assign() method:
df3=df1['no1'].append(df1['no2']).to_frame(name='no1').assign(quantity=df1['quantity']).reset_index(drop=True).dropna()
Output of df3:
no1 quantity
0 abc 3
1 pqr 5
2 123.0 3
For df4 you can use merge() method ,groupby() method and ffill() method:
df4=df3.merge(df2,on='no1',how='left').groupby('quantity').ffill()
Output of df4:
no1 serial
0 abc 10.0
1 pqr 20.0
2 123.0 10.0

How to group rows in pandas with sum in the certain column

Given a DataFrame like this:
A
B
C
D
0
ABC
unique_ident_1
10
ONE
1
KLM
unique_ident_2
2
TEN
2
KLM
unique_ident_2
7
TEN
3
XYZ
unique_ident_3
2
TWO
3
ABC
unique_ident_1
8
ONE
3
XYZ
unique_ident_3
-5
TWO
where column "B" contains a unique text identifier, columns "A" and "D" contain some constant texts dependent from unique id, and column C has a quantity. I want to group rows by unique identifiers (col "B") with quantity column summed up by ident:
A
B
C
D
0
ABC
unique_ident_1
18
ONE
1
KLM
unique_ident_2
9
TEN
2
XYZ
unique_ident_3
-3
TWO
How can I get this result with pandas?
use named tuples with a groupby.
df1 = df.groupby('B',as_index=False).agg(
A=('A','first'),
C=('C','sum'),
D=('D','first')
)[df.columns]
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO
You can also create a dictionary and then group incase you have many columns:
agg_d = {col:'sum' if col=='C' else'first' for col in df.columns}
out = df.groupby('B').agg(agg_d).reset_index(drop=True)
print(out)
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO

Dropping columns with high missing values

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.
My original data set - data_merge2 looks like this :
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
The count data set looks like this that gives me the missing count and ratio:
missing_count missing_ratio
C 4 0.10
D 4 0.66
The code that I used to create the count dataset looks like :
#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
.loc[data_merge2.isna().any()] )
print(new_df)
Now I want to drop the columns from the original dataframe whose missing ratio is >50%
How should I achieve this?
Use:
data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]
or DataFrame.drop using new_df DataFrame
data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])
Output
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC
Adding another way with query and XOR:
data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]
Or pandas way using Index.difference
data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]
A B
0 123.0 ABC
1 123.0 ABC
2 NaN ABC
3 123.0 ABC
4 245.0 ABC
5 345.0 ABC

New column with in a Pandas Dataframe with respect to duplicates in given column

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Resources