I have a data frame which contains two rows. The value in the column "ID" for both these rows is the same. How can I create a new data frame and bring all the values in both the rows into one row, but in separate columns?
For example, if in the input data frame, there is a column called "Amount" in both the rows, The new data frame should contain one-row with two different columns as Amount_1 and Amount_2.
groupby does not work as I do not want all the information in the same columns.
I can not merge, as this is not from two different data frames.
Turn:
+------+--------+----------+---------+
| ID | Amount |Name |State |
|------|--------|----------+---------+
| 1 | 16 |A |CA |
| 2 | 32 |B |GA |
| 2 | 64 |C |NY |
+------+--------+----------+---------+
into:
+------+----------+----------+-------+--------+---------+--------+
| ID | Amount_1 | Amount_2 |Name_1 | Name_2 | State_1 | State_2|
|------|----------|----------|-------+--------+---------+--------+
| 1 | 16 | |A | | CA | |
| 2 | 32 | 64 |B |C | GA | NY |
+------+----------+----------+-------+--------+---------+--------+
Add a column that will contain the column names of the new DataFrame by using cumcount. After that, use pivot:
df['amountnr'] = 'Amount_' + df.groupby('ID').cumcount().add(1).astype(str)
df.pivot(index='ID', columns= 'amountnr', values='Amount')
#amountnr Amount_1 Amount_2
#ID
#1 16.0 NaN
#2 32.0 64.0
Edit
With you new specifications, I feel you should really use a MultiIndex, like so:
df['cumcount'] = df.groupby('ID').cumcount().add(1)
df.set_index(['ID', 'cumcount']).unstack()
# Amount Name State
#cumcount 1 2 1 2 1 2
#ID
#1 16.0 NaN A NaN CA NaN
#2 32.0 64.0 B C GA NY
If you insist, you can later always join the columns of your MultiIndex:
df2.columns = ['_'.join([coltype, str(count)]) for coltype, count in df2.columns.values]
Related
I have 2 dataframes (df1 and df2) below. Based on the values of df1.col1 and df2.col1 , I want the values in df1.col2 and df2.col2 to be added. This sum total should then go against each row in df1 as a 3rd column. For example: wherever we have A01 in col1, the values in col2 should be summed. So 1+7+5=13. Same for all other col1 values.
To give further clarity, I have given a snapshot of how df1 should look at the end (End Result)
**df1
|col1 | col2** |
|------| -------- |
| A01 | 1 |
| A02 | 0 |
| A03 | 0 |
| A01 | 7 |
| A02 | 1 |
**df2
|col1 | col2 | col3**
|------| --------|----------
| A01 | 5 | x
| A02 | 0 | y
| A06 | 0 | asa
| A07 | 1 | asa
| A02 | 4 | st
END Result:
**df1**
**col1 | col2 | col3**
|------| --------|----------
A01 | 1 | 13
A02 | 0 | 5
A03 | 0 | 0
A01 | 7 | 13
A02 | 1 | 5
found a solution based on manipulating merge and groupby operations.
First I grouped by 'col1' df1 and 'df2', then calculated the sum, and finally merged with df1 the 'sum'
dic_df1={'col1':['A01', 'A02', 'A03', 'A01', 'A02'], 'col2':[1, 0, 0, 7, 1]}
df1=pd.DataFrame(dic_df1)
dic_df2={'col1':['A01', 'A02', 'A06', 'A07', 'A02'], 'col2':[5, 0, 0, 1, 4] ,'col3**':['x', 'y', 'asa', 'asa', 'st']}
df2=pd.DataFrame(dic_df2)
print(df1), print(df2)
then perform a groupby on 'col1' so we can calculate the first part of the sum, and then when merging df1 and df2, we will have total sum in both datasets in column 'sum'
df2=df2.groupby(['col1'], as_index=False)['col2'].sum()
merged=(df1.groupby(['col1'], as_index=False)['col2'].sum()).merge(df2, left_on='col1', right_on='col1', how='left')
merged['sum']=merged['col2_x']+merged['col2_y']
finally a merge on df1 with merged
df1=df1.merge(merged[['col1', 'sum']], left_on='col1', right_on='col1', how='left')
final output if df1:
There are many ways to fill the Nan value in final output df1.
I have 2 dataframes - one is a data source dataframe and another is reference dataframe.
I want to create an additional column in df1 based on the comparison of those 2 dataframes
df1 - data source
No | Name
213344 | Apple
242342 | Orange
234234 | Pineapple
df2 - reference table
RGE_FROM | RGE_TO | Value
2100 | 2190 | Sweet
2200 | 2322 | Bitter
2400 | 5000 | Neutral
final
if first 4 character of df1.No fall between the range of df2.RGE_FROM to df2.RGE_TO, get df2.Value for the derived column df.DESC. else, blank
No | Name | DESC
213344 | Apple | Sweet
242342 | Orange | Natural
234234 | Pineapple |
Any help is appreciated!
Thank you!
We can create an IntervalIndex from the columns RGE_FROM and RGE_TO, then set this as an index of column Value to create a mapping series, then slice the first four characters in the column No and using Series.map substitute the values from the mapping series.
i = pd.IntervalIndex.from_arrays(df2['RGE_FROM'], df2['RGE_TO'], closed='both')
df1['Value'] = df1['No'].astype(str).str[:4].astype(int).map(df2.set_index(i)['Value'])
No Name Value
0 213344 Apple Sweet
1 242342 Orange Neutral
2 234234 Pineapple NaN
I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.
In a table 1, I have,
+---+---+----+
| | A | B |
+---+---+----+
| 1 | A | 30 |
| 2 | B | 20 |
| 3 | C | 15 |
+---+---+----+
On table 2, I have
+---+---+---+----+
| | A | B | C |
+---+---+---+----+
| 1 | A | 2 | 15 |
| 2 | A | 5 | 6 |
| 3 | B | 4 | 5 |
+---+---+---+----+
I want the number in second column to divide the number in table 1, based on match, and the result in third column.
The number present in the bracket is the result needed. What is the formula that I must apply in third column in table 2?
Please help me on this.
Thanks in advance
You can use a vlookup() formula to go get the dividend. (assuming table 1 is on Sheet1 and table 2 in Sheet2 where we are doing this formula):
=VLOOKUP(A1,Sheet1!A:B, 2, FALSE)/Sheet2!B1
Since you mention table, with structured references, though it seems you are not applying those here:
=VLOOKUP([#Column1],Table1[#All],2,0)/[#Column2]
I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?