Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like
1) I dont know which all columns will have same values
2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
you may do following to get expected_output from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Output for df_transformed
Related
Hi everyone, I have two question need helping
Question 2
I have df with data as belows:
ABC_x
Quantity silent
ABC_y
Quantity noirse
A
05
NaN
NaN
B
03
NaN
NaN
NaN
NaN
D
08
NaN
NaN
E
09
G
01
NaN
NaN
How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?
DF expected:
ABC
Quantity
A
05
B
03
D
08
E
09
G
01
Thank you for reading and help me troubleshoot problem, Have a nice day <3
I have try but unsuccessful
Question 1
pandas has a function duplicated that gives you true for duplicates and false otherwise
In [40]: df.duplicated(["Column A"])
Out[40]:
0 False
1 True
dtype: bool
You can use this for boolean indexing
In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan
In [44]: df
Out[44]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN ValueB ValueC Value_D002 Value_E06 Value_F4
and the same for the other columns.
Note
You can also pass multiple columns with
In [52]: df.loc[
...: df.duplicated(["Column A", "Column B", "Column C"]),
...: ["Column A", "Column B", "Column C"],
...: ] = np.nan
In [53]: df
Out[53]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
However, this would replace only where all three columns are duplicated at the same time.
Question 2
pandas has a function fill to replace nan values. From your example I assume there is either a value in _x or _y. In this case you can use backfill to use _x if it is there and take _y otherwise
In [76]: df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1)
Out[76]:
ABC_x ABC_y
0 A NaN
1 B NaN
2 D D
3 E E
4 G NaN
Then do this for ABC as well as Quantity and use the first column only:
In [82]: pd.DataFrame({
"ABC": df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1).iloc[:, 0],
"Quantity": df[["Quantity silent", "Quantity noirse"]].fillna(method="backfill", axis=1).iloc[:, 0].astype(int),
})
Out[82]:
ABC Quantity
0 A 5
1 B 3
2 D 8
3 E 9
4 G 1
The astype(int) in the end is just because nan is not a valid integer, so pandas interprets the numbers as floats in the presence of nan
Question1
when column name have 'Column', chk duplicated to NaN
cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))
result:
Column A Column B Column C Column D Column E Column F
0 ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NaN NaN NaN Value_D002 Value_E06 Value_F4
make result to join to name
full code
cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question2
df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()
result
ABC Quantity
0 A 05
1 B 03
2 D 08
3 E 09
4 G 01
Problem: merging varying number of rows by multiple conditions
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
Goal: how the dataset should look like
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
If
same values in "connector"
then
merge
rows of "type"=2 with rows of "type"=1
but
(because multiple rows of "type"=1 have the same value in "connector")
duplicate
the corresponding rows of type=2
and
merge
all of the other rows that also have the same value in "connector" and are of "type"=1
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
Can i achieve my goal by goupby() ?
Solution by #victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
The final DataFrame looks like:
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
Which is much more compact and readable.
How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())
I have a data-frame with 55 columns and 2 million rows having mix of categorical and numeric fileds. There are null/na values in the data-set. I want to fill Null values with Column names.
The data-set I have is:
A B C D .....
1 na na 3 .....
na 3 4 na .....
........................
The output the I am trying to get is:
A B C D .....
1 B C 3 .....
A 3 4 D .....
........................
I am trying to use :
df.fillna(method='ffill')
Is there another way?
Python:3.6.5
Use DataFrame.fillna with columns converted to Series by Index.to_series:
df = df.fillna(df.columns.to_series())
print (df)
A B C D
0 1 B C 3
1 A 3 4 D
EDIT: If categorical columns in DataFrame select these columns and append non exist values by cat.add_categories:
for c in df.select_dtypes('category'):
df[c] = df[c].cat.add_categories(c)
df = df.fillna(df.columns.to_series())
I am trying to update one dataframe with another dataframe with respect to the first column. If there is an extra row in the second dataframe, it should be inserted in the first dataframe. It there is a row with the same data in the first column but different data in the other coulmns, that row should be updated. Also, the row which has no value in the first column should be dropped.
Code used -
df = df_1.combine_first(df_2)\
.reset_index()\
.reindex(columns=df_1.columns)
df = df.drop_duplicates(subset='A', keep= 'last', inplace=False)
df.dropna(subset=['A'])
print ("Final Data")
print (df)
First Dataframe -
A B C
0 45 a b
1 98 c d
2 67 bn k
Second Dataframe -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4
Final should look like -
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
The final dataframe that I get -
A B C
0 45.0 a b
1 98.0 c d
2 67.0 bn k
3 90.0 x z
4
So, neither the data is getting updated, nor is it removing the row with null values. What am I missing?
Based on my understanding of your question, your second dataframe basically supercedes the first, if there is a matching index. If there isn't, then the difference is added to the first dataframe. I am also assuming that there are no duplicate keys in the first column, A.
Framing this requirement a little differently, the final output should contain all the rows in the second dataframe, as well as the values (since they are meant to overwrite the first dataframe if there's a match). Therefore, we will start off using the second dataframe as it is, and then add back the rows that exist in the first dataframe but not in the second. See the example below. (I'm also using a slightly different first dataframe to highlight the effects)
import pandas as pd
df1 = pd.DataFrame({'A':[45,98,67,91],'B':['a','c','bn','y'],'C':['b','d','k','oo']})
df2 = pd.DataFrame({'A':[45,98,67,90,''],'B':['a','c','bn','x',''],'C':['d','d','k','z','']})
# Remove rows with empty values in first column. This should be whatever conditions applicable to you i.e. checking for np.nan instead of str('')
df2 = df2.loc[df2['A'] != '']
df1.set_index('A', inplace=True)
df2.set_index('A', inplace=True)
# Find keys in dataframe 1 that are not in dataframe 2
idx_diff = df1.index.difference(df2.index)
# Append these rows to dataframe 2
df_ins = df1.loc[idx_diff]
df3 = df2.append(df_ins)
df3.reset_index(inplace=True)
>>>df3
A B C
0 45 a d
1 98 c d
2 67 bn k
3 90 x z
4 91 y oo