How to select columns based on criteria? - python-3.x

I have the following dataframe:
d2 = {('CAR','ALPHA'): pd.Series(['A22', 'A23', 'A24', 'A25'],index=[2, 3, 4, 5]),
('CAR','BETA'): pd.Series(['B22', 'B23', 'B24', 'B25'],index=[2, 3, 4, 5]),
('MOTOR','SOLO'): pd.Series(['S22', 'S23', 'S24', 'S25'], index=[2, 3, 4, 5])}
db= pd.DataFrame(data=d2)
I would like in the columns that have 'CAR' in the 0 level multiindex to delete all the values and set them to NA after a row index, ex. 4.
I am trying to use .loc but I would like the results to be saved in the same dataframe.
The second thing I would to do to set the values of columns that their 0 multiindex level is different from 'CAR' to NA after a row index, ex 3.

Use slicers for first and for second MultiIndex.get_level_values compare by level value:
idx = pd.IndexSlice
db.loc[4:, idx['CAR', :]] = np.nan
db.loc[3:, db.columns.get_level_values(0) != 'CAR'] = 'AAA'
Or:
mask = db.columns.get_level_values(0) == 'CAR'
db.loc[4:, mask] = np.nan
db.loc[3:, ~mask] = 'AAA'
print(db)
CAR MOTOR
ALPHA BETA SOLO
2 A22 B22 S22
3 A23 B23 AAA
4 NaN NaN AAA
5 NaN NaN AAA

Related

Identifying Duplicates and arranging corresponding data in their perspective columns using python

I am having a CSV file that contains multiple duplicate entries. What I'am trying to do is that, I need to gather those duplicate fields and arrange their corresponding fields.
My table:
Column A
Column B
Column C
Column D
1004.
1
1004.
3
1004.
2
What I need:
Column A
Column B
Column C
Column D
1004.
1
3
2
How do I solve this in Python.
I tried Identifying the duplicates and dunno what to do next.
import pandas
df = pandas.read_csv(csv_file, names=fields, index_col=False)
df = df[df.duplicated([column1], keep=False)]
df.to_csv(csv_file2, index=False)
you can use a lambda function. First, we group by Column A:
df=pd.DataFrame(data={'col_a':[1004,1004,1004,1005],'col_b':[1,3,2,2],'col_c':['','','',''],'col_d':['','','','']})
df=df.replace(r'^\s*$',np.nan,regex=True) #replace empty cells with nan
dfx=df.groupby('col_a').agg(list)
print(dfx)
col_a col_b col_c col_d
1004 [1, 3, 2] [nan, nan, nan] [nan, nan, nan]
1005 [2] [nan] [nan]
if you have a several columns, you can replace the nan values ​​according to the index number of the values ​​in the list:
dfx['col_c']=dfx['col_b'].apply(lambda x: x[1] if len(x) > 1 else np.nan) #get the second value from col_b
dfx['col_d']=dfx['col_b'].apply(lambda x: x[2] if len(x) > 2 else np.nan) #get the third value from col_b
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan) #finally replace col_b with its first value (col_b must be replaced last)
Note: if we replace col_b first, we will lose the list. This is why we are replacing col_b last.
if there are many columns we can use a for loop:
loop_cols=[*dfx.columns[1:]] # get the list of columns except col_b.
list_index=1
list_lenght=1
for i in loop_cols:
dfx[i]=dfx['col_b'].apply(lambda x: x[list_index] if len(x) > list_lenght else np.nan)
list_index+=1
list_lenght+=1
#finally, replace col_b.
dfx['col_b']=dfx['col_b'].apply(lambda x: x[0] if len(x) > 0 else np.nan)
output:
col_a col_b col_c col_d
1004 1 3.0 2.0
1005 2

Python pandas move cell value to another cell in same row

I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0

merging varying number of rows by multiple conditions in python

Problem: merging varying number of rows by multiple conditions
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
Goal: how the dataset should look like
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
If
same values in "connector"
then
merge
rows of "type"=2 with rows of "type"=1
but
(because multiple rows of "type"=1 have the same value in "connector")
duplicate
the corresponding rows of type=2
and
merge
all of the other rows that also have the same value in "connector" and are of "type"=1
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
Can i achieve my goal by goupby() ?
Solution by #victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
The final DataFrame looks like:
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
Which is much more compact and readable.

Pandas - Conditional drop duplicates based on number of NaN

I have a Pandas 0.24.2 dataframe for Python 3.7x as below. I want to drop_duplicates() with the same Name based on a conditional logic. A similar question can be found here: Pandas - Conditional drop duplicates but it gets more complicated in my case
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6 ],
'Name': ['A', 'B', 'C', 'A', 'B', 'C' ],
'Value1':[1, np.NaN, 0, np.NaN, 1, np.NaN],
'Value2':[np.NaN, 0, np.NaN, 1, np.NaN, 0 ],
'Value3':[np.NaN, 0, np.NaN, 1, np.NaN, np.NaN]
})
How is it possible to:
Drop duplicates for same 'Name' records, keeping the one that has less NaNs?
If they have the same number of NaNs, keeping the one that has NOT a NaN in 'Value1'?
The desired output would be:
Id Name Value1 Value2 Value3
2 2 B NaN 0 0
3 3 C 0 NaN NaN
4 4 A NaN 1 1
Idea is create helper columns for both conditions, sorting and remove duplicates:
df1 = df.assign(count= df.isna().sum(axis=1),
count_val1 = df['Value1'].isna().view('i1'))
df2 = (df1.sort_values(['count', 'count_val1'])[df.columns]
.drop_duplicates('Name')
.sort_index())
print (df2)
Id Name Value1 Value2 Value3
1 2 B NaN 0.0 0.0
2 3 C 0.0 NaN NaN
3 4 A NaN 1.0 1.0
Here is a different solution. The goal is to create two columns that help sort the duplicate rows that will be deleted.
First, we create the columns.
df['count_nan'] = df.isnull().sum(axis=1)
Value1_nan = []
for row in df['Value1']:
if row >= 0:
Value1_nan.append(0)
else:
Value1_nan.append(1)
df['Value1_nan'] = Value1_nan
We then sort the columns so that the column with the most NaNs appears first.
df.sort_values(by=['Name','count_nan', 'Value1'], inplace=True, ascending = [True, True, True])
Finally, we drop the "last" duplicate line. That is, we keep the line with the smallest number of NaNs followed by the line with the smallest number of NaNs in Value1
df = df.drop_duplicates(subset = ['Name'],keep='first')

Splitting dictionary/list into Separate Columns

I have movie dataset saved for revenue prediction. However, the genres column of this dataset has a dictionary in that dictionary there is 2 or more list in 1 row. The DataFrame looks like this this is not actual dataframe but dataframe is similar to this:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, [{'c':4},{'d':3}], [{'c':5, 'd':6},{'c':7, 'd':8}]]})
this is output
a b
0 1 {'c': 1}
1 2 [{'c': 4}, {'d': 3}]
2 3 [{'c': 5, 'd': 6}, {'c': 7, 'd': 8}]
I need to split this column into separate columns.
How can i do that I used apply(pd.series) method This is what I'm getting as a output
0 1 c
0 NaN NaN 1.0
1 {'c': 4} {'d': 3} NaN
2 {'c': 5, 'd': 6} {'c': 5, 'd': 6} NaN
but I want like this if possible:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8
I do not know if it is possible to achieve what you want by using apply(pd.Series) because you have mixed types in your 'b' column: you have dictionaries and list of dictionaries. Maybe it is, not sure.
However this is how I would do.
First, loop over your column to build a set with all the new column names: that is, the keys of the dictionaries.
Then you can use apply with a custom function to extract the value for each column.
Notice that the values in this column are strings, needed because you want to concatenate with a comma cases like your row #2.
newcols = set()
for el in df['b']:
if isinstance(el, dict):
newcols.update(el.keys())
elif isinstance(el, list):
for i in el:
newcols.update(i.keys())
def extractvalues(x, col):
if isinstance(x['b'], dict):
return x['b'].get(col, np.nan)
elif isinstance(x['b'], list):
return ','.join(str(i.get(col, '')) for i in x['b']).strip(',')
for nc in newcols:
df[nc] = df.apply(lambda r: extractvalues(r, nc), axis=1)
df.drop('b', axis=1, inplace=True)
Your dataframe is now:
a c d
0 1 1 NaN
1 2 4 3
2 3 5,7 6,8

Resources