In Python Pandas, how do I combine two columns containing strings using if/else statement or similar? - python-3.x

I have created a pandas dataframe from an excel file where first two columns are:
df = pd.DataFrame({'0':['','','Location Code','pH','Ag','Alkalinity'], '1':['Lab Id','Collection Date','','','µg/L','mg/L']})
which looks like this:
df[0] df[1]
Lab Id
Collection Date
Location Code
pH
Ag µg/L
Alkalinity mg/L
I want to merge these columns into one that looks like this:
df[0]
Lab Id
Collection Date
Location Code
pH
Ag (µg/L)
Alkalinity (mg/L)
I believe I need a control statement before combining df[0] and df[1] which would appear like this:
if **there is a blank space in either column, then it performs**:
df[0] = df[0].astype(str)+df[1].astype(str)
else:
df[0] = df[0].astype(str)+' ('+df[1].astype(str)+')'
but I am not sure how to write the if statement. Could anyone please guide me here.
Thank you very much.

We can try np.select
cond=[(df['0']=='') & (df['1']!=''), (df['0']!='') & (df['1']==''), (df['0']!='') & (df['1'] !='')]
val=[df['1'], df['0'], df['0']+ '('+df['1']+')']
df['new']=np.select(cond,val)
df
0 1 new
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)

if value is Na, maybe:
df['result'] = df[0].fillna(df[1])

This works using numpy where, and the string concatenation assumption is based on the data shared :
df.assign(
merger=np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)
)
0 1 merger
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code Location Code
3 pH pH
4 Ag µg/L Ag(µg/L)
5 Alkalinity mg/L Alkalinity(mg/L)
Or, you could just assign it to "0", if that is what you are after :
df["0"] = np.where(
df["1"].str.endswith("/L"),
df["0"].str.cat(df["1"], "(").add(")"),
df["0"].str.cat(df["1"], ""),
)

Here is another way:
First you replace values you are going to concat with the value + '()'
df['1'].loc[df.replace('', np.nan).notnull().all(axis =1 )] = '(' + df['1'] + ')'
Now we fill in missing values with bfill and ffill
df = df.replace('', np.nan).bfill(axis = 1).ffill(axis = 1)
Only thing remaining, is to merge values wherever we have brackets
df.loc[:, 'merge'] = np.where(df['1'].str.endswith(')'), df['0'] + df['1'], df['1'])

Test if empty value at least in one column 0,1 by DataFrame.eq and DataFrame.any and then join both columns like in your answer in numpy.where:
df = pd.DataFrame({0:['','','Location Code','pH','Ag','Alkalinity'],
1:['Lab Id','Collection Date','','',u'µg/L','mg/L']})
print (df[[0,1]].eq(''))
0 1
0 True False
1 True False
2 False True
3 False True
4 False False
5 False False
print (df[[0,1]].eq('').any(axis=1))
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
df[0] = np.where(df[[0,1]].eq('').any(axis=1),
df[0].astype(str)+df[1].astype(str),
df[0].astype(str)+' ('+df[1].astype(str)+')')
print (df)
0 1
0 Lab Id Lab Id
1 Collection Date Collection Date
2 Location Code
3 pH
4 Ag (µg/L) µg/L
5 Alkalinity (mg/L) mg/L

Related

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

conditionally multiply values in DataFrame row

here is an example DataFrame:
df = pd.DataFrame([[1,0.5,-0.3],[0,-4,7],[1,0.12,-.06]], columns=['condition','value1','value2'])
I would like to apply a function which multiples the values ('value1' and 'value2' in each row by 100, if the value in the 'condition' column of that row is equal to 1, otherwise, it is left as is.
presumably some usage of .apply with a lambda function would work here but I am not able to get the syntax right. e.g.
df.apply(lambda x: 100*x if x['condition'] == 1, axis=1)
will not work
the desired output after applying this operation would be:
As simple as
df.loc[df.condition==1,'value1':]*=100
import numpy as np
df['value1'] = np.where(df['condition']==1,df['value1']*100,df['value1']
df['value2'] = np.where(df['condition']==1,df['value2']*100,df['value2']
In case multiple columns
# create a list of columns you want to apply condition
columns_list = ['value1','value2']
for i in columns_list:
df[i] = np.where(df['condition']==1,df[i]*100,df[i]
Use df.loc[] with the condition and filter the list of cols to operate then multiply:
l=['value1','value2'] #list of cols to operate on
df.loc[df.condition.eq(1),l]=df.mul(100)
#if condition is just 0 and 1 -> df.loc[df.condition.astype(bool),l]=df.mul(100)
print(df)
Another solution using df.mask() using same list of cols as above:
df[l]=df[l].mask(df.condition.eq(1),df[l]*100)
print(df)
condition value1 value2
0 1 50.0 -30.0
1 0 -4.0 7.0
2 1 12.0 -6.0
Use a mask to filter and where it is true choose second argument where false choose third argument is how np.where works
value_cols = ['value1','value2']
mask = (df.condition == 1)
df[value_cols] = pd.np.where(mask[:, None], df[value_cols].mul(100), df[value_cols])
If you have multiple value columns such as value1, value2 ... and so on, Use
value_cols = df.filter(regex='value\d').columns

Removing repetitive/duplicate occurance in excel using python

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
This is my input excel:
And need output like this:
This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
by creating a mask where you return a true/false boolean if the row is == the row above.
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
explanation :
what we are doing above is the following,
first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=
lets see this in action.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.
we do this by calling [.shift][2] hyperlinked for more info.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
if we called df['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
we can see here that that Rekha has moved down
so putting that all together,
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
we assign this to a self defined variable called mask you could call this whatever you want.
we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.
however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
hope that helps!

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Pandas "countif" based on column value and multiindex

I have a DataFrame that looks like:
Where I have YEAR and RACEETHN as a multiindex. I want to to count the number of "1" values (note, the data are not only 0 and 1 so I cannot do a sum) for each YEAR and RACEETHN combination for each column variable.
I am able to count where value = 1 for each column by doing this:
(df_3.ACSUPPSV == 1).sum()
(df_3.PSEDSUPPSV == 1).sum()
I want to do this with groupby, but am unable to get it to work. I've tried the following code to test if I could do it on a single column 'ACSUPPSV' and it did no work:
df.groupby(['YEAR', 'RACEETHN']).loc[df.ACSUPPSV == 1, 'ACSUPPSV'].count()
I exported the data to excel and was able to calculate this with a quick "COUNTIF" formula, but I know there must be a way to do this in pandas - the results from excel look like:
Would appreciate if someone had a better way to do this than export to Excel! :)
I think you need agg with custom function for count 1 only:
df_3 = pd.DataFrame({'ACSUPPSV':[1,1,1,1,0,1],
'PSEDSUPPSV':[1,1,0,1,0,0],
'BUDGETSV':[1,0,1,1,1,0],
'YEAR':[2000,2000,2001,2000,2000,2000],
'RACEETHN':list('aaabbb')}).set_index(['YEAR','RACEETHN'])
print (df_3)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 1 1 1
a 1 0 1
2001 a 1 1 0
2000 b 1 1 1
b 0 1 0
b 1 0 0
df2 = df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())
print (df2)
ACSUPPSV BUDGETSV PSEDSUPPSV
YEAR RACEETHN
2000 a 2 1 2
b 2 2 1
2001 a 1 1 0
Old answer:
df_3[((df_3.ACSUPPSV == 1) & (df_3.PSEDSUPPSV == 1))].groupby(['YEAR', 'RACEETHN']).size()
df_3.query('ACSUPPSV == 1 & PSEDSUPPSV == 1').groupby(['YEAR', 'RACEETHN']).size()
More general:
cols = ['ACSUPPSV','PSEDSUPPSV']
df_3[(df_3[cols] == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
For all columns:
df_3[(df_3 == 1).all(axis=1)].groupby(['YEAR', 'RACEETHN']).size()
EDIT:
Or maybe need:
df_3.groupby(['YEAR', 'RACEETHN']).agg(lambda x: (x == 1).sum())

Resources