pandas df: change values in column A only for rows that are unique in column B - python-3.x

It sounds like a trivial question and I'd expected to find a quick answer, but didn't have much success.
I have the dataframe population and columns A and B. I want to change the value in B to 1 only to those rows with unique value in column A (currently all rows in B hold the value 0).
I tried:
small_towns = population['A'].value_counts() == 1
population[population['A'] in small_towns]['B']=1
and got: 'Series' objects are mutable, thus they cannot be hashed
I also tried:
population.loc[population['A'].value_counts() == 1, population['B']] = 1
and got the same error with an aditional: pandas.core.indexing.IndexingError:
Any ideas?
Thanks in advance,
Ben

We can Series.duplicated with keep = False
this returns a Series with True on all duplicates values ​​and False on the rest. We can put 1 in rows with True using DataFrame.loc[]:
population.loc[~population['A'].duplicated(keep=False), 'B'] = 1
#population.loc[~population.duplicated(subset = 'A', keep=False), 'B'] = 1
We can also use Series.where or Series.mask
population['B'] = population['B'].where(population['A'].duplicated(keep=False), 1)
#population['B'] = population['B'].mask(~population['A'].duplicated(keep=False), 1)
but if you want to create a serie B with 1 or 0 you can simply do:
population['B'] = (~population['A'].duplicated(keep=False)).astype(int)
or
population['B'] = np.where(population['A'].duplicated(keep=False), 0, 1)

Related

Find and Add Missing Column Values Based on Index Increment Python Pandas Dataframe

Good Afternoon!
I have a pandas dataframe with an index and a count.
dictionary = {1:5,2:10,4:3,5:2}
df = pd.DataFrame.from_dict(dictionary , orient = 'index' , columns = ['count'])
What I want to do is check from df.index.min() to df.index.max() that the index increment is 1. If a value is missing like in my case the 3 is missing then I want to add 3 to the index with a 0 in the count.
The output will look like the below df2 but done in a programmatic fashion so I can use it on a much bigger dataframe.
RESULTS EXAMPLE DF:
dictionary2 = {1:5,2:10,3:0,4:3,5:2}
df2 = pd.DataFrame.from_dict(dictionary2 , orient = 'index' , columns = ['count'])
Thank you much!!!
Ensure the index is sorted:
df = df.sort_index()
Create an array that starts from the minimum index to the maximum index
complete_array = np.arange(df.index.min(), df.index.max() + 1)
Reindex, fill the null value with 0, and optionally change the dtype to Pandas Int:
df.reindex(complete_array, fill_value=0).astype("Int16")
count
1 5
2 10
3 0
4 3
5 2

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Conditionally concatenate variable names into a new variable in Python

I have a data set with 3 columns and occasional NAs. I am trying to create a new string column called 'check' that will concatenate the name of the variables that don't have an NA in each row in between underscores ('_'). I pasted my code below as well as the data that I have, the data that I need and what I actually get (See the hyperlinks after the code). For some reason, it seems the conditional that I have in place is completely ignored and the example_set['check'] = example_set['check'] + column is executed at every loop with or without the conditional code block. I assume there is a Python/Pandas quirk that I haven't fully comprehended... Can you please help?
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set
columns = list(example_set.columns)
example_set['check'] = '_'
for column in columns:
for row in range(example_set.shape[0]):
if example_set[column][row] != np.nan:
example_set['check'] = example_set['check'] + column
else:
continue
example_set
Data that I have
Data that I was hoping to get
What I actually get
Find the rows that have null values, iterate the values with numpy compress, get the difference of the iteration from the columns, format the strings to your taste and create a new column:
columns = example_set.columns
example_set['check'] = [f'_{"".join(columns.difference(np.compress(boolean,columns)))}_'
for boolean in example_set.isna().to_numpy()]
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_
The simplest strategy is to copy df. Create a new column in it. Iterate over the rows of the old df, filter for nan cells then get the remaining indices.
Join them in a string and put those values to new df.
It is probably not the most efficient method, but it should be easy to understand.
Here is some code to get you going:
nset = example_set.copy()
nset["checked"] = "__"
for s in range(example_set.shape[0]):
serie = example_set.iloc[s]
nserie = serie[serie.notnull()]
names = "".join(nserie.index.tolist())
nset.at[s, "checked"] = "__" + names + "__"
Please try:
import numpy as np
example_set = pd.DataFrame({
'A':[3,4,np.nan]
,'B':[1,np.nan,np.nan]
,'C':[3,4,5]
}
)
example_set['check'] = '_ABC_'
for i in range(len(example_set)):
list_ = example_set.iloc[i].values.tolist()
if math.isnan(list_[0]):
example_set['check'][i] = example_set['check'][i].replace('A','')
if math.isnan(list_[1]):
example_set['check'][i] = example_set['check'][i].replace('B','')
if math.isnan(list_[2]):
example_set['check'][i] = example_set['check'][i].replace('C','')
Output:
A B C check
0 3.0 1.0 3 _ABC_
1 4.0 NaN 4 _AC_
2 NaN NaN 5 _C_

conditionally multiply values in DataFrame row

here is an example DataFrame:
df = pd.DataFrame([[1,0.5,-0.3],[0,-4,7],[1,0.12,-.06]], columns=['condition','value1','value2'])
I would like to apply a function which multiples the values ('value1' and 'value2' in each row by 100, if the value in the 'condition' column of that row is equal to 1, otherwise, it is left as is.
presumably some usage of .apply with a lambda function would work here but I am not able to get the syntax right. e.g.
df.apply(lambda x: 100*x if x['condition'] == 1, axis=1)
will not work
the desired output after applying this operation would be:
As simple as
df.loc[df.condition==1,'value1':]*=100
import numpy as np
df['value1'] = np.where(df['condition']==1,df['value1']*100,df['value1']
df['value2'] = np.where(df['condition']==1,df['value2']*100,df['value2']
In case multiple columns
# create a list of columns you want to apply condition
columns_list = ['value1','value2']
for i in columns_list:
df[i] = np.where(df['condition']==1,df[i]*100,df[i]
Use df.loc[] with the condition and filter the list of cols to operate then multiply:
l=['value1','value2'] #list of cols to operate on
df.loc[df.condition.eq(1),l]=df.mul(100)
#if condition is just 0 and 1 -> df.loc[df.condition.astype(bool),l]=df.mul(100)
print(df)
Another solution using df.mask() using same list of cols as above:
df[l]=df[l].mask(df.condition.eq(1),df[l]*100)
print(df)
condition value1 value2
0 1 50.0 -30.0
1 0 -4.0 7.0
2 1 12.0 -6.0
Use a mask to filter and where it is true choose second argument where false choose third argument is how np.where works
value_cols = ['value1','value2']
mask = (df.condition == 1)
df[value_cols] = pd.np.where(mask[:, None], df[value_cols].mul(100), df[value_cols])
If you have multiple value columns such as value1, value2 ... and so on, Use
value_cols = df.filter(regex='value\d').columns

Looking for NaN values in a specific column in df [duplicate]

Now I know how to check the dataframe for specific values across multiple columns. However, I cant seem to work out how to carry out an if statement based on a boolean response.
For example:
Walk directories using os.walk and read in a specific file into a dataframe.
for root, dirs, files in os.walk(main):
filters = '*specificfile.csv'
for filename in fnmatch.filter(files, filters):
df = pd.read_csv(os.path.join(root, filename),error_bad_lines=False)
Now checking that dataframe across multiple columns. The first value being the column name (column1), the next value is the specific value I am looking for in that column(banana). I am then checking another column (column2) for a specific value (green). If both of these are true I want to carry out a specific task. However if it is false I want to do something else.
so something like:
if (df['column1']=='banana') & (df['colour']=='green'):
do something
else:
do something
If you want to check if any row of the DataFrame meets your conditions you can use .any() along with your condition . Example -
if ((df['column1']=='banana') & (df['colour']=='green')).any():
Example -
In [16]: df
Out[16]:
A B
0 1 2
1 3 4
2 5 6
In [17]: ((df['A']==1) & (df['B'] == 2)).any()
Out[17]: True
This is because your condition - ((df['column1']=='banana') & (df['colour']=='green')) - returns a Series of True/False values.
This is because in pandas when you compare a series against a scalar value, it returns the result of comparing each row of that series against the scalar value and the result is a series of True/False values indicating the result of comparison of that row with the scalar value. Example -
In [19]: (df['A']==1)
Out[19]:
0 True
1 False
2 False
Name: A, dtype: bool
In [20]: (df['B'] == 2)
Out[20]:
0 True
1 False
2 False
Name: B, dtype: bool
And the & does row-wise and for the two series. Example -
In [18]: ((df['A']==1) & (df['B'] == 2))
Out[18]:
0 True
1 False
2 False
dtype: bool
Now to check if any of the values from this series is True, you can use .any() , to check if all the values in the series are True, you can use .all() .

Resources