Apply a lambda function to iterate over two columns - python-3.x

I have a pandas df:
pd.DataFrame({'61 - 90': [np.NaN, 14, np.NaN, 9, 34, np.NaN],
'91 and over': [np.NaN, 10, np.NaN, 1, np.NaN, 9]})
I am trying to apply a lambda function that returns False if BOTH columns for a record == np.NaN. My attempt at solving this:
df['not_na'] = df[['61 - 90', '91 and over']].apply(lambda x: False if pd.isna(x) else True)
The error message I receive:
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 61 - 90')

Why don't you do:
df['not_na'] = df[['61 - 90', '91 and over']].notnull().any(axis=1)

To do this using a lambda function over a data frame For elementwise operations. we need to use applymap
df[['61 - 90', '91 and over']].applymap(lambda x: False if pd.isna(x) else True)
the documentation for applymap fuction is availablein the link below
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html

Related

Effitient way of Selecting row arrays in 2d Numpy Array falling into some Interval

I would like to select row arrays which has certain columns falls into an interval object.
Say I have an interval object intv and a 2d numpy array X:
intv
# Interval(1.5, 2, closed='right')
X
# array([[1, 1, 2],
# [2, 2, 4],
# [3, 3, 9]])
At first I tried:
X[X[:,0] in intv]
However, an exception was thrown:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Then I tried list comprehension, and it works:
sel_idx = [X[j,0] in intv for j in range(len(X[:,0])) ]
X[sel_idx]
I wonder if there is any computational inexpensive and elegant way of solving this problem?

Append pandas dataframe from other pandas dataframe that are values of a dictionary

I have a dictionary with N pairs (key,value), where N is unknown; each value is a pandas dataframe that contains a different set of columns. For example:
d = {'DF1': pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c']),
'DF2': pd.DataFrame(np.array([[10, 11 ,12], [13, 14, 15]]),columns=['d', 'e'])}
I would to append all dataframes contained into the dictionary, in a third empty dataframe because I have to save all dataframes of dictionary into a parquet file. But if I use the following lines of code, there is no dataframe into df3:
df3 = pd.Dataframe()
for key in d:
df3.append(d[key], ignore_index=True)
How can I append all dataframes into df3?
UPDATE 1: all dataframes into the dictionary may have common columns
Try:
v=list(d.values())
df3=v[0]
for el in v[1:]:
df3=pd.concat([df3,el])
df3=df3.reset_index(drop=True)
Or simpler, per your comment:
df3 = pd.concat(d.values(), axis=0).reset_index(drop=True)
I think a better approach is to use your for loop to concatenate the dataframes, i.e. pd.concat. Here's a link to the docs for how to use the function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html. The catch with this is to make sure you append along the right axis (0 or 1)!

Python print/display only if the sum is not zero

I have a dataframe below:
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
I am using groupby and sum.
final = df.groupby(['Product'])['Volume'].sum().reset_index()
print(final)
This is ok.
But I only want the print to be carry only those where sum != 0. Like Product C and D
Any idea how can I do that?
I try to use:
if final != 0:
print (final)
But this is throwing error and usually when I get this error, the syntax is definitely wrong...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Your data frame has Volume as strings, is that intended? if you want to sum it like numbers you have to convert it to numbers then you can apply the filter.
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': ['-3', '3', '1', '5']})
# convert from string to integers
df.Volume = df.Volume.map(lambda x: int(x))
final = df.groupby(['Product'])['Volume'].sum().reset_index()
#choose ones with sum none zero
print(final[final.Volume != 0])
it will print only the C & D
Given,
import pandas as pd
df = pd.DataFrame({'Product': ['A', 'A', 'C', 'D'], 'Volume': [-3, 3, 1, 5]})
final = df.groupby(['Product'])['Volume'].sum().reset_index()
Use selection to only select rows that match your criteria. df[some_series_of_booleans_based_on_condition]
print(final[final['Volume'] != 0])
#output:
Product Volume
1 C 1
2 D 5
The idea being that if [some series of booleans]: doesn't make sense for python to interpret, and thus it complains about the syntax with the message that you saw.

Performing function in new column based on condition in other column

Here I am attempting to query a column in dataframe df, which has boolean values 'Yes' or 'No', in order to perform some function of random letter assignment according to a probability distribution in rows where the condition is met.
if (df['some_bool'] == 'Yes'):
df['score'] = np.random.choice(['A', 'B', 'C'], len(df), p=[0.3, 0.2, 0.5])
What is a correct way of writing this as I receive the following error for the above:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thanks!
Try this instead:
df['score'] = np.where(df['some_bool'] == 'Yes',
np.random.choice(['A', 'B', 'C'], len(df), p=[0.3, 0.2, 0.5]), '')

why in dict when we have have the same values with different keys name then why it picks one compared with another

>>> votes ={}
>>> votes["maddy"]=6
>>> votes["katty"]=6
>>> votes
{'maddy': 6, 'katty': 6}
>>> print(max(votes.items(), key = lambda k:k[1]))
('maddy', 6)
>>> votes["jackie"]=1
>>> votes
{'maddy': 6, 'katty': 6, 'jackie': 1}
>>> votes["kavi"]=1
>>> votes
{'maddy': 6, 'katty': 6, 'jackie': 1, 'kavi': 1}
>>> print(min(votes.items(), key = lambda k:k[1]))
('jackie', 1)
>>>
I understand that whatever items we insert first with max values will be
considered as max votes as shown below but what if i need the max values
as
katty key with values as 6 which is inserted after maddy key with same
value 6 ?
print(max(votes.items(), key = lambda k:k[1]))
('maddy', 6)
expected output is
Katty , 6
According to max docs:
If multiple items are maximal, the function returns the first one encountered.
Moreover, your code also depends on the order of elements returned by .items(). The dict guarantees the insertion order since Python3.7 (read more about it here). If your version is lower than 3.7 and you want to keep the elements insertion order, you can use OrderedDict.
If you want to return the latest element with max / min value, you can explicitly specify the order in the key:
In [1]: votes = {'maddy': 6, 'katty': 6}
In [2]: max(votes.items(), key=lambda x: x[1])
Out[2]: ('maddy', 6)
In [3]: max(enumerate(votes.items()), key=lambda x: (x[1][1], x[0]))
Out[3]: (1, ('katty', 6))
In [4]: max(enumerate(votes.items()), key=lambda x: (x[1][1], x[0]))[1]
Out[4]: ('katty', 6)
Note that for min you will need to negate the index in the key function to retrieve the latest element. This is required to make element with index, say, 3 lower than element with index 2. After negating indices you will compare -3 with -2 and -3 will be lower than -2, and therefore min will return the element with higher index:
In [5]: votes = {'maddy': 6, 'katty': 6, 'a': 1, 'b': 1}
In [6]: min(enumerate(votes.items()), key=lambda x: (x[1][1], -x[0]))[1]
Out[6]: ('b', 1)
Or you can just reverse the order of items:
In [7]: max(reversed(list(votes.items())), key=lambda x: x[1])
Out[7]: ('katty', 6)

Resources