How can I subset the dataframe using if without facing truth value of a series is ambiguous error? - python-3.x

I have a list containing around 45 dataframes with 8 columns. Now I want to subset the dataframes based on specific values present in a particular column.
Code:
for z in list_dataframes:
if(z['Segmentation']=="FAST"):
list_fast.append(z)
Gives me error stating the truth value of a series is ambiguous.
Can anyone tell me how to solve this?
P.S. Another entirely different question how do you remove empty dataframes from a list of dataframes consisting both empty and non-empty dataframes.

You can just use in on the values:
for z in list_dataframes:
if("FAST" in z['Segmentation'].values):
list_fast.append(z)
Or wrap it all in comprehension:
list_fast = [z for z in list_dataframes if 'FAST' in z['Segmentation'].values]

Related

How to create a list and filter out row from another dataframe?

I know this question has been asked before, but every solution doesn't appear to work and gives me the same result. I am looking for insight into what I am doing wrong.
T_18_x2 and Tryp18_50 are large dataframes with different data (except for 2 columns). Specifically, each dataframe contains a column named 'Gene' that posses the same style sting (i.e. HSP90A_HUMAN). I would like to make a list from the Gene column in T_18_x2 to filter rows in Tryp18_50 with the same string in the "Gene" column.
My issue is that the output is simply an empty dataframe. I think it is the string (y2) because the output of this list is duplicates of the strings in the column. I am not sure why this is happening either.
List
Any help would be greatly appreciated.
input:
y2 =T18_x2['Gene'].astype(str).values.tolist()
T18 = Tryp18_50[Tryp18_50['Gene'].isin(y2)]
T18
output:
Output
** I have also tried:
T18=Tryp18_50[pd.notna(Tryp18_50['Gene']) & Tryp18_50['Gene'].astype(str).str.contains('|'.join(y2))]
with the output:
2nd Output
My mistake, I had two "Gene" columns in the first dataframe.

Can a list with repeating values be converted into a multiset in python?

I am having two lists with repeating values and I wanted to take the intersection of the repeating values along with the values that have occurred only once in any one of the lists.
I am just a beginner and would love to hear simple suggestions!
Method 1
l1=[1,2,3,4]
l2=[1,2,5]
intersection=[value for value in l1 if value in l2]
for x in l1+l2:
if x not in intersection:
intersection.append(x)
print(intersection)
Method 2
print(list(set(l1+l2)))

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

check if country code contains given string

below is my code
years_list = set()
for i in range(0,indicators_csv.shape[0]) :
if (indicators_csv['CountryCode'].str.contains('USA')) :
years_list.append(indicator_csv.iloc[i].Year)
Here indicator_csv is a csv file having column as 'CountryCode'
when run this I got following error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
on if line. I also tried
if (indicators_csv['CountryCode'] == ('USA'))
but getting same error.
I googled it i found some answer related to numbers or and or but nothing like this I found.
If I understood you correctly and you want to iterate over the df instead of using a vectorised approach, you can use:
years_list = []
for index, row in indicators_csv.iterrows():
if ('USA' in row['CountryCode']):
years_list.append(row['Year'])
Input:
CountryCode Year
0 USA 1980
1 UK 1990
2 FR 1984
3 USA 2000
Output:
[1980L, 2000L]
You should try to avoid iterating over pandas objects as much as possible - it's mich slower than the native vectorised operations. Your issue is that indicators_csv['CountryCode'].str.contains('USA') checks if 'USA' is in 'CountryCode' for every row, so you end up with a column of True and False entries.
What you want to do instead is filter the dataframe to just those rows that contain 'USA' and then convert the 'Year' column from that frame to a list. You can do all of this directly in one operation (split across two lines for readability)
years_list = indicators_csv[indicators_csv['CountryCode'].str.contains('USA')]\
['Year'].tolist()
the error is throwing up because you are trying to use a series of boolean value in a IF clause where it expects single boolean.

How can I calculate values in a Pandas dataframe based on another column in the same dataframe

I am attempting to create a new column of values in a Pandas dataframe that are calculated from another column in the same dataframe:
df['ema_ideal'] = df['Adj Close'].ewm(span=df['ideal_moving_average'], min_periods=0, ignore_na=True).mean
However, I am receiving the error:
ValueError: The truth of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any(), or a.all().
If I have the span set to 30, or some integer, I do not receive this error. Also, ideal_moving_average is a column of float.
My two questions are:
Why exactly am I receiving the error?
How can I incorporate the column values from ideal_moving_average into the df['ema_ideal'] column (subquestion as I am new to Pandas - is this column a Series within the dataframe?)
Thanks for the help!
EDIT: Example showing Adj Close data, in bad formatting
Date Open High Low Close Adj Close
2017-01-03 225.039993 225.830002 223.880005 225.240005 222.073914
2017-01-04 225.619995 226.750000 225.610001 226.580002 223.395081
2017-01-05 226.270004 226.580002 225.479996 226.399994 223.217606
2017-01-06 226.529999 227.750000 225.899994 227.210007 224.016220
2017-01-09 226.910004 227.070007 226.419998 226.460007 223.276779
2017-01-10 226.479996 227.449997 226.009995 226.460007 223.276779
I think something like this will work for you:
df['ema_ideal'] = df.apply(lambda x: df['Adj Close'].ewm(span=x['ideal_moving_average'], min_periods=0, ignore_na=True).mean(), axis=1)
Providing axis=1 to DataFrame.apply allows you to access the data row wise like you need.
There's absolutely no issue creating a dataframe column from another dataframe.
The error you're receiving is completely different, this error is returned when you try to compare Series with logical fonctions such as and, or, not etc...
In general, to avoid this error you must compare Series element wise, using for example & instead of and, or ~ instead of not, or using numpy to do element wise comparation.
Here, the issue is that you're trying to use a Serie as the span of your ema, and pandas ewma function only accept integers as spans.
You could for example, calculate the ema for each possible periods, and then regroup them in a Serie that you set as the ema idealcolumn of your dataframe.
For anyone wondering, the problem was that span could not take multiple values, which was happening when I tried to pass df['ideal_moving_average'] into it. Instead, I used the below code, which seemed to go line by line passing the value for that row into span.
df['30ema'] = df['Adj Close'].ewm(span=df.iloc[-1]['ideal_ma'], min_periods=0, ignore_na=True).mean()
EDIT: I will accept this as correct for now, until someone shows that it doesn't work or can create something better, thanks for the help.

Resources