How to get number of columns in a DataFrame row that are above threshold - python-3.x

I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:
import pandas as pd
df = pd.DataFrame({0:[1,1,1,1,100],
1:[1,3,1,1,1],
2:[1,3,1,1,4],
3:[1,1,1,1,1],
4:[3,4,1,1,5],
5:[1,1,1,1,1]})
Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:
df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)
This code doesn't work, but I feel like it's close?

df[(df>1).sum(axis=1)>=2]
Explanation:
(df>1).sum(axis=1) gives the number of columns in that row that is greater than 1.
then with >=2 we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet

The value of x in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)

Related

Filter Dataframe by comparing one column to list of other columns

I have a dataframe with numerous float columns. I want to filter the dataframe, leaving only the values that are inbetween the High and Low columns of the same dataframe.
I know how to do this when the conditions are one column compared to another column. But there are 102 columns, so I cannot write a condition for each column. And all my research just illustrates how to compare two columns and not one column against all others (or I am not typing the right search terms).
I tried df= df[ (df['High'] <= df[DFColRBs]) & (df['Low'] >= df[DFColRBs])].copy() But it erases everything.
and I tried booleanselction = df[ (df[DFColRBs].between(df['High'],df['Low'])]
and I tried: df= df[(df[DFColRBs].ge(df['Low'])) & (df[DFColRBs].le(df['Low']))].copy()
and I tried:
BoolMatrix = (df[DFColRBs].ge(DF_copy['Low'], axis=0)) & (df[DFColRBs].le(DF_copy['Low'], axis=0))
df= df[BoolMatrix].copy()
But it erases everything in dataframe, even 3 columns that are not included in the list.
I appreciate the guidance.
Example Dataframe:
High Low Close _1m_21 _1m_34 _1m_55 _1m_89 _1m_144 _1m_233 _5m_21 _5m_34 _5m_55
0 1.23491 1.23456 1.23456 1.23401 1.23397 1.23391 1.2339 1.2337 1.2335 1.23392 1.23363 1.23343
1 1.23492 1.23472 1.23472 1.23422 1.23409 1.234 1.23392 1.23375 1.23353 1.23396 1.23366 1.23347
2 1.23495 1.23479 1.23488 1.23454 1.23422 1.23428 1.23416 1.23404 1.23372 1.23415 1.234 1.23367
3 1.23494 1.23472 1.23473 1.23457 1.23425 1.23428 1.23417 1.23405 1.23373 1.23415 1.234 1.23367
Based on what you've said in the comments, best to split the df into the pieces you want to operate on and the ones you don't, then use matrix operations.
tmp_df = DF_copy.iloc[:, 3:].copy()
# or tmp_df = DF_copy[DFColRBs].copy()
# mask by comparing test columns with the high and low columns
m = tmp_df.le(DF_copy['High'], axis=0) & tmp_df.ge(DF_copy['Low'], axis=0)
# combine the masked df with the original cols
DF_copy2 = pd.concat([DF_copy.iloc[:, :3], tmp_df.where(m)], axis=1)
# or replace with DF_copy.iloc[:, :3] with DF_copy.drop(columns=DFColRBs)

Python 3: Groupby 3 DataFrame columns to check availability in a 4th column and add label 0 or 1 to 5th column

My first time posting on StackoOverflow. Please be kind.
I tried to find the exact solution for this problem but have failed to do so.
What I am attempting to do is groupby ProductID, Class, Material columns to see what are the null and non-null values in a column and assign 0 and 1 respectively in the column Level.
My Dataframe: https://i.stack.imgur.com/dRZcY.jpg
My Target Dataframe: https://i.stack.imgur.com/HWi5y.jpg
I am unable to get a label of 0's and 1's for the missing values in Material column. Please help!
Thanks in Advance!
Try this:
df['level'] = df[['ProductID', 'Class', 'Material']]\
.apply(lambda x: 0 if x.isna().sum() > 0 else 1, axis=1)

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

pandas merged data length

I have two data frames, each has one column with the same values (and equal length) but different order as in simplified example;
df1=pd.DataFrame(['a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
I want to know the corresponding index of each row in df1 in df2 and do;
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
this works and as expected for this example,the length of the data frames are equal len(df1)=len(df2)=len(df)
However in my real data, len(df1)=len(df2)=1714 and len(df)=1676
I am puzzled, how is this possible?
I just did an experiment and added duplicates.
df1=pd.DataFrame(['e','a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','e','c','d'],columns=['names'])
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
This gives len(df)=8 larger than len(df1)=len(df2)=6.
But in my real data df is smaller than individual df lengths.
Since pandas merge default is inner join , when you not specific the method of how , it will only output the row both in two dfs
For example :
df1=pd.DataFrame(['a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
index_x names index_y
0 0 a 2
Update
df1=pd.DataFrame(['a','a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','a','c','d'],columns=['names'])
df1.merge(df2)
names
0 a
1 a
2 a
3 a

Python - How to dynamically exclude a column name from a list of columns of a Panda Dataframe

So far I am able to get the list of all column names present in the dataframe or to get a specific column names based on its datatype, starting letters, etc...
Now my requirement is to get the whole list of column names or a sublist and to exclude one column from it (i.e Target variable / Label Column. This is a part of Machine Learning. So I am using the terms that are used in machine learning)
Please note I am not speaking about the data present in those columns. I am just taking the column names and want to exclude a particular column by its name
Please see below example for better understanding :
# Get all the column names from a Dataframe
df.columns
Index(['transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL'],
dtype='object')
# Get only the Numeric Variables (Columns with numeric values in it)
df._get_numeric_data().columns
Index(['transactionAmountUSD', 'transactionAmount', 'accountAge', 'LABEL'],
dtype='object')
Now inorder to get remaining column names I am subtracting both the above commands
string_cols = list(set(list(df.columns))-set(df._get_numeric_data().columns))
Ok everything goes well until I hit this.
I have found out that Label column though it has numeric values it should not be present in the list of numeric variables. It should be excluded.
(i.e) I want to exclude a particular column name (not using its index in the list but using its name explicitly)
I tried similar statements like the following ones but in vain. Any inputs on this will be helpful
set(df._get_numeric_data().columns-set(df.LABEL)
set(df._get_numeric_data().columns-set(df.LABEL.column)
set(df._get_numeric_data().columns-set(df['LABEL'])
I am sure I am missing a very basic thing but not able to figure it out.
First of all, you can exclude all numeric columns much more simply with
pd.DataFrame.select_dtypes(exclude=[np.number])
transactionID accountID transactionCurrencyCode validationid
0 a a a a
1 a a a a
2 a a a a
3 a a a a
4 a a a a
Second of all, there are many ways to drop a column. See this post
df._get_numeric_data().drop('LABEL', 1)
transactionAmountUSD transactionAmount accountAge
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
If you really wanted the columns, use pd.Index.difference
df._get_numeric_data().columns.difference(['LABEL'])
Index(['accountAge', 'transactionAmount', 'transactionAmountUSD'], dtype='object')
Setup
df = pd.DataFrame(
[['a', 'a', 1, 1, 'a', 1, 'a', 1]] * 5,
columns=[
'transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL']
)
Pandas' index supports set operations, so to exclude one column from column index you can just write something like
import pandas as pd
df = pd.DataFrame(columns=list('abcdef'))
print(df.columns.difference({'b'}))
which will return to you
Index(['a', 'c', 'd', 'e', 'f'], dtype='object')
I hope this is what you want :)
Considering LABEL column as your output and the other features as your input, you can try this:
feature_names = [x for x in df._get_numeric_data().columns if x not in ['LABEL']]
input = df[feature_names]
output= df['LABEL']
Hope this helps.

Resources