pandas: multiple conditions and perform arithmetic operation - python-3.x

I have a data frame with few columns. I am interested to check multiple conditions (>=, <=). Finally, when the conditions are met I am interested to subtract corresponding rows column value from the score.
import pandas as pd
data = {'diag_name':['diag1','diag2','diag3','diag4','diag5','diag6','diag7'],
'final_score': [100, 90, 89, 100, 100, 99,100],
'count_corrected': [2,0,2,2,0,1,1]}
# Create DataFrame
df = pd.DataFrame(data)
To explain my point using an example, if the final_score is 100 and count_corrected >0 then, corresponding value of count_corrected needs to be subtracted from final_score value. If not, then final_score_new will be same as final_score
df['final_score_new']=np.where((df.count_corrected>0) & (df.final_score==100),100-df.count_corrected,df['final_score'])
df[['final_score_new','final_score','count_corrected']] ## check
I hope I am doing the operation and checks correctly. I thought to confirm so I am not screwing any indices.
Thank you.

Related

How to drop entire record if more than 90% of features have missing value in pandas

I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]

Split dataframe into several ones (not the same size) based on consecuvite zero values

i have a dataframe with numeric values (i show here only the column used for the "condition").
I would like splitting it into several others (the size of the splitted dataframes could be different). The "splitting" should be based on consecutive no zero values.
In the following case from this initial dataframe:
:
I would like these three dataframes into new variables
, ,
Is there any function to achieve that without parsing all the initial dataframe?
Thank you
I think i found the solution...
df['index']=df.index.values #Create columns with index
s = df.iloc[:,3].eq(0) #mask of zero value
new_df = df.groupby([s, s.cumsum()]).apply(lambda x: list(x.index)) #find on stackoverflow to group value based on the previous mask
out=new_df.loc[False] #Select only False Value therefore only value >0
Finally i have a dataframe with the group of index of consecutive non zero values

Assigning np.nans to rows of a Pandas column using a query

I want to assign NaNs to the rows of a column in a Pandas dataframe when some conditions are met.
For a reproducible example here are some data:
'{"Price":{"1581292800000":21.6800003052,"1581379200000":21.6000003815,"1581465600000":21.6000003815,"1581552000000":21.6000003815,"1581638400000":22.1599998474,"1581984000000":21.9300003052,"1582070400000":22.0,"1582156800000":21.9300003052,"1582243200000":22.0200004578,"1582502400000":21.8899993896,"1582588800000":21.9699993134,"1582675200000":21.9599990845,"1582761600000":21.8500003815,"1582848000000":22.0300006866,"1583107200000":21.8600006104,"1583193600000":21.8199996948,"1583280000000":21.9699993134,"1583366400000":22.0100002289,"1583452800000":21.7399997711,"1583712000000":21.5100002289},"Target10":{"1581292800000":22.9500007629,"1581379200000":23.1000003815,"1581465600000":23.0300006866,"1581552000000":22.7999992371,"1581638400000":22.9599990845,"1581984000000":22.5799999237,"1582070400000":22.3799991608,"1582156800000":22.25,"1582243200000":22.4699993134,"1582502400000":22.2900009155,"1582588800000":22.3248996735,"1582675200000":null,"1582761600000":null,"1582848000000":null,"1583107200000":null,"1583193600000":null,"1583280000000":null,"1583366400000":null,"1583452800000":null,"1583712000000":null}}'
In this particular toy example, I want to assign NaNs to the column 'Price' when the column 'Target10' has NaNs. (in the general case the condition may be more complex)
This line of code achieves that specific objective:
toy_data.Price.where(toy_data.Target10.notnull(), toy_data.Target10)
However when I attempt to use a query and assign NaNs to the targeted column I fail:
toy_data.query('Target10.isnull()', engine = 'python').Price = np.nan
The above line leaves toy_data intact.
Why is that and how I should use query to replace values in particular rows?
One way to do it is -
import numpy as np
toy_data['Price'] = np.where(toy_data['Target10'].isna(), np.nan, toy_data['Price'])

How to fillna() all columns of a dataframe from a single row of another dataframe with identical structure

I have a train_df and a test_df, which come from the same original dataframe, but were split up in some proportion to form the training and test datasets, respectively.
Both train and test dataframes have identical structure:
A PeriodIndex with daily buckets
n number of columns that represent observed values in those time buckets e.g. Sales, Price, etc.
I now want to construct a yhat_df, which stores predicted values for each of the columns. In the "naive" case, yhat_df columns values are simply the last observed training dataset value.
So I go about constructing yhat_df as below:
import pandas as pd
yhat_df = pd.DataFrame().reindex_like(test_df)
yhat_df[train_df.columns[0]].fillna(train_df.tail(1).values[0][0], inplace=True)
yhat_df(train_df.columns[1]].fillna(train_df.tail(1).values[0][1], inplace=True)
This appears to work, and since I have only two columns, the extra typing is bearable.
I was wondering if there is simpler way, especially one that does not need me to go column by column.
I tried the following but that just populates the column values correctly where the PeriodIndex values match. It seems fillna() attempts to do a join() of sorts internally on the Index:
yhat_df.fillna(train_df.tail(1), inplace=True)
If I could figure out a way for fillna() to ignore index, maybe this would work?
you can use fillna with a dictionary to fill each column with a different value, so I think:
yhat_df = yhat_df.fillna(train_df.tail(1).to_dict('records')[0])
should work, but if I understand well what you do, then even directly create the dataframe with:
yhat_df = pd.DataFrame(train_df.tail(1).to_dict('records')[0],
index = test_df.index, columns = test_df.columns)

extract second row values in spark data frame

I have spark dataframe for table (1000000x4) sorted by second column
I need to get 2 values second row, column 0 and second row, column 3
How can I do it?
If you just need the values it's pretty simple, just use the DataFrame's internal RDD. You didn't specify the language, so I will take this freedom to show you how to achieve this using python2.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
requiredRows = [0, 2]
data = (df.rdd.zipWithIndex()
.filter(lambda ((name, age, balance), index): index in requiredRows)
.collect())
And now you can manipulate the variables inside the data list. By the way, I didn't remove the index inside every tuple just to provide you another idea about how this works.
print data
#[(Row(name=u'Bonsanto', age=20, balance=2000.0), 0),
# (Row(name=u'Mises', age=60, balance=1000.0), 2)]

Resources