Pandas time series - need to extract row value based on multiple conditionals based on other columns - python-3.x

I have a time series dataframe with the below columns. I am trying to figure out:
If df['PH'] ==1, then I need find the previous date where df['pivot_low_1'] == 1 and extract the value of df['low'] for that date. So, for 2010-01-12 where df['PH'] ==1, I would need to identify the previous non-zero df['pivot_low_1'] == 1 on 2010-01-07 and get df['low'] == 1127.00000.
low pivot_low_1 PH
date
2010-01-04 1114.00000 1 0
2010-01-05 1125.00000 0 0
2010-01-06 1127.25000 0 0
2010-01-07 1127.00000 1 0
2010-01-08 1131.00000 0 0
2010-01-11 1137.75000 0 0
2010-01-12 1127.75000 1 1
2010-01-13 1129.25000 0 0
2010-01-14 1138.25000 0 0
2010-01-15 1127.50000 1 0
2010-01-18 1129.50000 0 0
2010-01-19 1126.25000 0 0
2010-01-20 1125.25000 0 0
2010-01-21 1108.50000 0 0
2010-01-22 1086.25000 1 0
2010-01-25 1089.75000 0 0
2010-01-26 1081.00000 0 0
2010-01-27 1078.50000 0 0
2010-01-28 1074.25000 0 0
2010-01-29 1066.50000 1 1
2010-02-01 1068.00000 0 0

since you want a column in same dataframe but the output is correspondent to only certain rows , I will be replacing every other column with nan values,
data = pd.read_csv('file.csv')
data.columns=['low', 'pivot_low_1', 'PH']
count = 0
l = list()
new=list()
for index, row in data.iterrows():
if row['pivot_low_1']==1:
l.append(count)
if (row['PH']==1) and (row['pivot_low_1']==1):
new.append(data.iloc[l[len(l)-2]].low)
elif (row['PH']==1):
new.append(data.iloc[l[len(l)-1]].low)
elif (row['PH']==0):
new.append(np.nan)
count+=1
data['new'] = new
data
The output is as shown in this image, https://imgur.com/a/IqowZHZ , hope this helps

Related

How can I change the values of columns based on the values from other columns?

Here are the tables before cleaned:
name
date
time_lag1
time_lag2
time_lag3
lags
a
2000/5/3
1
0
1
time_lag1
a
2000/5/10
1
1
0
time_lag2
a
2000/5/17
1
1
1
time_lag3
b
2000/5/3
0
1
0
time_lag1
c
2000/5/3
0
0
0
time_lag1
Logics are simple, each name have several date and that date correspond to a "lags". What I tried to do is to match the column names like "time_lag1","time_lag2",...,"time_lagn" to the values in column "lags". For example, the first value of "time_lag1" is because column name "time_lag1" equals the corresponding value of "lags" which is also "time_lag1". However, I don't know why the values of other columns and rows are becoming incorrect.
My thought is:
# time_lag columns are not following a trend, so it can be lag_time4 as well.
time_list = ['time_lag1','time_lag2','lag_time4'...]
for col in time_list:
if col == df['lags'].values:
df.col == 1
else:
df.col == 0
I don't know why the codes I tried is not working very well.
Here are the tables I tried to get:
name
date
time_lag1
time_lag2
time_lag3
lags
a
2000/5/3
1
0
0
time_lag1
a
2000/5/10
0
1
0
time_lag2
a
2000/5/17
0
0
1
time_lag3
b
2000/5/3
1
0
0
time_lag1
c
2000/5/3
1
0
0
time_lag1
The simplest is to recalculate them from scratch with pandas.get_dummies and to update the dataframe:
df.update(pd.get_dummies(df['lags']))
Output:
name date time_lag1 time_lag2 time_lag3 lags
0 a 2000/5/3 1 0 0 time_lag1
1 a 2000/5/10 0 1 0 time_lag2
2 a 2000/5/17 0 0 1 time_lag3
3 b 2000/5/3 1 0 0 time_lag1
4 c 2000/5/3 1 0 0 time_lag1

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Pandas DataFrame: create a matrix-like with 0 and 1

i have to create a matrix-like with 0 and 1. How can i create something like that?
This is my DataFrame:
I want to check the intersection where df['luogo'] is 'sala' and df['sala'] and replace it with 1.
This is my try:
for head in dataframe.columns:
for i in dataframe['luogo']:
if i == head:
dataframe[head] = 1
else:
dataframe[head] = 0
Sorry for the italian dataframe.
You are probably looking for pandas.get_dummies(..) [pandas-doc]. For a given dataframe df:
>>> df
luogo
0 sala
1 scuola
2 teatro
3 sala
We get:
>>> pd.get_dummies(df['luogo'])
sala scuola teatro
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You thus can join this with your original dataframe with:
>>> df.join(pd.get_dummies(df['luogo']))
luogo sala scuola teatro
0 sala 1 0 0
1 scuola 0 1 0
2 teatro 0 0 1
3 sala 1 0 0
This thus constructs a "one hot encoding" [wiki] of the values in your original dataframe.

Writing Function on Data Frame in Pandas

I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3

Splitting a each column value into different columns [duplicate]

This question already has answers here:
Convert pandas DataFrame column of comma separated strings to one-hot encoded
(3 answers)
Closed 4 years ago.
I have a survey response sheet which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the response sheet and import it into pandas I get this:
Timestamp Sports you like Age
0 23/11/2013 13:22:30 Football, Chess, Cycling 15
1 23/11/2013 13:22:34 Football 25
2 23/11/2013 13:22:39 Swimming,Football 22
3 23/11/2013 13:22:45 Chess, Soccer 27
4 23/11/2013 13:22:48 Soccer 30
There can be any number of sport values in sports column (further rows has basketball,volleyball etc.) and there are still some other columns. I'd like to do statistics on the results of the question (how many people liked Football,etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Sports-Football, Sports-Volleyball, Sports-Basketball, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this
What I need is a new dataframe that looks like this (along with Age column) -
Timestamp Sports-Football Sports-Chess Sports-Cycling ....
0 23/11/2013 13:22:30 1 1 1
1 23/11/2013 13:22:34 1 0 0
2 23/11/2013 13:22:39 1 0 0
3 23/11/2013 13:22:45 0 1 0
I tried till this point can't proceed further.
df['Sports you like'].str.split(',\s*')
which splits into different columns but the first column may have any sport, I need only 1 in first column if the user likes Football or 0.
Problem is separator ,\s*, so solution is add str.split with str.join before str.get_dummies:
df1 = (df.pop('Sports you like').str.split(',\s*')
.str.join('|')
.str.get_dummies()
.add_prefix('Sports-'))
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Or use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df.pop('Sports you like').str.split(',\s*')
df1 = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_).add_prefix('Sports-')
print (df1)
Sports-Chess Sports-Cycling Sports-Football Sports-Soccer \
0 1 1 1 0
1 0 0 1 0
2 0 0 1 0
3 1 0 0 1
4 0 0 0 1
Sports-Swimming
0 0
1 0
2 1
3 0
4 0
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0

Resources