New colum in pandas, based on another column's last value - python-3.x

In the dataframe I got this data
Open High Low Close Volume \
-------------------------------------------------------------------------
Date
2015-05-01 538.429993 539.539978 532.099976 537.900024 1768200
2015-05-04 538.530029 544.070007 535.059998 540.780029 1308000
2015-05-05 538.210022 539.739990 530.390991 530.799988 1383100
2015-05-06 531.239990 532.380005 521.085022 524.219971 1567000
My question is: how do I add a new column and give it a value of 0 if the last close was lower that the present close and 1 if it is higher.
How do I make this work through out the dataframe?

df['increasing'] = (df['Open'].diff() > 0).astype(int)
or
df['increasing'] = (df['Open'] - df['Open'].shift() > 0).astype(int)
both work, but the former is quicker.
Take, for example,
In [41]: import pandas_datareader.data as pdata
In [42]: df = pdata.get_data_yahoo('AAPL', start='2009-01-02', end='2009-12-31')
In [43]: df.head()
Out[43]:
Open High Low Close Volume Adj Close
Date
2009-01-02 85.880003 91.040001 85.160000 90.750001 186503800 11.933430
2009-01-05 93.170003 96.179998 92.709999 94.580002 295402100 12.437067
2009-01-06 95.950000 97.170001 92.389998 93.020000 322327600 12.231930
2009-01-07 91.809999 92.500001 90.260003 91.010000 188262200 11.967619
2009-01-08 90.430000 93.150002 90.039998 92.699999 168375200 12.189851
diff() returns the difference between adjacent rows:
In [45]: df['Open'].diff().head()
Out[45]:
Date
2009-01-02 NaN
2009-01-05 7.290000
2009-01-06 2.779997
2009-01-07 -4.140001
2009-01-08 -1.379999
Name: Open, dtype: float64
(df['Open'].diff() > 0) returns a boolean-valued Series which is True when the difference is positive:
In [46]: (df['Open'].diff() > 0).head()
Out[46]:
Date
2009-01-02 False
2009-01-05 True
2009-01-06 True
2009-01-07 False
2009-01-08 False
Name: Open, dtype: bool
Calling .astype(int) converts False to 0 and True to 1:
In [47]: (df['Open'].diff() > 0).astype('int').head()
Out[47]:
Date
2009-01-02 0
2009-01-05 1
2009-01-06 1
2009-01-07 0
2009-01-08 0
Name: Open, dtype: int64
The code becomes a bit more complicated if you need to assign
a third possible value, 2, when the difference is 0:
import numpy as np
diff = df['Open'].diff()
conditions = [diff > 0, diff < 0]
choices = [1, 0]
df['increasing'] = np.select(conditions, choices, default=2)
np.select is a generalization of np.where. np.where is good for handling 1 condition, np.select handles multiple conditions. Above, the conditions are diff > 0 and diff < 0 and we wish to assign the values 1 and 0, respectively:
conditions = [diff > 0, diff < 0]
choices = [1, 0]
When neither condition is True, np.select assigns the default value 2:
df['increasing'] = np.select(conditions, choices, default=2)

Related

Replace element with specific value to pandas dataframe

I have a pandas dataframe with the following form:
cluster number
Robin_lodging_Dorthy 0
Robin_lodging_Phillip 1
Robin_lodging_Elmer 2
... ...
I want to replace replace every 0 that is in the column cluster number with with the string "low", every 1 with "mid" and every 2 with "high". Any idea of how that can be possible?
You can use replace function with some mappings to change your column values:
values = {
0: 'low',
1: 'mid',
2: 'high'
}
data = {
'name': ['Robin_lodging_Dorthy', 'Robin_lodging_Phillip', 'Robin_lodging_Elmer'],
'cluster_number': [0, 1, 2]
}
df = pd.DataFrame(data)
df.replace({'cluster_number': values}, inplace=True)
df
Output:
name cluster_number
0 Robin_lodging_Dorthy low
1 Robin_lodging_Phillip mid
2 Robin_lodging_Elmer high
More info on replace function.

The output of my code comes out too slowly.. How can i speed up my process

Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].
If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
My code seems to work fine, but it's taking too long..
Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:
keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
# index_x C1 index_y C2
# A B
# 0 1 0 98 0 228
# 2 2 2 228 2 130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
df.loc[df['C1'] < df['C2'], 'index_y'].values]

Randomly select elements from string in a dataframe

I have dataframe with 7 string columns:
bul; age; gender; hh; pn; freq_pn; rcrds_to_select
1; 2; 5; 1; ['35784905', '40666303', '47603805', '68229102'];4;3
2; 3; 3; 3; ['06299501', '07694901', '35070201'];3;2
In the last column I have the number of id's from "pn" column that I need to select randomly. Example: in the first row I have 4 id's ['35784905', '40666303', '47603805', '68229102'] and I need to select 3 random id's and remove the not selected one. There can be rows with only one id. I came to the conclusion that I need to turn the values in tuples and store them in another column ('pnTuple'). I don't know if this is the right way.
mass_grouped3['pnTuple'] = [tuple(x) for x in mass_grouped3['pn'].values]
I think random.shuffle will do the job, but have no idea how to implement it in my script. I was thinking something like this, but is not working:
for row in mass_grouped3['pnTuple']:
list = list(mass_grouped3['pnTuple'])
whitelist = random.shuffle(list)
Any ideas how to do this selection are appreciated.
You want to randomly select 1 from every row and make the rest 0. Here's one approach. Sample the indices and based on indices assign 1. i.e
idx = pd.DataFrame(np.stack(np.where(df==1))).T.groupby(0).apply(lambda x: x.sample(1)).values
# array([[0, 2],
# [1, 1],
# [2, 0],
# [3, 3]])
ndf = pd.DataFrame(np.zeros(df.shape),columns=df.columns)
ndf.values[idx[:,0],idx[:,1]] = 1
W1 W2 W3 W4
0 0 0 1 0
1 1 0 0 0
2 1 0 0 0
3 0 1 0 0
Welcome to StackOverflow! Hope this helps
Lets go step by step
First lets construct our random function that can select 3
>>> import random
>>> random.choices(['35784905', '40666303', '47603805', '68229102'], k=3)
['68229102', '40666303', '35784905']
I have a sample data frame, df with columns with same data as yours
>>> df
a b
0 12 [35784905, 40666303, 47603805, 68229102]
1 12 [06299501, 07694901, 35070201]
>>> df['b']
0 [35784905, 40666303, 47603805, 68229102]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
0 [35784905, 68229102, 35784905]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'] = df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
Using pandas map operation to apply this data transformation to whole columns
Note: We are using a lambda function lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist to ensure that each list has more than 3 items, and only then apply this operation.
It might be a little new, but this a standard way of writing code in python. Learn more about Python, lambda function and pandas for some time.

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

New column based on a row with conditions in Pandas

I'm trying to do an operation with Dataframes but i'm not sure how I can solve the problem using the built-in Pandas Operations (Actualy my code is based on a for so I'm trying to build a more elegant solution).
Given the following Dataframes, defined with the columns described below
original_df = [o1, o2, o3, o4]
weights_df = [w1, w2, w3, w4]
conditions_df = [c1, c2, c3, c4]
I need to built a new column on original_df based on the division of o1/w1 but depending on the value of c1, with takes the values ["+" or "-" I need to do the -o1/w1 operation.
As long as I did was:
orignal_df['newcolumn'] = original_df / weights_df
Where of course I divided the two terms but without applying the condition, I'm trying to do with map and apply functions but I'm not sure how I can add the third column into the function.
original_df = [100, 200, 300, 400]
weights_df = [10, 20, 30, 40]
conditions_df = [1, 2, 3, 4]
df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
def div(x, y, z):
if z > 2:
return float(x/y)
else:
return float(-1*x/y)
df['new_feature'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
This is one way of solving. If your conditions_df contains '+'/'-' then you can change the condition in def div(x, y, z) accordingly.
You can use numpy.where for mask by condition:
#data from lisa answer
#df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
df['new_feature'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
print (df)
x y z new_feature
0 100 10 1 -10.0
1 200 20 2 -10.0
2 300 30 3 10.0
3 400 40 4 10.0
Timings:
#4k rows
df = pd.concat([df]*1000).reset_index(drop=True)
#lisa answer
In [95]: %timeit df['new_feature1'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
10 loops, best of 3: 123 ms per loop
In [96]: %timeit df['new_feature2'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
1000 loops, best of 3: 595 µs per loop

Resources