Operations on selective rows on pandas dataframe - python-3.x

I have a dataframe with phone calls, some of them are of zero duration. I want to replace them with int values ranging from 0 to 7, but every my attempt leads to errors or data loss.
I wrote function:
def calls_new(dur):
dur = random.randint(0,7)
return dur
and I tried to use it like this (one of these lines):
df_calls['duration'] = df_calls['duration'].apply(lambda row: x = random.randint(0,7) if x == 0 )
df_calls['duration'] = df_calls['duration'].where(df_calls['duration'] == 0, df_calls.apply(calls_new))
df_calls['duration'] = df_calls[df_calls['duration']==0].apply(calls_new)

Use .loc to set the values only where duration is 0. You can generate all of the random numbers and set everything at once. If you want 7, the end of randint needs to be 8 as the docs indicate high is one above the largest integer to be drawn.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'duration': [0,10,20,0,15,0,0,211]})
m = df['duration'].eq(0)
df.loc[m, 'duration'] = np.random.randint(0, 8, m.sum())
# |
# Need this many numbers
print(df)
duration
0 4
1 10
2 20
3 7
4 15
5 6
6 2
7 211

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

Alternative to looping? Vectorisation, cython?

I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

How to split a dataframe and plot some columns

I have a dataframe with 990 rows and 7 columns, I want to make a XvsY linear graph, broking the line at every 22 rows.
I think that dividing the dataframe and then plotting it will be good way, but I don't get good results.
max_rows = 22
dataframes = []
while len(Co1new) > max_rows:
top = Co1new[:max_rows]
dataframes.append(top)
Co1new = Co1new[max_rows:]
else:
dataframes.append(Co1new)
for grafico in dataframes:
AC = plt.plot(grafico)
AC = plt.xlabel('Frequency (Hz)')
AC = plt.ylabel("Temperature (K)")
plt.show()
The code functions but it is not plotting the right columns.
Here some reduced data and in this case it should be divided at every four rows:
df = pd.DataFrame({
'col1':[2.17073,2.14109,2.16052,2.81882,2.29713,2.26273,2.26479,2.7643,2.5444,2.5027,2.52532,2.6778],
'col2':[10,100,1000,10000,10,100,1000,10000,10,100,1000,10000],
'col3':[2.17169E-4,2.15889E-4,2.10526E-4,1.53785E-4,2.09867E-4,2.07583E-4,2.01699E-4,1.56658E-4,1.94864E-4,1.92924E-4,1.87634E-4,1.58252E-4]})
One way I can think of is to add a new column with labels for every 22 records. See below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
seaborn.set(style='ticks')
"""
Assuming the index is numeric and is from [0-990)
this will return an integer for every 22 records
"""
Co1new['subset'] = 'S' + np.floor_divide(Co1new.index, 22).astype(str)
Out:
col1 col2 col3 subset
0 2.17073 10 0.000217 S0
1 2.14109 100 0.000216 S0
2 2.16052 1000 0.000211 S0
3 2.81882 10000 0.000154 S0
4 2.29713 10 0.000210 S1
5 2.26273 100 0.000208 S1
6 2.26479 1000 0.000202 S1
7 2.76434 10000 0.000157 S1
8 2.54445 10 0.000195 S2
9 2.50270 100 0.000193 S2
10 2.52532 1000 0.000188 S2
11 2.67780 10000 0.000158 S2
You can then use seaborn.pairplot to plot your data pairwise and use Co1new['subset'] as legend.
seaborn.pairplot(Co1new, hue='subset')
Or if you absolutely need line charts, you can make line charts of your data, each pair at a time separately, here is col1 vs. col3
seaborn.lineplot('col1', 'col3', hue='subset', data=Co1new)
Using #SIA ' s answer
df['groups'] = np.floor_divide(df.index, 3).astype(str)
import plotly.express as px
fig = px.line(df, x="col1", y="col2", color='groups')
fig.show()

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...
Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0
As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Sum and collapse two rows in pandas if two values are equal (order does not matter)

I am analyzing a dataset that has an Origin ID (Column A), a Destination ID (Column B), and how many trips have happened between them (Column Count). Now I want to sum the A-B trips with the B-A trips. This sum is the total number of trips between A and B.
Here is how my data looks like (it is not necessarily ordered in the same way):
In [1]: group_station = pd.DataFrame([[1, 2, 100], [2, 1, 200], [4, 6, 5] , [6, 4, 10], [1, 4, 70]], columns=['A', 'B', 'Count'])
Out[2]:
A B Count
0 1 2 100
1 2 1 200
2 4 6 5
3 6 4 10
4 1 4 70
And I want the following output:
A B C
0 1 2 300
1 4 6 15
4 1 4 70
I have tried groupby and setting the index to both variables with no success. Right now I am doing a very inefficient double loop, that is too slow for the size of my dataset.
If it helps this is the code for the double loop (I removed some efficiency modifications to make it more clear):
# group_station is the dataframe
collapsed_group_station = np.zeros(len(group_station), 3))
for i, row in enumerate(group_station.iterrows()):
start_id = row[0][0]
end_id = row[0][1]
count = row[1][0]
for check_row in group_station.iterrows():
check_start_id = check_row[0][0]
check_end_id = check_row[0][1]
check_time = check_row[1][0]
if start_id == check_end_id and end_id == check_start_id:
new_group_station[i][0] = start_id
new_group_station[i][1] = end_id
new_group_station[i][2] = time + check_time
break
I have ideas of how to make this code more efficient, but I wanted to know if there is a way of doing it without looping.
You can using np.sort with groupby.sum()
import numpy as np; import pandas as pd
group_station[['A','B']]=np.sort(group_station[['A','B']],axis=1)
group_station.groupby(['A','B'],as_index=False).Count.sum()
Out[175]:
A B Count
0 1 2 300
1 1 4 70
2 4 6 15

Resources