Convert string column to int pandas DataFrame - python-3.x

I have a Dataframe that has a column with unique string column. like below:
id customerId ...
1 vqUkxUDuEmB7gHWQvcYrBn
2 KaLEhwzZxCQ7GjPmVwBVav
3 pybDYgTiCUv3Pv3WLgxKCM
4 zqPiDV33KwrMBZoyeQXMJW
5 CR8z3ThPyzBKXFqqzemQAS
.
I want to replace customerIDs to int by a method like
# replace dataFrame.customerId[from start to end]
dataFrame.customerId.replace(sum(map(ord, ???)))
How can i do that?

Given something like
import pandas as pd
df = pd.DataFrame(columns=['UID'], index=range(7))
df.iloc[0,0] = 'vqUkxUDuEmB7gHWQvcYrBn'
df.iloc[1,0] = 'KaLEhwzZxCQ7GjPmVwBVav'
df.iloc[2,0] = 'pybDYgTiCUv3Pv3WLgxKCM'
df.iloc[3,0] = 'zqPiDV33KwrMBZoyeQXMJW'
df.iloc[4,0] = 'CR8z3ThPyzBKXFqqzemQAS'
df.iloc[5,0] = 'zqPiDV33KwrMBZoyeQXMJW' # equal to 3
df.iloc[6,0] = 'vqUkxUDuEmB7gHWQvcYrBn' # equal to 0
PS: I added 2 UIDs equal to previous ones to see that they'll be correctly categorized
you can use a categorical type
df['UID_categorical'] = df.UID.astype('category').cat.codes
output
UID UID_categorical
0 vqUkxUDuEmB7gHWQvcYrBn 3
1 KaLEhwzZxCQ7GjPmVwBVav 1
2 pybDYgTiCUv3Pv3WLgxKCM 2
3 zqPiDV33KwrMBZoyeQXMJW 4
4 CR8z3ThPyzBKXFqqzemQAS 0
5 zqPiDV33KwrMBZoyeQXMJW 4
6 vqUkxUDuEmB7gHWQvcYrBn 3
where UID_categorical is int
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UID 7 non-null object
1 UID_categorical 7 non-null int8
dtypes: int8(1), object(1)
memory usage: 191.0+ bytes
If you want to replace just do
df['UID'] = df.UID.astype('category').cat.codes

Related

How to add text element to series data in Python

I have a series data in python defined as:
scores_data = (pd.Series([F1[0], auc, ACC[0], FPR[0], FNR[0], TPR[0], TNR[0]])).round(4)
I want to append the text 'Featues' at location 0 to the series data.
I tried scores_data.loc[0] but that replaced the data at location 0.
Thanks for your help.
You can't directly insert a value in a Series (like you could in a DataFrame with insert).
You can use concat:
s = pd.Series([1,2,3,4])
s2 = pd.concat([pd.Series([0], index=[-1]), s])
output:
-1 0
0 1
1 2
2 3
3 4
dtype: int64
Or create a new Series from the values:
pd.Series([0]+s.to_list())
output:
0 0
1 1
2 2
3 3
4 4
dtype: int64

Highest frequency in a dataframe

I am looking for a way to get the highest frequency in the entire pandas, not in a particular column. I have looked at value count, but it seems that works in a column specific way. Any way to do that?
Use DataFrame.stack with Series.mode for top values, for first select by position:
df = pd.DataFrame({
'B':[4,5,4,5,4,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
a = df.stack().mode().iat[0]
print (a)
4
Or if need also frequency is possible use Series.value_counts:
s = df.stack().value_counts()
print (s)
4 6
5 4
3 3
9 2
7 2
2 2
1 2
8 1
6 1
0 1
dtype: int64
print (s.index[0])
4
print (s.iat[0])
6

Complex group by using Pandas

I am facing a situation where I need to group-by a dataframe by a column 'ID' and also calculate the total time frame depicted for that particular ID to complete. I only want to calculate the difference between the date_open and data_closed for the particular ID with the ID count.
We only need to focus on the date open and the date closed field. So it needs to do something taking the max closing date and the min open date and subtracting the two
The dataframe looks as follows:
ID Date_Open Date_Closed
1 01/01/2019 02/01/2019
1 07/01/2019 09/01/2019
2 10/01/2019 11/01/2019
2 13/01/2019 19/01/2019
3 10/01/2019 11/01/2019
The output should look like this :
ID Count_of_ID Total_Time_In_Days
1 2 8
2 2 9
3 1 1
How should I achieve this ?
Using GroupBy with named_aggregation and the min and max of the dates:
df[['Date_Open', 'Date_Closed']] = (
df[['Date_Open', 'Date_Closed']].apply(lambda x: pd.to_datetime(x, format='%d/%m/%Y'))
)
dfg = df.groupby('ID').agg(
Count_of_ID=('ID','size'),
Date_Open=('Date_Open','min'),
Date_Closed=('Date_Closed','max')
)
dfg['Total_Time_In_Days'] = dfg['Date_Closed'].sub(dfg['Date_Open']).dt.days
dfg = dfg.drop(columns=['Date_Closed', 'Date_Open']).reset_index()
ID Count_of_ID Total_Time_In_Days
0 1 2 8
1 2 2 9
2 3 1 1
Now we have Total_Time_In_Days as int:
print(dfg.dtypes)
ID int64
Count_of_ID int64
Total_Time_In_Days int64
dtype: object
This can also be used:
df['Date_Open'] = pd.to_datetime(df['Date_Open'], dayfirst=True)
df['Date_Closed'] = pd.to_datetime(df['Date_Closed'], dayfirst=True)
df_grouped = df.groupby(by='ID').count()
df_grouped['Total_Time_In_Days'] = df.groupby(by='ID')['Date_Closed'].max() - df.groupby(by='ID')['Date_Open'].min()
df_grouped = df_grouped.drop(columns=['Date_Open'])
df_grouped.columns=['Count', 'Total_Time_In_Days']
print(df_grouped)
Count Total_Time_In_Days
ID
1 2 8 days
2 2 9 days
3 1 1 days
I'll try first to create the a column depicting how much time passed from Date_open to Date_closed for each instance of the dataframe. Like this:
df['Total_Time_In_Days'] = df.Date_closed - df.Date_open
Then you can use groupby:
df.groupby('id').agg({'id':'count','Total_Time_In_Days':'sum'})
If you need any help with the .agg function you can refer to it's official documentation here.

adding 1 to the previous row based on conditions

I have a pandas dataframe like below:
data=[['A',1,30],
['A',1,2],
['A',0,4],
['A',1,4],
['B',0,5],
['B',1,1],
['B',0,5],
['B',1,8]]
df = pd.DataFrame(data,columns=['group','var_1','var_2'])
I want to create a series of values with index based on below condition:
Step 1) Increment should always happen from 1st row of 'var_2'of each group. For example: for group A, the increment should start from 30 and for group B,
increment should start from 5
Step 2) Incremented value where 'var_1" = 1
My desired output:
0 30
1 31
3 32
5 6
7 7
IIUC:
#Get first index in each group and union index where var_1 ==1
indx = df.drop_duplicates('group').index.union(df[(df['var_1']==1)].index)
#Reindex dataframe group by group, add cusum value to other present values in group.
#Use .loc to filter where var_1 != 0 and get column var_2
df.reindex(indx).groupby('group')\
.transform(lambda x: x.iloc[0] + x.shift().notna().cumsum())\
.loc[lambda x: x.var_1 !=0, 'var_2']
Output:
0 30
1 31
3 32
5 6
7 7
Name: var_2, dtype: int64
Try groupby cumcount and first
df1 = df.loc[df.var_1.eq(1)]
g = df1.groupby('group')['var_2']
g.transform('first') + g.cumcount()
Out[66]:
0 30
1 31
3 32
5 1
7 2
dtype: int64
Or use duplicated with df.where and cumsum
df1 = df.loc[df.var_1.eq(1)]
df1.var_2.where(~df1.duplicated('group'), 1).groupby(df1.group).cumsum()
Out[77]:
0 30
1 31
3 32
5 1
7 2
Name: var_2, dtype: int64

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...
Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0
As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Resources