Rename column index from 0 to last column pandas - python-3.x

I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.

dat.columns = range(dat.shape[1])

There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2

Related

How to add flag column in each group of pandas groupby object

I have df with three columns X, Y, Z. i want to apply groupby function to group data based on X . and then i want to insert flag column in each group . condition for flag column is if Column Z 30% values are greater than 1.5 then add flag column value 1 for group . if Column Z 30% values are not greater than 1.5 then add flag column value 0 for group .
here is my example df:
df = pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8']})
X Y Z
0 1 34 1.2
1 1 45 1.3
2 1 33 1.6
3 1 45 1.7
4 1 44 1.8
5 2 66 0
6 2 67 0
7 2 23 0
8 2 34 1.8
9 2 10 1.2
10 2 11 1.3
11 3 13 1.6
12 3 12 1.7
13 3 14 1.8
desired results:
df_result= pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8'],'flag':["1","1","1","1","1", "0",'0','0','0','0','0','1','1','1']})
print(df_result)
X Y Z flag
0 1 34 1.2 1
1 1 45 1.3 1
2 1 33 1.6 1
3 1 45 1.7 1
4 1 44 1.8 1
5 2 66 0 0
6 2 67 0 0
7 2 23 0 0
8 2 34 1.8 0
9 2 10 1.2 0
10 2 11 1.3 0
11 3 13 1.6 1
12 3 12 1.7 1
13 3 14 1.8 1
Use GroupBy.transform with lambda function and converting boolean to integers by Series.astype:
df["Z"]= df["Z"].astype(float)
f = lambda x: (x > 1.5).sum() > len(x) *.3
#if necessary convert 30% to integer by ceil
#f = lambda x: (x > 1.5).sum() > np.ceil(len(x) *.3)
df['flag'] = df.groupby("X")["Z"].transform(f).astype(int)
print (df)
X Y Z flag
0 1 34 1.2 1
1 1 45 1.3 1
2 1 33 1.6 1
3 1 45 1.7 1
4 1 44 1.8 1
5 2 66 0.0 0
6 2 67 0.0 0
7 2 23 0.0 0
8 2 34 1.8 0
9 2 10 1.2 0
10 2 11 1.3 0
11 3 13 1.6 1
12 3 12 1.7 1
13 3 14 1.8 1
Try this.please let me know if there is any issue.
import pandas as pd
import math
df = pd.DataFrame({'X':['1', '1', '1' ,'1', '1', '2','2','2','2','2','2','3','3','3'],'Y':["34","45","33","45","44", "66",'67','23','34','10','11','13','12','14'],'Z':["1.2","1.3","1.6","1.7","1.8", "0",'0','0','1.8','1.2','1.3','1.6','1.7','1.8']})
df["Z"]= pd.to_numeric(df["Z"])
def func(x):
p = math.ceil(x.shape[0]*3/10)
if sum(x>1.5) > p:
return 1
else:
return 0
t = df.groupby("X")["Z"].apply(lambda x: func(x)).reset_index(name="flag")
df["flag"] = df["X"].apply(lambda x: t[t["X"]==x]["flag"].values[0])
output
X Y Z flag
1 34 1.2 1
1 45 1.3 1
1 33 1.6 1
1 45 1.7 1
1 44 1.8 1
2 66 0.0 0
2 67 0.0 0
2 23 0.0 0
2 34 1.8 0
2 10 1.2 0
2 11 1.3 0
3 13 1.6 1
3 12 1.7 1
3 14 1.8 1

How to save rows when value change in column python

I have DataFrame with two columns ID and Value1, I want to select rows when the value of column value1 column changes. I want to save rows 3 before change and 3 after the change and also change point row.
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
ID Value1
0 1 0
1 3 0
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
16 56 1
output:
ID Value1
0 4 0
1 6 0
2 7 0
3 8 2
4 90 2
5 23 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
IIUC,
import numpy as np
df=pd.DataFrame({'ID':[1,3,4,6,7,8,90,23,56,78,90,34,56,78,89,34,56],'Value1':[0,0,0,0,0,2,2,2,2,0,0,0,1,1,1,1,1]})
df.reset_index(drop=True) #index needs to start from zero for solution
ind = list(set([val for i in df[df['Value1'].diff()!=0].index for val in range(i-3, i+4) if i>0 and val>=0]))
# diff gives column wise differencing. combined it with nested list and
# finally, list(set()) to drop any duplicates in index values
df[df.index.isin(ind)]
ID Value1
2 4 0
3 6 0
4 7 0
5 8 2
6 90 2
7 23 2
8 56 2
9 78 0
10 90 0
11 34 0
12 56 1
13 78 1
14 89 1
15 34 1
If you want to retain occurrences of duplicates, drop the list(set()) function over the list

Pandas JOIN/MERGE/CONCAT Data Frame On Specific Indices

I want to join two data frames specific indices as per the map (dictionary) I have created. What is an efficient way to do this?
Data:
df = pd.DataFrame({"a":[10, 34, 24, 40, 56, 44],
"b":[95, 63, 74, 85, 56, 43]})
print(df)
a b
0 10 95
1 34 63
2 24 74
3 40 85
4 56 56
5 44 43
df1 = pd.DataFrame({"c":[1, 2, 3, 4],
"d":[5, 6, 7, 8]})
print(df1)
c d
0 1 5
1 2 6
2 3 7
3 4 8
d = {
(1,0):0.67,
(1,2):0.9,
(2,1):0.2,
(2,3):0.34
(4,0):0.7,
(4,2):0.5
}
Desired Output:
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.9
...
5 56 56 3 7 0.5
I'm able to achieve this but it takes a lot of time since my original data frames' map has about 4.7M rows to map. I'd love to know if there is a way to MERGE, JOIN or CONCAT these data frames on different indices.
My Approach:
matched_rows = []
for key in d.keys():
s = df.iloc[key[0]].tolist() + df1.iloc[key[1]].tolist() + [d[key]]
matched_rows.append(s)
df_matched = pd.DataFrame(matched_rows, columns = df.columns.tolist() + df1.columns.tolist() + ['ratio']
I would highly appreciate your help. Thanks a lot in advance.
Create Series and then DaatFrame by dictioanry, DataFrame.join both and last remove first 2 columns by positions:
df = (pd.Series(d).reset_index(name='ratio')
.join(df, on='level_0')
.join(df1, on='level_1')
.iloc[:, 2:])
print (df)
ratio a b c d
0 0.67 34 63 1 5
1 0.90 34 63 3 7
2 0.20 24 74 2 6
3 0.34 24 74 4 8
4 0.70 56 56 1 5
5 0.50 56 56 3 7
And then if necessary reorder columns:
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
print (df)
a b c d ratio
0 34 63 1 5 0.67
1 34 63 3 7 0.90
2 24 74 2 6 0.20
3 24 74 4 8 0.34
4 56 56 1 5 0.70
5 56 56 3 7 0.50

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:
You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.

Create dataframe column based on the progression values of another column?

I've the following dataframe:
car_id time(seconds) is_charging
1 1 65 1
2 1 70 1
3 1 67 1
4 1 71 1
5 1 120 0
6 1 124 0
7 1 117 0
8 1 80 1
9 1 74 1
10 1 62 1
11 1 130 0
12 1 124 0
I want to create new column to enumerate the charging and discharging periods of the 'is_charging' column so later on i can groupby that new column and compute means, max, min values, etc, of each period.
The resulting dataframe should be like this:
car_id time(seconds) is_charging periods_id
1 1 65 1 1
2 1 70 1 1
3 1 67 1 1
4 1 71 1 1
5 1 120 0 2
6 1 124 0 2
7 1 117 0 2
8 1 80 1 3
9 1 74 1 3
10 1 62 1 3
11 1 130 0 4
12 1 124 0 4
I've done this using for statment, like this:
df['periods_ids] = 0
period_id = 1
previous_charging_state = df.at[0,'is_charging']
def computePeriodIDs():
for ind in df.index:
if df.at[index, 'is_charging'] != previous_charging_state:
previous_charging_state = df.at[index, 'is_charging']
period_id = period_id + 1
df.at[index, 'periods_id'] = period_id
else:
df.at[index, 'periods_id'] = period_id
This is way too slow for the amount of rows that i have. I'm trying to use a vectorize function, especially the apply() one but due to my lack of understanding i haven't had much success and i can not find a similar problem online.
Can someone help me optimize this problem?
Try this:
df.is_charging.diff().ne(0).cumsum()
Out[115]:
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 4
12 4
Name: is_charging, dtype: int32

Resources