I have data like the input data data_df2 sample below. I have the code below that creates the label column by comparing the Cleaned column value to the value in the record before it and then either giving it the same letter, if the values match, or a new value. The problem I have is that I would like the letters that are chosen for the label column to start over with every new label_set_id. So the label value for first label_set_id=2 would be A. Every 20 records the label_set_id goes up by 1. Can anyone suggest how I can modify the code below to accomplish this? Or is there a slicker way with pandas, say using the apply function. This code does run kind of slow.
code:
data_df2['label']=''
c=65
data_df2.label[0]=chr(c)
c=c+1
for i in range(1,len(data_df2)):
if(data_df2.loc[i,'Cleaned']==data_df2.loc[i-1,'Cleaned']):
data_df2.label[i]=data_df2.label[i-1]
else:
data_df2.label[i]=chr(c)
c=c+1
input data:
print(data_df2[:30])
id Source \
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 #BEEZLEEXTRACTS
3 4 #CALISIFTCO_
4 5 #CALISIFTCO_ X #_ZKITTLEZ_
5 6 #CALISIFTCO_ X #WONDERBRETT
6 7 #CALISIFTCO_ X #WONDERBRETT_
7 8 #DNA_GENETICS
8 9 #EDENEXTRACTS_CA
9 10 #EDENEXTRACTS_CA X #CALISIFTCO_
10 11 #FULLFLAVAEXTRACT
11 12 #GGSTRAINS
12 13 #SHERBINSKI415
13 14 #STR8MECHANIC X #ICEDOUTEXTRACTS
14 15 #STR8MECHANIC X #REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 #CALISIFTCO_ X #WONDERBRETT_ 1 E
6 #CALISIFTCO_ X #WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 R
21 BROTHERLY LOVE 2 S
22 BROTHERLY LOVE 2 S
23 CALIFORNIA DREAMIN 2 T
24 DIME BAG 2 U
25 EDEN 2 V
26 EEL RIVER 2 W
27 GANJA GOLD 2 X
28 GLOWING BUDDHA 2 Y
29 JETTY EXTRACTS 2 Z
output data:
id Source \
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 #BEEZLEEXTRACTS
3 4 #CALISIFTCO_
4 5 #CALISIFTCO_ X #_ZKITTLEZ_
5 6 #CALISIFTCO_ X #WONDERBRETT
6 7 #CALISIFTCO_ X #WONDERBRETT_
7 8 #DNA_GENETICS
8 9 #EDENEXTRACTS_CA
9 10 #EDENEXTRACTS_CA X #CALISIFTCO_
10 11 #FULLFLAVAEXTRACT
11 12 #GGSTRAINS
12 13 #SHERBINSKI415
13 14 #STR8MECHANIC X #ICEDOUTEXTRACTS
14 15 #STR8MECHANIC X #REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 #CALISIFTCO_ X #WONDERBRETT_ 1 E
6 #CALISIFTCO_ X #WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 A
21 BROTHERLY LOVE 2 B
22 BROTHERLY LOVE 2 B
23 CALIFORNIA DREAMIN 2 C
24 DIME BAG 2 D
25 EDEN 2 E
26 EEL RIVER 2 F
27 GANJA GOLD 2 G
28 GLOWING BUDDHA 2 H
29 JETTY EXTRACTS 2 I
IIUC, you can use groupby on label_set_id and check where two following rows are different with shift, and use cumsum to get an incremental value per group. Add 64 for map the chr function.
#dummy example
df = pd.DataFrame({'Cleaned':list('abbcddeffijkllmn'),
'label_set_id':[1]*8+[2]*8})
#create the column label
df['label'] = list(map(chr, df.groupby('label_set_id')['Cleaned']\
.apply(lambda x: x.ne(x.shift()).cumsum())+64))
print (df)
Cleaned label_set_id label
0 a 1 A
1 b 1 B
2 b 1 B #same cleaned than previous row
3 c 1 C
4 d 1 D
5 d 1 D
6 e 1 E
7 f 1 F
8 f 2 A #restart at A for new label_set_id
9 i 2 B
10 j 2 C
11 k 2 D
12 l 2 E
13 l 2 E
14 m 2 F
15 n 2 G
EDIT: if the data is ordered in terms of label_set_id, you can do it without groupby:
df['label'] = df['Cleaned'].ne(df['Cleaned'].shift()) .cumsum()
df['label'] = list(map(chr, df['label']
-df['label'].where(df['label_set_id'].ne(df['label_set_id'].shift()))\
.ffill().astype(int) + 65 ))
Related
I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows
While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)
I grouped a column in a pandas dataframe by the number of occurrences of an event per hour of the day like so:
df_sep.hour.groupby(df_sep.time.dt.hour).size()
Which gives the following result:
time
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
18 1
20 4
21 29
Name: hour, dtype: int64
For plotting, I would like to complete the series for each hour of the day. ie, there are no occurrences at midnight (0). So for every missing hour, I would like to create that index and add zero to the corresponding value.
To solve this I created two lists (x and y) using the following loop, but it feels a bit hacky... is there a better way to solve this?
x = []
y = []
for i in range(24):
if i not in df_sep.hour.groupby(df_sep.time.dt.hour).size().index:
x.append(i)
y.append(0)
else:
x.append(i)
y.append(df_sep.hour.groupby(df_sep.time.dt.hour).size().loc[i])
result:
for i, j in zip(x, y):
print(i, j)
0 0
1 0
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
17 0
18 1
19 0
20 4
21 29
22 0
23 0
Use Series.reindex with range(24):
df_sep.hour.groupby(df_sep.time.dt.hour).size().reindex(range(24), fill_value=0)
Suppose I have this dataframe :
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
I want to swap the position of row 1 and 2.
Is there a native Pandas function that can do this?
Thanks!
Use rename with a custom dict and sort_index
d = {1: 2, 2: 1}
df_final = df.rename(d).sort_index()
Out[27]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24
As far as I am aware there is no Native Pandas function for this.
But here is a custom function:
# Input
df = pd.DataFrame(np.arange(25).reshape(5, -1))
# Output
def swap_rows(df, i1, i2):
a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
df.iloc[i1, :], df.iloc[i2, :] = b, a
return df
print(swap_rows(df, 1, 2))
Output:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24
Cheers!
Try numpy flip:
df.iloc[1:3] = np.flip(df.to_numpy()[1:3], axis=0)
df
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 5 6 7 8 9
3 15 16 17 18 19
4 20 21 22 23 24
df1=df.copy()
df1.iloc[1,:],df1.iloc[2,:]=df.iloc[2,:],df.iloc[1,:]
df1
I have two dataframes with same entries in column A, but different entries in columns B and C.
One dataframe has multiple entries for one entry in A.
df1
A B C
0 this 3 4
1 is 4 6
2 an 7 9
3 example 12 20
df2
A B C
0 this 11 11
1 this 5 9
2 this 18 7
3 is 12 14
4 an 1 4
5 an 8 12
6 example 3 17
7 example 9 5
8 example 19 6
9 example 7 1
I want to sum the two dataframes for same entries in column A. The result shoul look like this:
df3
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
How can I calculate this in a fast way in pandas?
Use DataFrame.merge to left merge the dataframe df2 with df1 on column A then add the columns B, C of df2 to the columns B, C of df3:
df3 = df2[['A']].merge(df1, on='A', how='left')
df3[['B', 'C']] += df2[['B', 'C']]
Result:
print(df3)
A B C
0 this 14 15
1 this 8 13
2 this 21 11
3 is 16 20
4 an 8 13
5 an 15 21
6 example 15 37
7 example 21 25
8 example 31 26
9 example 19 21
OR another possible idea if order is not important:
df3 = df2.set_index('A').add(df1.set_index('A')).reset_index()
print(df3)
A B C
0 an 8 13
1 an 15 21
2 example 15 37
3 example 21 25
4 example 31 26
5 example 19 21
6 is 16 20
7 this 14 15
8 this 8 13
9 this 21 11
I have a 16*20 [input] matrix (A). I want to have a 16*20 output matrix (B) such that:
A1,1 = B16,20,
A1,2 = B16,19,
A1,3 = B16,18, …,
A2,1 = B15,20,
A2,2 = B15,19,
A2,3 = B15,18 and so on.
What should I do in MS Excel or MATLAB?
It looks like you want to rotate the matrix 180 degrees. Calling rot90 twice to rotate the matrix 180 degrees should do what you want:
B = rot90(A, 2);
Example
>> A = reshape(1:25,5,5)
A =
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
>> B = rot90(A,2)
B =
25 20 15 10 5
24 19 14 9 4
23 18 13 8 3
22 17 12 7 2
21 16 11 6 1
An alternative if you don't like rot90 is to call fliplr then flipud. The effect is that every row is flipped from left to right and every column is flipped from up to down. It doesn't matter what order you put them in.
Example
>> A = reshape(1:25,5,5)
A =
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
>> B = fliplr(flipud(A))
B =
25 20 15 10 5
24 19 14 9 4
23 18 13 8 3
22 17 12 7 2
21 16 11 6 1
With just a small code, you can visually the transformation that you want:
R = 16;
C = 20;
A = cell(R, C);
for r = 1:R,
for c = 1:C,
A(r,c) = strcat(num2str(r),',',num2str(c));
end,
end
B = flipud(fliplr(A))