group-by values obtained from splitting indexes - python-3.x

I need to find the max of two columns (p_1_logreg, p_2_logreg) where the comparison should be limited only to 14 rows.
My csv file
I tried to slice my index into:
int1_str1_str2_int2_str3_int4
The max should be found between rows where int1, str1, str2 int2 and str3 are fixed, and only the int4 would change (from index 0 to index 13, and so on).
I tried to fix each element at a time and use groupby, but I couldn't iterate over int4 value only.
Here is the code to find the max for column p_1_label, but the result is not what I am looking for.
max_1_row=raw_prob.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[1])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[3])['p_1_'+label].idxmax()]
max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[5])['p_1_'+label].idxmax()]
Any ideas?

I think you need DataFrameGroupBy.idxmax by replaced last _ with empty string and then select by loc:
df = pd.read_csv('myProb.csv', index_col=[0])
idx = df.drop('id', 1).groupby(df['id'].str.replace('_\d+$', '')).idxmax()
print (idx.head(15))
p_0_logreg p_1_logreg p_2_logreg
id
6_PanaCleanerJune_sub_12_ICA 2 9 6
6_PanaCleanerJune_sub_13_ICA 17 19 23
6_PanaCleanerJune_sub_14_ICA 34 37 33
6_PanaCleanerJune_sub_15_ICA 52 51 43
6_PanaCleanerJune_sub_17_ICA 66 67 69
6_PanaCleanerJune_sub_18_ICA 82 79 76
6_PanaCleanerJune_sub_19_ICA 89 87 90
6_PanaCleanerJune_sub_20_ICA 98 103 104
6_PanaCleanerJune_sub_21_ICA 114 117 112
6_PanaCleanerJune_sub_22_ICA 129 133 127
6_PanaCleanerJune_sub_23_ICA 145 146 143
6_PanaCleanerJune_sub_24_ICA 155 166 161
6_PanaCleanerJune_sub_25_ICA 176 173 174
6_PanaCleanerJune_sub_26_ICA 186 191 189
6_PanaCleanerJune_sub_27_ICA 202 203 209
df1 = df.loc[idx['p_1_logreg']]
print (df1.head(15))
id p_0_logreg p_1_logreg p_2_logreg
9 6_PanaCleanerJune_sub_12_ICA_10 0.013452 0.985195 0.001353
19 6_PanaCleanerJune_sub_13_ICA_6 0.051184 0.948816 0.000000
37 6_PanaCleanerJune_sub_14_ICA_10 0.013758 0.979351 0.006890
51 6_PanaCleanerJune_sub_15_ICA_10 0.076056 0.923944 0.000000
67 6_PanaCleanerJune_sub_17_ICA_12 0.051060 0.947660 0.001280
79 6_PanaCleanerJune_sub_18_ICA_10 0.051184 0.948816 0.000000
87 6_PanaCleanerJune_sub_19_ICA_4 0.078162 0.917751 0.004087
103 6_PanaCleanerJune_sub_20_ICA_6 0.076400 0.921263 0.002337
117 6_PanaCleanerJune_sub_21_ICA_6 0.155002 0.791753 0.053245
133 6_PanaCleanerJune_sub_22_ICA_8 0.000000 0.998623 0.001377
146 6_PanaCleanerJune_sub_23_ICA_7 0.017549 0.973995 0.008457
166 6_PanaCleanerJune_sub_24_ICA_13 0.025215 0.974785 0.000000
173 6_PanaCleanerJune_sub_25_ICA_6 0.025656 0.960220 0.014124
191 6_PanaCleanerJune_sub_26_ICA_10 0.098872 0.895526 0.005602
203 6_PanaCleanerJune_sub_27_ICA_8 0.066493 0.932470 0.001037
df2 = df.loc[idx['p_2_logreg']]
print (df2.head(15))
id p_0_logreg p_1_logreg p_2_logreg
6 6_PanaCleanerJune_sub_12_ICA_7 0.000000 0.000351 0.999649
23 6_PanaCleanerJune_sub_13_ICA_10 0.000000 0.000351 0.999649
33 6_PanaCleanerJune_sub_14_ICA_6 0.080748 0.000352 0.918900
43 6_PanaCleanerJune_sub_15_ICA_2 0.017643 0.000360 0.981996
69 6_PanaCleanerJune_sub_17_ICA_14 0.882449 0.000290 0.117261
76 6_PanaCleanerJune_sub_18_ICA_7 0.010929 0.000360 0.988711
90 6_PanaCleanerJune_sub_19_ICA_7 0.010929 0.000351 0.988720
104 6_PanaCleanerJune_sub_20_ICA_7 0.006714 0.000360 0.992925
112 6_PanaCleanerJune_sub_21_ICA_1 0.869393 0.000339 0.130269
127 6_PanaCleanerJune_sub_22_ICA_2 0.000000 0.000351 0.999649
143 6_PanaCleanerJune_sub_23_ICA_4 0.017218 0.000360 0.982421
161 6_PanaCleanerJune_sub_24_ICA_8 0.369685 0.000712 0.629603
174 6_PanaCleanerJune_sub_25_ICA_7 0.307056 0.000496 0.692448
189 6_PanaCleanerJune_sub_26_ICA_8 0.850195 0.000368 0.149437
209 6_PanaCleanerJune_sub_27_ICA_14 0.000000 0.000351 0.999649
Detail:
print (df['id'].str.replace('_\d+$', '').head(15))
0 6_PanaCleanerJune_sub_12_ICA
1 6_PanaCleanerJune_sub_12_ICA
2 6_PanaCleanerJune_sub_12_ICA
3 6_PanaCleanerJune_sub_12_ICA
4 6_PanaCleanerJune_sub_12_ICA
5 6_PanaCleanerJune_sub_12_ICA
6 6_PanaCleanerJune_sub_12_ICA
7 6_PanaCleanerJune_sub_12_ICA
8 6_PanaCleanerJune_sub_12_ICA
9 6_PanaCleanerJune_sub_12_ICA
10 6_PanaCleanerJune_sub_12_ICA
11 6_PanaCleanerJune_sub_12_ICA
12 6_PanaCleanerJune_sub_12_ICA
13 6_PanaCleanerJune_sub_12_ICA
14 6_PanaCleanerJune_sub_13_ICA
Name: id, dtype: object

Related

pandas read_csv not reading entire file

I have a really strange problem and don't know how to solve it.
I am using Ubuntu 18.04.2 together with Python 3.7.3 64-bit and use VScode as an editor.
I am reading data from a database and write it to a csv file with csv.writer
import pandas as pd
import csv
with open(raw_path + station + ".csv", "w+") as f:
file = csv.writer(f)
# Write header into csv
colnames = [par for par in param]
file.writerow(colnames)
# Write data into csv
for row in data:
file.writerow(row)
This works perfectly fine, it provides a .csv file with all the data I read from the database up to the current timestep. However in a later working step I have to read this data to a pandas dataframe and merge it with another pandas dataframe. I read the files like this:
data1 = pd.read_csv(raw_path + file1, sep=',')
data2 = pd.read_csv(raw_path + file2, sep=',')
And then merge the data like this:
comb_data = pd.merge(data1, data2, on="datumsec", how="left").fillna(value=-999)
For 5 out of 6 locations that I do this, everything works perfectly fine, the combined dataset has the same length as the two seperate ones. However for one location pd.read_csv seems not to read the csv files properly. I checked whether the problem is already in the database readout but everything is OK there, I can open both files with sublime and they have the same length, however when I read them with pandas.read_csv one shows less lines. The best part is, this problem is appearing totally random. Sometimes it works and reads the entire file, sometimes not. AND it occures at different locations in the file. Sometimes it stops after approx. 20000 entries, sometimes at 45000, sometimes somewhere else.. just totally random.
Here is an overview of my test output when I print all the lengths of the files
print(len(data1)): 57105
print(len(data2)): 57105
both values directly after read out from database, before writing it anywhere..
After saving the data as csv as described above and opening it in excel or sublime or anything I can confirm that the data contains 57105 rows. Everything is where it is supposed to be.
However if I try to read the data as with pd.read_csv
print(len(data1)): 48612
print(len(data2)): 57105
both values after reading in the data from the csv file
data1 48612
datumsec tl rf ff dd ffx
0 1538352000 46 81 75 288 89
1 1538352600 47 79 78 284 93
2 1538353200 45 82 79 282 93
3 1538353800 44 84 71 284 91
4 1538354400 43 86 77 288 96
5 1538355000 43 85 78 289 91
6 1538355600 46 80 79 286 84
7 1538356200 51 72 68 285 83
8 1538356800 52 71 68 281 73
9 1538357400 48 75 68 276 80
10 1538358000 45 78 62 271 76
11 1538358600 42 82 66 273 76
12 1538359200 43 81 70 274 78
13 1538359800 44 80 68 275 78
14 1538360400 45 78 66 279 72
15 1538361000 45 78 67 282 73
16 1538361600 43 79 63 275 71
17 1538362200 43 81 69 280 74
18 1538362800 42 80 70 281 76
19 1538363400 43 78 69 285 77
20 1538364000 43 78 71 285 77
21 1538364600 44 75 61 288 71
22 1538365200 45 73 56 290 62
23 1538365800 45 72 44 297 57
24 1538366400 44 73 51 286 57
25 1538367000 43 76 61 281 70
26 1538367600 40 79 66 284 73
27 1538368200 39 78 70 291 76
28 1538368800 38 80 71 287 81
29 1538369400 36 81 74 285 81
... ... .. ... .. ... ...
48582 1567738800 7 100 0 210 0
48583 1567739400 6 100 0 210 0
48584 1567740000 5 100 0 210 0
48585 1567740600 6 100 0 210 0
48586 1567741200 4 100 0 210 0
48587 1567741800 4 100 0 210 0
48588 1567742400 5 100 0 210 0
48589 1567743000 4 100 0 210 0
48590 1567743600 4 100 0 210 0
48591 1567744200 4 100 0 209 0
48592 1567744800 4 100 0 209 0
48593 1567745400 5 100 0 210 0
48594 1567746000 6 100 0 210 0
48595 1567746600 5 100 0 210 0
48596 1567747200 5 100 0 210 0
48597 1567747800 5 100 0 210 0
48598 1567748400 5 100 0 210 0
48599 1567749000 6 100 0 210 0
48600 1567749600 6 100 0 210 0
48601 1567750200 5 100 0 210 0
48602 1567750800 4 100 0 210 0
48603 1567751400 5 100 0 210 0
48604 1567752000 6 100 0 210 0
48605 1567752600 7 100 0 210 0
48606 1567753200 6 100 0 210 0
48607 1567753800 5 100 0 210 0
48608 1567754400 6 100 0 210 0
48609 1567755000 7 100 0 210 0
48610 1567755600 7 100 0 210 0
48611 1567756200 7 100 0 210 0
[48612 rows x 6 columns]
datumsec tl rf schnee ival6
0 1538352000 115 61 25 107
1 1538352600 115 61 25 107
2 1538353200 115 61 25 107
3 1538353800 115 61 25 107
4 1538354400 115 61 25 107
5 1538355000 115 61 25 107
6 1538355600 115 61 25 107
7 1538356200 115 61 25 107
8 1538356800 115 61 25 107
9 1538357400 115 61 25 107
10 1538358000 115 61 25 107
11 1538358600 115 61 25 107
12 1538359200 115 61 25 107
13 1538359800 115 61 25 107
14 1538360400 115 61 25 107
15 1538361000 115 61 25 107
16 1538361600 115 61 25 107
17 1538362200 115 61 25 107
18 1538362800 115 61 25 107
19 1538363400 115 61 25 107
20 1538364000 115 61 25 107
21 1538364600 115 61 25 107
22 1538365200 115 61 25 107
23 1538365800 115 61 25 107
24 1538366400 115 61 25 107
25 1538367000 115 61 25 107
26 1538367600 115 61 25 107
27 1538368200 115 61 25 107
28 1538368800 115 61 25 107
29 1538369400 115 61 25 107
... ... ... ... ... ...
57075 1572947400 -23 100 -2 -999
57076 1572948000 -23 100 -2 -999
57077 1572948600 -22 100 -2 -999
57078 1572949200 -23 100 -2 -999
57079 1572949800 -24 100 -2 -999
57080 1572950400 -23 100 -2 -999
57081 1572951000 -21 100 -1 -999
57082 1572951600 -21 100 -1 -999
57083 1572952200 -23 100 -1 -999
57084 1572952800 -23 100 -1 -999
57085 1572953400 -22 100 -1 -999
57086 1572954000 -23 100 -1 -999
57087 1572954600 -22 100 -1 -999
57088 1572955200 -24 100 0 -999
57089 1572955800 -24 100 0 -999
57090 1572956400 -25 100 0 -999
57091 1572957000 -26 100 -1 -999
57092 1572957600 -26 100 -1 -999
57093 1572958200 -27 100 -1 -999
57094 1572958800 -25 100 -1 -999
57095 1572959400 -27 100 -1 -999
57096 1572960000 -29 100 -1 -999
57097 1572960600 -28 100 -1 -999
57098 1572961200 -28 100 -1 -999
57099 1572961800 -27 100 -1 -999
57100 1572962400 -29 100 -2 -999
57101 1572963000 -29 100 -2 -999
57102 1572963600 -29 100 -2 -999
57103 1572964200 -30 100 -2 -999
57104 1572964800 -28 100 -2 -999
[57105 rows x 5 columns]
To me there is no obvious reason in the data why it should have problems reading the entire file and obviously there are none, considering that sometimes it reads the entire file and sometimes not.
I am really clueless about this. Do you have any idea how to cope with that and what could be the problem?
I finally solved my problem and as expected it was not within the file itself. I am using multiprocesses to run the named functions and some other things in parallel. The reading from database + writing to csv file and reading from csv file are performed in two different processes. Therefore the second process (reading from csv) did not know that the csv file was still being written and read only what was already available in the csv file. Because the file was opened by a different process it did not throw an exception when being opened.
I thought I already took care of this but obviously not thoroughly enough, excluding every possible case.
I had completely the same problem with a different application and also did not understand what was wrong, because sometimes it worked and sometimes it didn't.
In a for loop, I was extracting the last two rows of a dataframe that I was creating in the same file. Sometimes, the extracted rows where not the last two at all, but most of the times it worked fine. I guess the program started extracting the last two rows before the writing process was done.
I paused the script for half a second to make sure the writing process is done:
import time
time.sleep(0.5)
However, I don't think this is not a very elegant solution, since it might not be sufficient if somebody with a slower computer uses the script for instance.
Vroni, how did you solve this in the end, is there a way to define that a specific process must not be processed parallel with other tasks. I did not define anything about parallel processing in my program, so I think if this is the cause it is done automatically.

How to extract lines from a file when the second columns of a file matches the values in another file

I got two files.
file 1:
4
14
18
45
53
60
64
102
106
158
162
file2:
28 1 2
54 1 2
90 1 1
103 1 1
155 1 17
191 1 1
235 1 1
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
529 57 3
542 53 1
560 58 6
562 164 25
568 164 5
I want to extract the value from file2 if the second column of file two matches the value in file 1.
So the expected output will be:
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
I saw many of the solution online is using python or Perl, however, I want to use linux command to do this, any idea?
This should do it?
awk 'FNR==NR{a[$0]++};FNR!=NR{if($2 in a){print}}' file1 file2
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
Explanation:
we hand awk both files (order is important in this case!).
as long as we read the first file (FNR==NR) we store each value in an array a[$1]++
when we reach the second file we just check if values from the second file's second column ($2) are in the array; if yes, we print them.

pandas how to convert a dataframe to a matrix using transpose

I have the following df,
code y_m count
101 2017-11 86
101 2017-12 32
102 2017-11 11
102 2017-12 34
102 2018-01 46
103 2017-11 56
103 2017-12 89
now I want to convert this df into a matrix that transposes column y_m to row, make the count as matrix cell values like,
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 -1 89 -1
in specific, -1 represents a dummy value that indicates either a value doesn't exist for a y_m for a specific code or to maintain matrix shape; 0 represents 'all' values, that aggregates code or y_m or code and y_m, e.g. cell (1, 1) sums the count values for all y_m and code; (1,2) sums the count for 2017-11.
You can use first pivot_table:
df1 = (df.pivot_table(index='code',
columns='y_m',
values='count',
margins=True,
aggfunc='sum',
fill_value=-1,
margins_name='0'))
print (df1)
y_m 2017-11 2017-12 2018-01 0
code
101 86 32 -1 118
102 11 34 46 91
103 56 89 -1 145
0 153 155 46 354
And then for final format, but get mixed values, numeric with strings:
#change order of index and columns values for reindex
idx = df1.index[-1:].tolist() + df1.index[:-1].tolist()
cols = df1.columns[-1:].tolist() + df1.columns[:-1].tolist()
df2 = (df1.reindex(index=idx, columns=cols)
.reset_index()
.rename(columns={'code':-1})
.rename_axis(None,1))
#add columns to first row
df3 = df2.columns.to_frame().T.append(df2).reset_index(drop=True)
#reset columns names to range
df3.columns = range(len(df3.columns))
print (df3)
0 1 2 3 4
0 -1 0 2017-11 2017-12 2018-01
1 0 354 153 155 46
2 101 118 86 32 -1
3 102 91 11 34 46
4 103 145 56 89 -1

Conditional date join in python Pandas

I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44

Adding rows that match a criteria in another column in Excel

This is a sample data
Polling_Booth INC SAD BSP PS_NO
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1
4 356 355 25 2
5 290 333 9 2
6 144 143 4 3
7 327 196 1 4
8 370 235 1 5
And this is what I'm trying to achieve
Polling_Booth INC SAD BSP PS_NO OP_INC OP_SAD OP_BSP
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1 437 871 16
4 356 355 25 2
5 290 333 9 2 646 688 34
6 144 143 4 3 144 143 4
7 327 196 1 4 327 196 1
8 370 235 1 5 370 235 1
This is achieved adding up rows which has the same PS_NO, This is what I have tried
=if(E2=E3,sum(B2,B3),0) #same for all the rows
Any help would be much appreciated..Thanks
You could get it to look like your table by adding another condition to check if it's the last occurrence of the PS_No in column E and setting the result to an empty string if not
=IF(COUNTIF($E$2:$E2,$E2)=COUNTIF($E$2:$E$10,$E2),SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
If the data is sorted by PS_No, you can do it more easily by
=IF($E3<>$E2,SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
which I think is what you were trying in your question

Resources