how to pass pandas series element to another dataframe - python-3.x

I want to check if an error occurred.
I have this two dataframes, from excel files:
Log_frame is a dataframe of log files, reporting data recording and error:
Time Voltage[V] Freq[Hz] Speed Motor_Stt: ErrNo
0 10:00 220 50 30 1 0
1 10:10 220 50 30 1 0
2 10:20 220 50 0 2 3601
3 10:30 220 47 0 1 1500
4 10:40 250 50 0 1 7707
5 10:50 220 50 0 2 3601
6 11:00 220 50 0 2 3601
7 11:10 220 47 0 1 1500
8 11:20 220 50 30 1 0
9 11:30 220 50 30 1 0
Dev_frame is the dataframe of error description:
Fehler-Nr. Descr Cause
0 1500 Chk_Voltage Voltage out of range
1 7707 Chk_Freq. Freq. out of range
2 3601 Chk_Motor_Stt Motor_defec
3 7704 switch_trip chk_over_curr
from Log_frame I can check if, which and how many errors occurred during a day by:
Err_log = Log_frame['ErrNo']
p = Err_log[Err_log != 0].drop_duplicates('first').reset_index(drop=True)
and this result is a pandas series:
<class 'pandas.core.series.Series'>
0 3601
1 1500
2 7707
I can "pass" first error (or second and all the other) by this:
Dev_Err = Dev_frame['Fehler-Nr.']
n = Dev_Err[Dev_Err == p.iloc[0]] #or 1, 2 and so on
I was wondering how to loop trough p.iloc[i].
Should I use a for loop or can be done by any pandas function
EDIT: e.g. if I put 1 in p.iloc[] I can get:
0 1500
if 2:
1 7707

No need to create a loop to check each value, you can use isin method that pandas.DataFrame has as following:
n = dev_frame[dev_frame['Fehler-Nr.'].isin(p)]['Fehler-Nr.']
which is going to return:
0 1500
1 7707
2 3601
Name: Fehler-Nr., dtype: int64
Ref: pandas.DataFrame.isin

If you're using pandas and going for for loops you are wrong. Use pandas vectorised operations. These are done using (simple exaple)
df.apply(some function, axis)
I'm not 100% convinced I understood what you're trying to achieve, but I believe you just want to merge/join number of errors for a given error. If so, pandas.join() and pandas.merge() are to help. Check the docs.

Related

Re-formatting a dataframe to show sequence number and time difference after a groupby

I have a pandas dataframe that has an identifier, a sequence number, and a timestamp.
For example:
MyIndex seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10
I want to reformat by showing a sequence number for each index and with the time difference, something like:
MyIndex seq_no timediff
1 1 0
1 2 1
1 3 2
2 1 0
2 2 3
3 1 0
3 2 3
3 3 2
I know I can get the seq_no by doing
df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)
but how do I get the time difference? Bonus points if you show me how to do the time difference between steps, or total timediff from the start.
I think the simplest way to get the difference is to convert the timestamp to a single unit. You can then calculate the difference with groupby and shift.
import pandas as pd
from io import StringIO
data = """Index seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10"""
df = pd.read_csv(StringIO(data), sep='\s+')
# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1
# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]
# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)
# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]
Output
>>>df
Index seq_no_new timediff
0 1 1 0
1 1 2 1
2 1 3 2
3 2 1 0
4 2 2 3
5 3 1 0
6 3 2 3
7 3 3 2

How update one dataframe's column by matching columns in two different dataframes in Pandas

I have two dataframes. I need to generate report by matching columns in two dataframes and updating a column in the first dataframe:
Sample Data
input_file = pd.DataFrame({'Branch' : ['GGN','MDU','PDR','VLR','AMB'],
'Inflow' : [0, 0, 0, 0, 0]})
month_inflow = pd.DataFrame({'Branch' : ['AMB','GGN','MDU','PDR','VLR'],
'Visits' : [124, 130, 150, 100, 112]})
input_file
Branch Inflow
0 GGN 0
1 MDU 0
2 PDR 0
3 VLR 0
4 AMB 0
month_inflow
Branch Visits
0 AMB 124
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
Expected Output:
input_file
Branch Inflow
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
5 AMB 124
I tried using merge option, but I get the 'Inflow' column which is not required, I know I can drop it, but could someone let me know if there's a better way to get the desired output.
pd.merge(input_file, month_inflow, on = 'Branch')
Branch Inflow Visits
0 GGN 0 130
1 MDU 0 150
2 PDR 0 100
3 VLR 0 112
4 AMB 0 124
You can try
input_file.Inflow=input_file.Branch.map(month_inflow.set_index('Branch').Visits)
input_file
Out[145]:
Branch Inflow
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124
Merge on "Branch" column and then drop "Inflow" from input file.
input_file = input_file.merge(month_inflow, on="Branch").drop('Inflow',1)
input_file
Branch Visits
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124

panda value_counts show duplicate

Here is the code that I am using
all_data.groupby('BsmtFullBath').BsmtFullBath.count()
and the output is coming up as
BsmtFullBath
0 856
1 588
2 15
3 1
0 849
1 584
2 23
3 1
NA 2
Name: BsmtFullBath, dtype: int64
Expecting it to have a unique value for the each value, but "0" is coming two times.
I believe if you want to get rid of the duplicated values, to use the map function just like in the example below, just ch
df_final['DC'] = df_final['DC'].map({'NO':0, 'WT':1, 'BU':2,'CT':3,'BT':4, 'CD':5})

Splitting a each column value into different columns [duplicate]

This question already has answers here:
Convert pandas DataFrame column of comma separated strings to one-hot encoded
(3 answers)
Closed 4 years ago.
I have a survey response sheet which has questions which can have multiple answers, selected using a set of checkboxes.
When I get the data from the response sheet and import it into pandas I get this:
Timestamp Sports you like Age
0 23/11/2013 13:22:30 Football, Chess, Cycling 15
1 23/11/2013 13:22:34 Football 25
2 23/11/2013 13:22:39 Swimming,Football 22
3 23/11/2013 13:22:45 Chess, Soccer 27
4 23/11/2013 13:22:48 Soccer 30
There can be any number of sport values in sports column (further rows has basketball,volleyball etc.) and there are still some other columns. I'd like to do statistics on the results of the question (how many people liked Football,etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.
Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Sports-Football, Sports-Volleyball, Sports-Basketball, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this
What I need is a new dataframe that looks like this (along with Age column) -
Timestamp Sports-Football Sports-Chess Sports-Cycling ....
0 23/11/2013 13:22:30 1 1 1
1 23/11/2013 13:22:34 1 0 0
2 23/11/2013 13:22:39 1 0 0
3 23/11/2013 13:22:45 0 1 0
I tried till this point can't proceed further.
df['Sports you like'].str.split(',\s*')
which splits into different columns but the first column may have any sport, I need only 1 in first column if the user likes Football or 0.
Problem is separator ,\s*, so solution is add str.split with str.join before str.get_dummies:
df1 = (df.pop('Sports you like').str.split(',\s*')
.str.join('|')
.str.get_dummies()
.add_prefix('Sports-'))
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0
Or use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = df.pop('Sports you like').str.split(',\s*')
df1 = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_).add_prefix('Sports-')
print (df1)
Sports-Chess Sports-Cycling Sports-Football Sports-Soccer \
0 1 1 1 0
1 0 0 1 0
2 0 0 1 0
3 1 0 0 1
4 0 0 0 1
Sports-Swimming
0 0
1 0
2 1
3 0
4 0
df = df.join(df1)
print (df)
Timestamp Age Sports-Chess Sports-Cycling Sports-Football \
0 23/11/2013 13:22:30 15 1 1 1
1 23/11/2013 13:22:34 25 0 0 1
2 23/11/2013 13:22:39 22 0 0 1
3 23/11/2013 13:22:45 27 1 0 0
4 23/11/2013 13:22:48 30 0 0 0
Sports-Soccer Sports-Swimming
0 0 0
1 0 0
2 0 1
3 1 0
4 1 0

Parsing table with streaks of binaries to select larger group element

I have a table like the following (only much longer):
# time binary frequency
0 2.1 0 0.65
1 3.2 1 0.72
2 5.8 0 0.64
3 7.1 0 0.63
4 9.5 1 0.72
5 14.1 1 0.74
6 21.5 0 0.62
7 27.3 0 0.61
8 29.5 1 1.00
9 32.1 1 1.12
10 35.5 1 0.99
I want to collect all the times correspondent to only binary == 1 and, among the small groups, those whose correspondent frequency value is higher. In the table above, this would result in:
times = 3.2, 14.1, 32.1
I am not sure how to approach the sequentiality of the table on the first place, and then how to compare the values among them returning only the correspondent time (and not, for example, the largest frequency). Time hides a periodicity, so I would avoid to build another table with only binary == 1 elements.
Having my time, binary, and frequency arrays, I can isolate relevant elements by:
condition = (binary == 1)
time1 = time(condition)
frequency1 = frequency(condition)
but I do not know how to proceed to isolate the various streaks. What are useful functions I can use?
I don't know that there are any clever functions to use for this. Here's some code that will do the job. Please note that I removed the headers from your file.
binary is either zero or one, depending on whether the rows other values are to be included in a group. Initially in_group is set to False to indicate that no group has started. As rows are read, when binary is zero, if the code has been reading rows for a group and, therefore, in_group is True, in_group is set to False because now that a zero has been encountered that group has come to an end. Since processing of the group has ended, it's time to print results for it. As rows are read, when binary is one, if in_group is True then the code has already started processing rows are a group and the code checks whether the newest frequency is greater than what has been see before. If so, it updates both rep_time and rep_frequency. If in_group is False then this is the first row of a new group and in_group is set True and initial values of rep_time and rep_frequency are set.
with open('pyser.txt') as pyser:
in_group = False
for line in pyser:
_, time, binary, frequency = [float(_) for _ in line.rstrip().split()]
if binary == 0:
if in_group:
in_group = False
print (rep_time)
else:
if in_group:
if frequency > rep_frequency:
rep_time, rep_frequency = time, frequency
else:
in_group = True
rep_time, rep_frequency = time, frequency
if in_group:
print (rep_time)
Output:
3.2
14.1
32.1
Edit: We seem to be using different definitions of the problem.
In the first group, we agree. But, in the second group, the maximum amplitude is about 4.07E-01, which corresponds to a time of about 5.4740E+04.
I've also written code in Pandas:
>>> import pandas as pd
>>> df = pd.read_csv('Gyd9P1rb.txt', sep='\s+', skiprows=2, header=None, names='Row TSTOP PSRTIME DETECTED FDOTMAX AMPLITUDE AMPLITUDE_ERR'.split())
>>> del df['Row']
>>> del df['TSTOP']
>>> del df['FDOTMAX']
>>> del df['AMPLITUDE_ERR']
>>> groups = []
>>> in_group = False
>>> group_number = 1
>>> for b in df['DETECTED']:
... if b:
... if not in_group:
... group_number +=1
... in_group = True
... groups.append(group_number)
... else:
... groups.append(0)
... in_group = False
...
>>> df['groups'] = pd.Series(groups, index=df.index)
>>> df.head()
PSRTIME DETECTED AMPLITUDE groups
0 54695.471283 1 0.466410 2
1 54698.532412 1 0.389607 2
2 54701.520814 1 0.252858 2
3 54704.557583 0 0.103460 0
4 54707.557563 0 0.088215 0
>>> gb = df.groupby(by=df['groups'])
>>> def f(x):
... the_max = x['AMPLITUDE'].idxmax()
... print ( x['groups'][the_max], x['PSRTIME'][the_max])
...
>>> gb.apply(f)
0 58064.3656376
0 58064.3656376
2 54695.4712834
3 54740.4917137
4 54788.477571
5 54836.472922
6 54881.4605511
7 54926.4664883
8 54971.4932866
9 55019.5021472
10 55064.5029133
11 55109.4948108
12 55154.414381
13 55202.488766
14 55247.4721132
15 55292.5301332
16 55340.4728542
17 55385.5229596
18 55430.5332147
19 55478.4812671
20 55523.4894451
21 55568.4626766
22 55616.4630348
23 55661.4969604
24 55709.4504634
25 55754.4711994
26 55799.4736923
27 55844.5050404
28 55892.4699313
29 55937.4721754
30 55985.4677572
31 56030.5119765
32 56075.5517149
33 56168.4447074
34 56213.507484
35 56306.5133063
36 56351.4943058
37 56396.579122
38 56441.5683651
39 56489.5321173
40 56534.4838082
41 56582.469025
42 56627.4135202
43 56672.4926625
44 56720.582296
45 56768.5232469
46 56813.4997925
47 56858.3890558
48 56903.5182596
49 56951.4892721
50 56996.5787435
51 57086.3948136
52 57179.5421833
53 57272.5059448
54 57362.452523
55 57635.5013047
56 57728.4925251
57 57773.5235416
58 57821.5390364
59 57866.5205882
60 57911.5590132
61 57956.5699637
62 58001.4331976
Empty DataFrame
Columns: []
Index: []
The results of the two methods are the same, up to differences in presentation precision.
I also created a small set of data that would give easily calculable results. This is it. The original program performed correctly.
0 -1 0 -1
1 0 1 2
2 -1 0 -1
3 -1 0 -1
4 0 1 0
5 1 1 1
6 -1 0 -1
7 -1 0 -1
8 -1 0 -1
9 0 1 4
10 1 1 3
11 2 1 2
12 -1 0 -1
13 -1 0 -1
14 -1 0 -1
15 -1 0 -1
16 0 1 0
17 1 1 1
18 2 1 2
19 3 1 3
20 -1 0 -1
21 -1 0 -1
22 -1 0 -1
23 -1 0 -1
24 -1 0 -1
25 0 1 6
26 1 1 5
27 2 1 4
28 3 1 3
29 4 1 2
30 -1 0 -1
31 -1 0 -1
32 -1 0 -1
33 -1 0 -1
34 -1 0 -1
35 -1 0 -1
36 0 1 0
37 1 1 1
38 2 1 2
39 3 1 3
40 4 1 4
41 5 1 5
41 -1 0 -1
41 -1 0 -1

Resources