How update one dataframe's column by matching columns in two different dataframes in Pandas - python-3.x

I have two dataframes. I need to generate report by matching columns in two dataframes and updating a column in the first dataframe:
Sample Data
input_file = pd.DataFrame({'Branch' : ['GGN','MDU','PDR','VLR','AMB'],
'Inflow' : [0, 0, 0, 0, 0]})
month_inflow = pd.DataFrame({'Branch' : ['AMB','GGN','MDU','PDR','VLR'],
'Visits' : [124, 130, 150, 100, 112]})
input_file
Branch Inflow
0 GGN 0
1 MDU 0
2 PDR 0
3 VLR 0
4 AMB 0
month_inflow
Branch Visits
0 AMB 124
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
Expected Output:
input_file
Branch Inflow
1 GGN 130
2 MDU 150
3 PDR 100
4 VLR 112
5 AMB 124
I tried using merge option, but I get the 'Inflow' column which is not required, I know I can drop it, but could someone let me know if there's a better way to get the desired output.
pd.merge(input_file, month_inflow, on = 'Branch')
Branch Inflow Visits
0 GGN 0 130
1 MDU 0 150
2 PDR 0 100
3 VLR 0 112
4 AMB 0 124

You can try
input_file.Inflow=input_file.Branch.map(month_inflow.set_index('Branch').Visits)
input_file
Out[145]:
Branch Inflow
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124

Merge on "Branch" column and then drop "Inflow" from input file.
input_file = input_file.merge(month_inflow, on="Branch").drop('Inflow',1)
input_file
Branch Visits
0 GGN 130
1 MDU 150
2 PDR 100
3 VLR 112
4 AMB 124

Related

Pivot a Two Column DataFrame With No Numeric Column To Aggregate On

I have a dataframe with input like this:
df1 = pd.DataFrame(
{'StoreId':
[244, 391, 246, 246, 130, 130] , 'PackageStatus': ['IN TRANSIT','IN TRANSIT','IN TRANSIT', 'IN TRANSIT','IN TRANSIT','COLLECTED',]}
)
StoreId PackageStatus
0 244 IN TRANSIT
1 391 IN TRANSIT
2 246 IN TRANSIT
3 246 IN TRANSIT
4 130 IN TRANSIT
5 130 COLLECTED
The output I'm expecting is to look like this with the package status pivoting to columns and their counts becoming the values:
StoreId IN TRANSIT COLLECTED
244 1 0
391 1 0
246 2 0
130 1 1
All the examples I come across are with a third numeric column with which some aggregation (sum, mean, average etc) is done.
When I tried
df1.pivot_table(index='StoreId',values='PackageStatus', aggfunc='count')
I get the following instead:
PackageStatus
StoreId
130 2
244 1
246 2
391 1
In my case I need a simple transpose/pivot with the count. How to accomplish this? Thank you.
Use columns="PackageStatus" parameter:
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
)
Prints:
PackageStatus COLLECTED IN TRANSIT
StoreId
130 1 1
244 0 1
246 0 2
391 0 1
With .reset_index():
print(
df1.pivot_table(
index="StoreId", columns="PackageStatus", aggfunc="size", fill_value=0
)
.reset_index()
.rename_axis("", axis=1)
)
Prints:
StoreId COLLECTED IN TRANSIT
0 130 1 1
1 244 0 1
2 246 0 2
3 391 0 1

Re-formatting a dataframe to show sequence number and time difference after a groupby

I have a pandas dataframe that has an identifier, a sequence number, and a timestamp.
For example:
MyIndex seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10
I want to reformat by showing a sequence number for each index and with the time difference, something like:
MyIndex seq_no timediff
1 1 0
1 2 1
1 3 2
2 1 0
2 2 3
3 1 0
3 2 3
3 3 2
I know I can get the seq_no by doing
df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)
but how do I get the time difference? Bonus points if you show me how to do the time difference between steps, or total timediff from the start.
I think the simplest way to get the difference is to convert the timestamp to a single unit. You can then calculate the difference with groupby and shift.
import pandas as pd
from io import StringIO
data = """Index seq_no timestamp
1 181 7:56
1 182 7:57
1 183 7:59
2 184 8:01
2 185 8:04
3 186 8:05
3 187 8:08
3 188 8:10"""
df = pd.read_csv(StringIO(data), sep='\s+')
# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1
# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]
# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)
# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]
Output
>>>df
Index seq_no_new timediff
0 1 1 0
1 1 2 1
2 1 3 2
3 2 1 0
4 2 2 3
5 3 1 0
6 3 2 3
7 3 3 2

how to pass pandas series element to another dataframe

I want to check if an error occurred.
I have this two dataframes, from excel files:
Log_frame is a dataframe of log files, reporting data recording and error:
Time Voltage[V] Freq[Hz] Speed Motor_Stt: ErrNo
0 10:00 220 50 30 1 0
1 10:10 220 50 30 1 0
2 10:20 220 50 0 2 3601
3 10:30 220 47 0 1 1500
4 10:40 250 50 0 1 7707
5 10:50 220 50 0 2 3601
6 11:00 220 50 0 2 3601
7 11:10 220 47 0 1 1500
8 11:20 220 50 30 1 0
9 11:30 220 50 30 1 0
Dev_frame is the dataframe of error description:
Fehler-Nr. Descr Cause
0 1500 Chk_Voltage Voltage out of range
1 7707 Chk_Freq. Freq. out of range
2 3601 Chk_Motor_Stt Motor_defec
3 7704 switch_trip chk_over_curr
from Log_frame I can check if, which and how many errors occurred during a day by:
Err_log = Log_frame['ErrNo']
p = Err_log[Err_log != 0].drop_duplicates('first').reset_index(drop=True)
and this result is a pandas series:
<class 'pandas.core.series.Series'>
0 3601
1 1500
2 7707
I can "pass" first error (or second and all the other) by this:
Dev_Err = Dev_frame['Fehler-Nr.']
n = Dev_Err[Dev_Err == p.iloc[0]] #or 1, 2 and so on
I was wondering how to loop trough p.iloc[i].
Should I use a for loop or can be done by any pandas function
EDIT: e.g. if I put 1 in p.iloc[] I can get:
0 1500
if 2:
1 7707
No need to create a loop to check each value, you can use isin method that pandas.DataFrame has as following:
n = dev_frame[dev_frame['Fehler-Nr.'].isin(p)]['Fehler-Nr.']
which is going to return:
0 1500
1 7707
2 3601
Name: Fehler-Nr., dtype: int64
Ref: pandas.DataFrame.isin
If you're using pandas and going for for loops you are wrong. Use pandas vectorised operations. These are done using (simple exaple)
df.apply(some function, axis)
I'm not 100% convinced I understood what you're trying to achieve, but I believe you just want to merge/join number of errors for a given error. If so, pandas.join() and pandas.merge() are to help. Check the docs.

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Sum of next n rows in python

I have a dataframe which is grouped at product store day_id level Say it looks like the below and I need to create a column with rolling sum
prod store day_id visits
111 123 1 2
111 123 2 3
111 123 3 1
111 123 4 0
111 123 5 1
111 123 6 0
111 123 7 1
111 123 8 1
111 123 9 2
need to create a dataframe as below
prod store day_id visits rolling_4_sum cond
111 123 1 2 6 1
111 123 2 3 5 1
111 123 3 1 2 1
111 123 4 0 2 1
111 123 5 1 4 0
111 123 6 0 4 0
111 123 7 1 NA 0
111 123 8 1 NA 0
111 123 9 2 NA 0
i am looking for create a
cond column: that recursively checks a condition , say if rolling_4_sum is greater than 5 then make the next 4 rows as 1 else do nothing ,i.e. even if the condition is not met retain what was already filled before , do this check for each row until 7 th row.
How can i achieve this using python ? i am trying
d1['rolling_4_sum'] = d1.groupby(['prod', 'store']).visits.rolling(4).sum()
but getting an error.
The formation of rolling sums can be done with rolling method, using boxcar window:
df['rolling_4_sum'] = df.visits.rolling(4, win_type='boxcar', center=True).sum().shift(-2)
The shift by -2 is because you apparently want the sums to be placed at the left edge of the window.
Next, the condition about rolling sums being less than 4:
df['cond'] = 0
for k in range(1, 4):
df.loc[df.rolling_4_sum.shift(k) < 7, 'cond'] = 1
A new column is inserted and filled with 0; then for each k=1,2,3,4, look k steps back; if the sum then less than 7, then set the condition to 1.

Resources