How to merge two dataframes with MultiIndex? - python-3.x

I have a frame looks like:
2015-12-30 2015-12-31
300100 am 1 3
pm 3 2
300200 am 5 1
pm 4 5
300300 am 2 6
pm 3 7
and the other frame looks like
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
pm 3 2 4 5
300200 am 2 5 2 6
pm 5 1 3 7
300300 am 1 6 3 2
pm 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
Now I want to merge the two frames, and the frame after merge to be looked like this:
2015-12-30 2015-12-31 2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 1 3 5 1
pm 3 2 3 2 4 5
300200 am 5 1 2 5 2 6
pm 4 5 5 1 3 7
300300 am 2 6 1 6 3 2
pm 3 7 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
I tried pd.merge(frame1,frame2,right_index=True,left_index=True), but what it returned was not the desired format. Can anyone help? Thanks!

You can use concat:
print (pd.concat([frame1, frame2], axis=1))
2015-12-30 2015-12-31 1.1.2016 2.1.2016 3.1.2016 4.1.2016
300100 am 1.0 3.0 1 3 5 1
pm 3.0 2.0 3 2 4 5
300200 am 5.0 1.0 2 5 2 6
pm 4.0 5.0 5 1 3 7
300300 am 2.0 6.0 1 6 3 2
pm 3.0 7.0 3 7 2 3
300400 am NaN NaN 3 1 1 3
pm NaN NaN 2 5 5 2
300500 am NaN NaN 1 6 6 1
pm NaN NaN 5 7 7 5
Values in first and second column are converted to float, because NaN values convert int to float - see docs.
One possible solution is replace NaN by some int e.g. 0 and then convert to int:
print (pd.concat([frame1, frame2], axis=1)
.fillna(0)
.astype(int))
2015-12-30 2015-12-31 1.1.2016 2.1.2016 3.1.2016 4.1.2016
300100 am 1 3 1 3 5 1
pm 3 2 3 2 4 5
300200 am 5 1 2 5 2 6
pm 4 5 5 1 3 7
300300 am 2 6 1 6 3 2
pm 3 7 3 7 2 3
300400 am 0 0 3 1 1 3
pm 0 0 2 5 5 2
300500 am 0 0 1 6 6 1
pm 0 0 5 7 7 5

you can use join
frame1.join(frame2, how='outer')

Related

How to convert column to rows

I have csv file contain on 6 columns like this:
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
I need to convert this columns to rows to be like this:
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
How can do that please?
This is input
This is the output
Try:
import csv
with open("input.csv", "r") as f_in, open("output.csv", "w") as f_out:
reader = csv.reader(f_in, delimiter=" ")
writer = csv.writer(f_out, delimiter=" ")
writer.writerows(zip(*reader))
Contents of input.csv:
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Contents of output.csv after the script run:
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
you are looking for a table pivot method
if you are using pandas , this will do the trick https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

Expanding sum with group by date

I have a dataframe where I'm trying to do an expanding sum of values and group them by date.
Specifically, my data looks like:
creationDateTime OK Fail
2017-01-06 21:30:00 4 0
2017-01-06 21:35:00 4 0
2017-01-06 21:36:00 4 0
2017-01-07 21:48:00 3 1
2017-01-07 21:53:00 4 0
2017-01-08 21:22:00 3 1
2017-01-08 21:27:00 3 1
2017-01-09 21:49:00 3 1
and I'm trying to get something similar to:
creationDateTime OK Fail RollingOK RollingFail
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
I've figured out how to do a rolling sum of the values by using:
data_aggregated['RollingOK'] = data_aggregated['OK'].expanding(0).sum()
data_aggregated['RollingFail'] = data_aggregated['Fail'].expanding(0).sum()
But I'm not sure how I can alter this to get the rolling sums grouped by day, since the code above does a rolling sum over all the rows, without grouping by day.
Any help is very much appreciated.
Use DataFrameGroupBy.cumsum with specified columns after groupby:
#if DatetimeIndex
idx = data_aggregated.index.date
#if column
#idx = data_aggregated['creationDateTime'].dt.date
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.cumsum())
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
You can also working with all columns:
data_aggregated = (data_aggregated.join(data_aggregated.groupby(idx)
.cumsum()
.add_prefix('Rolling')))
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
Your solution should be changed:
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.expanding(0)
.sum()
.reset_index(level=0, drop=True))
You can use, (if 1st column : creationDateTime is a column):
df['RollingOK']=df.groupby(df.creationDateTime.dt.date)['OK'].cumsum()
df['RollingFail']=df.groupby(df.creationDateTime.dt.date)['Fail'].cumsum()
print(df)
creationDateTime OK Fail RollingOK RollingFail
0 2017-01-06 21:30:00 4 0 4 0
1 2017-01-06 21:35:00 4 0 8 0
2 2017-01-06 21:36:00 4 0 12 0
3 2017-01-07 21:48:00 3 1 3 1
4 2017-01-07 21:53:00 4 0 7 1
5 2017-01-08 21:22:00 3 1 3 1
6 2017-01-08 21:27:00 3 1 6 2
7 2017-01-09 21:49:00 3 1 3 1

Python: Summing every five rows of column b data and create a new column

I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0

Dataframe concatenate columns

I have a dataframe with a multiindex (ID, Date, LID) and columns from 0 to N that looks something like this:
0 1 2 3 4
ID Date LID
00112 11-02-2014 I 0 1 5 6 7
00112 11-02-2014 II 2 4 5 3 4
00112 30-07-2015 I 5 7 1 1 2
00112 30-07-2015 II 3 2 8 7 1
I would like to group the dataframe by ID and Date and concatenate the columns to the same row such that it looks like this:
0 1 2 3 4 5 6 7 8 9
ID Date
00112 11-02-2014 0 1 5 6 7 2 4 5 3 4
00112 30-07-2015 5 7 1 1 2 3 2 8 7 1
Using pd.concat and pd.DataFrame.xs
pd.concat(
[df.xs(x, level=2) for x in df.index.levels[2]],
axis=1, ignore_index=True
)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1
Use unstack + sort_index:
df = df.unstack().sort_index(axis=1, level=1)
#for new columns names
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1

groupby, count and average in numpy, pandas in python

I have a dataframe that looks like this:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:
userId count mean
0 1 3 2.83
1 2 2 4.5
2 3 3 3.5
3 4 2 4.5
4 5 3 4.0
Can someone help?
df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)
userId count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000
Drop movieId since we're not using it, groupby userId, and then apply the aggregation methods:
import pandas as pd
df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})
df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])
print(df)
Which produces:
rating
count mean
userId
1 3 2.833333
2 2 4.500000
3 3 3.500000
4 2 4.500000
5 3 4.000000
Here's a NumPy based approach using the fact that userID column appears to be sorted -
unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals
Sample run -
In [103]: df
Out[103]:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
In [104]: df_out
Out[104]:
userID count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000

Resources