How to compare two dataframes based on certain column values and remove them in pandas - python-3.x

I have two data frames.
df1:
userID ID Sex Date Month Year Security
John 45 Male 31 03 1975 Low
Tom 22 Male 01 01 1990 High
Mary 33 Female 23 05 1990 Medium
Hary 56 Male 15 09 1970 High
df2:
userID ID Sex Date Month Year
Hari 45 Male 31 03 1975
Luka 22 Male 01 01 1990
Johan 33 Female 23 05 1990
Irfan 56 Male 29 09 1971
John 45 Male 31 03 1975
Tom 22 Male 01 01 1990
Mary 34 Female 34 05 1980
Hary 56 Male 15 09 1970
I wanted to compare df2 with df1 and keep only those rows in df2 which are having
common values in columns (userID,ID,Date,Month,Year)
So my new df2 should look like this:
John 45 Male 31 03 1975
Tom 22 Male 01 01 1990
Hary 56 Male 15 09 1970
What could be the best approach get this in pandas?
Can someone help me in this?

Just do with simple merge follow with dropna
df2.merge(df1,how='left').dropna().drop('Security',1)
Out[318]:
userID ID Sex Date Month Year
4 John 45 Male 31 3 1975
5 Tom 22 Male 1 1 1990
7 Hary 56 Male 15 9 1970

Define the key columns which you want to merge on, and then perform an inner merge between df2 and only the key columns of df1. The default for merge is inner, so you don't need to specify it explicitly. Subsetting df1 to only these key columns ensures that you don't bring any of its columns over to df2 with the merge.
key_cols = ['userID', 'ID', 'Date', 'Month', 'Year']
df2.merge(df1.loc[:, df1.columns.isin(key_cols)])
Outputs:
userID ID Sex Date Month Year
0 John 45 Male 31 3 1975
1 Tom 22 Male 1 1 1990
2 Hary 56 Male 15 9 1970

Related

using python want to calculate last 6 months average for each month

I have a dataframe which has 3 columns [user_id ,year_month & value] , i want to calculate last 6months average for the year automatically for each individual unique user_id and assign it to new column
user_id value year_month
1 50 2021-01
1 54 2021-02
.. .. ..
1 50 2021-11
1 47 2021-12
2 36 2021-01
2 48.5 2021-05
.. .. ..
2 54 2021-11
2 30.2 2021-12
3 41.4 2021-01
3 48.5 2021-02
3 41.4 2021-05
.. .. ..
3 30.2 2021-12
Total year has 12-24 months
to get jan 2022 value[dec 2021 to july 2021]=[55+32+33+63+54+51]/6
to get feb 2022 value[jan 2022 to aug 2021] =[32+33+37+53+54+51]/6
to get mar 2022 value[feb 2022 to sep 2021] =[45+32+33+63+54+51]/6
to get apr 2022 value[mar 2022 to oct 2021] =[63+54+51+45+32+33]/6
First index, your datetime column
df = df.set_index('year_month')
Then do the following
df.groupby('UserId').rolling('6M').transform('avg')
This is the most correct way but hey here is one more intutitive
df.sort_values('year_month').groupby('UserId').rolling(6).transform('avg') # Returns wanted series
As paul h said

Horizontal SUMIFS with two vertical criteria

I am given the following sales table which provide the sales that each employee made, but instead of their name I have their ID and each ID may have more than 1 row.
To map the ID back to the name, I have a look up table with each employee's name and ID.
Sales Table:
Year
ID
North
South
West
East
2020
A
58
30
74
72
2020
A
85
40
90
79
2020
B
9
82
20
5
2020
B
77
13
49
21
2020
C
85
55
37
11
2020
C
29
70
21
22
2021
A
61
37
21
42
2021
A
22
39
2
34
2021
B
62
55
9
72
2021
B
59
11
2
37
2021
C
41
22
64
47
2021
C
83
18
56
83
ID table:
ID
Name
A
Allison
B
Brandon
C
Chris
I am trying to sum up each employee's sales by a given year, and aggregate all their transactions by their name (rather than ID), so that my result looks like the following:
Result:
Report
2021
Allison
258
Brandon
307
Chris
414
I want the user to be able to select the year, and the report would automatically sum up each person's sales by the year and their name.
Any ideas on how I can accomplish this?
With FILTER:
=SUM(FILTER($C$2:$F$13,($B$2:$B$13=INDEX($I$2:$I$4,MATCH(N3,$J$2:$J$4,0)))*($A$2:$A$13=$N$2)))
With SUMPRODUCT:
=SUMPRODUCT($C$2:$F$13*($B$2:$B$13=INDEX($I$2:$I$4,MATCH(N3,$J$2:$J$4,0)))*($A$2:$A$13=$N$2))

How can I add dates to column but repeat each 24 times, in Excel?

Here is a sample from the data that I am looking at.
Hour Index Visits
0 67
1 22
2 111
3 22
4 0
5 0
6 22
7 44
8 0
9 89
10 22
11 111
12 44
13 89
14 44
15 111
16 177
17 89
18 44
19 44
20 89
21 22
22 89
23 44
24 133
25 44
26 22
27 22
28 44
29 22
30 44
31 44
32 22
what I want to do is add another column that contains dates starting with Monday which is repeated 24 times then go to Tuesday (repeated 24 times) and so on. So the result should look like:
Hour Index Visits Day
0 67 MONDAY
1 22 MONDAY
2 111 MONDAY
3 22 MONDAY
4 0 MONDAY
5 0 MONDAY
6 22 MONDAY
7 44 MONDAY
8 0 MONDAY
9 89 MONDAY
10 22 MONDAY
11 111 MONDAY
12 44 MONDAY
13 89 MONDAY
14 44 MONDAY
15 111 MONDAY
16 177 MONDAY
17 89 MONDAY
18 44 MONDAY
19 44 MONDAY
20 89 MONDAY
21 22 MONDAY
22 89 MONDAY
23 44 MONDAY
24 133 TUESDAY
25 44 TUESDAY
26 22 TUESDAY
27 22 TUESDAY
28 44 TUESDAY
29 22 TUESDAY
30 44 TUESDAY
31 44 TUESDAY
32 22 TUESDAY
I know how to get the dates to increment, but not repeat 24 times then increment. Can someone show me how to do this with Excel?
try to use this formula (I suppose that your Hour column starts from A2 cell):
=TEXT(1+MOD(1+INT(A2/24),7),"dddd")
Note, that formula works well if your excel dates starts from 01.01.1900 (which is usually default for excel on PC).
If you are using 1904 date system, you should use next formula:
=TEXT(2+MOD(1+INT(A2/24),7),"dddd")
Please try: =UPPER(TEXT(DAY(2+A2/24),"dddd")). The first 2 is to control when the sequence starts.

Excel Add date column with dates repeated 24 times [duplicate]

This question already has answers here:
Excel add column starting at 1 and increments to 24 then resets [closed]
(2 answers)
Closed 8 years ago.
Here is a sample of my data
Hour Index Visits
0 67
1 22
2 111
3 22
4 0
5 0
6 22
7 44
8 0
9 89
10 22
11 111
12 44
13 89
14 44
15 111
16 177
17 89
18 44
19 44
20 89
21 22
22 89
23 44
24 133
25 44
26 22
27 22
28 44
29 22
30 44
31 44
32 22
What I want to do is add two columns. In one column there is the date starting at Jan 1, 2013 and repeats this date for 24 rows until it increments to the next day. Then I want another column that just displays the month of the previous column. Here is what it should look like
Hour Index Visits date month
0 67 1/1/2013 1
1 22 1/1/2013 1
2 111 1/1/2013 1
3 22 1/1/2013 1
4 0 1/1/2013 1
5 0 1/1/2013 1
6 22 1/1/2013 1
7 44 1/1/2013 1
8 0 1/1/2013 1
9 89 1/1/2013 1
10 22 1/1/2013 1
11 111 1/1/2013 1
12 44 1/1/2013 1
13 89 1/1/2013 1
14 44 1/1/2013 1
15 111 1/1/2013 1
16 177 1/1/2013 1
17 89 1/1/2013 1
18 44 1/1/2013 1
19 44 1/1/2013 1
20 89 1/1/2013 1
21 22 1/1/2013 1
22 89 1/1/2013 1
23 44 1/1/2013 1
24 133 2/1/2013 1
25 44 2/1/2013 1
26 22 2/1/2013 1
27 22 2/1/2013 1
28 44 2/1/2013 1
29 22 2/1/2013 1
30 44 2/1/2013 1
31 44 2/1/2013 1
32 22 2/1/2013 1
Suppose your Hours starts from A2. Then you can write in date column (column C):
=DATE(2013,1,1)+INT(A2/24)
and drop it down.
Next step, write in month column (Column D):
=MONTH(C2)
and drop it down.

Merging two files by a single column in unix

I would like to merge two files by one column in unix.
I have file_a:
subjectid name age
12 Jane 16
24 Kristen 90
15 Clarke 78
23 Joann 31
I have another file_b:
subjectid prob_disease
12 0.009
24 0.738
15 0.392
23 1.2E-5
I would like to merge these files in the command line. I'd like to merge files a and b by subjectid. Since each file is about 2 million lines long, I tried in R but it froze due to the amount of data, could someone please help me do this in linux?
Desired output:
subjectid prob_disease name age
12 0.009 Jane 16
24 0.738 Kristen 90
15 0.392 Clarke 78
23 1.2E-5 Joanna 31
Please help and thank you!
Check out join(1). In your case, you don't even need any flags:
$ join file_b file_a
subjectid prob_disease name age
12 0.009 Jane 16
24 0.738 Kristen 90
15 0.392 Clarke 78
23 1.2E-5 Joann 31
You're looking for the join command:
$ cat test.1
12 Jane 16
24 Kristen 90
15 Clarke 78
23 Joann 31
$ cat test.2
12 0.009
24 0.738
15 0.392
23 1.2E-5
$ join -j1 -o 2.1,2.2,1.2,1.3 <(sort test.1) <(sort test.2)
12 0.009 Jane 16
15 0.392 Clarke 78
23 1.2E-5 Joann 31
24 0.738 Kristen 90
$

Resources