This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two data frames
The first frame is my IDs, some 'old code' matches to one 'Master ID'. Some OLD code are not matched to a Master ID.
ID Dataframe
MASTER ID OLD CODE
MASTER1 1A
MASTER1 1B
MASTER2 2
MASTER3 3
4
Sales
OLD CODE Salesvalues
1A 10
1B 15
2 6
3 8
4 5
If I am doing a right join or an outer join, it returns more rows then my original sales table. How I can make a join on the first matching 'MASTER ID' match and keeping the same number of rows(no multiple duplicate rows). I would like if there is no match for the'old code' on 'master ID', that will returns NA.
Expected Merge dataframe
OLD CODE Salesvalues MASTER ID (Join column)
1A 10 MASTER1
1B 15 MASTER1
2 6 MASTER2
3 8 MASTER3
4 5 NA
See if this works for you.
Sales.merge(ID Dataframe,on='OLD_CODE',how ='outer')
Related
This question already has answers here:
Excel - order of occurrence / running total formula
(2 answers)
Closed 18 days ago.
I have dataset containing list of checks numbers.
Check
111
111
111
222
222
I am trying to have a new column in my dataset which would give me 1st, 2nd, ....nth instance for every check. The output would like something as below,
Check
Instance
111
1
111
2
111
3
222
1
222
2
To create a rolling count for a specific instance that appears in a range of cells, you can use the COUNTIF function with an expanding range.
=COUNTIF(A$2:A2,A2)
This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I have a data-frame which looks like:
A B C D E
a aa 1 2 3
b aa 4 5 6
c cc 7 8 9
d cc 11 10 3
e dd 71 81 91
As rows (1,2) and rows (3,4) has duplicate values of column B. I want to keep only one of them.
The Final output should be:
A B C D E
a aa 1 2 3
c cc 7 8 9
e dd 71 81 91
How can I use pandas to accomplish this?
DataFrame.drop_duplicates(subset="B", keep='first')
keep: keep is to control how to consider duplicate value.
It has only three distinct values and the default is ‘first’.
If ‘first’, it considers the first value as unique and the rest of the same values as duplicate.
If ‘last’, it considers the last value as unique and the rest of the same values as duplicate.
If False, it considers all of the same values as duplicates
Try drop_duplicates
df = df.drop_duplicates('B')
A B C D E
0 a aa 1 2 3
2 c cc 7 8 9
4 e dd 71 81 91
In the general case,
We need to drop across multiple columns. In that case, you need to use as follow
df.drop_duplicates(subset=['A', 'C'], keep=First)
We specify the column names in the subset argument and we use the keep argument to say what we need to keep
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
I have two sheets in Excel workbook.
The first sheet has
1) Customer ID – unique values for each customer.
2) Question ID – unique id for each question
3) Questions
Customer ID Question ID question
1 34 name
1 45 company
2 34 name
2 45 company
3 34 name
3 45 company
4 34 name
4 45 company
5 34 name
5 45 company
The second sheet has three columns
1) Customer ID – unique values for each customer.
2) Question ID – unqiue id for each question
3) Questions
Customer ID Question ID Answer
1 34 Amy
1 45 GEICO
2 34 Steph
3 34 Anna
3 45 GEICO
4 34 Adam
5 34 Mark
5 45 AAA
In this sheet, not every customer id and Question ID in sheet one will have answers in the sheet 2
Sheet 3 Expected Output
I wanted to do a vba macro to combine both sheet1 and sheet2 and have all the columns. For any customer id, if there is no answer for a question, that field should be left blank.
Expected Output in Sheet3
Customer ID Question ID question Answer
1 34 name Amy
1 45 company GEICO
2 34 name Steph
2 45 company
3 34 name Anna
3 45 company GEICO
4 34 name Adam
4 45 company
5 34 name Mark
5 45 company AAA
There are several ways this can be done without writing code.
Below is one method off the top of my head. Others include the built-in query editor (Get & Transform), or PivotTables and others ways to consolidate data in multiple worksheets.
On Sheet2, first set up a "helper column" since there are multiple columns you want to match. In this example the formula is: =C2&D2 starting in Cell B2.
...then, in Sheet1 (cell E2 in the example), use a formula like:
=IFERROR(VLOOKUP(B2&C2,Sheet2!$B$1:$E$9,4,FALSE),"")
Both formulas get copied or "dragged" down as far as necessary and obviously the formula adjusted to refer to the correct cells.
No third worksheet is necessary but if you want you can start by copying Sheet1 to Sheet3.
More Information:
Microsoft Support : VLOOKUP Function
Microsoft Support : Lookup & Reference Functions
Microsoft Support : IFERROR Function
I'm sorry if this has been answered. I've been searching around for awhile now.
I have a times series dataset that I need to perform calculations on based on the previous x time (last hour,day, etc).
My issues is that I don't know how to run these calculations since the time deltas are not standardized.
Example:
Column A - Time (in seconds lets say)
Column B - Value
Time Value Result(5)
01 3 0
02 5 3
04 4 8
07 8 9
09 6 12
13 4 6
14 4 10
15 1 8
22 9 0
33 7 0
How could I return the Result(5) column by summing the last 5 seconds from that one instance (row) (not including it)?
Thank you.
EDIT:
To clear up what I'm trying to do:
1) Find the previous 5 secs of data using column A and return that range of rows
2) Using that range of rows for the 5 previous secs, sum column B
3) Output in Column C (formula)
The following formula should do what you need (paste into C2 and drag down):
=SUMIFS($B$2:$B$11,$A$2:$A$11,">="&A2-5,$A$2:$A$11,"<"&A2)
Where YourTime is the time in the row you wish to look back and sum over.
I've tested and it works for the data you provided - expand the ranges as appropriate.
This question already has an answer here:
Comparing two columns on sheet1 to two other columns on sheet2 and returning another column in sheet 2.
(1 answer)
Closed 2 years ago.
I have a set of data around 25000 rows, for simplicity sake. i have 2 columns (submissionid, address,locationid), the table i want to compare it with has 4 columns(submissionid, address1,or address2,locationid). I want to retrieve the locationid from table 2 and put it in table 1.
The address from the first table can either come from address1 column or address2 column.
Most times, vlookup will solve the problem, however, some of the address are duplicates with different submissionid.
Ex: submissionid = 4, address = 25 main street, locationid = 7
submissionid=7, address = 25 main street, locationid= 8
Any way to solve this problem? I tried to use pivot table matrix, but my data set is too big!
Thanks
table 1
submissionid address locationid
5 123 MainStreet
4 123 MainStreet
4 45 MLK BLVD
6 11 Thames Rd
7 4 RR
Table 2
submissionid address locationid
4 123 MainStreet 7
5 123 MainStreet 10
4 45 MLK BLVD 4
6 11 Thames Rd 11
7 4 RR 10
As you can see, some of the submissionid takes more than 1 address, and the locationid can be nonexclusive. however, locationids are exclusive within its submissionid (ie, there can't be two of the same locationid for 1 submissionid)
If you are prepared to add a named array such as shown in D8:D13 in the image
then:
=INDEX(locationid,MATCH(A2&"\"&B2,submissionid\address,0))
may suit (copied down), where the left-hand bordered area is also a named range.