Row comparison on different tables - excel

friends.
I'm trying to figure out a formula that verifies if there is a matching row from table 2 on table 1. If not, the formula must show that the row were not listed, like stated on column E (CHECK). Is that possible? Or maybe a VBA macro, idk.
TABLE 1
A
B
C
D
29
1
1
1
29
2
1
2
30
3
1
2
15
1
1
1
15
2
1
2
15
3
1
2
20
1
1
1
20
2
1
2
20
3
2
1
20
4
2
2
20
5
1
3
TABLE 2
A
B
C
D
CHECK
29
1
1
1
EXISTS
15
1
1
2
NOT
15
2
1
2
EXISTS
15
3
1
2
EXISTS
20
6
1
1
NOT
100
1
2
3
NOT LISTED
Thanks, guys, would appreciate some help.

Related

Remove rows from Dataframe where row above or below has same value in a specific column

Starting Dataframe:
A B
0 1 1
1 1 2
2 2 3
3 3 4
4 3 5
5 1 6
6 1 7
7 1 8
8 2 9
Desired result - eg. Remove rows where column A has values that match the row above or below:
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9
You can use boolean indexing, the following condition will return true if value of A is NOT equal to value of A's next row
new_df = df[df['A'].ne(df['A'].shift())]
A B
0 1 1
2 2 3
3 3 4
5 1 6
8 2 9

How to randomly generate an unobserved data in Python3

I have an dataframe which contain the observed data as:
import pandas as pd
d = {'humanID': [1, 1, 2,2,2,2 ,2,2,2,2], 'dogID':
[1,2,1,5,4,6,7,20,9,7],'month': [1,1,2,3,1,2,3,1,2,2]}
df = pd.DataFrame(data=d)
The df is follow
humanID dogID month
0 1 1 1
1 1 2 1
2 2 1 2
3 2 5 3
4 2 4 1
5 2 6 2
6 2 7 3
7 2 20 1
8 2 9 2
9 2 7 2
We total have two human and twenty dog, and above df contains the observed data. For example:
The first row means: human1 adopt dog1 at January
The second row means: human1 adopt dog2 at January
The third row means: human2 adopt dog1 at Febuary
========================================================================
My goal is randomly generating two unobserved data for each (human, month) that are not appear in the original observed data.
like for human1 at January, he does't adopt the dog [3,4,5,6,7,..20] And I want to randomly create two unobserved sample (human, month) in triple form
humanID dogID month
1 20 1
1 10 1
However, the follow sample is not allowed since it appear in original df
humanID dogID month
1 2 1
For human1, he doesn't have any activity at Feb, so we don't need to sample the unobserved data.
For human2, he have activity for Jan, Feb and March. Therefore, for each month, we want to randomly create the unobserved data. For example, In Jan, human2 adopt dog1, dog4 and god 20. The two random unobserved samples can be
humanID dogID month
2 2 1
2 6 1
same process can be used for Feb and March.
I want to put all of the unobserved in one dataframe such as follow unobserved
humanID dogID month
0 1 20 1
1 1 10 1
2 2 2 1
3 2 6 1
4 2 13 2
5 2 16 2
6 2 1 3
7 2 20 3
Any fast way to do this?
PS: this is a code interview for a start-up company.
Using groupby and random.choices:
import random
dogs = list(range(1,21))
dfs = []
n_sample = 2
for i,d in df.groupby(['humanID', 'month']):
h_id, month = i
sample = pd.DataFrame([(h_id, dogID, month) for dogID in random.choices(list(set(dogs)-set(d['dogID'])), k=n_sample)])
dfs.append(sample)
new_df = pd.concat(dfs).reset_index(drop=True)
new_df.columns = ['humanID', 'dogID', 'month']
print(new_df)
humanID dogID month
0 1 11 1
1 1 5 1
2 2 19 1
3 2 18 1
4 2 15 2
5 2 14 2
6 2 16 3
7 2 18 3
If I understand you correctly, you can use np.random.permutation() for the dogID column to generate random permutations of the column,
df_new=df.copy()
df_new['dogID']=np.random.permutation(df.dogID)
print(df_new.sort_values('month'))
humanID dogID month
0 1 1 1
1 1 20 1
4 2 9 1
7 2 1 1
2 2 4 2
5 2 5 2
8 2 2 2
9 2 7 2
3 2 7 3
6 2 6 3
Or to create random sampling of missing values within the range of dogID:
df_new=df.copy()
a=np.random.permutation(range(df_new.dogID.min(),df_new.dogID.max()))
df_new['dogID']=np.random.choice(a,df_new.shape[0])
print(df_new.sort_values('month'))
humanID dogID month
0 1 18 1
1 1 16 1
4 2 1 1
7 2 8 1
2 2 4 2
5 2 2 2
8 2 16 2
9 2 14 2
3 2 4 3
6 2 12 3

excel 2016 - delete rows based on multiple

I am trying to delete rows when the date in column B is not present exactly 4 times for a given filekey in column C. Sample data below:
A B C
Row Date Filekey
2 1/6/2014 1
3 1/6/2014 1
4 1/6/2014 1
5 1/6/2014 1
6 1/7/2014 1
7 1/7/2014 1
8 1/8/2014 1
9 1/9/2014 1
10 1/9/2014 1
11 1/9/2014 1
12 1/9/2014 1
13 1/9/2014 1
14 1/6/2014 2
15 1/6/2014 2
16 1/6/2014 2
17 1/6/2014 2
The result I am looking for:
Row Date Filekey
2 1/6/2014 1
3 1/6/2014 1
4 1/6/2014 1
5 1/6/2014 1
14 1/6/2014 2
15 1/6/2014 2
16 1/6/2014 2
17 1/6/2014 2
Please note that Row 6-7 were removed for only having 2 dates the same (too few), Row 8 for 1 date (too few), Rows 9-13 for 5 dates (too many)
Rows 14-17 were kept because:
there are exactly 4 rows with that date and it has a different filekey (column C) than rows 2-5 even though it shares those four dates.
Thanks for your help.
In cell D2 use this formula and copy down:
=COUNTIFS(B:B,B2,C:C,C2)
Then filter on column D for everything other than 4 and delete those rows, then remove the filter and you can delete the formulas in column D

Excel - How do I create a cumulative sum column within a group?

In Excel, I have an hours log that looks like this:
PersonID Hours JobCode
1 7 1
1 6 2
1 8 3
1 10 1
2 5 3
2 3 5
2 12 2
2 4 1
What I would like to do is create a column with a running total, but only within each PersonID so I want to create this:
PersonID Hours JobCode Total
1 7 1 7
1 6 2 13
1 8 3 21
1 10 1 31
2 5 3 5
2 3 5 8
2 12 2 20
2 4 1 24
Any ideas on how to do that?
In D2 and fill down:
=SUMIF(A$2:A2,A2,B$2:B2)
Assuming that your data starts in cell A1, this formula will accumulate the hours until it finds a change in person ID.
=IF(A2=A1,D1+B2,B2)
Put the formula in cell D2, and copy down for each row of your data.

Create a list of duplicate records that are in several columns

I have a data set that is spread across five columns. Sample of data:
Raw Data End Results
A B C D E A B C D E
1 2 2 1 6 1 2 2 1 6
0 3 3 0 6 0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
The length of record varies from 10 to 40.
The data is to help me keep record of inventory and I wish to know which orders are popular.
Unfortunately I am still using Excel 2003.
Because I am not really sure what you have, this is deliberately simple:
In ColumnG Row1 put:
=A1&B1&C1&D1&E1
and copy down to suit. Select ColumnG and Paste Special, Values. Select ColumnG and sort. Insert in H1 and copy down to suit:
=COUNTIF(G$1:G1,G1)
1 should indicate the first ("unique") instance of each of the rows of Raw Data (and the other numbers the number of repetitions - up to 7 in your example, so one 'original' and six 'copies'.

Resources