Couting duplicate data in total for excel - excel

A B
1 8 Tiffney, Jennifer
2 8 Tiffney, Jennifer
3 8 Tiffney, Jennifer
4 8 Tiffney, Jennifer
5 8 Tiffney, Jennifer
6 8 Tiffney, Jennifer
7 9 Allen, Larry
8 9 Allen, Larry
9 9 Allen, Larry
10 9 Allen, Larry
11 9 Allen, Larry
12 10 Reid, Brian
13 10 Reid, Brian
14 10 Reid, Brian
15 10 Reid, Brian
16 10 Reid, Brian
17 10 Reid, Brian
18 10 Reid, Brian
19 10 Reid, Brian
20 10 Reid, Brian
21 10 Reid, Brian
22 10 Reid, Brian
23 11 Edington, Bruce
24 11 Edington, Bruce
25 11 Edington, Bruce
26 12 Almond, David
27 12 Almond, David
28 12 Almond, David
29 12 Almond, David
30 12 Almond, David
31 12 Almond, David
32 13 Mittal, Charu
33 13 Mittal, Charu
34 13 Mittal, Charu
35 13 Mittal, Charu
36 13 Mittal, Charu
37 13 Mittal, Charu
There are tons of duplicate data in excel, Is there any way can count how many people will in total? I tried to use "Count" and "Countif" formulas, but there are duplicate data.
there should be 6 people in total as above, any solution to do this?

Use the following formula:
=COUNT(IF(FREQUENCY(MATCH(A1:A37,A1:A37,0),MATCH(A1:A37,A1:A37,0))>0,1))
or
In the EXCEL, click Data Tab. You will find Remove Duplicates. Select your column and click remove duplicates and all the duplicates will be removed. Now you have distinct data and you will get only 6 records

Try this:
=SUMPRODUCT(1/COUNTIF($A$1:$A$100,$A$1:$A$100))
Where $A$1:$A$100 is your data range.

Related

How to insert a pandas series as a new column in DataFrame, matching with the indexes of df with series of different length

I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows
While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)

Recursive search in Spark DataFrame

I have employee table, where employee id and supervisor is present. I want to find the hierarchy for the employee in five levels.
Example: Employee 1 is reported to 2, 2 reported to 4,4 reported to 17, 17 reported to 20. But we not able to find 20 supervisor so we kept the supervisor for 20 is 20 itself.
EmployeeID
SupervisiorID
1
2
2
4
8
6
9
5
6
3
5
10
4
17
3
15
10
20
15
20
17
20
16
21
15
13
14
12
13
11
Excepted output
EmployeeID
SupervisiorID_1
SupervisiorID_2
SupervisiorID_3
SupervisiorID_4
SupervisiorID_5
1
2
4
17
20
20
2
4
17
20
20
20
8
6
3
15
20
20
9
5
10
20
20
20
6
3
15
20
20
20
5
10
20
20
20
20
4
17
20
20
20
20
3
15
20
20
20
20
10
20
20
20
20
20
15
20
20
20
20
20
17
20
20
20
20
20
16
21
21
21
21
21
15
13
11
11
11
11
14
12
12
12
12
12
13
11
11
11
11
11
How can we achieve this in Spark using dataframe recursively.
Although this has been asked many times, someone here https://dwgeek.com/spark-sql-recursive-dataframe-pyspark-and-scala.html/ has solved this.
If you only have 5 levels, than it is better to use 4 joins to do the job.
In my point of view, spark doesn't support natively recursive solutions for such scenario. If you really want to do it in a recursive way, you may need to collect the data u need and do it on driver locally.

How to work on "age bins" in Pandas Dataframe which are saved as string?

I downloaded a dataset in .csv format from kaggle which is about lego. There's a "Ages" column like this:
df['Ages'].unique()
array(['6-12', '12+', '7-12', '10+', '5-12', '8-12', '4-7', '4-99', '4+',
'9-12', '16+', '14+', '9-14', '7-14', '8-14', '6+', '2-5', '1½-3',
'1½-5', '9+', '5-8', '10-21', '8+', '6-14', '5+', '10-16', '10-14',
'11-16', '12-16', '9-16', '7+'], dtype=object)
These categories are the suggested ages for using and playing with the legos.
I'm intended to do some statistical analysis with these age bins. For example, I want to check the mean of these suggested ages.
However, since the type of each of them is string:
type(lego_dataset.loc[0]['Ages'])
str
I don't know how to work on the data.
I've already check How to categorize a range of values in Pandas DataFrame
But imagine there are 100 unique bins. It's not reasonable to prepare a list of 100 labels for each category. There should be a better way.
Not entirely sure what output you are looking for. See if the below code & output helps you.
df['Lage'] = df['Ages'].str.split('[-+]').str[0]
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]
or
df['Lage'] = df['Ages'].str.extract('(\d+)', expand=True) #you don't get the fractions for row 17 & 18
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]
Input
Ages
0 6-12
1 12+
2 7-12
3 10+
4 5-12
5 8-12
6 4-7
7 4-99
8 4+
9 9-12
10 16+
11 14+
12 9-14
13 7-14
14 8-14
15 6+
16 2-5
17 1½-3
18 1½-5
19 9+
20 5-8
21 10-21
22 8+
23 6-14
24 5+
25 10-16
26 10-14
27 11-16
28 12-16
29 9-16
30 7+
Output1
Ages Lage Uage
0 6-12 6 12
1 12+ 12
2 7-12 7 12
3 10+ 10
4 5-12 5 12
5 8-12 8 12
6 4-7 4 7
7 4-99 4 99
8 4+ 4
9 9-12 9 12
10 16+ 16
11 14+ 14
12 9-14 9 14
13 7-14 7 14
14 8-14 8 14
15 6+ 6
16 2-5 2 5
17 1½-3 1½ 3
18 1½-5 1½ 5
19 9+ 9
20 5-8 5 8
21 10-21 10 21
22 8+ 8
23 6-14 6 14
24 5+ 5
25 10-16 10 16
26 10-14 10 14
27 11-16 11 16
28 12-16 12 16
29 9-16 9 16
30 7+ 7
Output2
Ages Lage Uage
0 6-12 6 12
1 12+ 12
2 7-12 7 12
3 10+ 10
4 5-12 5 12
5 8-12 8 12
6 4-7 4 7
7 4-99 4 99
8 4+ 4
9 9-12 9 12
10 16+ 16
11 14+ 14
12 9-14 9 14
13 7-14 7 14
14 8-14 8 14
15 6+ 6
16 2-5 2 5
17 1½-3 1 3
18 1½-5 1 5
19 9+ 9
20 5-8 5 8
21 10-21 10 21
22 8+ 8
23 6-14 6 14
24 5+ 5
25 10-16 10 16
26 10-14 10 14
27 11-16 11 16
28 12-16 12 16
29 9-16 9 16
30 7+ 7

Get unique rows from csv

I have a csv file with over 5k rows with the following structure:
Source Target LinkId LinkName Throughput
==================================================
1 12 1250 link1250 5 //return
1 12 3250 link3250 14 //return
1 14 1250 link1250 5
1 14 3250 link3250 14
1 18 1250 link1250 5
1 18 3250 link3250 14
1 25 250 link250 2 //return
2 12 2250 link2250 5 //return
2 12 5250 link5250 14 //return
2 14 2250 link2250 5
2 14 5250 link5250 14
2 18 2250 link2250 5
2 18 5250 link5250 14
2 58 50 link50 34
I would now like to filter the csv to display only the lines highlighted above, i.e, the csv should be filtered in such a way that there is only one entry per linkID irrespective of the other columns. So I would expect something like this:
Source Target LinkId LinkName Throughput
==================================================
1 12 1250 link1250 5
1 12 3250 link3250 14
2 12 2250 link2250 5
2 12 5250 link5250 14
2 58 50 link50 34
and so on. Could someone suggest an easy way to do this in excel.
If you don't care about keeping the duplicates, you can select all cells and go to Data>Remove duplicates.
If you don't wanna delete any data, you can use a Pivot Table using the existing table as source.

Excel Pivot table - get maximum for a period of 24 hours

I have an excel with:
Days of the week and 24 hours for each day.
Each hour I get some points.
I would like to calcute the maximum of cumulate points I can get within 24 hours.
[TEST.XLSX]
2 Columns:
Monday Points
0 34
1 32
2 4
3 54
4 12
5 55
6 4
7 4
8 555
9 787
10 8
11 76
12 78
13 8
14 656
15 7
16 4
17 45
18 54
19 543
20 56
21 65
22 4
23 3
Tuesday
0 56
1 7
2 333
3 9
4 876
5 3333
6 3333
7 76
8 3333
9 465
10 7
11 6
12 5
13 6
14 7
15 6
16 7
17 65
18 555555555
19 6
20 5
21 4
22 6
23 6
Wednesday
0 6
1 7
...
Thanks for your help!
Use real date time values in your hours column. Delete the rows with the day text. Instead, use a formula that increments from a starting date/time. For example: cell A2 contains the date and midnight time for Nov 17. Cell A3 and copied down contains the formula
=A2+TIME(1,0,0)
which increments by one hour.
Now you con build a pivot table. Group by the date/time value by day and hour. Show the subtotal for the day and set its value field settings to Max.

Resources