I have a csv file with over 5k rows with the following structure:
Source Target LinkId LinkName Throughput
==================================================
1 12 1250 link1250 5 //return
1 12 3250 link3250 14 //return
1 14 1250 link1250 5
1 14 3250 link3250 14
1 18 1250 link1250 5
1 18 3250 link3250 14
1 25 250 link250 2 //return
2 12 2250 link2250 5 //return
2 12 5250 link5250 14 //return
2 14 2250 link2250 5
2 14 5250 link5250 14
2 18 2250 link2250 5
2 18 5250 link5250 14
2 58 50 link50 34
I would now like to filter the csv to display only the lines highlighted above, i.e, the csv should be filtered in such a way that there is only one entry per linkID irrespective of the other columns. So I would expect something like this:
Source Target LinkId LinkName Throughput
==================================================
1 12 1250 link1250 5
1 12 3250 link3250 14
2 12 2250 link2250 5
2 12 5250 link5250 14
2 58 50 link50 34
and so on. Could someone suggest an easy way to do this in excel.
If you don't care about keeping the duplicates, you can select all cells and go to Data>Remove duplicates.
If you don't wanna delete any data, you can use a Pivot Table using the existing table as source.
Related
I have a dataframe with multiple columns and 700+ rows and a series of 27 rows. I want to create a new column i.e. series in dataframe as per matching indexes with predefined column in df
data frame I have and need to add series which contains the same indexes of "Reason for absence"
ID Reason for absence Month of absence Day of the week Seasons
0 11 26 7 3 1
1 36 0 7 3 1
2 3 23 7 4 1
3 7 7 7 5 1
4 11 23 7 5 1
5 3 23 7 6 1
6 10 22 7 6 1
7 20 23 7 6 1
8 14 19 7 2 1
9 1 22 7 2 1
10 20 1 7 2 1
11 20 1 7 3 1
12 20 11 7 4 1
13 3 11 7 4 1
14 3 23 7 4 1
15 24 14 7 6 1
16 3 23 7 6 1
17 3 21 7 2 1
18 6 11 7 5 1
19 33 23 8 4 1
20 18 10 8 4 1
21 3 11 8 2 1
22 10 13 8 2 1
23 20 28 8 6 1
24 11 18 8 2 1
25 10 25 8 2 1
26 11 23 8 3 1
27 30 28 8 4 1
28 11 18 8 4 1
29 3 23 8 6 1
30 3 18 8 2 1
31 2 18 8 5 1
32 1 23 8 5 1
33 2 18 8 2 1
34 3 23 8 2 1
35 10 23 8 2 1
36 11 24 8 3 1
37 19 11 8 5 1
38 2 28 8 6 1
39 20 23 8 6 1
40 27 23 9 3 1
41 34 23 9 2 1
42 3 23 9 3 1
43 5 19 9 3 1
44 14 23 9 4 1
this is series table s_conditions
0 Not absent
1 Infectious and parasitic diseases
2 Neoplasms
3 Diseases of the blood
4 Endocrine, nutritional and metabolic diseases
5 Mental and behavioural disorders
6 Diseases of the nervous system
7 Diseases of the eye
8 Diseases of the ear
9 Diseases of the circulatory system
10 Diseases of the respiratory system
11 Diseases of the digestive system
12 Diseases of the skin
13 Diseases of the musculoskeletal system
14 Diseases of the genitourinary system
15 Pregnancy and childbirth
16 Conditions from perinatal period
17 Congenital malformations
18 Symptoms not elsewhere classified
19 Injury
20 External causes
21 Factors influencing health status
22 Patient follow-up
23 Medical consultation
24 Blood donation
25 Laboratory examination
26 Unjustified absence
27 Physiotherapy
28 Dental consultation
dtype: object
I tried this
df1.insert(loc=0, column="Reason_for_absence", value=s_conditons)
out- this is wrong because i need the reason_for_absence colum according to the index of reason for absence and s_conditions
Reason_for_absence ID Reason for absence \
0 Not absent 11 26
1 Infectious and parasitic diseases 36 0
2 Neoplasms 3 23
3 Diseases of the blood 7 7
4 Endocrine, nutritional and metabolic diseases 11 23
5 Mental and behavioural disorders 3 23
6 Diseases of the nervous system 10 22
7 Diseases of the eye 20 23
8 Diseases of the ear 14 19
9 Diseases of the circulatory system 1 22
10 Diseases of the respiratory system 20 1
11 Diseases of the digestive system 20 1
12 Diseases of the skin 20 11
13 Diseases of the musculoskeletal system 3 11
14 Diseases of the genitourinary system 3 23
15 Pregnancy and childbirth 24 14
16 Conditions from perinatal period 3 23
17 Congenital malformations 3 21
18 Symptoms not elsewhere classified 6 11
19 Injury 33 23
20 External causes 18 10
21 Factors influencing health status 3 11
22 Patient follow-up 10 13
23 Medical consultation 20 28
24 Blood donation 11 18
25 Laboratory examination 10 25
26 Unjustified absence 11 23
27 Physiotherapy 30 28
28 Dental consultation 11 18
29 NaN 3 23
30 NaN 3 18
31 NaN 2 18
32 NaN 1 23
i am getting output upto 28 rows and NaN values after that. Instead, I need correct order of series according to indexes for all the rows
While this question is a bit confusing, it seems the desire is to match the series index with the dataframe "Reason for Absence" column. If this is correct, below is a small example of how to accomplish. Keep in mind, the resulting dataframe will be sorted based on the 'Reason for Absence Numerical' column. If my understanding is incorrect, please clarify this question so we can better assist you.
d = {'ID': [11,36,3], 'Reason for Absence Numerical': [3,2,1], 'Day of the Week': [4,2,6]}
dataframe = pd.DataFrame(data=d)
s = {0: 'Not absent', 1:'Neoplasms', 2:'Injury', 3:'Diseases of the eye'}
disease_series = pd.Series(data=s)
def add_series_to_df(df, series, index_val):
df_filtered = df[df['Reason for Absence Numerical'] == index_val].copy()
series_filtered = series[series.index == index_val]
if not df_filtered.empty:
df_filtered['Reason for Absence Text'] = series_filtered.item()
return df_filtered
x = [add_series_to_df(dataframe, disease_series, index_val) for index_val in range(len(disease_series.index))]
new_df = pd.concat(x)
print(new_df)
I have a time series as a dataframe. The first column is the week number, the second are values for that week. The first week (22) and the last week (48), are the lower and upper bounds of the time series. Some weeks are missing, for example, there is no week 27 and 28. I would like to resample this series such that there are no missing weeks. Where a week was inserted, I would like the corresponding value to be zero. This is my data:
week value
0 22 1
1 23 2
2 24 2
3 25 3
4 26 2
5 29 3
6 30 3
7 31 3
8 32 7
9 33 4
10 34 5
11 35 4
12 36 2
13 37 3
14 38 10
15 39 5
16 40 7
17 41 10
18 42 11
19 43 15
20 44 9
21 45 13
22 46 5
23 47 6
24 48 2
I am wondering if this can be achieved in Pandas without creating a loop from scratch. I have looked into pd.resample, but can't achieve the results I am looking for.
I would set week as index, reindex with fill_value option:
start, end = df['week'].agg(['min','max'])
df.set_index('week').reindex(np.arange(start, end+1), fill_value=0).reset_index()
Output (head):
week value
0 22 1
1 23 2
2 24 2
3 25 3
4 26 2
5 27 0
6 28 0
7 29 3
8 30 3
I downloaded a dataset in .csv format from kaggle which is about lego. There's a "Ages" column like this:
df['Ages'].unique()
array(['6-12', '12+', '7-12', '10+', '5-12', '8-12', '4-7', '4-99', '4+',
'9-12', '16+', '14+', '9-14', '7-14', '8-14', '6+', '2-5', '1½-3',
'1½-5', '9+', '5-8', '10-21', '8+', '6-14', '5+', '10-16', '10-14',
'11-16', '12-16', '9-16', '7+'], dtype=object)
These categories are the suggested ages for using and playing with the legos.
I'm intended to do some statistical analysis with these age bins. For example, I want to check the mean of these suggested ages.
However, since the type of each of them is string:
type(lego_dataset.loc[0]['Ages'])
str
I don't know how to work on the data.
I've already check How to categorize a range of values in Pandas DataFrame
But imagine there are 100 unique bins. It's not reasonable to prepare a list of 100 labels for each category. There should be a better way.
Not entirely sure what output you are looking for. See if the below code & output helps you.
df['Lage'] = df['Ages'].str.split('[-+]').str[0]
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]
or
df['Lage'] = df['Ages'].str.extract('(\d+)', expand=True) #you don't get the fractions for row 17 & 18
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]
Input
Ages
0 6-12
1 12+
2 7-12
3 10+
4 5-12
5 8-12
6 4-7
7 4-99
8 4+
9 9-12
10 16+
11 14+
12 9-14
13 7-14
14 8-14
15 6+
16 2-5
17 1½-3
18 1½-5
19 9+
20 5-8
21 10-21
22 8+
23 6-14
24 5+
25 10-16
26 10-14
27 11-16
28 12-16
29 9-16
30 7+
Output1
Ages Lage Uage
0 6-12 6 12
1 12+ 12
2 7-12 7 12
3 10+ 10
4 5-12 5 12
5 8-12 8 12
6 4-7 4 7
7 4-99 4 99
8 4+ 4
9 9-12 9 12
10 16+ 16
11 14+ 14
12 9-14 9 14
13 7-14 7 14
14 8-14 8 14
15 6+ 6
16 2-5 2 5
17 1½-3 1½ 3
18 1½-5 1½ 5
19 9+ 9
20 5-8 5 8
21 10-21 10 21
22 8+ 8
23 6-14 6 14
24 5+ 5
25 10-16 10 16
26 10-14 10 14
27 11-16 11 16
28 12-16 12 16
29 9-16 9 16
30 7+ 7
Output2
Ages Lage Uage
0 6-12 6 12
1 12+ 12
2 7-12 7 12
3 10+ 10
4 5-12 5 12
5 8-12 8 12
6 4-7 4 7
7 4-99 4 99
8 4+ 4
9 9-12 9 12
10 16+ 16
11 14+ 14
12 9-14 9 14
13 7-14 7 14
14 8-14 8 14
15 6+ 6
16 2-5 2 5
17 1½-3 1 3
18 1½-5 1 5
19 9+ 9
20 5-8 5 8
21 10-21 10 21
22 8+ 8
23 6-14 6 14
24 5+ 5
25 10-16 10 16
26 10-14 10 14
27 11-16 11 16
28 12-16 12 16
29 9-16 9 16
30 7+ 7
I have a dataframe like this:
videoId viewedMinutes totalMinutes user_drop TotalUsers
1017479 0 5 8 34
1017479 1 5 3 34
1017479 2 5 2 34
1017479 4 5 3 34
1017479 5 5 19 34
1036704 0 16 1 14
1036704 1 16 2 14
1036704 2 16 2 14
1036704 3 16 1 14
1036704 5 16 1 14
1036704 6 16 1 14
1036704 8 16 2 14
So I want to create a new columns active_users minute wise, which will look something like this:
videoId viewedMinutes totalMinutes user_drop TotalUsers active_users
1017479 0 5 8 34 34 (1st is fixed)
1017479 1 5 3 34 26(34(active_users)-8(user_drop))
1017479 2 5 2 34 23(26-3)
1017479 4 5 3 34 21(23-2)
1017479 5 5 18 34 18(21-3)
1036704 0 16 1 14 14
1036704 1 16 2 14 12
1036704 2 16 2 14 10
1036704 3 16 1 14 9
1036704 5 16 1 14 8
1036704 6 16 1 14 7
1036704 8 16 6 14 6
So its like diagonally subtracting, but 1st and last values will remain fixed. Also I want this algorithm to run for each unique videoId and not throughout my dataframe.
Now with for loop I want to plot a scatter plot for each unique videoId in plotly, having minutes in X-axis and no. of active users in the Y-axis. The graph will look something like this: retention.jpg
Use groupby with custom function with cumsum and shift - for each first row is created NaN value, which is ignored by sub function with parameter fill_value=0:
s = df.groupby('videoId')['user_drop'].apply(lambda x: x.cumsum().shift())
df['active'] = df['TotalUsers'].sub(s, fill_value=0).astype(int)
print (df)
videoId viewedMinutes totalMinutes user_drop TotalUsers active
0 1017479 0 5 8 34 34
1 1017479 1 5 3 34 26
2 1017479 2 5 2 34 23
3 1017479 4 5 3 34 21
4 1017479 5 5 19 34 18
5 1036704 0 16 1 14 14
6 1036704 1 16 2 14 13
7 1036704 2 16 2 14 11
8 1036704 3 16 1 14 9
9 1036704 5 16 1 14 8
10 1036704 6 16 1 14 7
11 1036704 8 16 2 14 6
EDIT:
For scatter plot use:
for i, df in df.groupby('videoId'):
ax = df.plot.scatter(x='viewedMinutes', y='active')
ax.set_title(i, fontsize=20)
I have an excel with:
Days of the week and 24 hours for each day.
Each hour I get some points.
I would like to calcute the maximum of cumulate points I can get within 24 hours.
[TEST.XLSX]
2 Columns:
Monday Points
0 34
1 32
2 4
3 54
4 12
5 55
6 4
7 4
8 555
9 787
10 8
11 76
12 78
13 8
14 656
15 7
16 4
17 45
18 54
19 543
20 56
21 65
22 4
23 3
Tuesday
0 56
1 7
2 333
3 9
4 876
5 3333
6 3333
7 76
8 3333
9 465
10 7
11 6
12 5
13 6
14 7
15 6
16 7
17 65
18 555555555
19 6
20 5
21 4
22 6
23 6
Wednesday
0 6
1 7
...
Thanks for your help!
Use real date time values in your hours column. Delete the rows with the day text. Instead, use a formula that increments from a starting date/time. For example: cell A2 contains the date and midnight time for Nov 17. Cell A3 and copied down contains the formula
=A2+TIME(1,0,0)
which increments by one hour.
Now you con build a pivot table. Group by the date/time value by day and hour. Show the subtotal for the day and set its value field settings to Max.