How can I delete duplicates group 3 columns using two criteria (first two columns)? - python-3.x

That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.

I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18

df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.

Related

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

SUMIFS/SUMPRODUCT for 2D data with multiple possible values in 2nd direction

I have been struggling with the following:
I have a data sheet as follows, from which I want to sum the amounts per week and groups of projects, where the group of projects is user input. This "data" sheet is schematically looking like this
A B C D E F G
1 YEAR 2017 2017 2017 2017 2017 2017
2 WEEK 40 41 42 43 44 45
3 ProjectA 100 101 102 104 100 85
4 ProjectB 80 80 85 82 80 82
5 ProjectC 60 60 60 60 60 60
6 ProjectD 105 108 112 116 120 122
Next step is that the question of which projects you'd need to sum, is user input, so in another sheet ("projects"), the user would input:
A
1 ProjectA
2 ProjectC
3
4
5
Then in the third sheet, I would have to show the summed data per week:
A B C D E F
1 2017 2017 2017 2017 2017 2017
2 40 41 42 43 44 45
3
Now the big question is, what formula could I use in row 3 of this last sheet?
What I have tried so far is: (in A3)
{=SUM(IF(data!B1:G1=A1;IF(data!B2:G2=A2;IF(data!A3:A6=projects!A1:A5;data!B3:G6))))}
This gives me a #N/A error. If I replace projects!A1:A5 by projects!A1, everything works fine, but then that's not much of a summation anymore :)
I have tried other versions with SUMIFS and SUMPRODUCT but those get me nowhere closer to where I'd like to be.
So, any help would be greatly appreciated.
(One last note, I am not able/allowed to change or add anything in the "data" sheet)
Use SUMPRODUCT:
=SUMPRODUCT((Data!$B$2:$G$2=A2)*(Data!$B$1:$G$1=A1)*(ISNUMBER(MATCH(Data!$A$3:$A$6,projects!$A:$A,0))),Data!$B$3:$G$6)

find the number of skip rows between records

I have requirement to get row number of next matching value. ie.
Number 1 Number 2 Number 3 Number 4 Number 5 Number 6
16 33 28 20 23 14
13 12 27 29 2 32
31 25 9 28 17 10
11 22 14 3 18 13
12 39 22 32 25 24
37 40 33 18 9 3
4 35 17 24 7 12
16 3 38 8 17 24
now 16 is present in 7th row, and skipped rows are 6. 33 is present in 6th row so skipped rows are 5. Similarly 28 is present in 3rd row so skipped rows are 1.
output will be :
6 4 1 19 10 2
assume that 20 and 23 found in 20th and 11th row respectively.Skipped rows = row number of next find of that number - present row number.
I am not able to form formula for this. Match should work I guess, but not sure.
Put this formula in the first cell:
=AGGREGATE(15,6,ROW($A$3:$F$22)/($A$3:$F$22=A2),1) - ROW($A$3)
Then drag/copy across
If you want to drag down (put the results in columnar form):
=AGGREGATE(15,6,ROW($A$3:$F$22)/($A$3:$F$22=INDEX($2:$2,ROW(1:1))),1) - ROW($A$3)
Put it in the first cell and drag/copy down.

Excel indexmatch, vlookup

I have a holiday calendar for several years in one table. Can anyone help – How to arrange this data by week and show holiday against week? I want to reference this data in other worksheets and hence arranging this way will help me to use formulae on other sheets. I want the data to be: col A having week numbers and column B showing holiday for year 1, col. C showing holiday for year 2, etc.
Fiscal Week
2015 2014 2013 2012
Valentine's Day 2 2 2 3
President's Day 3 3 3 4
St. Patrick's Day 7 7 7 7
Easter 10 12 9 11
Mother's Day 15 15 15 16
Memorial Day 17 17 17 18
Flag Day 20 19 19 20
Father's Day 21 20 20 21
Independence Day 22 22 22 23
Labor Day 32 31 31 32
Columbus Day 37 37 37 37
Thanksgiving 43 43 43 43
Christmas 47 47 47 48
New Year's Day 48 48 48 49
ML King Day 51 51 51 52
It's not too clear what year 1 is, so I'm going to assume that's 2015, and year 2 is 2014, etc.
Here's how you could set it up, if I understand correctly. Use this index/match formula (psuedo-formula):
=Iferror(Index([holiday names range],match([week number],[2015's week numbers in your table],0)),"")
It looks like this:
(=IFERROR(INDEX($A$3:$A$17,MATCH($H3,B$3:B$17,0)),""), in the cell next to the week numbers)
You can then drag the formula over, and the matching group (in above picture, B3:B17) will "slide over" as you drag the formula over.

calculate consecute streak in excel row

I am trying to calculate 2 values. Current Streak and Long Streak.
each record is on 1 row and contains a name and values
each of those columns has a value from 1 to 200.
Example:
John Doe 14 16 25 18 40 65 101 85 14 19 18 9 3
Jane Doe 24 22 18 5 8 22 17 17 15 2 1 5 22
Jim Doe 40 72 66 29 25 28
Jan Doe 27 82 22 17 18 9 6 7 9 13
For each row, I'm trying to find the "current" streak and "longest" streak.
The values have to be <= 24 to be counted. Data goes left to right.
John: Current 2; Long 5
Jane: Current 13; Long 13
Jim: Current 0; Long 0
Jan: Current 0; Long 8
What would be a formula to calculate the current and long in their own cell on that same row (would have to go before data)?
For current run, assuming data in C2:Z2, try this array formula:
=IFERROR(MATCH(TRUE,C2:Z2>24,0)-1,COUNT(C2:Z2))
Confirm with CTRL+SHIFT+ENTER
For longest streak try this version based on the cell references used in your comment
=MAX(FREQUENCY(IF(P7:BB7<=24,COLUMN(P7:BB7)),IF(P7:BB7>24,COLUMN(P7:B‌​B7))))
Again confirm with CTRL+SHIFT+ENTER
or to allow blanks in the range (which would end a streak) you can use this version
=MAX(FREQUENCY(IF(P7:BB7<>"",IF(P7:BB7<=24,COLUMN(P7:BB7))),IF((P7:BB7="")+(P7:BB7>24),COLUMN(P7:BB7))))

Resources