I have a table where data may be provided on monthly or yearly basis. The table data looks something as below:
Item Date Name
Class 1 12/31/2010 David
Class 1 12/31/2011 David
Class 1 12/31/2012 David
Class 1 12/31/2010 Moses
Class 1 12/31/2011 Moses
Class 1 12/31/2012 Moses
Class 1 01/31/2012 Shelly
Class 1 02/28/2012 Shelly
Class 1 03/31/2012 Shelly
Class 1 04/30/2012 Shelly
Class 1 05/31/2012 Shelly
Class 1 06/30/2012 Shelly
Class 1 07/31/2012 Shelly
Class 1 08/31/2012 Shelly
Class 1 09/30/2012 Shelly
Class 1 10/31/2012 Shelly
Class 1 11/30/2012 Shelly
Class 1 12/31/2012 Shelly
Class 2 01/31/2012 Shelly
Class 2 02/28/2012 Shelly
Class 2 03/31/2012 Shelly
Class 2 04/30/2012 Shelly
Class 2 05/31/2012 Shelly
Class 2 06/30/2012 Shelly
Class 2 07/31/2012 Shelly
Class 2 08/31/2012 Shelly
Class 2 09/30/2012 Shelly
Class 2 10/31/2012 Shelly
Class 2 11/30/2012 Shelly
Class 2 12/31/2012 Shelly
Class 2 01/31/2012 David
Class 2 02/28/2012 David
Class 2 03/31/2012 David
Class 2 04/30/2012 David
Class 2 05/31/2012 David
Class 2 12/31/2011 Soni
Class 2 12/31/2012 Soni
For a combination of Name and Item, either the date difference can be monthly or yearly. I want to include a calculated column named Flag. The condition to set flag is, if users had entered monthly data then set value as Yes else No.
So the rows with Class 1 - Shelly and all rows of Class 2 excluding Soni should be set as Yes.
Can anyone please guide me on this ? If I try Over and Intersect, for some of the columns, result is blank row.
Assuming you have at least 2 months of data for each Item / Name pairing, you can approach this two ways. Note, that unless you have 2 months of data you can't tell if, for the given year, you are receiving data monthly or yearly.
Insert Calculated Column DatePart("year",[Date]) as [Year]
Insert Calculated Column If(Count([Item]) OVER (Intersect([Name],[Item],[Year]))>1,"Yes","No") as [Flag]
Another expression that you may find useful will apply a rank / row number to each Year | Month | Item | Name pairing. Here is that expression which you can use to see how many months of data you have for that paring (using MAX()) or to do other aggregates / logical checks.
RankReal(Date(DatePart("year",[Date]),DatePart("month",[Date]),1),"asc",[Name],[Item],DatePart("year",[Date]),"ties.method=minimum") as [RowRank]
Related
I have a dataframe in pandas as mentioned below where elements in column info is same as unique file in column id:
id text info
1 great boy,police
1 excellent boy,police
2 nice girl,mother,teacher
2 good girl,mother,teacher
2 bad girl,mother,teacher
3 awesome grandmother
4 superb grandson
All I want to get list elements as row for each file, like:
id text info
1 great boy
1 excellent police
2 nice girl
2 good mother
2 bad teacher
3 awesome grandmother
4 superb grandson
Let us try
df['new'] = df.loc[~df.id.duplicated(),'info'].str.split(',').explode().values
df
id text info new
0 1 great boy,police boy
1 1 excellent boy,police police
2 2 nice girl,mother,teacher girl
3 2 good girl,mother,teacher mother
4 2 bad girl,mother,teacher teacher
5 3 awesome grandmother grandmother
6 4 superb grandson grandson
Take advantage of the fact that 'info' is duplicated.
df['info'] = df['info'].drop_duplicates().str.split(',').explode().to_numpy()
Output:
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
One way using pandas.DataFrame.groupby.transform.
Note that this assumes:
elements in info have same length as the number of members for each id after split by ','
elements in info are identical among the same id.
df["info"] = df.groupby("id")["info"].transform(lambda x: x.str.split(",").iloc[0])
print(df)
Output:
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
create temp variable counting the number of rows for each info group:
temp = df.groupby('info').cumcount()
Do a list comprehension to index per text in info:
df['info'] = [ent.split(',')[pos] for ent, pos in zip(df['info'], temp)]
df
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
Or try apply:
df['info'] = pd.DataFrame({'info': df['info'].str.split(','), 'n': df.groupby('id').cumcount()}).apply(lambda x: x['info'][x['n']], axis=1)
Output:
>>> df
id text info
0 1 great boy
1 1 excellent police
2 2 nice girl
3 2 good mother
4 2 bad teacher
5 3 awesome grandmother
6 4 superb grandson
>>>
Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.
You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')
I got a quick question I got a column like this
the players name and the percentage of matches won
Rank
Country
Name
Matches Won %
1 ESP ESP Rafael Nadal 89.06%
2 SRB SRB Novak Djokovic 83.82%
3 SUI SUI Roger Federer 83.61%
4 RUS RUS Daniil Medvedev 73.75%
5 AUT AUT Dominic Thiem 72.73%
6 GRE GRE Stefanos Tsitsipas 67.95%
7 JPN JPN Kei Nishikori 67.44%
and I got another data like this ACES PERCENTAGE
Rank
Country
Name
Ace %
1 USA USA John Isner 26.97%
2 CRO CRO Ivo Karlovic 25.47%
3 USA USA Reilly Opelka 24.81%
4 CAN CAN Milos Raonic 24.63%
5 USA USA Sam Querrey 20.75%
6 AUS AUS Nick Kyrgios 20.73%
7 RSA RSA Kevin Anderson 17.82%
8 KAZ KAZ Alexander Bublik 17.06%
9 FRA FRA Jo Wilfried Tsonga 14.29%
---------------------------------------
85 ESP ESP RAFAEL NADAL 6.85%
My question is can I make my two tables align so for example I want to have
my data based on matches won
So I have for example
Rank Country Name Matches% Aces %
1 ESP RAFAEL NADAL 89.06% 6.85%
Like this for all the player
I agree with the comment above that it would be easiest to import both and to then use XLOOKUP() to add the Aces % column to the first set of data. If you import the first data set to Sheet1 and the second data set to Sheet2 and both have the rank in Column A , your XLOOKUP() in Sheet 1 Column E would look something like:
XLOOKUP(A2, Sheet2!A:A, Sheet2!D:D)
Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-4', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to calculate the last visited interval by day for two people, in this example, the outcome should be
last_visited_interval name
0 1 John
1 2 Mary
Since '2015-3-5','2015-3-6' has interval of 1 and '2016-3-6', '2016-3-8' has interval of 2
I tried
df.groupby('name').agg(last_visited_interval=('visited',lambda x: x.diff().dt.days.last())),
but got the exception of
last() missing 1 required positional argument: 'offset'
How should I do it?
If check Series.last it working different - it return last value of datetimes by DatetimeIndex, also it is not GroupBy.last, because working with Series in lambda function. So you can use Series.iloc or Series.iat:
df.groupby('name').agg(last_visited_interval=('visited',lambda x:x.diff().dt.days.iat[-1]))
last_visited_interval
name
John 1.0
Mary 2.0
I have a dataset in Excel that I need to transpose. It is survey data and the first column is the month of the survey. The second is unique to each company, the third is a sector code for that company (which can change over time), the forth is a Size variable and then there is question number and answer columns. I want to be able to do Pivot tables of this, but as I understand it I need to get each question in its own column to be able to cross tabulate in the pivot table. Eg what has companies answered on question 2, dependent on their answer on question 1. Ho wan I transpose the data?
From this
Period Company Sector Size Question Answer
201601 101 Cons Small 1 2
201601 101 Cons Small 2 1
201601 101 Cons Small 3 2
201601 102 Int Small 1 3
201601 102 Int Small 2 1
201601 102 Int Small 3 1
201602 101 Cons Small 1 3
201602 101 Cons Small 2 2
201602 101 Cons Small 3 1
201602 102 Int Small 1 3
201602 102 Int Small 2 1
201602 102 Int Small 3 2
To this
Period Company Sector Size Question1 Question2 Question3
201601 101 Cons Small 2 1 2
201601 102 Int Small 3 1 1
201602 101 Cons Small 3 2 1
201602 102 Int Small 3 1 2
There can be up to about 30 questions in one file, about 1500-2000 companies and in my first files I will have 4 months. The companies are grouped on at most 5 sectors and two different sizes.
Thanks to a comment from Doug Glancy I could figure out how to do things.
Create a Pivot Table with all columns in Row Lables except for Question and Answer. Then put Question in Column Labels and Answer in Values. Choose to sum the values.
To get the format correct, in the Pivot Table Tools - Design menu, choose Subtotals - Do not show Subtotals. Copy the resulting table into a new workbook without the sums column and row.