Why isn't the format of each dataframe changing by date? - python-3.x

I have an ordered dict:
OrderedDict([('Sheet1', name newdate
0 rob 3-2020
1 will 2-2020
2 john 1-2020), ('Sheet2', name newdate
0 william 1-2020
1 tim 2-2020
2 james 3-2020), ('Sheet3', name newdate
0 eric 5-2020
1 jim 4-2020
2 evan 6-2020)])
I try to run this code in order to change the date column to date format and to get the order of the dataframes from earliest to latest:
for sheet, df in company_dict.items():
df['newdate'] = pd.to_datetime(df['newdate'])
df = df.sort_values(by="newdate")
I get:
OrderedDict([('Sheet1', name newdate
0 rob 2020-03-01
1 will 2020-02-01
2 john 2020-01-01), ('Sheet2', name newdate
0 william 2020-01-01
1 tim 2020-02-01
2 james 2020-03-01), ('Sheet3', name newdate
0 eric 2020-05-01
1 jim 2020-04-01
2 evan 2020-06-01)])
the dates are in date format but the order in each df didn't change
I'm looking for it to look like:
OrderedDict([('Sheet1', name newdate
0 john 2020-01-01
1 will 2020-02-01
2 rob 2020-03-01), ('Sheet2', name newdate
0 william 2020-01-01
1 tim 2020-02-01
2 james 2020-03-01), ('Sheet3', name newdate
0 jim 2020-04-01
1 eric 2020-05-01
2 evan 2020-06-01)])
any ideas?

Modify the content of the directory as df is just a copy of the dataframe
for sheet, _ in company_dict.items():
company_dict[sheet]['newdate'] = pd.to_datetime(company_dict[sheet]['newdate'])
company_dict[sheet] = company_dict[sheet].sort_values(by="newdate")

Related

pandas: groupby + store in another dataframe

I asked a similar question last week and now I have a similar issue, but I cannot convert the answer I received in this case.
Basically, I have a dataframe called comms which looks like this:
articleID Material commentScore
1234 News 0.75
1234 News -0.1
5678 Sport 1.33
5678 News 0.75
5678 Fashion 0.02
7412 Politics -3.45
and another dataframe called arts and it looks like this:
articleID wordCount byLine
1234 1524 John
5678 9824 Mary
7412 3713 Sam
I would like to simply count how many comms there are for each articleID, and store this number in a new column of the arts dataframe named commentNumber.
I think I have to use groupby, count() and maybe merge, but I can't figure out why.
Expected output
articleID wordCount byLine commentNumber
1234 1524 John 2
5678 9824 Mary 3
7412 3713 Sam 1
Thanks in advance!
Andrea
Use groupby() then count() on one column. At last, map the result with articleID columns of arts.
arts['commentNumber'] = arts['articleID'].map(comms.groupby('articleID')['Material'].count())
print(arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1
Use Series.map with Series.value_counts:
arts['commentNumber'] = arts['articleID'].map(comms['articleID'].value_counts())
print (arts)
articleID wordCount byLine commentNumber
0 1234 1524 John 2
1 5678 9824 Mary 3
2 7412 3713 Sam 1
Alternative:
from collections import Counter
arts['commentNumber'] = arts['articleID'].map(Counter(comms['articleID']))

Code to detect Sunday to Saturday date windows and modify Dataframe

I'm trying to set up a code that will take in a table with date windows and modify them to fit a Sun-Sat template.
I have the data saved as follows:
Index Name: From: To:
1 Joe Doe 6/1/2020 6/8/2020
2 Joe Doe 6/14/2020 6/23/2020
3 Brandon Smith 5/9/2020 5/20/2020
4 Brandon Smith 5/26/2020 5/28/2020
5 Brandon Smith 5/12/2020 5/24/2020
6 Brandon Smith 5/26/2020 5/31/2020
7 Sarah Roberts 6/3/2020 6/25/2020
8 Sarah Roberts 6/15/2020 6/23/2020
I would like to create another From: and To: columns but only capturing windows of 7,14,21... days that run from a Sunday to a Saturday.
For example: Index 1 would not apply, index 2 would get transformed from the 14th to the 20th, and so forth.
The resulting table that I was hoping to get would look like this:
Index Name: From: To: From_new: To_new
1 Joe Doe 6/1/2020 6/8/2020 NA NA
2 Joe Doe 6/14/2020 6/23/2020 6/12/2020 6/20/2020
3 Brandon Smith 5/9/2020 5/20/2020 5/10/2020 5/16/2020
4 Brandon Smith 5/26/2020 5/28/2020 NA NA
5 Brandon Smith 5/12/2020 5/24/2020 5/17/2020 5/23/2020
6 Brandon Smith 5/26/2020 5/31/2020 NA NA
7 Sarah Roberts 6/3/2020 6/25/2020 6/7/2020 6/20/2020
8 Sarah Roberts 6/15/2020 6/23/2020 NA NA
I've tried to loop through each record and look at the start week day, if it's Sunday then run to the next Saturday, but then I get confused if it runs for another whole week after that, or if it's not Sunday to begin with.
Thank in advance.
You don't need a loop. The solution was in this SO post. All credits should go to #ifly6. :)
Having said that, this should work for you:
df['From_new'] = df['From:'] + pd.offsets.Week(weekday=6)
df.loc[df['From:'].dt.weekday == 6, 'From_new'] = df.loc[df['From:'].dt.weekday == 6, 'From:']
df['To_new'] = df['To:'] - pd.offsets.Week(weekday=5)
df.loc[df['To:'].dt.weekday == 5, 'To_new'] = df.loc[df['From:'].dt.weekday == 5, 'To:']
df.loc[df['To_new'] < df['From_new'], 'From_new'] = pd.NaT
df.loc[df['From_new'].isna(), 'To_new'] = pd.NaT
Output:
Index Name: From: To: From_new To_new
1 Joe Doe 2020-06-01 2020-06-08 NaT NaT
2 Joe Doe 2020-06-14 2020-06-23 2020-06-14 2020-06-20
3 Brandon Smith 2020-05-09 2020-05-20 2020-05-10 2020-05-16
4 Brandon Smith 2020-05-26 2020-05-28 NaT NaT
5 Brandon Smith 2020-05-12 2020-05-24 2020-05-17 2020-05-23
6 Brandon Smith 2020-05-26 2020-05-31 NaT NaT
7 Sarah Roberts 2020-06-03 2020-06-25 2020-06-07 2020-06-20
8 Sarah Roberts 2020-06-15 2020-06-23 NaT NaT

Separate a name into first and last name using Pandas

I have a DataFrame that looks like this:
name birth
John Henry Smith 1980
Hannah Gonzalez 1900
Michael Thomas Ford 1950
Michelle Lee 1984
And I want to create two new columns, "middle" and "last" for the middle and last names of each person, respectively. People who have no middle name should have None in that data frame.
This would be my ideal result:
name middle last birth
John Henry Smith 1980
Hannah None Gonzalez 1900
Michael Thomas Ford 1950
Michelle None Lee 1984
I have tried different approaches, such as this:
df['middle'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 2 else None)
df['last'] = df['name'].map(lambda x: x.split(" ")[1] if x.count(" ")== 1 else x.split(" ")[2])
I even made some functions that try to do the same thing more carefully, but I always get the same error: "List Index out of range". This is weird because if I go about printing df.iloc[i,0].split(" ") for i in range(len(df)), I do get lists with length 2 or length 3 only.
I also printed x.count(" ") for all x in the "name" column and I always got either 1 or 2 as a result. There are no single names.
This is my first question so thank you so much!
Use Series.str.replace with expand = True.
df2 = (df['name'].str
.split(' ',expand = True)
.rename(columns = {0:'name',1:'middle',2:'last'}))
new_df = df2.assign(middle = df2['middle'].where(df2['last'].notnull()),
last = df2['last'].fillna(df2['middle']),
birth = df['birth'])
print(new_df)
name middle last birth
0 John Henry Smith 1980
1 Hannah NaN Gonzalez 1900
2 Michael Thomas Ford 1950
3 Michelle NaN Lee 1984

manipulating pandas dataframe - conditional

I have a pandas dataframe that looks like this:
ID Date Event_Type
1 01/01/2019 A
1 01/01/2019 B
2 02/01/2019 A
3 02/01/2019 A
I want to be left with:
ID Date
1 01/01/2019
2 02/01/2019
3 02/01/2019
Where my condition is:
If the ID is the same AND the dates are within 2 days of each other then drop one of the rows.
If however the dates are more than 2 days apart then keep both rows.
How do I do this?
I believe you need first convert values to datetimes by to_datetime, then get diff and get first values per groups by isnull() chained with comparing if next values are higher like timedelta treshold:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
2 2 2019-02-01 A
3 3 2019-02-01 A
Check solution with another data:
print (df)
ID Date Event_Type
0 1 01/01/2019 A
1 1 04/01/2019 B <-difference 3 days
2 2 02/01/2019 A
3 3 02/01/2019 A
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
s = df.groupby('ID')['Date'].diff()
df = df[(s.isnull() | (s > pd.Timedelta(2, 'd')))]
print (df)
ID Date Event_Type
0 1 2019-01-01 A
1 1 2019-01-04 B
2 2 2019-01-02 A
3 3 2019-01-02 A

Cannot convert object to date after groupby

I have been successful with converting while working with a different dataset a couple days ago. However, I cannot apply the same technique to my current dataset. The set looks as:
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
Datez Volumez
0 2016-09-19 6.300000e+07
1 2016-09-20 3.382694e+07
2 2016-09-26 4.000000e+05
3 2016-09-27 4.900000e+09
4 2016-09-28 5.324995e+08
totalHist.dtypes
Datez object
Volumez float64
dtype: object
This used to do the trick:
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'], format='%d-%m-%Y')
totalHist.dtypes
which now is giving me:
KeyError: 'Datez'
During handling of the above exception, another exception occurred:
How can I fix this? I am doing this groupby before trying:
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
totalHist.head()
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
You can just use .rename() to rename your columns
Generate some data (in same format as OP)
d = ['1/1/2018','1/2/2018','1/3/2018',
'1/3/2018','1/4/2018','1/2/2018','1/1/2018','1/5/2018']
df = pd.DataFrame(d, columns=['Date'])
df['Trading_Value'] = [1000,1005,1001,1001,1002,1009,1010,1002]
print(df)
Date Trading_Value
0 1/1/2018 1000
1 1/2/2018 1005
2 1/3/2018 1001
3 1/3/2018 1001
4 1/4/2018 1002
5 1/2/2018 1009
6 1/1/2018 1010
7 1/5/2018 1002
GROUP BY
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
print(totalHist.head())
Date Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Rename columns
totalHist.rename(columns={'Date':'Datez','totalHist':'Volumez'}, inplace=True)
print(totalHist)
Datez Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Finally, convert to datetime
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'])
print(totalHist.dtypes)
Datez datetime64[ns]
Trading_Value int64
dtype: object
This was done with python --version = 3.6.7 and pandas (0.23.4).

Resources