How to put a number of timedate data into a subset of dataframe while keeping the data type? - python-3.x

I have a dataframe which has name as the index while a column of birth date e.g
> df_birthdate
date
Paul 2009-03-07
Peter 2000-06-23
Pauline 2001-03-03
Paula 2002-02-17
> type(df_birthdate.date[0])
pandas._libs.tslibs.timestamps.Timestamp
> df_huge = pd.DataFrame({'School': ['A','A','A','A','B','B','B','B']})
> df_huge['new_date'] = ''
> idx_t = df_huge.School == 'A'
And I have a huge dataframe called df_huge which I want to put the date into it. I know that the order won't change.
df_huge.loc[idx_t, "new_date"] = df_birthdate.values
The above code works for me in the most cases, however, when the 'date' column is in datetime format, by applying .values, the data which I put into the df_huge dataframe are no longer in datetime format. Any suggestion to put 'date' from df_birthdate into a specific location of the df_huge? Many thanks.

You can omit df_huge['new_date'] = '' for assign empty strings to column:
idx_t = df_huge.School == 'A'
df_huge.loc[idx_t, "new_date"] = df_birthdate.to_numpy()
print (df_huge)
School new_date
0 A 2009-03-07
1 A 2000-06-23
2 A 2001-03-03
3 A 2002-02-17
4 B NaT
5 B NaT
6 B NaT
7 B NaT
print (df_huge.dtypes)
School object
new_date datetime64[ns]
dtype: object

Related

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

pandas to_datetime() funtion is not converting for date 08-12-1600 in dataframe

raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
df_1['dates'] = pd.to_datetime(df_1['dates'])
the above code gives error due to date 08-12-1600 if the date is removed it works fine, what could be the possible reason for it?
error is:
Out of bounds nanosecond timestamp: 1600-08-12 00:00:00
That is because the provided dates are outside the range of Timestamp.
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
Details here
If we need the dates even out of range
Then we can convert them to period using below code
raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
def conv(x):
day,month,year = tuple(x.split('-'))
return pd.Period(year=int(year), month=int(month), day=int(day), freq="D")
df_1['dates'] = df_1.dates.apply(conv)
df_1
Output
Event dates
0 A 1600-12-08
1 B 1400-09-26
2 C 1991-11-04
3 D 1991-03-25
4 E 1991-05-10
If we can ignore dates outside range
df_1['dates'] = pd.to_datetime(df_1.dates, errors='coerce')
df_1
Output
Event dates
0 A NaT
1 B NaT
2 C 1991-04-11
3 D 1991-03-25
4 E 1991-10-05
Bonus Fact
Why timestamp can hold values for around 584 years 1677-2262?
Since timestamps provides nano second precision and is stored in 64-bit integer, hence it can store around 584 years with this nano second resolution in 64-bit int space.

get rows by date regardless of format of date in pandas

I have data as follows:
Col1,ColDate
a,2020-09-11 08:43:00
b,2020-09-12 09:43:00
c,13-09-2020 09:43:00
d,09/16/2020 10:43:00
e,09/19/2020 12:43:00
f,09/12/2020 15:43:00
Intention is to get all rows between 11th sep and 13th sept, regardless of the format. In pandas
I am trying the following:
df[df["ColDate"].between('11-09-2020','13-09-2020')]
I get an empty dataframe.
You can try this,
df[pd.to_datetime(df['ColDate']).dt.strftime('%d-%m-%Y').between('11-09-2020','13-09-2020')]
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
2 c 13-09-2020 09:43:00
5 f 09/12/2020 15:43:00
but its really hard to say which will be considered month and day, because of the date format being jumbled.
Please Check the snippet. You can first convert your Coldate to pd.to_datetime format and then you can apply a mask over it like this.
df['ColDate'] = pd.to_datetime(df['ColDate'])
mask = (df['ColDate'] > '2020-09-11') & (df['ColDate'] <='2020-09-13')
df = df.loc[mask]
Output
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
5 f 2020-09-12 15:43:00

To extract distinct values for all categorical columns in dataframe

I have a situation where I need to print all the distinct values that are there for all the categorical columns in my data frame
The dataframe looks like this :
Gender Function Segment
M IT LE
F IT LM
M HR LE
F HR LM
The output should give me the following:
Variable_Name Distinct_Count
Gender 2
Function 2
Segment 2
How to achieve this?
using nunique then passing the series into a new datafame and setting column names.
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)
Variable DistinctCount
0 Gender 2
1 Function 2
2 Segment 2
This is not good, yet it won't fail to provide the expected output:
new_data = {'Variable_Name':[],'Distinct_Count':[]}
for i in list(df):
new_data['Variable_Name'].append(i)
new_data['Distinct_Count'].append(df[i].nunique())
new_df = pd.DataFrame(new_data)
print(new_df)
Output:
Variable_Name Distinct_Count
0 Gender 2
1 Function 2
2 Segment 2

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Resources