I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.
Related
I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7
I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.
I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540
I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05
I have data as follows:
Col1,ColDate
a,2020-09-11 08:43:00
b,2020-09-12 09:43:00
c,13-09-2020 09:43:00
d,09/16/2020 10:43:00
e,09/19/2020 12:43:00
f,09/12/2020 15:43:00
Intention is to get all rows between 11th sep and 13th sept, regardless of the format. In pandas
I am trying the following:
df[df["ColDate"].between('11-09-2020','13-09-2020')]
I get an empty dataframe.
You can try this,
df[pd.to_datetime(df['ColDate']).dt.strftime('%d-%m-%Y').between('11-09-2020','13-09-2020')]
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
2 c 13-09-2020 09:43:00
5 f 09/12/2020 15:43:00
but its really hard to say which will be considered month and day, because of the date format being jumbled.
Please Check the snippet. You can first convert your Coldate to pd.to_datetime format and then you can apply a mask over it like this.
df['ColDate'] = pd.to_datetime(df['ColDate'])
mask = (df['ColDate'] > '2020-09-11') & (df['ColDate'] <='2020-09-13')
df = df.loc[mask]
Output
Col1 ColDate
0 a 2020-09-11 08:43:00
1 b 2020-09-12 09:43:00
5 f 2020-09-12 15:43:00
I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0