So, I am working with a Dataframe where there are around 20 columns, but only three columns are really of importance.
Index
ID
Date
Time_difference
1
01-40-50
2021-12-01 16:54:00
0 days 00:12:00
2
01-10
2021-10-11 13:28:00
2 days 00:26:00
3
03-48-58
2021-11-05 16:54:00
2 days 00:26:00
4
01-40-50
2021-12-06 19:34:00
7 days 00:26:00
5
03-48-58
2021-12-09 12:14:00
1 days 00:26:00
6
01-10
2021-08-06 19:34:00
0 days 00:26:00
7
03-48-58
2021-10-01 11:44:00
0 days 02:21:00
There are 90 unique ID's and a few thousand rows in total. What I want to do is:
Create a plot for each unique ID
Each plot with an y-axis of 'Time_difference' and a x-axis of 'date'
Each plot with a trendline
Optimally a plot that has the average of all other plots
Would appreciate any input as to how to start this! Thank you!
For future documentation, solved it as follows:
First transforming the time_delta to an integer:
df['hour_difference'] = df['time_difference'].dt.days * 24 +
df['time_difference'].dt.seconds / 60 / 60
Then creating a list with all unique entries of the ID:
id_list = df['ID'].unique()
And last, the for-loop for the plotting:
for i in id_list:
df.loc[(df['ID'] == i)].plot(y=["hour_difference"], figsize=(15,4))
plt.title(i, fontsize=18) #Labeling titel
plt.xlabel('Title name', fontsize=12) #Labeling x-axis
plt.ylabel('Title Name', fontsize=12) #Labeling y-axis
Related
So, I am working with a Dataframe where there are around 20 columns, but only two columns are really of importance.
Index
ID
Date
1
01-40-50
2021-12-01 16:54:00
2
01-10
2021-10-11 13:28:00
3
03-48-58
2021-11-05 16:54:00
4
01-40-50
2021-12-06 19:34:00
5
03-48-58
2021-12-09 12:14:00
6
01-10
2021-08-06 19:34:00
7
03-48-58
2021-10-01 11:44:00
There are 90 different ID's and a few thousand rows in total. What I want to do is:
Group the entries by the ID's
Order those ID rows by the Date
Then calculate the difference between one timestamp to another
And create a column that has those entries (to then visualize it for the 90 different ID's)
While I thought it would be an easy thing to use the function groupby, I am having quite a bit of trouble. Would appreciate any input as to how to start this! Thank you!
You can do it this way:
>>> df.groupby("ID")["Date"].apply(lambda x: x.sort_values().diff())
ID Index
01-10 6 NaT
2 65 days 17:54:00
01-40-50 1 NaT
4 5 days 02:40:00
03-48-58 7 NaT
3 35 days 05:10:00
5 33 days 19:20:00
I have this python code:
counting_bach_new = counting_bach.groupby(['User Name', 'time_diff', 'Logon Time']).size()
print("\ncounting_bach_new")
print(counting_bach_new)
...getting this neat result:
counting_bach_new
User Name time_diff Logon Time
122770 -132 days +21:38:00 1 1
-122 days +00:41:00 1 1
123526 -30 days +12:04:00 1 1
-29 days +16:39:00 1 1
-27 days +18:16:00 1 1
..
201685 -131 days +21:21:00 1 1
202047 -106 days +10:14:00 1 1
202076 -132 days +10:22:00 1 1
-132 days +14:46:00 1 1
-131 days +21:21:00 1 1
So how do I add new column that adds and sums counts from existing column? The rightmost column with 1's should be disregarded, while I--on the other hand--would like to add a new column, summing up counts of 'time diff's per 'User Name', i.e. the result in the new col should sum # of observations listed per user. Either summing up # of time_diffs or Logon Time's. For User Name 122770 the new col should sum up to 2, for 123526 it should sum up to 3, and so on....
I tried several attempts, including (but not working)...
counting_bach_new.groupby('User Name').agg(MySum=('Logon Time', 'sum'), MyCount=('Logon Time', 'count'))
Any help would be appreciated. Thank you, for your kind support. Christmas Greetings from #Hubsandspokes
Use DataFrame.join with Series.reset_index:
df = (counting_bach_new.to_frame('count')
.join((counting_bach_new.reset_index()
.groupby('User Name')
.agg(MySum=('Logon Time', 'sum'),
MyCount=('Logon Time', 'count'))), on='User Name'))
print (df)
count MySum MyCount
User Name time_diff Logon Time
122770 -132 days +21:38:00 1 1 2 2
-122 days +00:41:00 1 1 2 2
123526 -30 days +12:04:00 1 1 3 3
-29 days +16:39:00 1 1 3 3
-27 days +18:16:00 1 1 3 3
201685 -131 days +21:21:00 1 1 1 1
202047 -106 days +10:14:00 1 1 1 1
202076 -132 days +10:22:00 1 1 3 3
-132 days +14:46:00 1 1 3 3
-131 days +21:21:00 1 1 3 3
If I understand the request correctly, try:
counting_bach_new.reset_index().groupby(['User Name'])['Logon Time'].count()
If you need to save starting number of columns, try:
counting_bach_new.reset_index().groupby(['User Name'])['Logon Time'].transform('count')
I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0
I want to calculate cumulative average every 3 rows from the value field. Above figure shows the Column cumulative average which is expected output. Tried offset method but it gives the average after every 3 rows gap interval and not the cumulative average every 3 continuous rows.
Use Series.rolling with mean and then Series.shift:
N = 3
df = pd.DataFrame({'Value': [6,9,15,3,27,33]})
df['Cum_sum'] = df['Value'].rolling(N).mean().shift(-N+1)
print (df)
Value Cum_sum
0 6 10.0
1 9 9.0
2 15 15.0
3 3 21.0
4 27 NaN
5 33 NaN
I have multiple dataframes having different years data.
The data in dataframes are:
>>> its[0].head(5)
Crocs
date
2017-01-01 46
2017-01-08 45
2017-01-15 43
2017-01-22 43
2017-01-29 41
>>> its[1].head(5)
Crocs
date
2018-01-07 23
2018-01-14 21
2018-01-21 23
2018-01-28 21
2018-02-04 25
>>> its[2].head(5)
Crocs
date
2019-01-06 90
2019-01-13 79
2019-01-20 82
2019-01-27 82
2019-02-03 81
I tried to plot all these dataframes in single figure (graph), yeah i accomplished but it was not what i wanted.
I plotted the dataframes using the following code
>>> for p in its:
plt.plot(p.index,p.values)
>>> plt.show()
and i got the following graph
but this is not what i wanted
i want the graph to be like this
Simply i want graph to ignore years and plot by month and days
You can try of converting the datetime index to timeseries integers based on month and date and plot
df3 = pd.concat(its,axis=1)
xindex= df3.index.month*30 + df3.index.day
plt.plot(xindex,df3)
plt.show()
If you want to have datetime information than integers you can add xticks to frame
labels = (df3.index.month*30).astype(str)+"-" + df3.index.day.astype(str)
plt.xticks(df3.index.month*30 + df3.index.day, labels)
plt.show()