Sorting data frames with time in hours days months format - python-3.x

I have one data frame with time in columns but not sorted order, I want to sort in ascending order, can some one suggest any direct function or code for data frames sort time.
My input data frame:
Time
data1
1 month
43.391588
13 h
31.548372
14 months
41.956652
3.5 h
31.847388
Expected data frame:
Time
data1
3.5 h
31.847388
13 h
31.847388
1 month
43.391588
14 months
41.956652

You need replace units to numbers first by Series.replace and then convert to numeric by pandas.eval, last use this column for sorting by DataFrame.sort_values:
d = {' months': '*30*24', ' month': '*30*24', ' h': '*1'}
df['sort'] = df['Time'].replace(d, regex=True).map(pd.eval)
df = df.sort_values('sort')
print (df)
Time data1 sort
3 3.5 h 31.847388 3.5
1 13 h 31.548372 13.0
0 1 month 43.391588 720.0
2 14 months 41.956652 10080.0

Firstly you have to assert the type of data you have in your dataframe.
This will indicate how you may proceed.
df.dtypes or at your case df.index.dtypes .
Preferred option for sorting dataframes is df.sort_values()

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

how to get percentage of columns to sum of row in python [duplicate]

This question already has an answer here:
Normalize rows of pandas data frame by their sums [duplicate]
(1 answer)
Closed 2 years ago.
I have a very high dimensional data with more than 100 columns. As an example, I am sharing the simplified version of it given as a below:
date product price amount
11/17/2019 A 10 20
11/24/2019 A 10 20
12/22/2020 A 20 30
15/12/2019 C 40 50
02/12/2020 C 40 50
I am trying to calculate the percentage of columns based on total row sum illustrated below:
date product price amount
11/17/2019 A 10/(10+20) 20/(10+20)
11/24/2019 A 10/(10+20) 20/(10+20)
12/22/2020 A 20/(20+30) 30/(20+30)
15/12/2019 C 40/(40+50) 50/(40+50)
02/12/2020 C 40/(40+50) 50/(40+50)
Is there any way to do this efficiently for high dimensional data? Thank you.
In addition to the provided link (Normalize rows of pandas data frame by their sums), you need to locate the specific columns as your first two column are non-numeric:
cols = df.columns[2:]
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0)
Out[1]:
date product price amount
0 11/17/2019 A 0.3333333333333333 0.6666666666666666
1 11/24/2019 A 0.3333333333333333 0.6666666666666666
2 12/22/2020 A 0.4 0.6
3 15/12/2019 C 0.4444444444444444 0.5555555555555556
4 02/12/2020 C 0.4444444444444444 0.5555555555555556

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

How to calculate time difference between the rows groupby rowname and extract only the most recent ones?

I want to calculate the number of days between 2 rows with a grouby function and extract only 1 row with the latest date. I need not want all the rows with the same row value instead want the one which is more recent with the number of days as new column.
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-27 16:36:04
2 A 2016-11-29 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-28 16:35:46
I want the output as
id no of days
0 A 4(approx)
1 B 3(approx)
So what i want is only the column 2 with id A which has the most recent change in time and date and omit rest of rows.
IIUC
df.time=pd.to_datetime(df.time)
df.groupby('id').time.apply(lambda x : (x.max()-x.min()).days)
Out[1186]:
id
A 4
B 3
Name: time, dtype: int64

Resources