Convert a numerical relative index (=months) to datetime - python-3.x

Given is a Pandas DataFrame with a numerical index representing the relative number of months:
df = pd.DataFrame(columns=['A', 'B'], index=np.arange(1,100))
df
A B
1 NaN NaN
2 NaN NaN
3 NaN NaN
...
How can the index be converted to a DateTimeIndex by specifying a start date (e.g., 2018-11-01)?
magic_function(df, start='2018-11-01', delta='month')
A B
2018-11-01 NaN NaN
2018-12-01 NaN NaN
2019-01-01 NaN NaN
...
I would favor a general solution that also works with arbitrary deltas, e.g. daily or yearly series.

Using date_range
idx=pd.date_range(start='2018-11-01',periods =len(df),freq='MS')
df.index=idx

I'm not sure with Pandas, but with plain datetime can't you just do this?
import datetime
start=datetime.date(2018,1,1)
months = 15
adjusted = start.replace(year=start.year + int(months/12), month=months%12)

Related

Assign array values to NaN Dataframe Pandas

I am trying to fill a dataframe which originally has NaN values with the same number of values taken from an array. All the values in the dictionary leagueList (NFL,NBA, etc.) are individual dataframes.
Sorry, I can't place them here as the post will become too long.
The idea behind the loop below is to get the series of paired t-tests (p_value) between all leagues in the dataframe and compare them based on columns called 'win_loss_ratio'.
The resulting array with the same number of values as in the empty dataframe should be used to replace the NaN values in the dataframe but I am stuck on this part. How this could be accomplished?
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
df = pd.DataFrame(columns = leagueList, index = leagueList)
print(df)
NFL NBA NHL MLB
NFL NaN NaN NaN NaN
NBA NaN NaN NaN NaN
NHL NaN NaN NaN NaN
MLB NaN NaN NaN NaN
#Double loop for making all possible league combinations
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
print(p_value)
[nan]
[0.94179205]
[0.03088317]
[0.80206949]
[0.94179205]
[nan]
[0.02229705]
[0.95053998]
[0.03088317]
[0.02229705]
[nan]
[0.00070784]
[0.80206949]
[0.95053998]
[0.00070784]
[nan]
Put the p-values into a list to either use .fillna, or just construct it straight a way:
import pandas as pd
from scipy import stats
#some sample data
NFL = pd.DataFrame([.5,.6,.7], columns=['win_loss_ratio'])
NBA = pd.DataFrame([.7,.5,.3], columns=['win_loss_ratio'])
NHL = pd.DataFrame([.4,.3,.2], columns=['win_loss_ratio'])
MLB = pd.DataFrame([.9,.8,.9], columns=['win_loss_ratio'])
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
#Double loop for making all possible league combinations
rows = []
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
rows.append(p_value[0])
n=len(leagueList)
data = [rows[i * n:(i + 1) * n] for i in range((len(rows) + n - 1) // n )]
df = pd.DataFrame(data, columns = leagueList, index = leagueList)
Output:
print (df.to_string())
NFL NBA NHL MLB
NFL NaN 0.622036 0.12169 0.057191
NBA 0.622036 NaN 0.07418 0.092735
NHL 0.121690 0.074180 NaN 0.013560
MLB 0.057191 0.092735 0.01356 NaN

Pivoting a repeating Time Series Data

I am trying to pivot this data in such a way that I get columns like eg: AK_positive AK_probableCases, AK_negative, AL_positive.. and so on.
You can get the data here, df = pd.read_csv('https://covidtracking.com/api/states/daily.csv')
Just flatten the original MultiIndex column into tuples using .to_flat_index(), and rearrange tuple elements into a new column name.
df_pivoted.columns = [f"{i[1]}_{i[0]}" for i in df_pivoted.columns.to_flat_index()]
Result:
# start from April
df_pivoted[df_pivoted.index >= 20200401].head(5)
AK_positive AL_positive AR_positive ... WI_grade WV_grade WY_grade
date ...
20200401 133.0 1077.0 584.0 ... NaN NaN NaN
20200402 143.0 1233.0 643.0 ... NaN NaN NaN
20200403 157.0 1432.0 704.0 ... NaN NaN NaN
20200404 171.0 1580.0 743.0 ... NaN NaN NaN
20200405 185.0 1796.0 830.0 ... NaN NaN NaN

How do i remove nan values from dataframe in Python. dropna() does not seem to be working for me

How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516

Trying to append a single row of data to a pandas DataFrame, but instead adds rows for each field of input

I am trying to add a row of data to a pandas DataFrame, but it keeps adding a separate row for each piece of data. I feel I am missing something very simple and obvious, but what it is I do not know.
import pandas
colNames = ["ID", "Name", "Gender", "Height", "Weight"]
df1 = pandas.DataFrame(columns = colNames)
df1.set_index("ID", inplace=True, drop=False)
i = df1.shape[0]
person = [{"ID":i},{"Name":"Jack"},{"Gender":"Male"},{"Height":177},{"Weight":75}]
df1 = df1.append(pandas.DataFrame(person, columns=colNames))
print(df1)
Output:
ID Name Gender Height Weight
0 0.0 NaN NaN NaN NaN
1 NaN Jack NaN NaN NaN
2 NaN NaN Male NaN NaN
3 NaN NaN NaN 177.0 NaN
4 NaN NaN NaN NaN 75.0
You are using too many squiggly brackets. All of your data should be inside one pair of squiggly brackets. This creates a single python dictionary. Change that line to:
person = [{"ID":i,"Name":"Jack","Gender":"Male","Height":177,"Weight":75}]

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Resources