Generate daterange and insert in a new column of a dataframe - python-3.x

Problem statement: Create a dataframe with multiple columns and populate one column with daterange series of 5 minute interval.
Tried solution:
Created a dataframe initially with just one row / 5 columns (all "NAN") .
Command used to generate daterange:
rf = pd.date_range('2000-1-1', periods=5, freq='5min').
O/P of rf :
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:05:00',
'2000-01-01 00:10:00', '2000-01-01 00:15:00',
'2000-01-01 00:20:00'],
dtype='datetime64[ns]', freq='5T')
When I try to assign rf to one of the columns of df (df['column1'] = rf)., it is throwing exception as shown below (copying the last line of exception).
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2879, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
Though I understood the issue, I don't know the solution. I'm looking for a easy way to achieve this.

I think, I was slowly understanding the power/usage of dataframes.
Initially create a dataframe :
df = pd.DataFrame(index=range(100),columns=['A','B','C'])
Then created a date_range.
date = pd.date_range('2000-1-1', periods=100, freq='5T')
Using "assign" function , added date_range as new column to already created dataframe (df).
df = df.assign(D=date)
Final O/P of df:
df[:5]
A B C D
0 NaN NaN NaN 2000-01-01 00:00:00
1 NaN NaN NaN 2000-01-01 00:05:00
2 NaN NaN NaN 2000-01-01 00:10:00
3 NaN NaN NaN 2000-01-01 00:15:00
4 NaN NaN NaN 2000-01-01 00:20:00

Your dataframe has only one row and you try to insert data for five rows.

Related

TypeError: '(slice(None, 59, None), slice(None, None, None))' is an invalid key

I am having the below table where I want to remove these rows with NaN values.
date Open ... Real Lower Band Real Upper Band
0 2020-07-08 08:05:00 2.1200 ... NaN NaN
1 2020-07-08 09:00:00 2.1400 ... NaN NaN
2 2020-07-08 09:30:00 2.1800 ... NaN NaN
3 2020-07-08 09:35:00 2.2000 ... NaN NaN
4 2020-07-08 09:40:00 2.1710 ... NaN NaN
5 2020-07-08 09:45:00 2.1550 ... NaN NaN
These NaN values are til row no. 58
For this, I wrote the following code. But the above error occurred.
data.drop(data[:59,:],inplace= True)
print(data)
Please help me!
There are many options to choose from:
Drop rows by index label.
df.drop(list(range(59)), axis=0, inplace=True)
Drop if nans in selected columns.
df.dropna(axis=0, subset=['Real Upper Band'], inplace=True)
Select rows to keep by index label slice
df = df.loc[59:, :] # 59 is the label in index, if index was date then replace 59 with corresponding datetime
Select rows to keep by integer index slice (similar to slicing a list)
df = df.iloc[59:, :] # 59 is the 0-index row number, regardless of what index is set on df
Filter with .loc and boolean array returned by .isna()
df = df.loc[~df['Real Upper Band'].isna(), :]
Remember that loc and iloc work with two dimensions when applied to dataframes, it is recomended to use full slice : to avoid ambiguity and improve performance according to the docs https://pandas.pydata.org/docs/user_guide/indexing.html
You want to keep rows from 59-th on, so the shortest code you can run is:
data = data[59:]

Trying to append a single row of data to a pandas DataFrame, but instead adds rows for each field of input

I am trying to add a row of data to a pandas DataFrame, but it keeps adding a separate row for each piece of data. I feel I am missing something very simple and obvious, but what it is I do not know.
import pandas
colNames = ["ID", "Name", "Gender", "Height", "Weight"]
df1 = pandas.DataFrame(columns = colNames)
df1.set_index("ID", inplace=True, drop=False)
i = df1.shape[0]
person = [{"ID":i},{"Name":"Jack"},{"Gender":"Male"},{"Height":177},{"Weight":75}]
df1 = df1.append(pandas.DataFrame(person, columns=colNames))
print(df1)
Output:
ID Name Gender Height Weight
0 0.0 NaN NaN NaN NaN
1 NaN Jack NaN NaN NaN
2 NaN NaN Male NaN NaN
3 NaN NaN NaN 177.0 NaN
4 NaN NaN NaN NaN 75.0
You are using too many squiggly brackets. All of your data should be inside one pair of squiggly brackets. This creates a single python dictionary. Change that line to:
person = [{"ID":i,"Name":"Jack","Gender":"Male","Height":177,"Weight":75}]

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Convert a numerical relative index (=months) to datetime

Given is a Pandas DataFrame with a numerical index representing the relative number of months:
df = pd.DataFrame(columns=['A', 'B'], index=np.arange(1,100))
df
A B
1 NaN NaN
2 NaN NaN
3 NaN NaN
...
How can the index be converted to a DateTimeIndex by specifying a start date (e.g., 2018-11-01)?
magic_function(df, start='2018-11-01', delta='month')
A B
2018-11-01 NaN NaN
2018-12-01 NaN NaN
2019-01-01 NaN NaN
...
I would favor a general solution that also works with arbitrary deltas, e.g. daily or yearly series.
Using date_range
idx=pd.date_range(start='2018-11-01',periods =len(df),freq='MS')
df.index=idx
I'm not sure with Pandas, but with plain datetime can't you just do this?
import datetime
start=datetime.date(2018,1,1)
months = 15
adjusted = start.replace(year=start.year + int(months/12), month=months%12)

Resources