Impute missing date (YYYY-WW) - python-3.x

I have a dataframe like below and I need to insert rows where date is missing or omitted (Note this is weekly date):
A B C
'alpha' 2006-01 12
'beta' 2006-02 4
'kappa' 2006-04 2
Required result is something like:
A B C
'alpha' 2006-01 12
'beta' 2006-02 4
'gamma' 2006-03 0
'kappa' 2006-04 2
Can it be done?

Create an index based on YYYY-WW format using to_datetime, resample to weekly and fix NaN values. Here is a useful link to python strftimes
# Set index to datetime - based on yyyy-ww format (the '-1' makes weeks start Monday)
df.index = pd.to_datetime(df['B'] + '-1', format='%Y-%W-%w')
# Resample to weekly - Monday start
df_new = df.resample('W-MON').first().fillna(0)
# Correct format of 'B' column back to yyyy-ww
df_new['B'] = df_new.index.strftime('%Y-%W')
# Optional step to reset index
df_new.reset_index(drop=True, inplace=True)
print(df_new)
[out]
A B C
0 'alpha' 2006-01 12.0
1 'beta' 2006-02 4.0
2 0 2006-03 0.0
3 'kappa' 2006-04 2.0

User resample() with an adequate filling function. Update:
df['B'] = pd.to_datetime(df['B'])
df = df.set_index('B').resample('1D').asfreq().reset_index()
Then you can refill NA's for every particular column personally.

Related

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

Creating a single file from multiple files (Python 3.x)

I can't figure out a great way to do this, but I have 2 files with a standard date and value format.
File 1 File 2
Date Value Date Value
4 7.0 1 9.0
5 5.5 . .
6 4.0 7 2.0
I want to combine files 1 and 2 to get the following:
Combined Files
Date Value1 Value2 Avg
1 NaN 9.0 9.0
2 NaN 9.0 9.0
3 NaN 8.5 8.5
4 7.0 7.5 7.25
5 5.5 5.0 5.25
6 4.0 3.5 3.75
7 NaN 2.0 2.0
How would I attempt this? I figured I should make a masked array with the date going from 1 to 7 and then just append the files together, but I don't know how I would do that with file 1. Any help where to look would be appreciated.
Using Python 3.x
EDIT:
I solved my own problem!
I am sure there is a better way to streamline this. My solution, doesn't use the example above, I just threw in my code.
def extractFiles(Dir, newDir, newDir2):
fnames = glob(Dir)
farray = np.array(fnames)
## Dates range from 723911 to 737030
dateArray = np.arange(723911,737030) # Store the dates
dataArray = [] # Store the data, This needs to be a list! Not np.array!
for f in farray:
## Extracting Data
CH4 = np.genfromtxt(f, comments='#', delimiter=None, dtype=np.float).T
myData = np.full(dateArray.shape, np.nan) # Create an masked array
myDate = np.array([])
## Converts the given datetime into something more useable
for x, y in zip(*CH4[1:2], *CH4[2:3]):
myDate = np.append(myDate,
(mdates.date2num(datetime.strptime('{}-{}'.format(int(x), int(y)), '%Y-%m'))))
## Finds where the dates are the same and places the approprite concentration value
for i in range(len(CH4[3])):
idx = np.where(dateArray == myDate[i])
myData[idx] = CH4[3, i]
## Store all values in the list
dataArray.append(myData)
## Convert list to numpy array and save in txt file
dataArray = np.vstack((dateArray, dataArray))
np.savetxt(newDir, dataArray.T, fmt='%1.2f', delimiter=',')
## Find the averge of the data to plot
avg = np.nanmean(dataArray[1:].T,1)
avg = np.vstack((dateArray, avg))
np.savetxt(newDir2, avg.T, fmt='%1.2f', delimiter=',')
return avg
Here is my answer based on the information you gave me:
import pandas as pd
import os
# I stored two Excel files in a subfolder of this sample code
# Code
# ----Files
# -------- File1.xlsx
# -------- File2.xlsx
# Here I am saving the path to a variable
file_path = os.path.join(*[os.getcwd(), 'Files', ''])
# I define an empty DataFrame that we then fill we the files information
final_df = pd.DataFrame()
# file_number will be used to increment the Value column based number of files that we load.
# First file will be Value1, second will lead to Value2
file_number = 1
# os.listdir is now "having a look" into the "Files" folder and will return a list of files which is contained in there
# ['File1.xlsx', 'File2.xlsx'] in our case
for file in os.listdir(file_path):
# we load the Excel file with pandas function "read_excel"
df = pd.read_excel(file_path + file)
# Rename the column "Value" to "Value" + the "file_number"
df = df.rename(columns={'Value': 'Value'+str(file_number)})
# Check if the Dataframe already contains values
if not final_df.empty:
# If there is values already then we merge them together with the new values
final_df = final_df.merge(df, how='outer', on='Date')
else:
# Otherwise we "initialize" our final_df with the first Excel file that we loaded
final_df = df
# at the end we increment the file number by one to continue to next file
file_number += 1
# get all column names that have "Value" in it
value_columns = [w for w in final_df.columns if 'Value' in w]
# Create a new column for the average and build the average on all columns that we found for value columns
final_df['Avg'] = final_df.apply(lambda x: x[value_columns].mean(), axis=1)
# Sort the dataframe based on the Date
sorted_df = final_df.sort_values('Date')
print(sorted_df)
The print will output this:
Date Value1 Value2 Avg
3 1 NaN 9.0 9.00
4 2 NaN 9.0 9.00
5 3 NaN 8.5 8.50
0 4 7.0 7.5 7.25
1 5 5.5 5.0 5.25
2 6 4.0 3.5 3.75
6 7 NaN 2.0 2.00
Please be aware that this is not paying attention on the file names and is just loading one file after another based on the alphabet.
But this has the advantage that you can put as many files in there as you want.
If you need to load them in a specific order I can probably help you with that as well.

Pandas groupby and append the original values. Count the mean of per row

I have a dataframe of IDs and Values. Where IDs are kind of repetition of trial and Values are the results.
I want to do groupby by ID and for same IDs the Values will be added to adjacent columns. Finally I want to calculate the mean of each of the rows.
>>>df
ID Value
0 1 1.1
1 2 1.2
2 3 2.4
3 1 1.7
4 2 4.3
5 3 2.2
>>>groups = df.groupby(by='ID')
#Now I cannot figure it what to do for my desired output.
I want the output like
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.9
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.3
Use DataFrame.assign for new column created by counter per groups by GroupBy.cumcount, reshape by DataFrame.pivot, change columns names by DataFrame.add_prefix, add new column filled by means and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df = (df.assign(g = df.groupby('ID').cumcount().add(1))
.pivot('ID','g','Value')
.add_prefix('Value_')
.assign(Mean = lambda x: x.mean(axis=1))
.reset_index()
.rename_axis(None, axis=1))
print (df)
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.40
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.30
One of possible solutions, assuming that you have 2 rows for each ID:
Define a function to be applied to groups:
def fn(grp):
vals = grp.Value.values
return [ vals[0], vals[-1], grp.Value.mean() ]
Then apply it and "move" ID column from index to regular column:
df2 = df.groupby('ID').apply(fn).apply(pd.Series).reset_index()
And the last point is to set proper column names:
df2.columns=[ 'ID', 'Value_1', 'Value_2', 'Mean' ]

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

python-3: how to create a new pandas column as subtraction of two consecutive rows of another column?

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks
Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

Resources