Creating a single file from multiple files (Python 3.x) - python-3.x

I can't figure out a great way to do this, but I have 2 files with a standard date and value format.
File 1 File 2
Date Value Date Value
4 7.0 1 9.0
5 5.5 . .
6 4.0 7 2.0
I want to combine files 1 and 2 to get the following:
Combined Files
Date Value1 Value2 Avg
1 NaN 9.0 9.0
2 NaN 9.0 9.0
3 NaN 8.5 8.5
4 7.0 7.5 7.25
5 5.5 5.0 5.25
6 4.0 3.5 3.75
7 NaN 2.0 2.0
How would I attempt this? I figured I should make a masked array with the date going from 1 to 7 and then just append the files together, but I don't know how I would do that with file 1. Any help where to look would be appreciated.
Using Python 3.x
EDIT:
I solved my own problem!
I am sure there is a better way to streamline this. My solution, doesn't use the example above, I just threw in my code.
def extractFiles(Dir, newDir, newDir2):
fnames = glob(Dir)
farray = np.array(fnames)
## Dates range from 723911 to 737030
dateArray = np.arange(723911,737030) # Store the dates
dataArray = [] # Store the data, This needs to be a list! Not np.array!
for f in farray:
## Extracting Data
CH4 = np.genfromtxt(f, comments='#', delimiter=None, dtype=np.float).T
myData = np.full(dateArray.shape, np.nan) # Create an masked array
myDate = np.array([])
## Converts the given datetime into something more useable
for x, y in zip(*CH4[1:2], *CH4[2:3]):
myDate = np.append(myDate,
(mdates.date2num(datetime.strptime('{}-{}'.format(int(x), int(y)), '%Y-%m'))))
## Finds where the dates are the same and places the approprite concentration value
for i in range(len(CH4[3])):
idx = np.where(dateArray == myDate[i])
myData[idx] = CH4[3, i]
## Store all values in the list
dataArray.append(myData)
## Convert list to numpy array and save in txt file
dataArray = np.vstack((dateArray, dataArray))
np.savetxt(newDir, dataArray.T, fmt='%1.2f', delimiter=',')
## Find the averge of the data to plot
avg = np.nanmean(dataArray[1:].T,1)
avg = np.vstack((dateArray, avg))
np.savetxt(newDir2, avg.T, fmt='%1.2f', delimiter=',')
return avg

Here is my answer based on the information you gave me:
import pandas as pd
import os
# I stored two Excel files in a subfolder of this sample code
# Code
# ----Files
# -------- File1.xlsx
# -------- File2.xlsx
# Here I am saving the path to a variable
file_path = os.path.join(*[os.getcwd(), 'Files', ''])
# I define an empty DataFrame that we then fill we the files information
final_df = pd.DataFrame()
# file_number will be used to increment the Value column based number of files that we load.
# First file will be Value1, second will lead to Value2
file_number = 1
# os.listdir is now "having a look" into the "Files" folder and will return a list of files which is contained in there
# ['File1.xlsx', 'File2.xlsx'] in our case
for file in os.listdir(file_path):
# we load the Excel file with pandas function "read_excel"
df = pd.read_excel(file_path + file)
# Rename the column "Value" to "Value" + the "file_number"
df = df.rename(columns={'Value': 'Value'+str(file_number)})
# Check if the Dataframe already contains values
if not final_df.empty:
# If there is values already then we merge them together with the new values
final_df = final_df.merge(df, how='outer', on='Date')
else:
# Otherwise we "initialize" our final_df with the first Excel file that we loaded
final_df = df
# at the end we increment the file number by one to continue to next file
file_number += 1
# get all column names that have "Value" in it
value_columns = [w for w in final_df.columns if 'Value' in w]
# Create a new column for the average and build the average on all columns that we found for value columns
final_df['Avg'] = final_df.apply(lambda x: x[value_columns].mean(), axis=1)
# Sort the dataframe based on the Date
sorted_df = final_df.sort_values('Date')
print(sorted_df)
The print will output this:
Date Value1 Value2 Avg
3 1 NaN 9.0 9.00
4 2 NaN 9.0 9.00
5 3 NaN 8.5 8.50
0 4 7.0 7.5 7.25
1 5 5.5 5.0 5.25
2 6 4.0 3.5 3.75
6 7 NaN 2.0 2.00
Please be aware that this is not paying attention on the file names and is just loading one file after another based on the alphabet.
But this has the advantage that you can put as many files in there as you want.
If you need to load them in a specific order I can probably help you with that as well.

Related

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

How to create a dataframe from records that has Timestamps as index?

I am trying to create a dataframe in which i have Timestamps as index but it throws error. I can use the same methodology to create a dataframe in case the index is not a Timestamp. The following codebit is a bare min example:
Works fine:
pd.DataFrame.from_dict({'1':{'a':1,'b':2,'c':3},'2':{'a':1,'c':4},'3':{'b':6}})
output:
1 2 3
a 1 1.0 NaN
b 2 NaN 6.0
c 3 4.0 NaN
BREAKS
o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
output:
KeyError Traceback (most recent call last)
<ipython-input-627-f9a075f611c0> in <module>
1 o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
2
----> 3 pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in <listcomp>(.0)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
KeyError: Timestamp('2017-11-01 00:00:00')
Please help me understand the behavior and what I am missing. Also, how do go about creating a dataframe from records that has Timestamps as indices
Change from_records to from_dict (just like in your working example)
and everything executes fine.
Another, optional hint: Since you create a Pandas DataFrame, use
pandasonic native way to create datetime values to use as column names:
o = pd.date_range(start='2017-11-01', periods=3)
Edit
I noticed that if you create o object the way I proposed (as a
date_range), you can use even from_records.
Edit 2
You wrote that you want datetime objects as the index, whereas
your code attempts to set them as column names.
If you want datetime objects as the index, run something like:
df = pd.DataFrame.from_records({'1':{o[0]:1, o[1]:2, o[2]:3},
'2':{o[0]:1, o[2]:4}, '3':{o[1]:6}})
The result is:
1 2 3
2017-11-01 1 1.0 NaN
2017-11-02 2 NaN 6.0
2017-11-03 3 4.0 NaN
Another way to create the above result is:
df = pd.DataFrame.from_records([{'1':1, '2':1}, {'1':2, '3':6}, {'1':3, '2':4}], index=o)

Table has several columns with the same type of information

My table has 4 columns: order_id, item_id_1, item_id_2 and item_id_3. The three last columns cover the same type of information (the ids of products). I want to transform this table to get 2-columns table with "order_id" and "item_id", so my columns cover unique type of informations. That means, if in a particular order_id there were 3 products ordered, I will get three (instead of one) rows in my new table).
This will alow me, for exapmle, perform 'grupby' operation on 'item_id" column to count how meny times a particular product was ordered.
How this table transformation process is called?
For example, if you have a dataframe like this -
df = pd.DataFrame({'order_id':[1,2,3], 'item_id_1':['a','b','c'], 'item_id_2':['x','y',''], 'item_id_3':['','q','']})
df
order_id item_id_1 item_id_2 item_id_3
0 1 a x
1 2 b y q
2 3 c
pd.melt(df, id_vars=['order_id'], \
value_vars=['item_id_1', 'item_id_2', 'item_id_3'], \
var_name='item_id', value_name='item_value').\
replace('',np.nan).dropna().\
sort_values(['order_id']).\
reset_index(drop=True)\
[['order_id', 'item_id']]
So I'm not aware of any method that allows you to expand rows automatically as you're suggesting, but you can easily reach you're goal without. Let's start from a similar data frame, I put nan in cells of items that have not been ordered:
import pandas as pd
import numpy as np
data = {'order_id':[1,2,3],'item_id_1':[11,12,13],'item_id_2':[21,np.nan,23],'item_id_3':[31,np.nan,np.nan]}
df = pd.DataFrame(data)
cols = ['item_id_1','item_id_2','item_id_3']
print(df)
Out:
order_id item_id_1 item_id_2 item_id_3
0 1 11 21.0 31.0
1 2 12 NaN NaN
2 3 13 23.0 NaN
Then you can define a new empty data frame to fill by iterating through the rows of the initial one. For every item a new row is added to the empty data frame with same order_id and different item_id.
new_df = pd.DataFrame(columns = ['order_id','item_id']) # ,'item_num']
for ind, row in df.iterrows():
new_row = {}
new_row['order_id'] = row['order_id']
for col in cols: # for num, col in enumerate(cols):
item = row[col]
if not pd.isna(item):
new_row['item_id'] = item
# new_row['item_num'] = num +1
new_df = new_df.append(new_row,ignore_index=True)
print(new_df)
Out: # shape (6,2), ok because because 6 items have been ordered
order_id item_id
0 1.0 11.0
1 1.0 21.0
2 1.0 31.0
3 2.0 12.0
4 3.0 13.0
5 3.0 23.0
If you want, you could also add a third column to keep trace of the category of each item (i.e. if it was item_1, 2 or 3) by uncommenting the lines in the code, which gives you this output:
order_id item_id item_num
0 1.0 11.0 1.0
1 1.0 21.0 2.0
2 1.0 31.0 3.0
3 2.0 12.0 1.0
4 3.0 13.0 1.0
5 3.0 23.0 2.0

Impute missing date (YYYY-WW)

I have a dataframe like below and I need to insert rows where date is missing or omitted (Note this is weekly date):
A B C
'alpha' 2006-01 12
'beta' 2006-02 4
'kappa' 2006-04 2
Required result is something like:
A B C
'alpha' 2006-01 12
'beta' 2006-02 4
'gamma' 2006-03 0
'kappa' 2006-04 2
Can it be done?
Create an index based on YYYY-WW format using to_datetime, resample to weekly and fix NaN values. Here is a useful link to python strftimes
# Set index to datetime - based on yyyy-ww format (the '-1' makes weeks start Monday)
df.index = pd.to_datetime(df['B'] + '-1', format='%Y-%W-%w')
# Resample to weekly - Monday start
df_new = df.resample('W-MON').first().fillna(0)
# Correct format of 'B' column back to yyyy-ww
df_new['B'] = df_new.index.strftime('%Y-%W')
# Optional step to reset index
df_new.reset_index(drop=True, inplace=True)
print(df_new)
[out]
A B C
0 'alpha' 2006-01 12.0
1 'beta' 2006-02 4.0
2 0 2006-03 0.0
3 'kappa' 2006-04 2.0
User resample() with an adequate filling function. Update:
df['B'] = pd.to_datetime(df['B'])
df = df.set_index('B').resample('1D').asfreq().reset_index()
Then you can refill NA's for every particular column personally.

Pandas groupby and append the original values. Count the mean of per row

I have a dataframe of IDs and Values. Where IDs are kind of repetition of trial and Values are the results.
I want to do groupby by ID and for same IDs the Values will be added to adjacent columns. Finally I want to calculate the mean of each of the rows.
>>>df
ID Value
0 1 1.1
1 2 1.2
2 3 2.4
3 1 1.7
4 2 4.3
5 3 2.2
>>>groups = df.groupby(by='ID')
#Now I cannot figure it what to do for my desired output.
I want the output like
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.9
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.3
Use DataFrame.assign for new column created by counter per groups by GroupBy.cumcount, reshape by DataFrame.pivot, change columns names by DataFrame.add_prefix, add new column filled by means and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df = (df.assign(g = df.groupby('ID').cumcount().add(1))
.pivot('ID','g','Value')
.add_prefix('Value_')
.assign(Mean = lambda x: x.mean(axis=1))
.reset_index()
.rename_axis(None, axis=1))
print (df)
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.40
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.30
One of possible solutions, assuming that you have 2 rows for each ID:
Define a function to be applied to groups:
def fn(grp):
vals = grp.Value.values
return [ vals[0], vals[-1], grp.Value.mean() ]
Then apply it and "move" ID column from index to regular column:
df2 = df.groupby('ID').apply(fn).apply(pd.Series).reset_index()
And the last point is to set proper column names:
df2.columns=[ 'ID', 'Value_1', 'Value_2', 'Mean' ]

Resources