What data structure is returned by Pands.read_excel, and how can I reference the columns of the underlying data frames? - excel

I am trying to plot columns using Pandas running in Ipython environment with Python 3.4.3. Using the read_excel function, I try to convert an xls to a DataFrame as follows:
import matplotlib.pyplot as plt
import pandas as pd
data=pd.read_excel('/Path/to/file.xlsx',sheetname='Sheet1')
print(sup_sub)
which results in
{'Sheet1': Day a b c d
0 Monday 24 1 34.0 3
1 Tuesday 4 7 8.0 2
2 Wednesday 3 6 3.0 1
3 Thursday 2 6 4.0 0
4 Friday 1 34 -11.5 -1
5 Saturday 0 2 -21.0 -2
6 Sunday -1 4 -30.5 -3}
I know this format is incorrect as it doesn't match the formatting when a test excel file is made from scratch; the columns are not properly aligned. This also prevents me from even printing the columns using:
print(data.columns)
which returns
AttributeError: 'dict' object has no attribute 'columns'
Is there a simple way to reformat the data so columns can be referenced and graphed?

I think data is a dictionary of dataframes, with one entry per sheet of your excel file; you should be able to access the individual dataframes with data['Sheet1'].

Related

How to fill NaN values based on the top and bottom strings with highest frequency

I have a dataframe of string values with missing values in it. It needs to be populated/filled by the below conditions.
From the NaN value index , Check the last 3 rows and next 3 rows and replace the NaN with the most frequent/repeated value out of 6 rows.
If there is 2 strings with an equal amount of frequency occurred from the last 3 rows and next 3 rows , replace the NaN with the value that has lowest index out of theses 6 rows.
My DataFrame:
reading
0 talk
1 kill
2 NaN
3 vertical
4 type
5 kill
6 NaN
7 vertical
8 vertical
9 type
10 durable
11 NaN
12 durable
13 vertical
Expected output:
reading
0 talk
1 kill
2 kill
3 vertical
4 type
5 kill
6 vertical
7 vertical
8 vertical
9 type
10 durable
11 vertical
12 durable
13 vertical
Here is the minimum reproducible code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'reading':['talk','kill',np.NAN,'vertical','type','kill',np.NAN,'vertical','vertical','type','durable',np.NAN,'durable','vertical']})
def filldf(df):
# Do the logic here
return df
I am not sure how to approach this problem. Any help will be appreciated !!
If you don't have too many NaN values, you can iterate over the index of NaN "reading" values and simply look for the mode of the surrounding 6 values of it (use iloc to get the first occurrence of multiple modes) and assign the values back to the corresponding "NaN" values
msk = df['reading'].isna()
df.loc[msk, 'reading'] = [df.loc[min(0, i-3):i+3, 'reading'].mode().iloc[0] for i in df.index[msk]]
Output:
reading
0 talk
1 kill
2 kill
3 vertical
4 type
5 kill
6 vertical
7 vertical
8 vertical
9 type
10 durable
11 vertical
12 durable
13 vertical

Split up time series per year for plotting

I would like to plot a time series, start Oct-2015 and end Feb-2018, in one graph, each year is a single line. The time series is int64 value and is in a Pandas DataFrame. The date is in datetime64[ns] as one of the columns in the DataFrame.
How would I create a graph from Jan-Dez with 4 lines for each year.
graph['share_price'] and graph['date'] are used. I have tried Grouper, but that somehow takes Oct-2015 values and mixes it with the January values from all other years.
This groupby is close to what I want, but I loose the information which year the index of the list belongs to.
graph.groupby('date').agg({'share_price':lambda x: list(x)})
Then I have created a DataFrame with 4 columns, 1 for each year but still, I don't know how to go ahead and group these 4 columns in a way, that I will be able to plot a graph in a way I want.
You can achieve this by:
extracting the year from the date
replacing the dates by the equivalent without the year
setting both the year and the date as index
unstacking the values by year
At this point, each year will be a column, and each date within the year a row, so you can just plot normally.
Here's an example.
Assuming that your DataFrame looks something like this:
>>> import pandas as pd
>>> import numpy as np
>>> index = pd.date_range('2015-10-01', '2018-02-28')
>>> values = np.random.randint(-3, 4, len(index)).cumsum()
>>> df = pd.DataFrame({
... 'date': index,
... 'share_price': values
>>> })
>>> df.head()
date share_price
0 2015-10-01 0
1 2015-10-02 3
2 2015-10-03 2
3 2015-10-04 5
4 2015-10-05 4
>>> df.set_index('date').plot()
You would transform the DataFrame as follows:
>>> df['year'] = df.date.dt.year
>>> df['date'] = df.date.dt.strftime('%m-%d')
>>> unstacked = df.set_index(['year', 'date']).share_price.unstack(-2)
>>> unstacked.head()
year 2015 2016 2017 2018
date
01-01 NaN 28.0 -16.0 21.0
01-02 NaN 29.0 -14.0 22.0
01-03 NaN 29.0 -16.0 22.0
01-04 NaN 26.0 -15.0 23.0
01-05 NaN 25.0 -16.0 21.0
And just plot normally:
unstacked.plot()

can you re-sample a series without dates?

I have a time series from months 1 to 420 (35 years). I would like to convert to an annual series using the average of the 12 months in each year so I can put in a dataframe I have with annual datapoints. I have it setup using a range with steps of 12 but it gets kind of messy. Ideally would like to use the resample function but having trouble since no dates. Any way around this?
There's no need to resample in this case. Just use groupby with integer division to obtain the average over the years.
import numpy as np
import pandas as pd
# Sample Data
np.random.seed(123)
df = pd.DataFrame({'Months': np.arange(1,421,1),
'val': np.random.randint(1,10,420)})
# Create Yearly average. 1-12, 13-24, Subtract 1 before // to get this grouping
df.groupby((df.Months-1)//12).val.mean().reset_index().rename(columns={'Months': 'Year'})
Outputs:
Year val
0 0 3.083333
1 1 4.166667
2 2 5.250000
3 3 4.416667
4 4 5.500000
5 5 4.583333
...
31 31 5.333333
32 32 5.000000
33 33 6.250000
34 34 5.250000
Feel free to add 1 to the year column or whatever you need to make it consistent with indexing in your other annual df. Otherwise, you could just use df.groupby((df.Months+11)//12).val().mean() to get the Year to start at 1.

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Using relative positioning with Python 3.5 and pandas

I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)

Resources