pandas - convert Panel into DataFrame using lookup table for column headings - python-3.x

Is there a neat way to do this, or would I be best off making a look that creates a new dataframe, looking into the Panel when constructing each column?
I have a 3d array of data that I have put into a Panel, and I want to reorganise it based on a 2d lookup table using 2 of the axes so that it will be a DataFrame with labels taken from my lookup table using the nearest value. In a kind of double vlookup type of a way.
The main thing I am trying to achieve is to be able to quickly locate a time series of data based on the label. If there is a better way, please let me know!
my data is in a panel that looks like this, with items axis latitude and minor axis longitude.
data
Out[920]:
<class 'pandas.core.panel.Panel'>
Dimensions: 53 (items) x 29224 (major_axis) x 119 (minor_axis)
Items axis: 42.0 to 68.0
Major_axis axis: 2000-01-01 00:00:00 to 2009-12-31 21:00:00
Minor_axis axis: -28.0 to 31.0
and my lookup table is like this:
label_coords
Out[921]:
lat lon
label
2449 63.250122 -5.250000
2368 62.750122 -5.750000
2369 62.750122 -5.250000
2370 62.750122 -4.750000
I'm kind of at a loss. Quite new to python in general and only really started using pandas yesterday.
Many thanks in advance! Sorry if this is a duplicate, I couldn't find anything that was about the same type of question.
Andy

figured out a loop based solution and thought i may as well post in case someone else has this type of problem
I changed the way my label coordinates dataframe was being read so that the labels were a column, then used the pivot function:
label_coord = label_coord.pivot('lat','lon','label')
this then produces a dataframe where the labels are the values and lat/lon are the index/columns
then used this loop, where data is a panel as in the question:
data_labelled = pd.DataFrame()
for i in label_coord.columns: #longitude
for j in label_coord.index: #latitude
lbl = label_coord[i][j]
shut_nump['%s'%lbl]=data[j][i]

Related

Subset misses values

I'm pretty new to coding, but it seems that my subset is missing values and I'm wondering what i am doing wrong. So, I have a data frame called «df_envel» with 4 colums : Elevation, distance, profil, date. I am trying to subset this dataframe to get only values that equals -0.1 m. I have tried multiple subset methods but all methods misses some -0.1 values and put some NA's instead. Here's the subset code lines I tried which all returns to the same number of values:
Here is my code:
f<- df_envel[which(df_envel$Elevation=='-0.1'),]
f<- df_envel %>% filter(Elevation == '-0.1')
f<- subset(df_envel, Elevation %in% '-0.1')
Does anybody know what I might be doing wrong?
I finally resolved it by changing the data frame into a matrix, change it to numeric, subset and than turn it back into a data frame. I don't really know why it worked, but it did!
df_envel <- as.matrix(df_envel)
df_envel[,c(1,2)] <- as.numeric(df_envel[,c(1,2)])
f <- df_envel[ which(df_envel[,'Elevation']=='0'),]
f <- as.data.frame(f)

Z-score normalization in pandas DataFrame (python)

I am using python3 (spyder), and I have a table which is the type of object "pandas.core.frame.DataFrame". I want to z-score normalize the values in that table (to each value substract the mean of its row and divide by the sd of its row), so each row has mean=0 and sd=1. I have tried 2 approaches.
First approach
from scipy.stats import zscore
zetascore_table=zscore(table,axis=1)
Second approach
rows=table.index.values
columns=table.columns
import numpy as np
for i in range(len(rows)):
for j in range(len(columns)):
table.loc[rows[i],columns[j]]=(table.loc[rows[i],columns[j]] - np.mean(table.loc[rows[i],]))/np.std(table.loc[rows[i],])
table
Both approaches seem to work, but when I check the mean and sd of each row it is not 0 and 1 as it is suppose to be, but other float values. I don´t know which can be the problem.
Thanks in advance for your help!
The code below calculates a z-score for each value in a column of a pandas df. It then saves the z-score in a new column (here, called 'num_1_zscore'). Very easy to do.
from scipy.stats import zscore
import pandas as pd
# Create a sample df
df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})
# Calculate the zscores and drop zscores into new column
df['num_1_zscore'] = zscore(df['num_1'])
display(df)
Sorry, thinking about it I found myself another easier way to calculate z-score (substract the mean of each row and divide the result by the sd of the row) than the for loops:
table=table.T# need to transpose it since the functions work like that
sd=np.std(table)
mean=np.mean(table)
numerator=table-mean #numerator in the formula for z-score
z_score=numerator/sd
z_norm_table=z_score.T #we transpose again and we have the initial table but with all the
#values z-scored by row.
I checked and now mean in each row is 0 or very close to 0 and sd is 1 or very close to 1, so like that was working for me. Sorry, I have few experience with coding and sometimes easy things require a lot of trials until I figure out how to solve them.

Use excel to analyze lab data and present preliminary findings?

I am trying to build an excel file to take soil lab test results and organize and assign them preliminary labels.
A sample test will include pH, SAR/ESP, and EC readings. Based on those readings I want to assign the results the label Normal, Saline, Saline-Sodic, or Sodic.
Each label has an associated range of values for each criteria, simplest way to visualize what Im looking for is a a graph with two axis (SAR/ESP vs EC) with 4 quadrants. 3 of the quadrants refer to the same pH range.
I have a simple if then setup going right now that basically assigns each result all the possible labels based on each category then assigns it the label that comes up the most. However this is slow and ugly. Is there a way to consolidate this so that when I import a table where each row is a test I can have one column calculating this?
For example ph is evaluated:
=IF($I$2<=8.5,"A B D","C")
With A = Saline, B = Saline-Sodic, C= Sodic, D = Normal.
Then SAR is evaluated:
=IF($I$3<=13,"A D","B C")
etc.
Then:
=COUNTIF($B$9:$B$12,"A*")
Iterated for each label.
The labels are then counted:
=INDEX(Table1[Column1],MATCH(MAX(Table1[Column3]),Table1[Column3],0))
Working properly:

Very Basic Python 3.6 Plotting Issue

So I have a rather easy question r.e. some plotting issues. I have don't have the greatest level of Python knowledge, its been a few months since looking at it, there isn't anything that I can see that would aid me.
I have the following data frame:
Date Open High Low Close Adj Close Volume
0 11/01/2018 86.360001 87.370003 85.930000 86.930000 86.930000 143660001
1 10/01/2018 87.000000 87.190002 85.980003 86.080002 86.080002 108223002
This isn't all of the data; there's 3000+ rows of it.
QUESTION: I'm trying to plot Adj Close vs. Date. However, due to the index column, which I don't actually want, I end up with a plot of Adj Close vs. the index column. No use obviously.
I've used:
bp['Adj Close'].plot(label='BP',figsize=(16,8),title='Adjusted Closing Price')
So really it's a case of, where do I put the ['Date'] part into the code, so the Index column isn't used?
Many thanks for any help.
You need first convert column by to_datetime:
bp['Date'] = pd.to_datetime(bp['Date'])
and then use x and y parameters in DataFrame.plot:
bp.plot(x='Date', y='Adj Close', label='BP',figsize=(16,8),title='Adjusted Closing Price')
Or set_index from column Date and then use Series.plot:
bp.set_index('Date')['Adj Close'].plot(label='BP',figsize=(16,8),title='Adjusted Closing Price')

How can I calculate values in a Pandas dataframe based on another column in the same dataframe

I am attempting to create a new column of values in a Pandas dataframe that are calculated from another column in the same dataframe:
df['ema_ideal'] = df['Adj Close'].ewm(span=df['ideal_moving_average'], min_periods=0, ignore_na=True).mean
However, I am receiving the error:
ValueError: The truth of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any(), or a.all().
If I have the span set to 30, or some integer, I do not receive this error. Also, ideal_moving_average is a column of float.
My two questions are:
Why exactly am I receiving the error?
How can I incorporate the column values from ideal_moving_average into the df['ema_ideal'] column (subquestion as I am new to Pandas - is this column a Series within the dataframe?)
Thanks for the help!
EDIT: Example showing Adj Close data, in bad formatting
Date Open High Low Close Adj Close
2017-01-03 225.039993 225.830002 223.880005 225.240005 222.073914
2017-01-04 225.619995 226.750000 225.610001 226.580002 223.395081
2017-01-05 226.270004 226.580002 225.479996 226.399994 223.217606
2017-01-06 226.529999 227.750000 225.899994 227.210007 224.016220
2017-01-09 226.910004 227.070007 226.419998 226.460007 223.276779
2017-01-10 226.479996 227.449997 226.009995 226.460007 223.276779
I think something like this will work for you:
df['ema_ideal'] = df.apply(lambda x: df['Adj Close'].ewm(span=x['ideal_moving_average'], min_periods=0, ignore_na=True).mean(), axis=1)
Providing axis=1 to DataFrame.apply allows you to access the data row wise like you need.
There's absolutely no issue creating a dataframe column from another dataframe.
The error you're receiving is completely different, this error is returned when you try to compare Series with logical fonctions such as and, or, not etc...
In general, to avoid this error you must compare Series element wise, using for example & instead of and, or ~ instead of not, or using numpy to do element wise comparation.
Here, the issue is that you're trying to use a Serie as the span of your ema, and pandas ewma function only accept integers as spans.
You could for example, calculate the ema for each possible periods, and then regroup them in a Serie that you set as the ema idealcolumn of your dataframe.
For anyone wondering, the problem was that span could not take multiple values, which was happening when I tried to pass df['ideal_moving_average'] into it. Instead, I used the below code, which seemed to go line by line passing the value for that row into span.
df['30ema'] = df['Adj Close'].ewm(span=df.iloc[-1]['ideal_ma'], min_periods=0, ignore_na=True).mean()
EDIT: I will accept this as correct for now, until someone shows that it doesn't work or can create something better, thanks for the help.

Resources