Read_excel reads cells with formulae as NaN (instead of the actual value) - python-3.x

path = "C:\\Users\\Adam\\Desktop\\Stock Trackers\\New folder\\New folder\\Stock Tracker WK36 mTeam.xlsx"
df = pd.read_excel(path,usecols="A:D,R",index_col=None)
Column R (in the file path above) is a column with a simple SUM formula. When I use read_excel as above, Columns A-D (in the resulting dataframe df) are fine as these are constants but column R is displayed with all NaN. How can I use pandas to read the underlying cell value instead of displaying NaN?

If you are interested in getting the results of an excel formula computation into a data frame, Given an Excel sheet which looks like the following:
where the Total Qty column is a formula of the form sum(D:F), and the last column is a formula of the form G*C and the formula in cell h5 is sum(h2:h4).
When reading directly into a df using pandas.read_excel(fileName_) yields:
item Description Unit Cost Part A Qty Part B Qty Part C Qty Total Qty Total Cost
0 1.0 System A 25.10 1.0 2.0 1.0 4.0 100.4
1 2.0 Part B 15.25 3.0 0.0 3.0 6.0 91.5
2 3.0 Part C 6.30 6.0 5.0 1.0 12.0 75.6
3 NaN Sum NaN NaN NaN NaN NaN 267.5

Related

Pandas apply with eval not giving NAN as result when NAN in column its calculating on

I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.
So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64

Pandas converts integer numbers to real numbers when reading from Excel

I recently started exploring python for analyzing excel data.
I have an excel file with two worksheets, each one with one matrix (with m = 1000 rows and n= 999 columns).The elements of both matrices are related to each other: one of the matrices concerns diplacement values and the other matrix concerns the force values corresponding to each displacement. The displacements and corresponding forces are obtained from m=1000 numerical simulations and n= 999 increments. Is it possible to identify the force values that correspond only to displacement values that are integer numbers? Or, as an alternative, is it possible to replace all the decimal numbers from the matrix of displacements by 0? I tried to read the excel file into a Pandas dataframe, however all values from the matrix of displacements seem presented as "real numbers" (e.g. numbers "1", "2", "3", etc. from excel are presented with a floating point as "1.", "2.", "3." in python).
Thank you for your attention.
Let's make an example in a smaller scale (3 * 3).
I prepared an Excel file with 2 sheets and read them:
displ = pd.read_excel('Input_2.xlsx', 'Displ')
forces = pd.read_excel('Input_2.xlsx', 'Forces')
Both DataFrames contain:
displ forces
C1 C2 C3 C1 C2 C3
0 10.0 12.1 11.3 0 120.1 130.2 140.3
1 12.5 13.0 13.5 1 150.4 160.5 170.6
2 12.6 13.6 13.8 2 180.7 190.8 200.9
To identify elements of displ containing integer numbers
(actually, still float numbers, but with the fractional
parts == 0.0), you can run:
displ.mod(1.0) == 0.0
and you will get:
C1 C2 C3
0 True False False
1 False True False
2 False False False
And to get corresponding force values and NaN
for other values, you can run:
forces.where(displ.mod(1.0) == 0.0)
getting:
C1 C2 C3
0 120.1 NaN NaN
1 NaN 160.5 NaN
2 NaN NaN NaN
Another option is to get a list of indices in displ where
the corresponding element has zero fractional part.
Actually it is a Numpy function, so it operates on the
underlying Numpy array and returns integer (zero-based)
indices:
ind = np.nonzero((displ.mod(1.0) == 0.0).values)
The result is:
(array([0, 1], dtype=int64), array([0, 1], dtype=int64))
so it is a 2-tuple of indices:
row indices,
column indices.
You can also retrieve a list of indicated elements from
forces, actually also from the underlying Numpy array,
running:
forces.values[ind]
The result is:
array([120.1, 160.5])
To replace "integer" elements of displ with zeroes, you
can run:
displ.mask(displ.mod(1.0) == 0.0, 0, inplace=True)
Now displ contains:
C1 C2 C3
0 0.0 12.1 11.3
1 12.5 0.0 13.5
2 12.6 13.6 13.8
Note that the "wanted" elements are still float zeroes,
but this is a feature of Pandas that each column has one
type, fitting all elements in this column (in this case just float).

Associate values in excel between different column

I have two different series of data that look something like this
A B
1 0.998
2 0.9975
3 0.997
4 0.9967
5 0.9962
6 0.9960
.
.
.
and
C D
1 240.5
1.3 249.5
1.7 241.45
2 239.0
2.5 124.5
3 125.6
3.4 235.1
3.5 236.4
.
.
.
How can I merge the two in excel so that the end results will look like this?
C C E
1 240.5 0.998
1.3 249.5 0.998
1.7 241.45 0.998
2 239.0 0.9975
2.5 124.5 0.9975
3 125.6 0.997
3.4 235.1 0.997
3.5 236.4 0.997
Essentially I need to add, for each integer of the column C, its corresponding value as shown in the series A,B. the whole dataset is 3500 rows long, so I am looking for an automated solution that can help me with that before I resolve to painstakingly paste each value in its position.
In column E you can create a formula that uses your first table (A/B) as a LOOKUP table. Truncate the value in Column C as your lookup value.
So in column E, use a formula something like,
=LOOKUP(TRUNC(Cx), A1:An, B1:Bn)
where x is the row number of your C/D/E table, and n is the last row in your A/B table.

What is the difference between interpolation and imputation?

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?
Interpolation
Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:
Two red points are known
Blue point is missing
source: wikipedia
Oke nice explanation, but show me with data.
First of all the formula for linear interpolation is the following:
(y1-y0) / (x1-x0)
Let's say we have the three data points from the graph above:
df = pd.DataFrame({'Value':[0, np.NaN, 3]})
Value
0 0.0
1 NaN
2 3.0
As we can see row 1 (blue point) is missing.
So following formula from above:
(3-0) / (2-0) = 1.5
If we interpolate these using the pandas method Series.interpolate:
df['Value'].interpolate()
0 0.0
1 1.5
2 3.0
Name: Value, dtype: float64
For a bigger dataset it would look as follows:
df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})
Value
0 1.0
1 NaN
2 4.0
3 NaN
4 NaN
5 7.0
df['Value'].interpolate()
0 1.0
1 2.5
2 4.0
3 5.0
4 6.0
5 7.0
Name: Value, dtype: float64
Imputation
When we impute the data with the (arithmetic) mean, we follow the following formula:
sum(all points) / n
So for our second dataframe we get:
(1 + 4 + 7) / 3 = 4
So if we impute our dataframe with Series.fillna and Series.mean:
df['Value'].fillna(df['Value'].mean())
0 1.0
1 4.0
2 4.0
3 4.0
4 4.0
5 7.0
Name: Value, dtype: float64
I will answer the second part of your question i.e. when to use what.
We use both techniques depending upon the use case.
Imputation:
If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.
Interpolation:
If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.
So, you can choose the technique depending upon the use case.

How to change the location of the column name in pandas?

Exercise 1: What are the minimum and maximum prices of each variety of wine? Create a Dataframe whose index is the variety category from the min and max thereof.
answ = pd.DataFrame()
answ['price_max_variety'] = reviews.groupby('variety').price.max()
answ['price_min_variety'] = reviews.groupby('variety').price.min()
answ.head()
Output:
price_max_variety
price_min_variety
variety
# this is my problem
# what is this row?
Abouriou
75.0
15.0
Agiorgitiko
66.0
10.0
Aglianico
180.0
6.0
Aidani
27.0
27.0
Airen
10.0
8.0
I would like it to do it right. I have no idea how to make this look properly and because of the generic words can't find relevant information.
The blank grey row that you've pointed out is there to make room for the name of the DataFrame's index, which is variety. That index came from the default behaviour of df.groupby(): the grouped-by column(s) end up in the index of the resulting DataFrame.
To override this, try df.groupby('variety', as_index=False). Or, if you have a DataFrame with an index that you want to move into a column, run df.reset_index().

Resources