First explaining the dataframe, the values of columns '0-156', '156-234', '234-546' .... '> 76830' is the percentage distribution for each range of distances in meters, totaling 100%.
Column 'Cell Name' refers to the data element of the other columns and the column 'Distance' is the column that will trigger the desired sum.
I need to sum the values of the columns '0-156', '156-234', '234-546' .... '> 76830' which are less than the value of the 'Distance' (Meters) column.
Below creation code for testing.
import pandas as pd
# initialize list of lists
data = [['Test1',0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
['Test2',0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
['Test3',0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Cell Name','0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','>76830','Distance'])
Example of what should be done:
The value of column 'Distance' = 158.412286772863 therefore would have to sum the values <= of the following columns, 0-156, '156-234' totalizing 19.43116162 %.
Thanks so much!
As I understand it, you want to sum up all the percentage values in a row, where the lower value of the column-description (in case of '0-156' it would be 0, in case of '156-234' it would be 156, and so on...) is smaller than the value in the distance column.
First I would suggest, that you transform your string-like column-names into values, as an example:
lowerlimit=df.columns[2]
>>'156-234'
Then read the string only till the '-' and make it a number
int(lowerlimit[:lowerlimit.find('-')])
>> 156
You can loop this through all your columns and make a new row for the lower limits.
For a bit more simplicity I left out the first column for your example, and added another first row with the lower limits of each column, that you could generate as described above. Then this code works:
data = [[0,156,234,546,1014,1950,3510,6630,11430,30030,53430,76830,1e-23],[0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
[0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
[0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','76830-','Distance'])
df['lastindex']=None
df['sum']=None
After creating basically your dataframe, I add two columns 'lastindex' and 'sum'.
Then I am searching for the last index in every row, that is has its lower limit below the distance given in that row (df.iloc[x,-3]); afterwards I'm summing up the respective columns in that row.
for i in np.arange(1,len(df)):
df.at[i,'lastindex']=np.where(df.iloc[0,:-3]<df.iloc[i,-3])[0][-1]
df.at[i,'sum']=sum(df.iloc[i][0:df.at[i,'lastindex']+1])
I hope, this is helpful. Best, lepakk
Related
I'm working with trading data and Pandas. Given a 4-column OHLC pandas DataFrame that is 100 rows in length, I'm trying to calculate if an "Upper Shadow" exists or not for an individual row and store the result in its own column. To calculate if an "Upper Shadow" exists all you have to do is take the high (H) value of the row and subtract the open (O) value if the close (C) value is less than the open value. Otherwise, you have to subtract the close value.
Right now I'm naively doing this in a for loop where I iterate over each row with an if statement.
for index, row in df.iterrows():
if row["close"] >= row["open"]:
df.at[index,"upper_shadow"]=float(row["high"]) - float(row["close"])
else:
df.at[index,"upper_shadow"]=float(row["high"]) - float(row["open"])
Is there a better way to do this?
You can use np.maximum to calculate the maximum of close and open in a vectorized way:
import numpy as np
df['upper_shadow'] = df['high'] - np.maximum(df['close'], df['open'])
I think #Psidom's solution is what you are looking for. However the following piece of code is another way of writing what you already have using apply-lambda...
df["upper_shadow"] = df.apply(lambda row: float(row["high"]) - float(row["close"]) if row["close"] >= row["open"] else float(row["high"]) - float(row["open"]),axis=1)
Actually I want to find the maximum values of the teachers_prefix with the teacher's prefix (for eg: in this case mrs, and the value which is 639471
Input can be found
Try using pd.Series.idxmax(). This returns the index of the column/series where you get the max value.
#grouped_df[[grouped_df.idxmax()]] #If series
grouped_df.loc[grouped_df.idxmax()] #If dataFrame
teacher_prefix
mrs 639471
Name: sum, dtype: int64
If you want one or more than one rows based on top n largest values -
grouped_df.nlargest(2, 'sum')
I am trying to assign the values from a single row in a DataFrame to multiple rows. I have a DF def_security where the first row looks like this (the column headers are AGG and SPY, and the row index is the date)
AGG SPY
2006-01-01 95 21
The rest of the DF all have zeros.
AGG SPY
2006-01-02 0 0
...........
I would like to assign the same value as the first row (the values are calculated and not assigned scalars) to the next 250 rows of def_security. The column headers are user-input and the number of columns or the column headers are not pre-defined. However, there are same number of columns in each row
I am trying with the code
def_security.iloc[1:251] = def_security.iloc[0]
but it is returning error msg "could not broadcast input array from shape(250) into shape (250,2)".
What is the easiest way to do this ?
You were nearly right :-)
Try this:
def_security.iloc[1:251] = def_security.iloc[0].values
I assume this code would work:
def_security.iloc[1:251,-2]=def_security.iloc[0].at['AGG']
def_security.iloc[1:251,-1]=def_security.iloc[0].at['SPY']
What is the meaning of below lines., especially confused about how iloc[:,1:] is working ? and also data[:,:1]
data = np.asarray(train_df_mv_norm.iloc[:,1:])
X, Y = data[:,1:],data[:,:1]
Here train_df_mv_norm is a dataframe --
Definition: pandas iloc
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
For example:
df.iloc[:3] # slice your object, i.e. first three rows of your dataframe
df.iloc[0:3] # same
df.iloc[0, 1] # index both axis. Select the element from the first row, second column.
df.iloc[:, 0:5] # first five columns of data frame with all rows
So, your dataframe train_df_mv_norm.iloc[:,1:] will select all rows but your first column will be excluded.
Note that:
df.iloc[:,:1] select all rows and columns from 0 (included) to 1 (excluded).
df.iloc[:,1:] select all rows and columns, but exclude column 1.
To complete the answer by KeyMaker00, I add that data[:,:1] means:
The first : - take all rows.
:1 - equal to 0:1 take columns starting from column 0,
up to (excluding) column 1.
So, to sum up, the second expression reads only the first column from data.
As your expression has the form:
<variable_list> = <expression_list>
each expression is substituted under the corresponding variable (X and Y).
Maybe it will complete the answers before.
You will know
what you get,
its shape
how to use it with de column name
df.iloc[:,1:2] # get column 1 as a DATAFRAME of shape (n, 1)
df.iloc[:,1:2].values # get column 1 as an NDARRAY of shape (n, 1)
df.iloc[:,1].values # get column 1 as an NDARRAY of shape ( n,)
df.iloc[:,1] # get column 1 as a SERIES of shape (n,)
# iloc with the name of a column
df.iloc[:, df.columns.get_loc('my_col')] # maybe there is some more
elegants methods
I have data of 100 x 101. I want to convert them in series e.g. for first row all column data then for 2nd row all column data and so on. It means the result will be three columns only. The first column with row numbers, the 2nd column with column numbers and the 3rd column with the value for that respective row and column.
Could you please help me doing this conversion in MATLAB.
Available data are in ASCII format and it is possible to open in both MATLAB and Excel.
This can be done by find:
A = rand(100,101);
[data(:,1), data(:,2), data(:,3)] = find(A);
data = sortrows(data,[1 2]);
Note that this is highly inefficient, as you are storing 3 values where you only need to store 1 (the element's actual value). For accessing a specific element, say row 31, column 43, you simply do A(31,43), where you index the matrix.
The file size of data is indeed three times larger than that of A:
whos
Name Size Bytes Class Attributes
A 100x101 80800 double
data 10100x3 242400 double
You can use the ind2sub function that is faster and make more sense in this situation:
tic
A = rand(100,101);
[data(:,1), data(:,2), data(:,3)] = find(A);
data = sortrows(data,[1 2]);
toc
tic
B = A' ;
[data_B(:,1), data_B(:,2)] = ind2sub(size(B), 1:length(B(:)));
data_B(:,3) = B(:);
toc
The output for the timing is as follow:
Elapsed time is 0.002130 seconds (first method)
Elapsed time is 0.000525 seconds (second method).