Dynamically filtering a Pandas DataFrame based on user input - python-3.x

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Is there a pandas function that can create a dataframe of the mean, median, and mode of selected columns?

My attempt:
# Compute the mean, median and variance for the variables sph, acous and dur. Compare their level of variability.
sad_mean = dat_songs[['spch', 'acous', 'dur']].mean()
sad_mode = dat_songs[['spch', 'acous', 'dur']].mode()
sad_median = dat_songs[['spch', 'acous', 'dur']].median()
sad_mmm = pd.DataFrame({'mean':sad_mean, 'median':sad_median, 'mode':sad_mode})
sad_mmm
Which outputs this
First of all, the median column is not right at all and want to know how to fix that too.
Secondly, I feel like I have seen some quicker or shorter way to do this with a simple function with pandas.
My data head for reference
Simply try, dat_songs.describe(). Descriptive Statistics will be present for all the numerical columns.
For selected columns.
dat_songs[['spch', 'acous', 'dur']].describe()

Why is for-loop variable skipping every second row when iterating over sqlite database?

I have set-up a sqlite database which records some world-trading data.
I am iterating over the database and want to extract some data, which will flow into another method of that class (works - no problem).
However, iterating over the database I have two variables "row" & "l":
For Row in Database("..."):
l = self.c.fetchone()
Strangely half of the data is in the variable "row" and the other half is in "l". It took me for ever to figure out, but now I really have no idea, why this problem happens? If I iterate over a list/db for "row" - "row" should have all the data for each iteration?
I tried accessing row through "row" and "l" from different ways, from within a new loop - rewrote the loops and restructured them, but then I have too much data and over 2000 entry points??? I used fetchmany() - and made another (outside) loop to iterate over,...
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
l = self.c.fetchone()
count+=1
print(count,">>",row)
print(count,">>",l)
I expect the data to be accessible through "row" or "l" - but not one half in one variable and the other half in the other?
You are mixing up 2 different ways of accessing the results of your query. The simplest way to do it is:
for row in self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
print(count, ">>", row)
Or alternatively:
self.c.execute("SELECT order_number,quotaStart,valid FROM volume"):
while l = self.c.fetchone():
print(count, ">>", l)

Name variable based on string MATLAB

I have a variable that is created by a loop. The variable is large enough and in a complicated enough form that I want to save the variable each time it comes out of the loop with a different name.
PM25 is my variable. But I want to save it as PM25_year in which the year changes based on `str = fname(13:end)'
PM25 = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]); % Reshape and permute to achieve the right shape. Each face of the 3D should be one day
str = fname(13:end); % The year
% Third dimension is organized so that the data for each site is on a face
save('PM25_str', 'PM25_Daily_US.mat', '-append')
The str would be a year, like 2008. So the variable saved would be PM25_2008, then PM25_2009, etc. as it is created.
Defining new variables based on data isn't considered best practice, but you can store your data more efficiently using a cell array. You can store even a large, complicated variable like your PM25 variable within a single cell. Here's how you could go about doing it:
Place your PM25 data for each year into the cell array C using your loop:
for i = 1:numberOfYears
C{i} = PM25;
end
Resulting in something like this:
C = { PM25_2005, PM25_2006, PM25_2007 };
Now let's say you want to obtain your variable for the year 2006. This is easy (assuming you aren't skipping years). The first year of your data will correspond to position 1, the second year to position 2, etc. So to find the index of the year you want:
minYear = 2005;
yearDesired = 2006;
index = yearDesired - minYear + 1;
PM25_2006 = C{index};
You can do this using eval, but note that it's often not considered good practice. eval may be a security risk, as it allows user input to be executed as code. A better way to do this may be to use a cell array or an array of objects.
That said, I think this will do what you want:
for year = 2008:2014
eval(sprintf('PM25_%d = permute(reshape(E',[c,r/nlay,nlay]),[2,1,3]);',year));
save('PM25_Daily_US.mat',sprintf('PM25_%d',year),'-append');
end
I do not recommend to set variables like this since there is no way to track these variables and completely prevents all kind of error checking that MATLAB does beforehand. This kind of code is handled completely in runtime.
Anyway in case you have a really good reason for doing this I recommend that you use the function assignin for this.
assignin('caller', ['myvar',num2str(1)], 63);

Converting Excel functions into R

I have two excel functions that I am trying to convert into R:
numberShares
=IF(AND(N213="BOH",N212="BOH")=TRUE,P212,IF(AND(N213="BOH",N212="Sell")=TRUE,ROUNDDOWN(Q212/C213,0),0))
marketValue
=IF(AND(N212="BOH",N213="BOH")=TRUE,C213*P212,IF(AND(N212="Sell",N213="Sell")=TRUE,Q212,IF(AND(N212="BOH",N213="Sell")=TRUE,P212*C213,IF(AND(N212="Sell",N213="BOH")=TRUE,Q212))))
The cells that are referenced include:
c = closing price of a stock
n = position values of either "buy or hold" or "sell"
p = number of Shares
q = market value, assuming $10,000 initial equity (number of shares*closing price)
and the tops of the two output columns that i am trying to recreate look like this:
output
So far, in R I have constructed a dataframe with the necessary four columns:
data.frame
I just don't know how to write the functions that will populate the number of shares and market value columns. For loops? ifelse?
Again, thank you!!
Covert the AND()'s to infix "&"; the "=" to "=="; and the IF's to ifelse() and you are halfway there. The problem will be in converting your cell references to array or matrix references, and for that task we would have needed a better description of the data layout:
numberShares <-
ifelse( N213=="BOH" & N212=="BOH",
#Perhaps PosVal[213] == "BOH" & PosVal[212] == "BOH"
# ... and very possibly the 213 should be 213:240 and the 212 should be 212:239
P212,
ifelse( N213=="BOH" & N212=="Sell" ,
round(Q212/C213, digits=0),
0))
(You seem to be returning incommensurate values which seems preeety questionable.) Assuming this is correct code despite my misgivings the next translation involves apply the same substitutions in this structure (although you seem to be missing an else-consequent in the last IF function:
marketValue <-
IF( AND(N212="BOH", N213="BOH")=TRUE,
C213*P212,
IF(AND(N212="Sell",N213="Sell")=TRUE,
Q212,
IF( AND(N212="BOH",N213="Sell")=TRUE,
P212*C213,
IF(AND(N212="Sell",N213="BOH")=TRUE,
Q212))))
(Your testing for AND( .,.)=TRUE is I believe unnecessary in Excel and certainly unnecessary in R.)

Resources