Altair's selection and transform_filter via binding_range slider for datetime values doesn't seem to work with equality condition or selector itself - python-3.x

I wanted to bind a range slider with datetime values to filter a chart for data for a particular date only. Using the stocks data, what I want to do is have the x-axis show the companies and y-axis the price of the stocks for a particular day which the user selects via a range slider.
Based on inputs from this answer and this issue I have the following code which shows something
when the slider is moved around after one particular value (with the inequality condition in transform_filter), but is empty for the rest.
What is peculiar is that if I have inequality operator then at least it shows something, but everything is empty when its ==.
import altair as alt
from vega_datasets import data
source = data.stocks()
def timestamp(t):
return pd.to_datetime(t).timestamp()
slider = alt.binding_range(step=86400, min=timestamp(min(source['date'])), max=timestamp(max(source['date']))) #86400 is the difference b/w consequetive days
select_date = alt.selection_single(fields=['date'], bind=slider, init={'date': timestamp(min(source['date']))})
alt.Chart(source).mark_bar().encode(
x='symbol',
y='price',
).add_selection(select_date).transform_filter(alt.datum.date == select_date.date)
Since the output is empty I am inclined to conclude that it's the transform_filter that is causing issues, but I have been at it for more than 6 hours now and tried all the permutation and combinations of using alt.expr.toDate and other conversions here and there but I cannot get it to work.
Also tried just transform_filter(select_date.date) and transform_filter(date) along with other things but nothing quite works.
The expected output is that, the heights of bars change(due to data being filtered on date) as the user drags the slider.
Any help would be really appreciated.

There are several issues here:
In Vega-Lite, timestamps are expressed in milliseconds, not seconds
You are filtering on equality between a numerical timestamp and a string representation of a date.
Even if you parse the date in the filter expression, Python date parsing and Javascript date parsing behave differently and the results will generally not match. Even within javascript, the date parsing behavior can vary from browser to browser; all this means that filtering on equality of a Python and Javascript timestamp is generally problematic
The data you are using has monthly timestamps, so the slider step should account for this
Keeping all that in mind, the best course would probably be to adjust the slider values and filter on matching year and month, rather than trying to achieve equality in the exact timestamp. The result looks like this:
import altair as alt
from vega_datasets import data
import pandas as pd
source = data.stocks()
def timestamp(t):
return pd.to_datetime(t).timestamp() * 1000
slider = alt.binding_range(
step=30 * 24 * 60 * 60 * 1000, # 30 days in milliseconds
min=timestamp(min(source['date'])),
max=timestamp(max(source['date'])))
select_date = alt.selection_single(
fields=['date'],
bind=slider,
init={'date': timestamp(min(source['date']))},
name='slider')
alt.Chart(source).mark_bar().encode(
x='symbol',
y='price',
).add_selection(select_date).transform_filter(
"(year(datum.date) == year(slider.date[0])) && "
"(month(datum.date) == month(slider.date[0]))"
)
You can view the result here: vega editor.

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Is there a pandas function that can create a dataframe of the mean, median, and mode of selected columns?

My attempt:
# Compute the mean, median and variance for the variables sph, acous and dur. Compare their level of variability.
sad_mean = dat_songs[['spch', 'acous', 'dur']].mean()
sad_mode = dat_songs[['spch', 'acous', 'dur']].mode()
sad_median = dat_songs[['spch', 'acous', 'dur']].median()
sad_mmm = pd.DataFrame({'mean':sad_mean, 'median':sad_median, 'mode':sad_mode})
sad_mmm
Which outputs this
First of all, the median column is not right at all and want to know how to fix that too.
Secondly, I feel like I have seen some quicker or shorter way to do this with a simple function with pandas.
My data head for reference
Simply try, dat_songs.describe(). Descriptive Statistics will be present for all the numerical columns.
For selected columns.
dat_songs[['spch', 'acous', 'dur']].describe()

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Query date range and product size from xlsx file

I'm using python 3.6 to do this. Below are just a few important columns that I'm interested to query out.
Auto-Gen Index : Product Container : Ship Date :.......
0 : Large Box : 2017-01-09:.......
1 : Large Box : 2012-07-15:.......
2 : Small Box : 2012-07-18:.......
3 : Large Box : 2012-07-31:.......
I would like to query rows that indicate Large Box as their product container and the shipping date must be in the period of July in the year of 2012.
file_name = r'''Sample-Superstore-Subset-Excel.xlsx'''
df = read_excel(file_name, sheet_name = my_sheet)
lb = df.loc[df['Product Container'] == 'Large Box'] //Get large box
july = lb[(lb['Ship Date'] > '2012-07-01') & (lb['Ship Date'] < '2012-07-31')]
I just wonder how to use query and where condition by python(pd.query())?
If your question is when to use loc vs where, see my answer here:
Think of loc as a filter - give me only the parts of the df that
conform to a condition.
where originally comes from numpy. It runs over an array and checks if
each element fits a condition. So it gives you back the entire array,
with a result or NaN. A nice feature of where is that you can also get
back something different, e.g. df2 = df.where(df['Goals']>10,
other='0'), to replace values that don't meet the condition with 0.
If you are asking when to use query, AFAIK there is no real reason to do besides performance. If you have a very large dataset, query is expected to be faster. More on high-level performance here.

groupby select value only if match

I got my data sorted correctly, but now Im trying to find a way to group by "first not empty string value". Is there a way to do this without changing the rest of the data? First was close, but not quite what I needed
grouped = sortedvals.groupby(['name']).first().reset_index()
doesnt work if the first value is empty ie: '',2 (my goal is to return 2) but does work for everything else.
Use replace function to replace blank values with np.nan
import numpy as np
grouped = sortedvals.replace('',np.nan).groupby(['name']).first().reset_index()

Resources