Trying to plot a pandas dataframe groupby with Bokeh - python-3.x

New here but I've been searching for hours now and can't seem to find the solution for this. What I'm trying to do is display an aggregate of a dataframe in a Bokeh chart. I tried using a groupby object but I get an error when passing the groupby object to the ColumnDataSource (as mentioned in the post below).
how use bokeh vbar chart parameter with groupby object?
Here's some sample code I'm using:
import pandas
from bokeh.models import ColumnDataSource
df = pandas.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
group = df.groupby("A")
source = ColumnDataSource(group)
Getting this error:
ValueError: expected a dict or pandas.DataFrame, got <pandas.core.groupby.DataFrameGroupBy object at 0x103f7bfd0>
Any ideas as to plot the groupby object in a chart with Bokeh?
Thanks in advance!

I haven't used Bokeh, however from what I see you are passing a pandas.core.groupby.DataFrameGroupBy and ColumnDataSource is expecting a pd.DataFrame. That said the problem is that when using groupby you create a data structure that resembles key, value storage. So each group in the groups object will have a key and value, that value is the DataFrame that your are looking for. Running your code as shown below will help you understand the resulting data structure from applying groupby() to a DataFrame:
groups = df.group('A')
for group in groups:
# get group key
key = group[0]
# Get group df
group_df = group[1]
Notice that I replaced group = df.group('A') with groups = df.group('A')

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

AttributeError: 'RangeIndex' object has no attribute 'inferred_freq'

I'm trying to do forecast in my python 3.x. So I wrote following code
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts_log)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
But I'm getting error message
AttributeError: 'RangeIndex' object has no attribute 'inferred_freq'
Can you please help me to resolve the issue
You need to make sure that your Panda Series object ts_log have a DateTime index with inferred frequency.
For example:
ts_log.index
>>> DatetimeIndex(['2014-01-01', ... '2017-12-31'],
dtype='datetime64[ns]', name='Date', length=1461, freq='D')
Noticed how there's a an attribute freq='D', it means that Pandas infer that the Pandas Series is indexed Daily (D=Daily).
Now to achieve this, I assume your Series have a column call 'Date'. And here's the code to do it:
# Convert your daily column from just string to DateTime (skip if already done)
ts_log['Date'] = pd.to_datetime(ts_log['Date'])
# Set the column 'Date' as index (skip if already done)
ts_log = ts_log.set_index('Date')
# Specify datetime frequency
ts_log = ts_log.asfreq('D')
For frequency other than Daily, refer here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
For statsmodel==0.10.1 and where ts_log is not a dataframe or a dataframe without datetime index, use the following
decomposition = seasonal_decompose(ts_log, freq=1)

How to perform functions to a newly added column in a spark data frame using pyspark

I was trying to create a new column in pyspark using literals but when I am trying to do some functions using that column, it shows error like this
AttributeError: 'NoneType' object has no attribute 'show'
my code is
autodata1=autodata.withColumn('pricePerMPG',(col('PRICE')/(col('MPG-CITY')+col('MPG-HWY')/2))).show(truncate=False)`
autodata1.show()
from pyspark.sql.functions import max
max = autodata1.agg({"pricePerMPG": "max"}).collect()[0]
print(max)
Can anyone help me to solve this?
autodata1=autodata.withColumn('pricePerMPG',(col('PRICE')/(col('MPG-CITY')+col('MPG-HWY')/2))).show(truncate=False)`
Here your autodata is a dataframe but when you add show at the last of this then it will returns unit that's why your autodata1 is not a dataframe.
Your show() trigger a action and returna Non object.
autodata1=autodata.withColumn('pricePerMPG',(col('PRICE')/(col('MPG-CITY')+col('MPG-HWY')/2)))
autodata1.show(truncate=False)
from pyspark.sql.functions import max
max = autodata1.agg({"pricePerMPG": "max"}).collect()[0]
print(max)

how to convert the type of an object from "pandas.core.groupby.generic.SeriesGroupBy" to "pandas.core.series.Series"?

I have a variable of type "pandas.core.groupby.generic.SeriesGroupBy" which I got from grouping various fields of a pandas dataframe. But, I would like to convert that variable into a pandas series which is working but with a lot of errors.
Here is the code which I have tried:
w = data.groupby(['dt', 'b'])['w']
w = pd.Series(w)
When I try to run this code, it's taking a lot of time to execute and also generating a lot of errors.
I am getting a pandas Series as follows:
But, I am expecting something similar to this:
Is there any other way to group the below column of a DataFrame and store it inside a pandas Series:
Pandas groupby objects are iterable. Using list comprehension you can extract the partitioned sub-series. Try:
list_of_series = [s for _, s in data.groupby(['dt', 'b'])['w']]
list_of_series is a list and should contain your desired pandas series.

Pandas iterate over group with SeriesGroupBy objects to Burst Data in MatPlotLib

I am attempting to iterate through an index of a grouped-by dataframe (see comments in code).
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1':['A','A','B','B'], 'COL2': [1,1,2,2,], 'COL3': [2,3,4,6]})
#Here, I'm creating a copy of 'COL2' so that it's still a column after I assign it to
#the index.
#I'm guessing there's a better way to do this in the groupby line.
df['COL2_copy'] = df['COL2']
df = df.groupby(['COL2_copy'], as_index=True)
#This will actually be a more complex function (creating a chart in MatPlotLib based on the
#data frame)
#slice(group) per iteration through the index ('COL2')).
#I'll just keep it simple for now.
def pandfun(df):
#Here's the real issue:
df['COL4'] = np.trunc(np.round(df['COL3']*100,0))
return df['COL4']
pandfun(df)
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'int'
The desired results are: 200 and 300 for the first group and 400 and 600 for the second group. So to summarize what I believe to be the main problem here is that I want to select individual groups of rows by index (i.e. 'COL2 == 1') and within each group, refer to individual rows for a calculation.
I am taking this approach because I'll actually be using this with a MatPlotLib function that I created and I want to "burst" the data into one chart for each group in the dataframe, where each chart refers to individual row data for a given group.
I did this instead:
1. Get a unique list of values from COL2 and create a copy of df:
ulist = pd.unique(df['COL2'].ravel())
df = df2
Iterate over that list where it matched in COL2:
for i in ulist:
df = df2.loc[df2['COL2']==i]
Within each iteration, apply the function.
Within the MatPlotLib function, enter the following code at the top:
df.reset_index()
This served to reset the selection of rows after each iteration.

Resources