Quinlan Attribute C5.0 - decision-tree

Error in UseMethod("QuinlanAttributes") :
no applicable method for 'QuinlanAttributes' applied to an object of class "logical"
I am getting this error whenever I am running a code. I have installed several packages but this error is keep on repitiating.

it seems that C50 does not accept BOOLEAN features.
you can simpliy drop that column or replace BOOLEAN to 0/1.
if "tdata$Windy" is the BOOLEAN feature, replace the value of it.
library(C50)
tdata = read.csv('play.csv', header = TRUE, sep = ",")
xdata <- data.frame(tdata$Outlook,tdata$Temperature, tdata$Humidity, tdata$Windy)
ydata <- tdata$Play
treeModel <- C5.0(x = xdata, y = ydata )
summary(treeModel)

It's old question but I had same issue today and realized it's was due to read_sav().
I solved applying haven::as_factor to columns that should be factors.
data <- read_sav("datafile.sav")
data <- mutate(data, across(ends_with("_fct"), haven::as_factor ))

Related

spatstat integer overflow error in Kcross and crosspairs

I'm working with spatstat 2.3-4 in R 4.1.0 on a 64bit windows 10 pro machine.
Recently I ran into the integer-overflow error while using Kcross with a large number of points (i.e. the number of combination exceeded .Machine$integer.max). For example:
W <- as.owin(list(xrange = c(688.512, 17879.746) , yrange = c(-27996.842, -7759.813)))
cells1 <- runifpoint(n = 8062, win = W)
cells2 <- runifpoint(n = 1768988, win = W)
cells3 <- superimpose(tumor = cells1 , bcell = cells2)
Kcross(cells3 , r = seq(0,200,by=5) , "tumor" , "bcell" , correction="none") # error
# Error in if (nXY <= 1024) { : missing value where TRUE/FALSE needed
# In addition: Warning message: In nX * nY : NAs produced by integer overflow
8062 * 1768988 > .Machine$integer.max
# [1] TRUE
After a lot of struggling I realized that the error comes from this part of crosspairs:
if (spatstat.options("crosspairs.newcode")) {
nXY <- nX * nY
if (nXY <= 1024) {
nsize <- 1024
}
I could "fix" the error by changing spatstat options: spatstat.options("crosspairs.newcode" = FALSE).
Is this the way to deal with the error?
UPDATE:
As Adrian.Baddeley answered below, there is now a new spatstat.geom version on GitHub (currently: v2.4.-0.029) in which the bug is fixed. The new version works fine without the change of the options.
The bug is fixed in the development version of spatstat.geom available at the GitHub repository
This is a bug in some relatively new code to speed up the underlying function crosspairs.ppp(). Until a new version of spatstat.geom is available you can workaround the problem by setting spatstat.options("crosspairs.newcode" = FALSE) as suggested.

slider.value values not getting updated using ColumnDataSource(Dataframe).data

I have been working on COVID19 analysis for a dashboard and am using a JSON data source. I have converted the json to dataframe. I am working on plotting bar chart for "Days to reach deaths" over a "States" x-axis (categorical values). I am trying to use a function to update the slider.value. Upon running the bokeh serve with --log-level=DEBUG, I am getting a following error:
Can someone provide me with any direction or help with what might be causing the issue as I am new to Python and any help is appreciated? Or if there's any other alternative.
Please find the code below:
cases_summary = requests.get('https://api.rootnet.in/covid19-in/stats/history')
json_data = cases_summary.json()
#Data Cleaning
cases_summary=pd.json_normalize(json_data['data'], record_path='regional', meta='day')
cases_summary['loc']=np.where(cases_summary['loc']=='Nagaland#', 'Nagaland', cases_summary['loc'])
cases_summary['loc']=np.where(cases_summary['loc']=='Madhya Pradesh#', 'Madhya Pradesh', cases_summary['loc'])
cases_summary['loc']=np.where(cases_summary['loc']=='Jharkhand#', 'Jharkhand', cases_summary['loc'])
#Calculate cumulative days since 1st case for each state
cases_summary['day_count']=(cases_summary['day'].groupby(cases_summary['loc']).cumcount())+1
#Initial plot for default slider value=35
days_reach_death_count=cases_summary.loc[(cases_summary['deaths']>=35)].groupby(cases_summary['loc']).head(1).reset_index()
slider = Slider(start=10, end=max(cases_summary['deaths']), value=35, step=10, title="Total Deaths")
source = ColumnDataSource(data=dict(days_reach_death_count[['loc','day_count', 'deaths']]))
q = figure(x_range=days_reach_death_count['loc'], plot_width=1200, plot_height=600, sizing_mode="scale_both")
q.title.align = 'center'
q.title.text_font_size = '17px'
q.xaxis.axis_label = 'State'
q.yaxis.axis_label = 'Days since 1st Case'
q.xaxis.major_label_orientation = math.pi/2
q.vbar('loc', top='day_count', width=0.9, source=source)
deaths = slider.value
q.title.text = 'Days to reach %d Deaths' % deaths
hover = HoverTool(line_policy='next')
hover.tooltips = [('State', '#loc'),
('Days since 1st Case', '#day_count'), # #$name gives the value corresponding to the legend
('Deaths', '#deaths')
]
q.add_tools(hover)
def update(attr, old, new):
days_death_count = cases_summary.loc[(cases_summary['deaths'] >= slider.value)].groupby(cases_summary['loc']).head(1).reindex()
source.data = [ColumnDataSource().from_df(days_death_count)]
slider.on_change('value', update)
layout = row(q, slider)
tab = Panel(child=layout, title="New Confirmed Cases since Day 1")
tabs= Tabs(tabs=[tab])
curdoc().add_root(tabs)
Your code has 2 issues
(critical) source.data must be a dictionary, but you're assigning it an array
(minor) from_df is a class method, you don't have to construct an object of it
Try using source.data = ColumnDataSource.from_df(days_death_count) instead.

Updating multiple line plots dynamically in callback in bokeh

I have a use case where I have multiple line plots (with legends), and I need to update the line plots based on a column condition. Below is an example of two data set, based on the country, the column data source changes. But the issue I am facing is, the number of columns is not fixed for the data source, and even the types can vary. So, when I update the data source based on a callback when there is a new country selected, I get this error:
Error: attempted to retrieve property array for nonexistent field 'pay_conv_7d.content'.
I am guessing because in the new data source, the pay_conv_7d.content column doesn't exist, but in my plot those lines were already there. I have been trying to fix this issue by various means (making common columns for all country selection - adding the missing column in the data source in callback, but still get issues.
Is there any clean way to have multiple line plots updating using callback, and not do a lot of hackish way? Any insights or help would be really appreciated. Thanks much in advance! :)
def setup_multiline_plots(x_axis, y_axis, title_text, data_source, plot):
num_categories = len(data_source.data['categories'])
legends_list = list(data_source.data['categories'])
colors_list = Spectral11[0:num_categories]
# xs = [data_source.data['%s.'%x_axis].values] * num_categories
# ys = [data_source.data[('%s.%s')%(y_axis,column)] for column in data_source.data['categories']]
# data_source.data['x_series'] = xs
# data_source.data['y_series'] = ys
# plot.multi_line('x_series', 'y_series', line_color=colors_list,legend='categories', line_width=3, source=data_source)
plot_list = []
for (colr, leg, column) in zip(colors_list, legends_list, data_source.data['categories']):
xs, ys = '%s.'%x_axis, ('%s.%s')%(y_axis,column)
plot.line(xs,ys, source=data_source, color=colr, legend=leg, line_width=3, name=ys)
plot_list.append(ys)
data_source.data['plot_names'] = data_source.data.get('plot_names',[]) + plot_list
plot.title.text = title_text
def update_plot(country, timeseries_df, timeseries_source,
aggregate_df, aggregate_source, category,
plot_pay_7d, plot_r_pay_90d):
aggregate_metrics = aggregate_df.loc[aggregate_df.country == country]
aggregate_metrics = aggregate_metrics.nlargest(10, 'cost')
category_types = list(aggregate_metrics[category].unique())
timeseries_df = timeseries_df[timeseries_df[category].isin(category_types)]
timeseries_multi_line_metrics = get_multiline_column_datasource(timeseries_df, category, country)
# len_series = len(timeseries_multi_line_metrics.data['time.'])
# previous_legends = timeseries_source.data['plot_names']
# current_legends = timeseries_multi_line_metrics.data.keys()
# common_legends = list(set(previous_legends) & set(current_legends))
# additional_legends_list = list(set(previous_legends) - set(current_legends))
# for legend in additional_legends_list:
# zeros = pd.Series(np.array([0] * len_series), name=legend)
# timeseries_multi_line_metrics.add(zeros, legend)
# timeseries_multi_line_metrics.data['plot_names'] = previous_legends
timeseries_source.data = timeseries_multi_line_metrics.data
aggregate_source.data = aggregate_source.from_df(aggregate_metrics)
def get_multiline_column_datasource(df, category, country):
df_country = df[df.country == country]
df_pivoted = pd.DataFrame(df_country.pivot_table(index='time', columns=category, aggfunc=np.sum).reset_index())
df_pivoted.columns = df_pivoted.columns.to_series().str.join('.')
categories = list(set([column.split('.')[1] for column in list(df_pivoted.columns)]))[1:]
data_source = ColumnDataSource(df_pivoted)
data_source.data['categories'] = categories
Recently I had to update data on a Multiline glyph. Check my question if you want to take a look at my algorithm.
I think you can update a ColumnDataSource in three ways at least:
You can create a dataframe to instantiate a new CDS
cds = ColumnDataSource(df_pivoted)
data_source.data = cds.data
You can create a dictionary and assign it to the data attribute directly
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}
data_source.data = d
Here if you need different sizes of columns or empty columns you can fill the gaps with NaN values in order to keep column sizes. And I think this is the solution to your question:
import numpy as np
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [np.nan], [np.nan]],
'ys1': [[179.0, 169.0], [np.nan], [np.nan]],
'xs2': [[np.nan], [np.nan], [np.nan]],
'ys2': [[np.nan], [np.nan], [np.nan]]
}
data_source.data = d
Or if you only need to modify a few values then you can use the method patch. Check the documentation here.
The following example shows how to patch entire column elements. In this case,
source = ColumnDataSource(data=dict(foo=[10, 20, 30], bar=[100, 200, 300]))
patches = {
'foo' : [ (slice(2), [11, 12]) ],
'bar' : [ (0, 101), (2, 301) ],
}
source.patch(patches)
After this operation, the value of the source.data will be:
dict(foo=[11, 22, 30], bar=[101, 200, 301])
NOTE: It is important to make the update in one go to avoid performance issues

R blpapi forward spread as points or outright

I am using the blpapi package in R to download FX forward prices. In the formula I want to specify the setting to download forward prices as points or as outright prices. I have tried the following:
conn <- blpConnect()
sdate <- as.Date("1998-12-31")
edate <- Sys.Date()-1
vFWD <- c("EURAUD1M Curncy")
opts.daily <- c("periodicitySelection"="DAILY","nonTradingDayFillMethod"="PREVIOUS_VALUE","nonTradingDayFillOption"="NON_TRADING_WEEKDAYS")
opts.monthly <- c("periodicitySelection"="MONTHLY","nonTradingDayFillMethod"="PREVIOUS_VALUE","nonTradingDayFillOption"="NON_TRADING_WEEKDAYS")
opts.fwd <- c("FWD_CURVE_QUOTE_FORMAT"="OUTRIGHTS")
dfwd <- bdh(securities = vFWD, c("PX_LAST"), start.date = sdate, end.date = edate, options = opts.daily, overrides = opts.fwd, con = defaultConnection())
** for Java coding the answer is here: In Bloomberg API how do you specify to get FX forwards as a spread rather than absolute values?
Use "OUTRIGHT", not "OUTRIGHTS" as your override option value.

using as.ppp on data frame to create marked process

I am using a data frame to create a marked point process using as.ppp function. I get an error Error: is.numeric(x) is not TRUE. The data I am using is as follows:
dput(head(pointDataUTM[,1:2]))
structure(list(POINT_X = c(439845.0069, 450018.3603, 451873.2925,
446836.5498, 445040.8974, 442060.0477), POINT_Y = c(4624464.56,
4629024.646, 4624579.758, 4636291.222, 4614853.993, 4651264.579
)), .Names = c("POINT_X", "POINT_Y"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I can see that the first two columns are numeric, so I do not know why it is a problem.
> str(pointDataUTM)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5028 obs. of 31 variables:
$ POINT_X : num 439845 450018 451873 446837 445041 ...
$ POINT_Y : num 4624465 4629025 4624580 4636291 4614854 ...
Then I also checked for NA, which shows no NA
> sum(is.na(pointDataUTM$POINT_X))
[1] 0
> sum(is.na(pointDataUTM$POINT_Y))
[1] 0
When I tried even only the first two columns of the data.frame, the error I get on using as.ppp is this:
Error: is.numeric(x) is not TRUE
5.stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA)
4.stopifnot(is.numeric(x))
3.ppp(X[, 1], X[, 2], window = win, marks = marx, check = check)
2.as.ppp.data.frame(pointDataUTM[, 1:2], W = studyWindow)
1.as.ppp(pointDataUTM[, 1:2], W = studyWindow)
Could someone tell me what is the mistake here and why I get the not numeric error?
Thank you.
The critical check is whether PointDataUTM[,1] is numeric, rather than PointDataUTM$POINT_X.
Since PointDataUTM is a tbl object, and tbl is a function from the dplyr package, what is probably happening is that the subset operator for the tbl class is returning a data frame, and not a numeric vector, when a single column is extracted. Whereas the $ operator returns a numeric vector.
I suggest you convert your data to data.frame using as.data.frame() before calling as.ppp.
In the next version of spatstat we will make our code more robust against this kind of problem.
I'm on the phone, so can't check but I think it is happens because you have a tibble and not a data.frame. Please try to convert to a data.frame using as.data.frame first.

Resources