Are there better interface to add Highcharts support to Zeppelin - apache-spark

Apache Zeppelin has good support for AngularJS. While there is a gap between Scala and Javascript.
I am trying to add Highcharts support to Zeppelin to fill in this gap. The main goal is to plot it simply directly from Spark DataFrame.
After couple round refactor, I come up with the following interface.
github.com/knockdata/zeppelin-highcharts
Here are two options. Which option is better?
Option A
This is an example to plot highcharts.
highcharts(bank,
"marital",
List("name" -> "age", "y" -> avg(col("balance")), "orderBy" -> col("age")),
new Title("Marital Job Average Balance").x(-20),
new Subtitle("Source: Zeppelin Tutorial").x(-20),
new XAxis("Age").typ("category"),
new YAxis("Balance(¥)").plotLines(Map("value"->0, "width"->1)),
new Tooltip().valueSuffix("¥"),
new Legend().layout("vertical").align("right").verticalAlign("middle")
)
Here is the code without extra option.
highcharts(bank,
"marital",
List("name" -> "age",
"y" -> avg(col("balance")),
"orderBy" -> col("age")))
Option B
I come up this option with inspiring by #honnix's answer. It has more syntactic sugar.
highcharts(bank).series("marital")
.data("name" -> "age", "y" -> avg(col("balance")))
.orderBy(col("age"))
.title(Title("Marital Job Average Balance").x(-20))
.subtitle(Subtitle("Source: Zeppelin Tutorial").x(-20))
.xAxis(XAxis("Age").typ("category"))
.yAxis(YAxis("Balance(¥)").plotLines("value"->0, "width"->1))
.tooltip(Tooltip().valueSuffix("¥"))
.legend(Legend().layout("vertical").align("right").verticalAlign("middle"))
.plot
A simple plot without option will be
highcharts(bank).series("marital")
.data("name" -> "age", "y" -> avg(col("balance")))
.orderBy(col("age"))
.plot
It will generate a chart here.

It would be good to have some kind of chaining methods to pass in those parameters, because putting a few lists together in one apply() method is a little bit hard to read.

Related

importing data and fitting survival model

I am trying to import data from Stata to R and fit a survival model. I did the following:
library(haven)
data <- read_dta("C:/Users/user/Desktop/data.dta")
View(data)
install.packages(c("survival", "survminer"))
library("survival")
library("survminer")
It worked well. However, I got errors:
data("data")
Warning message:
In data("data") : data set ‘data’ not found
fit <- survfit(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI) , data = data)
Error in survfit.Surv(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI), data = data) :
the survfit function requires a formula as its first argument
I wonder if you can tell me how to fix this.
The issue is you aren't supplying a formula. As noted in the documentation for survfit one must now supply a formula:
Older releases of the code also allowed the specification for a
single curve to omit the right hand of the formula, i.e.,
survfit(Surv(time, status)), in which case the formula argument is not
actually a formula. Handling this case required some non-standard and
fairly fragile manipulations, and this case is no longer supported.
Here in an example of a fix, where ~ 1 would be replaced by the formula that fits your research question:
fit <- survfit(Surv(data$finaltime, data$GSTATUS_DTHCNS_KI) ~ 1 , data = data)
summary(fit)
See help("survfit.formula") for more information.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

If-Then-Else in Ruta

is there something like if then else in Ruta available? I'd like to do something like:
if there's at least one term from catA, then label the document with "one"
else if there's at least one term from catB, then label the document with "two"
else label the document with "three".
All the best
Philipp
There is no language structure for if-then-else in UIMA Ruta (2.7.0).
You need to duplicate some parts of the rule in order to model the else part, e.g., something like the following:
Document{CONTAINS(CatA) -> One};
Document{-CONTAINS(CatA), CONTAINS(CatB) -> Two};
Document{-CONTAINS(CatA), -CONTAINS(CatB) -> Three};
You could also check if the previous rule has matched and depend on that.
How the rule should actually look like depends mainly on the type system and how you want to model the information (features?).
DISCLAIMER: I am a developer of UIMA Ruta
I think you are asking about If-else-if in Ruta. This is possible using "ONLYFIRST"
PACKAGE uima.ruta.example;
DECLARE CatA,CatB,CatC;
"CatA"->CatA;
"CatB"->CatB;
"CatC"->CatC;
DECLARE one,two,three;
ONLYFIRST Document{}{
Document{CONTAINS(CatA) -> one};
Document{CONTAINS(CatB) -> two};
Document{CONTAINS(CatC) -> three};
}

Updating a single field in a record with Haskell #

I need to update one field of a very large default record.
As the default may change I don't want to rebuild the entire record manually.
Now I have come across the following way of doing this, but I am not sure how it works:
unaggregate :: MyResult -> MyResult
unaggregate calc#MyResult{..} = calc{ the_defaults = the_override
`mappend` the_defaults }
where
the_override = create ("aggregation" := False)
I have tried searching for 'Haskell # operator' in Google but it does not return immediately useful information.
I saw somewhere calc#MyResult{..} does pattern matching on variables but I don't see what variable calc does for the MyResult record...
Also I have looked up mappend (and Monoids) and I am not sure how these work either...
Thank you for any help
The # symbol is called an "as-pattern". In the example above, you can use calc to mean the whole record. Usually you'd use it like this: calc#(MyResult someResult) -- so you can have both the whole thing and the pieces that you're matching. You can do the same thing with lists (myList#(myHead:myTail)) or tuples (myTuple#(myFst, mySnd). It's pretty handy!
MyResult{..} uses RecordWildcards. Which is a neat extension! BUT RecordWildcards doesn't help you update just one field of a record.
You can do this instead: calc { theFieldYouWantToUpdate = somethingNew }.

How to use a vector of strings to call dataframe columns by its header

In R, I want to use a subset of a dataframe 'RL', by selecting specific headers (eg. 'RL$age01' etc.. I generate the selected headers as a vector of strigs:
v = c('ID', sprintf("sex%02d", seq(1,15)), sprintf("age%02d", seq(1,15)))
and the dataframe index as:
c = sprintf('RL$%s', v)
how can I evaluate these strigns to call the dataframe columns by header and rearange them in a matrix, in the sense of x = cbind(RL$ID, RL$age01, ...) ?
cbind(c) does not work neither using things like eval(), parse() or expression().
Thanks for any help
Rafael
Just use
RL[,v]
Just noticed this was already mentioned in the comments.

Resources