R DiagrammeR package Mermaid text using actual calculation results - text

I would like to utilize the DiagrammeR package for a simple flow chart in my Rmarkdown. However, I couldn't figure out a way to use actual output from a data table into the text. Suppose I have a simple query of a database with total records, patients count and date in year info for three different cohorts.
I wanted to create a diagram using Mermaid. The codes look at this.
Total = paste0('Records:',b1$records,' Patients:',b1$patients,' Year:',b1$year)
# (Records:1000 Patients:822 Year:5)
Sub1 = paste0('Records:',b2$records,' Patients:',b2$patients,' Year:',b2$year)
Sub2 = paste0('Records:',b3$records,' Patients:',b3$patients,' Year:',b3$year)
mermaid("
graph TB
A[Total] --> B{Sub1} --> C{Sub2}
")
Instead of Printing out diagram with: Records:1000 Patients:822 Year:5 in the A, it shows verbatim word "Total".
Any suggestion on how to do it correctly?
Thanks!

You are one step away from what you'd like to achieve. Please try this simple example below to see the logic:
library(DiagrammeR)
Stracture:
DiagrammeR(
"
graph TB
A[Question] -->B[Answer]
"
)
1. Define answer node:
B <- paste0("There are ", nrow(iris), " records")
2. Combine it with other components, using ; to separate statements:
results <- paste0("graph TB; A[How many rows does iris have?]-->", "B[", B, "]")
3. Call 'results' in DiagrammeR:
DiagrammeR(diagram = results)
The final plot should refresh when your calculation updates.
The plot that calls your calculation

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Dynamically filtering a Pandas DataFrame based on user input

I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)

Best Way to "tag" data for fast parsing through matlab?

I collect data into an excel sheet through a labview program, the data is collected continuously at a regular interval and events are marked in the file in one of the columns with TaskA_0 representing the start of an event, and TaskA_1 representing the end. this is a snippet of the data:
Time Data 1 Data 2 Data 3 Data 4 Event Name
13:38:41.888 0.719460527 0.701654664 0.221332969 0.012234448 Task A_0
13:38:41.947 0.437707516 0.588673334 0.524042112 0.309975646 Task A_1
13:38:42.021 0.186847503 0.589175696 0.393891242 0.917737946 Task B_0
13:38:42.115 0.44490411 0.073132298 0.897701096 0.633815257 Task B_1
13:38:42.214 0.833793601 0.004524633 0.40950937 0.808966844 Task C_0
13:38:42.314 0.953997375 0.055717025 0.914080619 0.166492915 Task C_1
13:38:42.414 0.245698313 0.066643778 0.515709814 0.606289696 Task D_0
13:38:42.514 0.248038367 0.862138045 0.025489223 0.352926629 Task D_1
Currently I load this into matlab using xlsread , and then run a strfind to locate the row indices of the event markers in order to break my data up into tasks where each each task is the data in the adjacent columns between TaskA_0 and TaskA_1 (here there is no data between but normally there is, also between event names there are blank cells normally). Is this the best method for doing this? Once I have it in separate variables I then perform identical actions on each variable, usually basic statistics and some data plotting. If I want to batch process my data I have to rewrite these lines over and over to get the data broken up by task. Which even I know is wrong and horribly inefficient but I don't know how better to do this.
[Data,Text]= xlsread('C:\TestData.xlsx',2); %time column and event name column end up in text, as does the data headers, hence the +1 for the row indices
IndexTaskAStart = find(~cellfun(#isempty,strfind(Text(:,2),'TaskA_0')))+1;
IndexTaskAEnd = find(~cellfun(#isempty, strfind(Text(:,2),'TaskA_1')))+1;
TaskAData = Data([IndexTaskAStart:IndexTaskAEnd,:];
Now I can perform analysis on columns in TaskAData, and repeat the process for the remaining tasks.
Presuming you cannot change the format of the files, but do know which tasks you're searching for, you can still automate the search by creating a list of task names, just appending _0 and _1 onto the task names to search. Then do not create individual named variables but store in a cell array for easier looping:
tasknames = {'Task A', 'Task B', 'Task C'}
for n = 1:numel(tasknames)
first = find(~cellfun(#isempty,strfind(Text(:,2),[tasknames{n},'_0'])))+1;
last = find(~cellfun(#isempty, strfind(Text(:,2),[tasknames{n},'_1'])))+1;
task_data{n} = Data(first:last, :);
% whatever other analysis you require goes here
end
If there are a large number of tasknames but they follow some pattern, you might prefer to create them on the fly instead of preallocating a list in tasknames.

Converting Excel functions into R

I have two excel functions that I am trying to convert into R:
numberShares
=IF(AND(N213="BOH",N212="BOH")=TRUE,P212,IF(AND(N213="BOH",N212="Sell")=TRUE,ROUNDDOWN(Q212/C213,0),0))
marketValue
=IF(AND(N212="BOH",N213="BOH")=TRUE,C213*P212,IF(AND(N212="Sell",N213="Sell")=TRUE,Q212,IF(AND(N212="BOH",N213="Sell")=TRUE,P212*C213,IF(AND(N212="Sell",N213="BOH")=TRUE,Q212))))
The cells that are referenced include:
c = closing price of a stock
n = position values of either "buy or hold" or "sell"
p = number of Shares
q = market value, assuming $10,000 initial equity (number of shares*closing price)
and the tops of the two output columns that i am trying to recreate look like this:
output
So far, in R I have constructed a dataframe with the necessary four columns:
data.frame
I just don't know how to write the functions that will populate the number of shares and market value columns. For loops? ifelse?
Again, thank you!!
Covert the AND()'s to infix "&"; the "=" to "=="; and the IF's to ifelse() and you are halfway there. The problem will be in converting your cell references to array or matrix references, and for that task we would have needed a better description of the data layout:
numberShares <-
ifelse( N213=="BOH" & N212=="BOH",
#Perhaps PosVal[213] == "BOH" & PosVal[212] == "BOH"
# ... and very possibly the 213 should be 213:240 and the 212 should be 212:239
P212,
ifelse( N213=="BOH" & N212=="Sell" ,
round(Q212/C213, digits=0),
0))
(You seem to be returning incommensurate values which seems preeety questionable.) Assuming this is correct code despite my misgivings the next translation involves apply the same substitutions in this structure (although you seem to be missing an else-consequent in the last IF function:
marketValue <-
IF( AND(N212="BOH", N213="BOH")=TRUE,
C213*P212,
IF(AND(N212="Sell",N213="Sell")=TRUE,
Q212,
IF( AND(N212="BOH",N213="Sell")=TRUE,
P212*C213,
IF(AND(N212="Sell",N213="BOH")=TRUE,
Q212))))
(Your testing for AND( .,.)=TRUE is I believe unnecessary in Excel and certainly unnecessary in R.)

Access list element using get()

I'm trying to use get() to access a list element in R, but am getting an error.
example.list <- list()
example.list$attribute <- c("test")
get("example.list") # Works just fine
get("example.list$attribute") # breaks
## Error in get("example.list$attribute") :
## object 'example.list$attribute' not found
Any tips? I am looping over a vector of strings which identify the list names, and this would be really useful.
Here's the incantation that you are probably looking for:
get("attribute", example.list)
# [1] "test"
Or perhaps, for your situation, this:
get("attribute", eval(as.symbol("example.list")))
# [1] "test"
# Applied to your situation, as I understand it...
example.list2 <- example.list
listNames <- c("example.list", "example.list2")
sapply(listNames, function(X) get("attribute", eval(as.symbol(X))))
# example.list example.list2
# "test" "test"
Why not simply:
example.list <- list(attribute="test")
listName <- "example.list"
get(listName)$attribute
# or, if both the list name and the element name are given as arguments:
elementName <- "attribute"
get(listName)[[elementName]]
If your strings contain more than just object names, e.g. operators like here, you can evaluate them as expressions as follows:
> string <- "example.list$attribute"
> eval(parse(text = string))
[1] "test"
If your strings are all of the type "object$attribute", you could also parse them into object/attribute, so you can still get the object, then extract the attribute with [[:
> parsed <- unlist(strsplit(string, "\\$"))
> get(parsed[1])[[parsed[2]]]
[1] "test"
flodel's answer worked for my application, so I'm gonna post what I built on it, even though this is pretty uninspired. You can access each list element with a for loop, like so:
#============== List with five elements of non-uniform length ================#
example.list=
list(letters[1:5], letters[6:10], letters[11:15], letters[16:20], letters[21:26])
#===============================================================================#
#====== for loop that names and concatenates each consecutive element ========#
derp=c(); for(i in 1:length(example.list))
{derp=append(derp,eval(parse(text=example.list[i])))}
derp #Not a particularly useful application here, but it proves the point.
I'm using code like this for a function that calls certain sets of columns from a data frame by the column names. The user enters a list with elements that each represent different sets of column names (each set is a group of items belonging to one measure), and the big data frame containing all those columns. The for loop applies each consecutive list element as the set of column names for an internal function* applied only to the currently named set of columns of the big data frame. It then populates one column per loop of a matrix with the output for the subset of the big data frame that corresponds to the names in the element of the list corresponding to that loop's number. After the for loop, the function ends by outputting that matrix it produced.
Not sure if you're looking to do something similar with your list elements, but I'm happy I picked up this trick. Thanks to everyone for the ideas!
"Second example" / tangential info regarding application in graded response model factor scoring:
Here's the function I described above, just in case anyone wants to calculate graded response model factor scores* in large batches...Each column of the output matrix corresponds to an element of the list (i.e., a latent trait with ordinal indicator items specified by column name in the list element), and the rows correspond to the rows of the data frame used as input. Each row should presumably contain mutually dependent observations, as from a given individual, to whom the factor scores in the same row of the ouput matrix belong. Also, I feel I should add that if all the items in a given list element use the exact same Likert scale rating options, the graded response model may be less appropriate for factor scoring than a rating scale model (cf. http://www.rasch.org/rmt/rmt143k.htm).
'grmscores'=function(ColumnNameList,DataFrame) {require(ltm) #(Rizopoulos,2006)
x = matrix ( NA , nrow = nrow ( DataFrame ), ncol = length ( ColumnNameList ))
for(i in 1:length(ColumnNameList)) #flodel's magic featured below!#
{x[,i]=factor.scores(grm(DataFrame[, eval(parse(text= ColumnNameList[i]))]),
resp.patterns=DataFrame[,eval(parse(text= ColumnNameList[i]))])$score.dat$z1}; x}
Reference
*Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses, Journal of Statistical Software, 17(5), 1-25. URL: http://www.jstatsoft.org/v17/i05/

Resources