I’m a developer want to use SystemMl for running R-Code from our business people on a Spark cluster.
I’ve studied http://apache.github.io/systemml/dml-language-reference , however, haven’t found a implementation of the R function “which” or any alternative functionality. Has anyone an idea how I could
Given
v = c(1,4,NA,2, 5, NA)
Expect indexes where value meets condition = int[] 2 5
v2 = which(v>2)
Expect indexes where is.na returns TRUE = int[] 3 6
v3 = which(is.na(v))
I’ve already considered the functions replace() and removeEmpty(), but they don’t exactly meets my needs.
Thanks a lot in advance
Kuno
Just in case someone else stumbles over the same problem. R's which can be emulated with the following workaround:
v2 = removeEmpty(target=seq(1,length(v)) * (v>2), margin="rows")
Furthermore, SystemML does not allow NA, so you would need to replace it with 0 or NaN (e.g., 0/0=NaN). The extraction would then look like (v==0) or (v!=v), where the latter accounts for the fact that any comparison with a NaN is false and so NaN is the only value that is not equal to itself.
Related
first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
This returns all the first 'nd's as expected
select="osm/way/nd[1]"
This returns all the lasts:
select="osm/way/nd[last()]"
This returns both:
select="osm/way/nd[position() = 1 or position() = last()]"
Is there a syntax to remove position() function?
Something like this, but works?
select="osm/way/nd[[1] or [last()]]"
There has been some debate about allowing a new syntax to select a range https://github.com/qt4cg/qtspecs/issues/50#issuecomment-799228627 e.g. osm/way/nd[#1,last()] might work in a future XPath 4 but currently it is all up in the air of a lot of debate and questionable whether a new operator is helpful instead of doing osm/way/nd[position() = (1, last())].
Me again asking another Kusto related question (I really wish there would be a thorough video tutorial on this somewhere).
I have a summarize statement, that produces two columns for y axis and one for x axis.
Now i want to relabel the columns for x axis to show a string, that i also got from the database and already put into a variable with let.
This basically looks like this:
let android_col = strcat("Android: ", toscalar(customEvents
| where application_Version contains secondLatestVersionAndroid));
let iOS_col = strcat("iOS: ", toscalar(customEvents
| where application_Version contains secondLatestVersionIOS));
... some Kusto magic ...
| summarize
Android = 100 - (round((countif(hasUnhandledErrorAndroid == 1 ) * 100.0 ) / countif(isAndroid == 1), 2)),
iOS = 100 - (round((countif(hasUnhandledErroriOS == 1) * 100.0 ) / countif(isIOS == 1), 2))
by Time
|render timechart with (ytitle="crashfree users in %", xtitle="date", legend=visible )
Now i want to have the summarize display not Android and iOS, but the value of android_col and iOS_col.
Is that possible?
Best regards
Maverick
Generally, it's suggested to have predefined column names, otherwise various features don't work. For example, IntelliSense won't know the names of the columns, as they would be determined at run time only. Also, if you create a function that returns a dynamic schema, you won't be able to run this function from other clusters.
However, if you do want to change column names, you definitely have a way to do it by using various plugins. For example, bag_unpack, pivot and others.
As for courses on Kusto, there are actually several excellent courses on Pluralsight (all are free):
How to start with Microsoft Azure Data Explorer
Basic KQL
Azure Data Explorer – Advanced KQL
The usage of the "toscalar" in this query looks wrong, it seems to me that you should use the "extend" operator with the same logic to create the additional columns.
I would appreciate suggestions for a more computationally efficient way to dynamically filter a Pandas DataFrame.
The size of the DataFrame, len(df.index), is around 680,000.
This code from the callback function of a Plotly Dash dashboard is triggered when points on a scatter graph are selected. These points are passed to points as a list of dictionaries containing various properties with keys 'A' to 'C'. This allows the user to select a subset of the data in the pandas.DataFrame instance df for cross-filtering analysis.
rows_boolean = pandas.Series([False] * len(df.index))
for point in points:
current_condition = ((df['A'] == point['a']) & (df['B'] == point['b'])
& (df['C'] >= point['c']) & (df['C'] < point['d']))
rows_boolean = rows_boolean | current_condition
filtered = df.loc[rows_boolean, list_of_column_names]
The body of this for loop is very slow as it is iterating over the whole data frame, it is manageable to run it once but not inside a loop.
Note that these filters are not additive, as in this example; each successive iteration of the for loop increases, rather than decreases, the size of filtered (as | rather than & operator is used).
Note also that I am aware of the existence of the method df['C'].between(point['c'], point['d']) as an alternative to the last two comparison operators, however, I only want this comparison to be inclusive at the lower end.
Solutions I have considered
Searching the many frustratingly similar posts on SO reveals a few ideas which get some of the way:
Using pandas.DataFrame.query() will require building a (potentially very large) query string as follows:
query = ' | '.join([f'((A == {point["a"]}) & (B == {point["b"]})
& (C >= {point["c"]}) & (C < {point["d"]}))' for point in points])
filtered = df.query(query)
My main concern here is that I don’t know how efficient the query method becomes when the query passed has several dozen (or even several hundred) conditions strung together. This solution also currently does not allow the selection of columns using list_of_column_names.
Another possible solution could come from implementing something like this.
To reiterate, speed is key here, so I'm not just after something that works, but something that works a darn sight faster than my boolean implementation above:
There should be one-- and preferably only one --obvious way to do it. (PEP 20)
I am trying to replace some nan values in a few columns using a calculation from other columns.
ie.
nancolumn = column1.value + column2.value
My first attempt didn't work, ie there is still nan values
indecies = list(list(map(tuple, np.where(np.isnan(df['nancolumn']))))[0])
newValue = df.iloc[indecies]['column1'] + df.iloc[indecies ]['column2']
df.iloc[indecies]['nancolumn'] = newValue
I then found a specific index that i wanted to replace, 1805, and tried just replacing this data point value with 1.0. The result is still a nan
df.iloc[1805]['nancolumn'] = 1.0
I tried using fillna(), and isnan()
df[np.isnan(df)]=1
I get this error for the isnan() attempt:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
df.iloc[1805]['nancolumn'].dtype
dtype('float64')
I know im missing something simple, but i can't figure it out.
Can someone please help?
I found out that its best to use reference the column first and then the index, like below
df['nancolumn'].iloc[1805] = 1.0
Although, i still don't really understand the difference. If any one has a explenation, that would be helpful.