How many events are stored in my PredictionIO event server? - apache-spark

I imported an unknown number of events into my PIO eventserver and now I want to know that number (in order to measure and compare recommendation engines). I could not find an API for that, so I had a look at the MySQL database my server uses. I found two tables:
mysql> select count(*) from pio_event_1;
+----------+
| count(*) |
+----------+
| 6371759 |
+----------+
1 row in set (8.39 sec)
mysql> select count(*) from pio_event_2;
+----------+
| count(*) |
+----------+
| 2018200 |
+----------+
1 row in set (9.79 sec)
Both tables look very similar, so I am still unsure.
Which table is relevant? What is the difference between pio_event_1 and pio_event_2?
Is there a command or REST API where I can look up the number of stored events?

You could go through the spark shell, described in the troubleshooting docs
Launch the shell with
pio-shell --with-spark
Then find all events for your app and count them
import io.prediction.data.store.PEventStore
PEventStore.find(appName="MyApp1")(sc).count
You could also filter to find different subsets of events by passing more parameters to find. See the api docs for more details. The LEventStore is also an option

Connect to your database
\c db_name
List tables
\dt;
Run query
select count(*) from pio_event_1;
PHP
<?php
$dbconn = pg_connect("host=localhost port=5432 dbname=db_name user=postgres");
$result = pg_query($dbconn, "select count(*) from pio_event_1");
if (!$result) {
echo "An error occurred.\n";
exit;
}
// Not the best way, but output the total number of events.
while ($row = pg_fetch_row($result)) {
echo '<P><center>'.number_format($row[0]) .' Events</center></P>';
} ?>

Related

Azure Custom Log Alert - Not Firing

I am trying to debug an issue with an Azure Alert not firing. This alert should run every 30 minutes and find any devices that have not emitted a heartbeat in the last 30 minutes up to the hour. In addition, an alert should only be fired once for each device until it becomes healthy again.
The kusto query is:
let missedHeartbeatsFrom30MinsAgo = traces
| where message == “Heartbeat”
| summarize arg_max(timestamp, *) by tostring(customDimensions.id)
| project Id = customDimensions_id, LastHeartbeat = timestamp
| where LastHeartbeat < ago(30m);
let missedHeartbeatsFrom1HourAgo = traces
| where message == "Heartbeat"
| summarize arg_max(timestamp, *) by tostring(customDimensions.id)
| project Id = customDimensions_id, LastHeartbeat = timestamp
| where LastHeartbeat <= ago(1h);
let unhealthyIds = missedHeartbeatsFrom30MinsAgo
| join kind=leftanti missedHeartbeatsFrom1HourAgo on Id;
let deviceDetails = customEvents
| where name == "Heartbeat"
| distinct tostring(customDimensions.deviceId), tostring(customDimensions.fullName)
| project Id = customDimensions_deviceId, FullName = customDimensions_fullName;
unhealthyIds |
join kind=leftouter deviceDetails on Id
| project Id, FullName, LastHeartbeat
| order by FullName asc
The rules for this alert are:
When I pull the plug on a device, wait ~30 minutes, and run the query manually in App Insights, I see the device in the results data set. However, no alert gets generated (nothing shows up in the Alerts history page and no one in the Action Group gets notified). Any help in this matter would be greatly appreciated!
I can see your KQL Query take several times to execute, and it consume more resource usage to run the query.
Optimize your query to avoid more resource utilization and quick response of your query result.
Make sure your alert processing rule Status should be Enabled like below
Once it is done make sure your query result should be Greater than or equal to 1. So that the alert processing rule will check the threshold if it matches the condition the alert will fire.
Still, you get the issue alert not firing try to delete the alert and run your query in a Query Editor and try to create a New alert rule.

Grafana azure log analytics transfer query from logs

I have this query that works in Azure logs when i set the scope to the specific application insights I want to use
let usg_events = dynamic(["*"]);
let mainTable = union pageViews, customEvents, requests
| where timestamp > ago(1d)
| where isempty(operation_SyntheticSource)
| extend name =replace("\n", "", name)
| where '*' in (usg_events) or name in (usg_events)
;
let queryTable = mainTable;
let cohortedTable = queryTable
| extend dimension =tostring(client_CountryOrRegion)
| extend dimension = iif(isempty(dimension), "<undefined>", dimension)
| summarize hll = hll(user_Id) by tostring(dimension)
| extend Users = dcount_hll(hll)
| order by Users desc
| serialize rank = row_number()
| extend dimension = iff(rank > 5, 'Other', dimension)
| summarize merged = hll_merge(hll) by tostring(dimension)
| project ["Country or region"] = dimension, Counts = dcount_hll(merged);
cohortedTable
but trying to use the same in grafana just gives an error.
"'union' operator: Failed to resolve table expression named 'pageViews'"
Which is the same i get in azure logs if i dont set the scope to the specific application insights resource. So my question is. how do i make it so grafana targets this specific scope inside the logs? The query jsut gets the countries of the users that log in
As far as I know, Currently, there is no option/feature to add Scope in Grafana.
The Scope is available only in the Azure Log Analytics Workspace.
If you want the Feature/Resolution, please raise a ticket in Grafana Community where all the issues are officially addressed.

Excluding data in KQL SLA charts

We are showing SLA charts for URLs, VPN and VMs for that if there is any planned scheduled maintenance we want to exclude that timings in KQL SLA charts as its known downtime.
We are disabling Alerts via powershell during this time we are passing below columns to Loganalytics custom table.
"resourcename": "$resourcename",
"Alertstate": "Enabled",
"Scheduledmaintenance" : "stop",
"Environment" : "UAT",
"timestamp": "$TimeStampField",
Now we want to use join condition SLA charts queries with custom table data and exclude the time range in SLA charts during scheduled maintenance.
Adding query as per request
---------------------------
url_json_CL
| where Uri_s contains "xxxx"
| extend Availablity = iff(StatusCode_d ==200,1.000,0.000)
| extend urlhit = 1.000
| summarize PassCount = sum(Availablity), TestCount = sum(urlhit) by Uri_s ,ClientName_s
| extend AVLPERCENTAGE = ((PassCount / TestCount ) * 100)
| join kind=leftouter
( scheduledmaintenance2_CL
| where ResourceName_s == "VMname"
| where ScheduledMaintenance_s == "start"
| extend starttime = timestamp_t)
on ClientName_s
| join kind= leftouter
(scheduledmaintenance2_CL
| where ResourceName_s == "VMname"
| where ScheduledMaintenance_s == "stop"
| extend stoptime = timestamp_t )
on ClientName_s
| extend excludedtime=stoptime - starttime
| project ClientName_s, ResourceName_s, excludedtime, AVLPERCENTAGE , Uri_s
| top 3 by ClientName_s desc
You can perform cross-resource log queries in Azure Monitor
Using Application Insights explorer we can query Log analytics workspace custom tables as well.
workspace("/subscriptions/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx/resourcegroups/rgname/providers/Microsoft.OperationalInsights/workspaces/workspacename").Event | count
Using Log Analytics logs explorer you can query the Application Insights Availability Results
app("applicationinsightsinstancename").availabilityResults
You can use any of the above options to query the required tables and join the tables. Please refer to this documentation on joins.
Additional documentation reference.
Hope this helps.

How to cache subquery result in WITH clause in Spark SQL

I wonder if Spark SQL support caching result for the query defined in WITH clause.
The Spark SQL query is something like this:
with base_view as
(
select some_columns from some_table
WHERE
expensive_udf(some_column) = true
)
... multiple query join based on this view
While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times.
In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.
P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.
That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.
The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.
This can be done by excuting several sql query.
-- first cache sql
spark.sql("
CACHE TABLE base_view as
select some_columns
from some_table
WHERE
expensive_udf(some_column) = true")
-- then use
spark.sql("
... multiple query join based on this view
")
Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-
spark.sql("""
| create temp view my_view
| as
| WITH base_view as
| (
| select some_columns
| from some_table
| WHERE
| expensive_udf(some_column) = true
| )
| SELECT *
| from base_view
""");
spark.sql("""CACHE TABLE my_view""");
Now you can use the my_view temp view to join to other tables as shown below-
spark.sql("""
| select mv.col1, t2.col2, t3.col3
| from my_view mv
| join tab2 t2
| on mv.col2 = t2.col2
| join tab3 t3
| on mv.col3 = t3.col3
""");
Remember to uncache the view after using-
spark.sql("""UNCACHE TABLE my_view""");
Hope this helps.

Azure Log Analytics - Search REST API - How to Paginate through results

When grabbing search result using Azure Log Analytics Search REST API
I'm able to receive only the first 5000 results (as by the specs, at the top of the document), but know there are many more (by the "total" attribute in the metadata in the response).
Is there a way to paginate so to get the entire result set?
One hacky way would be to attempt to break down the desired time-range iteratively until the "total" is less than 5000 for that timeframe, and do this process iteratively for the entire desired time-range - but this is guesswork that will cost many redundant requests.
While it doesn't appear to be a way to paginate using the REST API itself, you can use your query to perform the pagination. The two key operators here are TOP and SKIP:
Suppose you want page n with pagesize x (starting at page 1), then append to your query:
query | skip (n-1) * x | top x.
For a full reference list, see https://learn.microsoft.com/en-us/azure/log-analytics/log-analytics-search-reference
Yes, skip operation is not available anymore but if you want create pagination there is still an option. You need to count total count of entries, use a simple math and two opposite sortings.
Prerequisites for this query are values: ContainerName, Namespace, Page, PageSize.
I'm using it in Workbook where these values are set by fields.
let containers = KubePodInventory
| where ContainerName matches regex '^.*{ContainerName}$' and Namespace == '{Namespace}'
| distinct ContainerID
| project ContainerID;
let TotalCount = toscalar(ContainerLog
| where ContainerID in (containers)
| where LogEntry contains '{SearchText}'
| summarize CountOfLogs = count()
| project CountOfLogs);
ContainerLog
| where ContainerID in (containers)
| where LogEntry contains '{SearchText}'
| extend Log=replace(#'(\x1b\[[0-9]*m|\x1b\[0 [0-9]*m)','', LogEntry)
| project TimeGenerated, Log
| sort by TimeGenerated asc
| take {PageSize}*{Page}
| top iff({PageSize}*{Page} > TotalCount, TotalCount - ({PageSize}*({Page} - 1)) , {PageSize}) by TimeGenerated desc;
// The '| extend' is not needed if in logs are not the annoying special characters

Resources