How to dynamically limit time range? - search

I have two sourcetypes:
A defines the period of activities:
_time, entity, start_time, end_time, activity, ...
B defines the 2D position of the entities:
_time, entity, x, y, ....
Now I tried to extract only those rows of all the entities in B that is within the periods defined in A, how can I do that? It seems I can't make a comparison with the command 'join' for time?

You're right, join won't be much help here. I've found that the splunk way to match up information in two indexes is to start with both indexes and manipulate the heterogeneous events as if they were a single index.
In this case, one approach uses streamstats to produce events that are denormalized to include relevant activity fields on each position event. First, make sure each event from index A considers start_time to be the _time field. Then, use streamstats to fill each event with null start_time, end_time, or activity fields (which should be coming from index B) with the latest value for that entity that was non-null (which should be coming from index A). Finally filter out any events where _time > end_time, which would be any position event that falls outside an activity window.
index=A OR index=B
| eval _time=coalesce(start_time, _time)
| streamstats latest(start_time) as activity_start_time, latest(end_time) as activity_end_time, latest(activity) as activity by entity
| where _time<=end_time
Keep in mind that this approach assumes that activities are neatly ordered, so that no activity overlaps another. This would be a bit trickier if activities can overlap.
Another method I sometimes use is to use transaction instead of streamstats. This gives much more control over the logic around when one activity starts and ends, and results in a single event per activity with multi-valued fields for the position. You'd want to start with a single "point" field for each position if you took this route.

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

PySpark Design Pattern for Combining Values Based on Criteria

Hi I am new to PySpark and want to create a function that takes a table of duplicate rows and a dict of {field_names : ["the source" : "the approach for getting the record"]} as an input and creates a new record. The new record will be equal to the first non-null value in the priority list where each "approach" is a function.
For example, the input table looks like this for a specific component:
And given this priority dict:
The output record should look like this:
The new record looks like this because for each field there is a function selected that dictates how the value is selected. (e.g. phone is equal to 0.75 as Amazon's most complete record is null so you coalesce to the next approach in the list which is the value of phone for the most complete record for Google = 0.75)
Essentially, I want to write a pyspark function that groups by components and then applies the appropriate function for each column to get the correct value. While I have a function that "works" the time complexity is terrible as I am naively looping through each component then each column, then each approach in the list to build the record.
Any help is much appreciated!
I think you can solve this using pyspark.sql.functions.when . See this blog post for some complicated usage patterns. You're going to want to group by id, and then use when statements to implement your logic. For example, 'title': {'source': 'Google', 'approach': 'first record'} can be implemented as
(df.groupBy('id').agg(
when(col("source") == lit("Google"), first("title") ).otherwise("null").alias("title" )
)
'Most recent' and 'most complete' are more complicated and may require some self-joins, but you should still be able to use when clauses to get the aggregates you need.

Maintain statistics across rows in accumulo

I am relatively new to Accumulo, so would greatly appreciate general tips for doing this better.
I have a rowIds that are made up of a time component and a geographic component. I'd like to maintain statistics (counts, sums, etc.) in an iterator of some sort, but would like to emit mutations to other rows as part of the ingest. In other words, as I insert a row:
<timeA>_<geoX> colFam:colQual value
In addition to the mutation above, I'd like to maintain stats in separate rows in the same table (or a different one) as follows:
timeA_countRow colFam:colQual count++
geoX_countRow colFam:colQUal count++
timeA_sumRow colFam:colQUal sum += value
geoX_sumRow colFam:colQual sum += value
What is the best way to accomplish such a thing? I have definitely seen the stats combiner, but that works within a single row to my understanding. I'd like to maintain stats based on parts of the key...
Thanks!
In addition to the mutation above, I'd like to maintain stats in separate rows in the same table (or a different one) as follows
This is something that fundamentally does not work with Accumulo. You cannot know, within the confines of an Iterator, about data in a separate row. That's why the StatsCombiner is written in the context of a single row. Any other row is not guaranteed to be contained in the Tablet (physical data boundary).
A common approach is to maintain this information client-side via a separate table or locality group with a SummingCombiner. When you insert an update for a specific column, you also submit an update to your stats table.
You could also look into Fluo which allows you to perform cross-row transactions. This is a different beast than normal Accumulo and is still in beta.

Joining on a view in esqueleto

I have an sql view V which has a 0:1 correspondence to a table X. I would like to join this view onto another table, Y, which has a reference to X (type XId).
I have specified the view as I would any other table in persistent. V's id column is a reference to X, but declaring the view as a table in persistent naturally gives it the type VId
instead of XId. And so I can't join the view onto Y because the types don't match up.
I realize I can do this with rawSQL, but my query also has an IN clause, which doesn't seem to play well with a list of values (using rawSQL).
Another option is to select the XId column twice in the view, and specify the extra one as having type XId in the model definition.
Lastly I could fall back to inserting the view query inline or doing the query entirely with raw sql, skipping persistent's interpolation.
Is there a way to do this without resorting to the methods above?
I'd prefer to use esqueleto if possible.
I haven't found a proper solution to this yet.
For the time being I am selecting each primary key twice in the view eg
... SELECT id, id AS xId...
along with adding the corresponding table's key type to the second selected id in the view schema:
XView sql=xView
...
xId XId

Building a pagination cursor

I have activities that are stored in a graph database. Multiple activities are grouped and aggregated into 1 activity in some circumstances.
A processed activity feed could look like this:
Activity 1
Activity 2
Grouped Activity
Activity 3
Activity 4
Activity 5
Activities have an updated timestamp and a unique id.
The activities are ordered by their updated time and in the case of a grouped activity, the most recent updated time within its child activities is used.
Activities can be inserted anywhere in the list (for example, if we start following someone, their past activities would be inserted into the list).
Activities can be removed from anywhere in the list.
Due to the amount of data, using the timestamp with microseconds can still result in conflicts (2 items can have the same timestamp).
Cursor identifiers should be unique and stable. Adding and removing feed items should not change the identifier.
I would like to introduce cursor based paging to allow clients to paginate through the feed similar to twitter's. There doesn't seem to be much information on how they are built as I have only found this blog post talking about implementing them. However it seems to have a problem if the cursor's identifier happens to be pointing to the item that was removed.
With the above, how can I produce an identifier that can be used as a cursor for the above? Initially, I considered combining the timestamp with the unique id: 1371813798111111.myuniqueid. However, if the item at 1371813798111111.myuniqueid is deleted, I can get the items with the 1371813798111111 timestamp, but would not be able to determine which item with that timestamp I should start with.
Another approach I had was to assign an incrementing number to each feed result. Since the number is incrementing and in order, if the number/id is missing, I can just choose the next one. However, the problem with this is that the cursor ids will change if I start removing and adding feed items in the middle of the feed. One solution I had to this problem is to have a huge gap between each number, but it is difficult to determine how new items can be added to the space between each number in a deterministic way. In addition, as the new items are added, and the gaps are being filled up, we would end up with the same problem.
Simply put, if I have a list of items where items can be added and removed from anywhere in the list, what is the best way to generate an id for each list item such that if the item for the id is deleted, I can still determine its position in the list?
You need to have additional (or existing) column which sequentially increased for every new added row to target table. Let's call this column seq_id.
When client request cursor for the first time:
GET /api/v1/items?sort_by={sortingFieldName}&size={count}
where sortingFieldName is name of field by which we apply sorting
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
ORDER BY sortingFieldName, seq_id
LIMIT :count
Response:
{
"data": [...],
"cursor": {
"prev_field_name": "{result[0].sortingFieldName}",
"prev_id": "{result[0].seq_id}",
"nextFieldName": "{result[count-1].sortingFieldName}",
"next_id": "{result[count-1].seq_id}",
"prev_results_link": "/api/v1/items?size={count}&cursor=bw_{prevFieldName}_{prevId}",
"next_results_link": "/api/v1/items?size={count}&cursor=fw_{nextFieldName}_{nextId}"
}
}
Next of cursor will not be present in response if we retrieved less than count rows.
Prev part of cursor will not be present in response if we don't have cursor in request or don't have data to return.
When client perform request again - he need to use cursor. Forward cursor:
GET /api/v1/items?size={count}&cursor=fw_{nextFieldName}_{nextId}
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
AND ((fieldName = :cursor.nextFieldName AND seq_id > :cursor.nextId) OR
fieldName > :cursor.nextFieldName)
ORDER BY sortingFieldName, seq_id
LIMIT :count
Or backward cursor:
GET /api/v1/items?size={count}&cursor=fw_{prevFieldName}_{prevId}
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
AND ((fieldName = :cursor.prevFieldName AND seq_id < :cursor.prevId) OR
fieldName < :cursor.prevFieldName)
ORDER BY sortingFieldName DESC, seq_id DESC
LIMIT :count
Response will be similar to previous one

Resources