Elasticsearch dsl - large unique list of single column in python - python-3.x

I have a large Windows event log set that I am attempting to find unique listing of a users from a single column in a single event ID. This runs, but takes an extremely long time. How would you use python Elasticsearch_dsl and Elasticsearch-py to accomplish this?
es = Elasticsearch([localhostmines], timeout=30)
s = Search(using=es, index="logindex-*").filter('term', EventID="4624")
users = set([])
for hit in s.scan():
users.add(hit.TargetUserName)
print(users)
TargetUserName column contains stringed names, EventID column contains strings of event ids for windows.

You need to use a terms aggregations which will do exactly what you expect.
s = Search(using=es, index="logindex-*").filter('term', EventID="4624")
s.aggs.bucket('per_user', 'terms', field='TargetUserName')
response = s.execute()
for user in response.aggregations.per_user.buckets:
print(user.key, user.doc_count)

Related

Is there a more efficient way to interact with ItemPaged objects from azure-data-tables SDK function query_entities?

The quickest method I have found is to just convert the ItemPaged object to a list using list() and then I'm able to manipulate/extract using a Pandas DataFrame. However, if I have millions of results, the process can be quite time-consuming, especially if I only want every nth result over a certain time-frame, for instance. Typically, I would have to query the entire time-frame and then re-loop to only obtain every nth element. Does anyone know a more efficient way to use query_entities OR how to more efficiently return every nth item from ItemPaged or more explicitly from table.query_entities? Portion of my code below:
connection_string = "connection string here"
service = TableServiceClient.from_connection_string(conn_str=connection_string)
table_string = ""
table = service.get_table_client(table_string)
entities = table.query_entities(filter, select, etc.)
results = pd.DataFrame(list(entities))
Does anyone know a more efficient way to use query_entities OR how to more efficiently return every nth item from ItemPaged or more explicitly from table.query_entities?
After reproducing from my end, one of the ways to achieve your requirement using get_entity() instead of query_entities(). Below is the complete code that worked for me.
entity = tableClient.get_entity(partition_key='<PARTITION_KEY>', row_key='<ROW_KEY>')
print("Results using get_entity :")
print(format(entity))
RESULTS:

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

How to use Django iterator with value list?

I have Profile table with a huge number of rows. I was trying to filter out profiles based on super_category and account_id (these are the fields in the model Profile).
Assume I have a list of ids in the form of bulk_account_ids and super_categories
list_of_ids = Profile.objects.filter(account_id__in=bulk_account_ids, super_category__in=super_categories).values_list('id', flat=True))
list_of_ids = list(list_of_ids)
SomeTask.delay(ids=list_of_ids)
This particular query is timing out while it gets evaluated in the second line.
Can I use .iterator() at the end of the query to optimize this?
i.e list(list_of_ids.iterator()), if not what else I can do?

Insert values into API request dynamically?

I have an API request I'm writing to query OpenWeatherMap's API to get weather data. I am using a city_id number to submit a request for a unique place in the world. A successful API query looks like this:
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id=4456703&units=imperial')
The key part of this is 4456703, which is a unique city_ID
I want the user to choose a few cities, which then I'll look through a JSON file for the city_ID, then supply the city_ID to the API request.
I can add multiple city_ID's by hard coding. I can also add city_IDs as variables. But what I can't figure out is if users choose a random number of cities (could be up to 20), how can I insert this into the API request. I've tried adding lists and tuples via several iterations of something like...
#assume the user already chose 3 cities, city_ids are below
city_ids = [763942, 539671, 334596]
r = requests.get(f'http://api.openweathermap.org/data/2.5/groupAPPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial')
Maybe a list is not the right data type to use?
Successful code would look something like...
r = requests.get(f'http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_id1},{city_id2},{city_id3}&units=imperial')
Except, as I stated previously, the user could choose 3 cities or 10 so that part would have to be updated dynamically.
you can use some string methods and list comprehensions to append all the variables of a list to single string and format that to the API string as following:
city_ids_list = [763942, 539671, 334596]
city_ids_string = ','.join([str(city) for city in city_ids_list]) # Would output "763942,539671,334596"
r = requests.get('http://api.openweathermap.org/data/2.5/group?APPID=333de4e909a5ffe9bfa46f0f89cad105&id={city_ids}&units=imperial'.format(city_ids=city_ids_string))
hope it helps,
good luck

Most appeared search between a particular time

I have a search log with fields namely time, place and the query. I want to find the most queried word from a particular place between a particular time. All the fields namely date,time, query_String are chararrays. I have the below pig script but it doesnot do what is required.
Data = LOAD 'data' USING CustomPigStorage();
FClients = FILTER Data BY NOT(country is null);
Clients = FOREACH FClients GENERATE date,time, country,query_string as query;
grp = group Clients by (query, country, date, time);
wth_count = foreach grp generate FLATTEN(group), COUNT(Clients) as count;
For example, I want the result to be "between 2pm and 3 pm, hello was searched 4 times from USA".
I am basically confused by the Count() function .Relatively new to pig. I believe my count() here is counting the total number of records I have.
Your query looks correct, COUNT(Clients) returns number of records in the bag, that came from Clients and belong to the group. To see it you can remove COUNT from "wth_count" statement and save results into a file and than look into it.
wth_count = foreach grp generate group, Clients;
store wth_count into 'path';
Your potential problem might be in the fact that you are using date and time columns in the group by and they produce too many groups. To mitigate this you could write a java static function that gets date and time and returns a single value for the range, for example 12-07-2012, 14.05.03 is converted into "12-07-2012 14h" and 12-07-2012, 14.05.05 into "12-07-2012 14h". This will create a key that covers the time interval 2pm and 3pm and will put all of the records from Clinets into that group's bag.

Resources