Opensearch - best practice for indexing

Opensearch - best practice for indexing - logstash

I have ~1 TB of old apache log data I would like to index in Opensearch. Logs are per day and structured like: s3://bucket/logdata/year/year_month_day.json.gz
I plan to use logstash for the ingest and wonder the best way to index(es) to get performance?
I would like to index per day but how do extract the date from the logfile name above to get it right in the logstash conf file?
index = > "%{+YYYY.MM.dd}" will solve the future logfiles but how do I solve it for the old ones?

You can do it like this using the dissect filter that can parse the date components from the bucket key and reconstruct the date into a new field called log_date:
dissect {
mapping => {
"[#metadata][s3][key]" => "%{ignore}/logdata/%{+ignore}/%{year}_%{+month}_%{day}.json.gz"
}
add_field => {
"log_date" => "%{year}-%{month}-%{day}"
}
remove_field => ["ignore"]
}
Then in your output section you can reference that new field in order to build your index name:
index = > "your-index-%{log_date}"
PS: another way is to parse the year_month_day part as one token and replace the _ characters with - using mutate/gsub

In my experience, daily indices can quickly run out of control: they vary in size greatly, with a decent retention period cluster might get oversharded, etc. I would recommend to set up ILM rollover with policy based on both index age (7 or 30 days, depending on logging volume) and primary shard size (common threshold is 50GB). You can also set up a delete phase as well in the same policy, based on your retention period.
This way you'll get optimal indexing and search performance, as well as uniform load distribution and resource usage.

Related

How to set spark Kusto connector write polling interval?

I'm writing data to Kusto using azure-kusto-spark.
I see this write has high latency. On seeing debug logs from Spark cluster, I see KustoConnector does polling on write. I believe there is default long polling time interval value. Is there a way to configure it to lower time interval value?
In azure-kusto-spark codebase I see this piece of code which I think is responsible for polling.
def finalizeIngestionWhenWorkersSucceeded(
...
DelayPeriodBetweenCalls,
(writeOptions.timeout.toMillis / DelayPeriodBetweenCalls + 5).toInt,
res => res.isDefined && res.get.status == OperationStatus.Pending,
res => finalRes = res,
maxWaitTimeBetweenCalls = KDSU.WriteMaxWaitTime.toMillis.toInt)
.await(writeOptions.timeout.toMillis, TimeUnit.MILLISECONDS)
....
Not sure about understanding it.

The polling operations is just to check whether the data was inserted to Kusto. It has a max timeout but this is not impacting the latency.
I believe the latency is coming from the time you have in your Kusto DB batching ingestion policy - see details here https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy
By default it's 5 minutes, so you might want to reduce this time if it won't have any negative impact (see doc). Note that the policy should be changed on DB, not table, the reason being the following: the connector creates a temporary table in the same Kusto DB and first inserts into this temporary table. So even if you change the policy of your destination table, it'll still take at least 5min to write to the temporary.

How to count the number of messages fetched from a Kafka topic in a day?

I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?

Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.

You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).

Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Best way to store high frequency, periodic time-series data?

I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?

Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules

So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.

Azure Table Partitioning Strategy

I am trying to come up with a partition key strategy based on a DateTime that doesn't result in the Append-Only write bottleneck often described in best practices guidelines.
Basically, if you partition by something like YYYY-MM-DD, all your writes for a particular day will end up the same partition, which will reduce write performance.
Ideally, a partition key should even distribute writes across as many partitions as possible.
To accomplish this while still basing the key off a DateTime value, I need to come up with a way to assign what amounts to buckets of dateline values, where the number of buckets is predetermined number per time interval - say 50 a day. The assignment of a dateline to a bucket should be as random as possible - but always the same for a given value. The reason for this is that I need to be able to always get the correct partition given the original DateTime value. In other words, this is like a hash.
Lastly, and critically, I need the partition key to be sequential at some aggregate level. So while DateTime values for a given interval, say 1 day, would be randomly distributed across X partition keys, all the partition keys for that day would be between a queryable range. This would allow me to query all rows for my aggregate interval and then sort them by the DateTime value to get the correct order.
Thoughts? This must be a fairly well known problem that has been solved already.

How about using the millisecond component of the date time stamp, mod 50. That would give you your random distribution throughout the day, the value itself would be sequential, and you could easily calculate the PartitionKey in future given the original timestamp ?

To add to Eoin's answer, below is the code I used to simulate his solution:
var buckets = new SortedDictionary<int,List<DateTime>>();
var rand = new Random();
for(int i=0; i<1000; i++)
{
var dateTime = DateTime.Now;
var bucket = (int)(dateTime.Ticks%53);
if(!buckets.ContainsKey(bucket))
buckets.Add(bucket, new List<DateTime>());
buckets[bucket].Add(dateTime);
Thread.Sleep(rand.Next(0, 20));
}
So this should simulate roughly 1000 requests coming in, each anywhere between 0 and 20 milliseconds apart.
This resulted in a fairly good/even distribution between the 53 "buckets". It also resulted, as expected, in avoiding the append-only or prepend-only anti-pattern.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Opensearch - best practice for indexing - logstash

Related

How to set spark Kusto connector write polling interval?

How to count the number of messages fetched from a Kafka topic in a day?

Every 'nth' document from a collection - MongoDB + NodeJS

Best way to store high frequency, periodic time-series data?

Azure Table Partitioning Strategy

Categories

Resources