ElasticSearch and index strategy

ElasticSearch and index strategy - iis

We are currently setting up an ELK stack to consume our various logs. Currently we are just using an index per day and only have logs from various applications we develop (approx 100m docs). Next step is to look at other types of logs from e.g. IIS, event logs from network devices (nlb) and maybe post processed documents into statistical data. So the question is now what kind of index strategy we should use.
In terms of post processing of documents into statistical data, my initial idea is to create a single index (maybe per year/month depending on amount of data) per application. Makes sense?
In regards to IIS logs and various events logs, we could just append to the daily applications logs (currently 20k logs per day). However I'm not sure mixing logs types is a good idea, although easier to maintain. Another strategy is to create indivisual indexes for different log types (application, iis, event logs).
Any recommendations or references to good blog posts/info is greatly appreciated.

I recommend different ES Types for different log types. Structured logging is important, you'll want to query IIS logs differently from Elmah errors log for instance. You also may want to change the indexing settings of a particular field to make it not_analyzed.
You could keep everything in the same rolling daily/weekly/monthly index and keep them separate by type. Something like this for a daily index example:
PUT 20150227
PUT 20150227/iis/_mapping
{
"properties": {
"value1": {
"type": "string"
}
}
}
PUT 20150227/errors/_mapping
{
"properties": {
"value2": {
"type": "string"
}
}
}

Related

Handling Out-Of-Order Event Windowing in Apache Beam from a Multitenant Kafka Topic

I’ve been mulling over how to solve a given problem in Beam and thought I’d reach out to a larger audience for some advice. At present things seem to be working sparsely and I was curious if someone could provide a sounding-board to see if this workflow makes sense.
The primary high-level goal is to read records from Kafka that may be out of order and need to be windowed in Event Time according to another property found on the records and eventually emitting the contents of those windows and writing them out to GCS.
The current pipeline looks roughly like the following:
val partitionedEvents = pipeline
.apply("Read Events from Kafka",
KafkaIO
.read<String, Log>()
.withBootstrapServers(options.brokerUrl)
.withTopic(options.incomingEventsTopic)
.withKeyDeserializer(StringDeserializer::class.java)
.withValueDeserializerAndCoder(
SpecificAvroDeserializer<Log>()::class.java,
AvroCoder.of(Log::class.java)
)
.withReadCommitted()
.commitOffsetsInFinalize()
// Set the watermark to use a specific field for event time
.withTimestampPolicyFactory { _, previousWatermark -> WatermarkPolicy(previousWatermark) }
.withConsumerConfigUpdates(
ImmutableMap.of<String, Any?>(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest",
ConsumerConfig.GROUP_ID_CONFIG, "log-processor-pipeline",
"schema.registry.url", options.schemaRegistryUrl
)
).withoutMetadata()
)
.apply("Logging Incoming Logs", ParDo.of(Events.log()))
.apply("Rekey Logs by Tenant", ParDo.of(Events.key()))
.apply("Partition Logs by Source",
// This is a custom function that will partition incoming records by a specific
// datasource field
Partition.of(dataSources.size, Events.partition<KV<String, Log>>(dataSources))
)
dataSources.forEach { dataSource ->
// Store a reference to the data source name to avoid serialization issues
val sourceName = dataSource.name
val tempDirectory = Directories.resolveTemporaryDirectory(options.output)
// Grab all of the events for this specific partition and apply the source-specific windowing
// strategies
partitionedEvents[dataSource.partition]
.apply(
"Building Windows for $sourceName",
SourceSpecificWindow.of<KV<String, Log>>(dataSource)
)
.apply("Group Windowed Logs by Key for $sourceName", GroupByKey.create())
.apply("Log Events After Windowing for $sourceName", ParDo.of(Events.logAfterWindowing()))
.apply(
"Writing Windowed Logs to Files for $sourceName",
FileIO.writeDynamic<String, KV<String, MutableIterable<Log>>>()
.withNumShards(1)
.by { row -> "${row.key}/${sourceName}" }
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(SerializableFunction { logs -> Files.stringify(logs.value) }), TextIO.sink())
.to(options.output)
.withNaming { partition -> Files.name(partition)}
.withTempDirectory(tempDirectory)
)
}
In a simpler, bulleted form, it might look like this:
Read records from single Kafka topic
Key all records by their tenant
Partition stream by another event properly
Iterate through known partitions in previous step
Apply custom windowing rules for each partition (related to datasource, custom window rules)
Group windowed items by key (tenant)
Write tenant-key pair groupings to GCP via FileIO
The problem is that the incoming Kafka topic contains out-of-order data across multiple tenants (e.g. events for tenant1 might be streaming in now, but then a few minutes later you’ll get them for tenant2 in the same partition, etc.). This would cause the watermark to bounce back and forth in time as each incoming record would not be guaranteed to continually increase, which sounds like it would be a problem, but I'm not certain. It certainly seems that while data is flowing through, some files are simply not being emitted at all.
The custom windowing function is extremely simple and was aimed to emit a single window once the allowed lateness and windowing duration has elapsed:
object SourceSpecificWindow {
fun <T> of(dataSource: DataSource): Window<T> {
return Window.into<T>(FixedWindows.of(dataSource.windowDuration()))
.triggering(Never.ever())
.withAllowedLateness(dataSource.allowedLateness(), Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes()
}
}
However, it seemed inconsistent since we'd see logging come out after the closing of the window, but not necessarily files being written out to GCS.
Does anything seem blatantly wrong or incorrect with this approach? Since the data can come in out of order within the source (i.e. right now, 2 hours ago, 5 minutes from now) and covers data across multiple tenants, but the aim is try and ensure that one tenant that keeps up to date won't drown out tenants that might come in the past.
Would we potentially need another Beam application or something to "split" this single stream of events into sub-streams that are each processed independently (so that each watermark processes on their own)? Is that where a SplittableDoFn would come in? Since I'm running on the SparkRunner, which doesn't appear to support that - but it seems as though it'd be a valid use case.
Any advice would be greatly appreciated or even just another set of eyes. I'd be happy to provide any additional details that I could.
Environment
Currently running against SparkRunner

While this may not be the most helpful response, I'll be transparent as far as the end result. Eventually the logic required for this specific use-case extended far beyond the built-in capabilities of those in Apache Beam, primarily in the area around windowing/governance of time.
The solution that was landed on was to switch the preferred streaming technology from Apache Beam to Apache Flink, which as you might imagine was quite a leap. The stateful-centric nature of Flink allowed us to more easily handle our use cases, define custom eviction criteria (and ordering) around windowing, while losing a layer of abstraction over it.

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});

You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Cloudant couchdb changes api and geospatial indexes

Currently I'm doing filtered replication by monitoring the below resource:
_changes?filter=_selector&include_docs=true&attachments=true&limit=20
As you can see, I'm using a selector defined by
"selector": {
"type": "Property"
}
and everything is working great. Now I need to add another criteria which is geospatial index. I want to replicate documents with locations in a radius. e.g.
lat=-11.05987446&lon=12.28339928&radius=100
How can I replicate using the above filtered replication technique and replicate documents within a radius?
Thanks

The selector filter for _changes is not backed by an index - it just uses the same syntax as Query selectors, which currently does not support geospatial operations.
I think you have 3 options:
1. Use a bounding box
Your selector would then look something like:
"selector": {
"type": "Property",
"lat": {
"$gt": -11
},
"lat": {
"$lt": 11
},
"lon": {
"$gt": 12
},
"lon": {
"$lt": 14
}
}
Perhaps you could then further restrict the results on the client if you need exactly a radial search.
2. Implement a radius search in a JavaScript filter
This means dropping the use of the `selector, would be relatively slow (anything that involves JavaScript in Couch/Cloudant will be) but gives you exactly the result you want.
3. Run a query and replicate the resulting ids
Use a search or geospatial query to get the set of _ids you need and use a doc_id based replication to fetch them.
Lastly, it's worth considering whether you really want replication (which implies the ability for documents to sync both ways) or if you're just caching / copying data to the client. Replication carries some overhead beyond just copying the data (it needs to figure out the delta between the client and server, retrieve the rev history for each doc, etc) so, if you don't need to write the docs back to the server, maybe you don't need it.
If you do go down the replication route, you may need to handle cases where documents that you previously replicated no longer match the query so updates do not propagate in subsequent replications.
If not, you may be better off just running a query with include_docs=true and manually inserting the documents to a local database.

The _selector filter for the changes feed isn't backed by an index, it's basically a handy shortcut to achieve the same thing as a javascript filter, but much faster as it's executed directly in the Erlang.
As it's not index-backed, you can't tap into a geo-index that way.
You'd be better off to run either a bounding box or radius query to get the ids and then fetch those documents with a bulk_get or with a post to all_docs with the ids in the body.
https://cloudant.com/wp-content/uploads/Cloudant-Geospatial-technical-overview.pdf

How do I resolve RequestRateTooLargeException on Azure Search when indexing a DocumentDB source?

I have a DocumentDB instance with about 4,000 documents. I just configured Azure Search to search and index it. This worked fine at first. Yesterday I updated the documents and indexed fields along with one UDF to index a complex field. Now the indexer is reporting that DocumentDB is reporting RequestRateTooLargeException. The docs on that error suggest throttling calls but it seems like Search would need to do that. Is there a workaround?

Azure Search code uses DocumentDb client SDK, which retries internally with the appropriate timeout when it encounters RequestRateTooLarge error. However, this only works if there're no other clients using the same DocumentDb collection concurrently. Check if you have other concurrent users of the collection; if so, consider adding capacity to the collection.
This could also happen because, due to some other issue with the data, DocumentDb indexer isn't able to make forward progress - then it will retry on the same data and may potentially encounter the same data problem again, akin a poison message. If you observe that a specific document (or a small number of documents) cause indexing problem, you can choose to ignore them. I'm pasting an excerpt from the documentation we're about to publish:
Tolerating occasional indexing failures
By default, an Azure Search indexer stops indexing as soon as even as single document fails to be indexed. Depending on your scenario, you can choose to tolerate some failures (for example, if you repeatedly re-index your entire datasource). Azure Search provides two indexer parameters to fine- tune this behavior:
maxFailedItems: The number of items that can fail indexing before an indexer execution is considered as failure. Default is 0.
maxFailedItemsPerBatch: The number of items that can fail indexing in a single batch before an indexer execution is considered
as failure. Default is 0.
You can change these values at any time by specifying one or both of these parameters when creating or updating your indexer:
PUT https://service.search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
"dataSourceName" : "mydatasource",
"targetIndexName" : "myindex",
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 5 }
}
Even if you choose to tolerate some failures, information about which documents failed is returned by the Get Indexer Status API.

Significant terms causes a CircuitBreakingException

I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
"query": {
"ids": {
"type": "document",
"values": [
"xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
]
}
},
"aggregations": {
"Keywords": {
"significant_terms": {
"field": "Body"
}
}
},
"size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?

There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.
I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.
One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string