I am using logstash 5.4.2 Persistence Queue. Where I have config file to get input throw JDBC and do some transformations and storing output to mongo db. But when I run this in logstash it inserts only few records let us say 6000 where as the actual output should be 300000 records. And main pipeline gets shutdown. When I see the page file in data folder has more number of records. How to flush the data into output without pipeline before or after shutdown. My logstash persistence queue setting as follows.
pipeline.workers: 2
pipeline.output.workers: 1
pipeline.batch.size: 50
pipeline.batch.delay: 5
pipeline.unsafe_shutdown: false
config.test_and_exit: false
config.reload.automatic: false
queue.type: persisted
queue.page_capacity: 1gb
queue.max_events: 0
queue.max_bytes: 4gb
queue.checkpoint.acks: 1024
queue.checkpoint.writes: 1024
queue.checkpoint.interval: 1000
Is there any way to flush all data from persistence queue to output during main pipeline shutdown or anyway of workaround to handle this issue?
Thanks in Advance!
Your queue.checkpoint.writes is set to 1024, which is a default. You need to set it to 1, to guarantee maximum durability for input events, i.e.
queue.checkpoint.writes: 1
Please remember that this involves heavy disk writing which will impact performance severely.
Related
I try to feed a csv data file to logstash using filebeat. Unfortunately the messages are out of order. Is there any way to correct this?
Could this caused by TCP or any pipeline? Logstash started logstash.javapipeline / pipeline_id=>"main", "pipeline.workers"=>8
I tried:
filebeat - output to console - pass
filebeat - output to logstash (localhost) - logstash w/o filter; output to stdout - fail (wrong order of messages)
Per default order is not guaranteed in Logstash as the events in the batch can be reordered in the filter processing and some events can be processed faster than others.
If you need to order your events you will have to change the number of pipeline.workers to 1, which means that only 1 CPU will be used to process your messages.
Also, set pipeline.ordered to auto in logstash.yml.
Setting pipeline.workers to 1 will make logstash process the events in the orders they are received, but since it will use only 1 CPU, it can impact the performance if you have a high rate of events per second.
This is the part of the documentation about ordering events.
I have an application where I read csv files and do some transformations and then push them to elastic search from spark itself. Like this
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + type).save()
I have several nodes and in each node, I run 5-6 spark-submit commands that push to elasticsearch
I am frequently getting Errors
Could not write all entries [13/128] (Maybe ES was overloaded?). Error sample (first [5] error messages):
rejected execution of org.elasticsearch.transport.TransportService$7#32e6f8f8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#4448a084[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 451515]]
My Elasticsearch cluster has following stats -
Nodes - 9 (1TB space,
Ram >= 15GB ) More than 8 cores per node
I have modified following parameters for elasticseach
spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false
Could anyone suggest, What can I fix to get rid of these errors?
This occurs because the bulk requests are incoming at a rate greater than elasticsearch cluster could process and the bulk request queue is full.
The default bulk queue size is 200.
You should handle ideally this on the client side :
1) by reducing the number the spark-submit commands running concurrently
2) Retry in case of rejections by tweaking the es.batch.write.retry.count and
es.batch.write.retry.wait
Example:
es.batch.write.retry.wait = "60s"
es.batch.write.retry.count = 6
On elasticsearch cluster side :
1) check if there are too many shards per index and try reducing it.
This blog has a good discussion on criteria for tuning the number of shards.
2) as a last resort increase the thread_pool.index.bulk.queue_size
Check this blog with an extensive discussion on bulk rejections.
The bulk queue in your ES cluster is hitting its capacity (200) . Try increasing it. See this page for how to change the bulk queue capacity.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
Also check this other SO answer where OP had a very similar issue and was fixed by increasing the bulk pool size.
Rejected Execution of org.elasticsearch.transport.TransportService Error
We have a Spark Streaming application, it reads data from a Kafka queue in receiver and does some transformation and output to HDFS. The batch interval is 1min, we have already tuned the backpressure and spark.streaming.receiver.maxRate parameters, so it works fine most of the time.
But we still have one problem. When HDFS is totally down, the batch job will hang for a long time (let us say the HDFS is not working for 4 hours, and the job will hang for 4 hours), but the receiver does not know that the job is not finished, so it is still receiving data for the next 4 hours. This causes OOM exception, and the whole application is down, we lost a lot of data.
So, my question is: is it possible to let the receiver know the job is not finishing so it will receive less (or even no) data, and when the job finished, it will start receiving more data to catch up. In the above condition, when HDFS is down, the receiver will read less data from Kafka and block generated in the next 4 hours is really small, the receiver and the whole application is not down, after the HDFS is ok, the receiver will read more data and start catching up.
You can enable back pressure by setting the property spark.streaming.backpressure.enabled=true. This will dynamically modify your batch sizes and will avoid situations where you get an OOM from queue build up. It has a few parameters:
spark.streaming.backpressure.pid.proportional - response signal to error in last batch size (default 1.0)
spark.streaming.backpressure.pid.integral - response signal to accumulated error - effectively a dampener (default 0.2)
spark.streaming.backpressure.pid.derived - response to the trend in error (useful for reacting quickly to changes, default 0.0)
spark.streaming.backpressure.pid.minRate - the minimum rate as implied by your batch frequency, change it to reduce undershoot in high throughput jobs (default 100)
The defaults are pretty good but I simulated the response of the algorithm to various parameters here
I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command
tail -f <logfile>| nc -lk 9999
Here the spark streaming service is taking data from port 9999. However, I observe that the last few lines are being dropped, i.e. spark streaming does not receive those log lines or they are not processed.
However, I also observed that if I simply take the logfile as standard input instead of tailing it, no lines are dropped:
nc -q 10 -lk 9999 < logfile
Can anyone explain why this behavior is happening? And what could be a better resolution to the problem of streaming log data to spark streaming instance?
In Spark Streaming, data comes in over the wire, and constitutes a block on every block interval. This block is replicated on other machines (according to your storage level as soon as formed. Once a batch interval elapses, each block formed since the last batch interval tick forms part of a new RDD. It is once you have formed this RDD that you can schedule a job, so the data collected during the batch interval n is then processed during batch interval n+1.
So, the possible culprits for "losing a bit of data towards the end" could be:
you are observing your input file at the same time as you are monitoring the input for Spark. If you consider your monitoring at instant t, a bit after n batch intervals have elapsed, your log file has produced the data for n batches and then some ("a little bit more"). Except, the beginning of the next batch (n+1) is at this stage in the data collection phase, in the form of blocks on your Receiver. No data has been lost, the processing of batch n+1 has simply not started yet.
or your application assumes it's receiving a similar number of elements in each RDD and does not process the potentially (much) smaller last batch's RDD correctly.
or you're stopping your application or data before the last batch interval elapses (you need to wait n+1 batch intervals to see the processing of n batches of data).
or there is something weird occurring with the system clock of your executors. Have you thought of synchronizing them with ntp ?
Logstash is running.
How long takes it from adding a single line to a log file until Logstash recognize the new line and start to transform and output it.
With a simple BASH script I measure from 99 msec up to 800 msec including a transformation. It's clear that the latency depends on the Logstash transformation, HD, OS and the CPU. But how recognize Logstash the file change? Is there an internal timer? Pulls logstash from file?
Logstash's file input polls the files being watched at the interval set in the stat_interval parameter, which currently (Logstash 1.5) defaults to 1, i.e. every second.
In other words, assuming that
Logstash isn't behind on the reading any of the log files monitored by a particular file input and
the Logstash process isn't CPU-starved (it usually runs at priority 19 so heavy CPU usage by other processes could cause scheduling delays),
new events will on average get picked up within 500 ms and in the worst case within 1000 ms.