Solr becoming unavailable while adding documents - node.js

We have a Solr 4.51 (Debian) installation that is being filled up by a node.js app. Both reside on the same machine. We are adding about 250 to 350 documents per second. Solr is configured with auto-commit (1000 ms soft / 15000 ms hard).
After round about 100 to 150 seconds Solr becomes unavailable for one up to five seconds. So http.request() returns EADDRNOTAVAIL. In the meantime we have a workaround, which retries to add the document every 500 ms. So after one up to 10 tries Solr becomes available again and everything works fine (up to the next break).
We are wondering if this is normal. Or could the described behaviour be due to any misconfiguration?
There is no error or warning entry in the Solr log file. Notably while being unavailable the cpu workload of the corresponding Solr java process falls from 30 % down to almost zero.
Of course our node.js app always waits on the answer of the http.request, so there should not be any parallel requests that could rise any overflow.
What could be the reason that Solr makes these "coffee breaks"?
Thanks for any hint!
EDIT:
When looking into the Solr log (haviing auto-commit enabled) while the error occurs. The log file sais:
80583779 [commitScheduler-9-thread-1] INFO org.apache.solr.update.UpdateHandler â start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
80583788 [commitScheduler-9-thread-1] INFO org.apache.solr.search.SolrIndexSearcher â Opening Searcher#1263c036 main
80583789 [commitScheduler-9-thread-1] INFO org.apache.solr.update.UpdateHandler â end_commit_flush
80583789 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore â QuerySenderListener sending requests to Searcher#1263c036 main{StandardDirectoryReader(segments_1lv:67657:nrt _opz(4.5.1):C1371694/84779:delGen=1 _p29(4.5.1):C52149/1067:delGen=1 _q0p(4.5.1):C48456 _q0z(4.5.1):C2434 _q19(4.5.1):C2583 _q1j(4.5.1):C3195 _q1t(4.5.1):C2633 _q23(4.5.1):C3319 _q1c(4.5.1):C351 _q2n(4.5.1):C3277 _q2x(4.5.1):C2618 _q2d(4.5.1):C2621 _q2w(4.5.1):C201)}
80583789 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore â QuerySenderListener done.
80583789 [searcherExecutor-5-thread-1] INFO org.apache.solr.core.SolrCore â [base] Registered new searcher Searcher#1263c036 main{StandardDirectoryReader(segments_1lv:67657:nrt _opz(4.5.1):C1371694/84779:delGen=1 _p29(4.5.1):C52149/1067:delGen=1 _q0p(4.5.1):C48456 _q0z(4.5.1):C2434 _q19(4.5.1):C2583 _q1j(4.5.1):C3195 _q1t(4.5.1):C2633 _q23(4.5.1):C3319 _q1c(4.5.1):C351 _q2n(4.5.1):C3277 _q2x(4.5.1):C2618 _q2d(4.5.1):C2621 _q2w(4.5.1):C201)}
80593917 [commitScheduler-6-thread-1] INFO org.apache.solr.update.UpdateHandler â start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
80593931 [commitScheduler-6-thread-1] INFO org.apache.solr.core.SolrCore â SolrDeletionPolicy.onCommit: commits: num=2
So at first glance it looks like Solr refuses the http connections while performing a soft commit. BUT,
there have been many soft commits before without refusals
there will be many soft commit afterwards without refusalss
and (most important) when disabling auto-commit at all the http requests from node are refused anyway (more or less at the same stage as if aut-commit wre enabled).
The Solr log file with auto-commit disabled just stops here (while adding the documents):
153314 [qtp1112461277-10] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d009155]} 0 0
153317 [qtp1112461277-16] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d009156]} 0 0
153320 [qtp1112461277-17] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d009157]} 0 1
153322 [qtp1112461277-18] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d009158]} 0 0
153325 [qtp1112461277-13] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d009159]} 0 0
153329 [qtp1112461277-19] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d00915a]} 0 1
153331 [qtp1112461277-12] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d00915b]} 0 0
153334 [qtp1112461277-15] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d00915c]} 0 0
153336 [qtp1112461277-11] INFO org.apache.solr.update.processor.LogUpdateProcessor â [base] webapp=/solr path=/update/json params={commit=false&wt=json} {add=[52c14621fc45c82d4d00915d]} 0 0
So here is no hint anymore that commiting or indexing could be the reason that Solr refuses any (any!) http request for 1-3 seconds.
EDIT2:
It is also remarkable, that if we switch root-logging to ALL, Solr becomes slower (that is obvious due to the more verbose logging) but also the error vanishes. So there are no refusal periods any more. This looks like it is also a timing problem...
EDIT3:
In the meantime I found out, that the unavailability of Solr only affects my node application. While not beeing available for node.js I can make requests from the Solr Web Admin. I also tried to connect from node.js to a different webserver whle not beeing able to access Solr. That works! So this is weird, my node.js app cannot access Solr for some seconds, but any other app can. Any idea what could be the reason for that?

If you are doing a full indexing , its a bad idea to use auto-commit .
The thing happening is at every hard commit a new index file is formed . and your policy shows that at every 15000 docs a commit happens . which may create a optimization cycle every 50 seconds (300 docs /sec ) and during optimization the solr server refuses to serve queries as its maximum resources are being utilized for optimization , hence if doing a "Big/Bulk/Full INdexing " comment out auto commits , it would speed up your indexing .
Try to comment out transaction logs for big indexings as these are overhead in bulk indexing scenario
regards
Rajat

Please provide more data ,
1 volume of number of documents per indexing cycle .(Number of documents per minute or any thing like it )
2 what purpose you are using solr (Type of search NRT or with a Delay)
3 what is your solr configuration (master slave etc and their purpose )
Process of a commit as I have seen is . while indexing on a commit solr "Ques " up the indexing requests when its free it just indexes them although it may seem it has stopped , but indexing goes on.
Look in to the warm up count for caches , I am assuming that you have a master slave configuration . So on master warm up count for caches should be zero because in any way you are clearing all caches in every 15 seconds
Secondly
I feel is it is highly unlikely for you to run into halting problem on production system .
because you will have a large index and rest would be small files so in this scenario merging would be of small segments but I would be in a better position to answer the question if you could answer questions which I asked
regards
Rajat

Finally I found the answer. It is a problem of the default global http agent. See full description here.

Related

Key with version number error in YugabyteDB cluster?

[Question posted by a user on YugabyteDB Community Slack]
I'm just testing yugabyte under some negative tests and I'm facing a kind of issue. I'm running a 3 node cluster (master and tserver running on same node) When I stop one node and start it up again, T-server is not booting up under this log
F20220506 06:50:49 ../../src/yb/tserver/tablet_server_main.cc:220] Invalid argument (yb/util/universe_key_manager.cc:73): Could not init Tablet Manager: Failed to open tablet metadata for tablet: eb1e5457022f42c084148ca8fa4ba5c6: Failed to load tablet metadata for tablet id eb1e5457022f42c084148ca8fa4ba5c6: Co
uld not load Raft group metadata from /data/yugabyte/data/yb-data/tserver/tablet-meta/eb1e5457022f42c084148ca8fa4ba5c6: Key with version number c7b91fad-dd60-404f-8846-cab568e52468 does not exist.
# 0x7fcdace5ee4c yb::LogFatalHandlerSink::send()
# 0x7fcdaa5e28ee google::LogMessage::SendToLog()
# 0x7fcdaa5dfa7a google::LogMessage::Flush()
# 0x7fcdaa5e3169 google::LogMessageFatal::~LogMessageFatal()
# 0x4124ea yb::tserver::(anonymous namespace)::TabletServerMain()
# 0x7fcda6811825 __libc_start_main
# 0x410f99 _start
# (nil) (unknown)
The only way to start it up is remove the old data.
My steps were:
1.- Cluster up with 3 server
2.- Create a taable with 3 partition on different tablet id confirmed via UI)
3.- Insert 3 different row to diff partition
4.- Select * working fine
5.- Shut down one Table server
6.- Select * working fine
7.- Starting up the table server (error)
We are running using config file:
/usr/local/yugabyte/src/yugabyte-2.11.0.1/bin/./yb-tserver --flagfile /data/yugabyte/etc/tserver.conf
and config:
--tserver_master_addrs=ip1:7100,ip2:7100,ip3:7100
--rpc_bind_addresses=fqdn
--server_broadcast_addresses=ip1
--enable_ysql
--pgsql_proxy_bind_address=ip1:5433
--cql_proxy_bind_address=ip1:9042
--fs_data_dirs=/data/yugabyte/data
--placement_cloud=cloud
--placement_region=reg
--placement_zone=zone
--use_client_to_server_encryption=true
--certs_for_client_dir=/data/yugabyte/ssl
--certs_dir=/data/yugabyte/ssl
--use_node_to_node_encryption=true
--ysql_enable_auth=true
--log_dir=/data/yugabyte/logs
--ssl_protocols=tls12,tls13
--ysql_pg_conf=pgaudit.log='DDL',pgaudit.log_level=notice,pgaudit.log_client=ON,log_min_messages=notice,log_line_prefix='\%m \%r \%u \%d [\%p]'
Looks like the key is not in memory:
/usr/local/yugabyte/src/yugabyte-2.11.0.1/bin/yb-admin -master_addresses $master get_universe_config
{"version":2,"replicationInfo":{"liveReplicas":{"numReplicas":3,"placementBlocks":[{"cloudInfo":{"placementCloud":"cloud","placementRegion":"region","placementZone":"zone"},"minNumReplicas":1}]}},"clusterUuid":"dccea8cb-9790-48ba-8a05-6218a8e875a4","encryptionInfo":{"encryptionEnabled":true,"universeKeyRegistryEncoded":"sZTzNciYu6b1KxZonpJx6v7CDDvexiv1jh/HIEAOkpV4YRrIZbIK9jtajdEMmVEUy706+dmz8bmnZvy6/n33u+qS7fzRSOTPOlpxYI6+k1lSM6bu2DRTTffhZtaiKN15gy8a3ifaZV7xJ9QJ3z9SvFYzb96+KDWw","keyPath":"/data/yugabyte/rest/universe_key","latestVersionId":"c7b91fad-dd60-404f-8846-cab568e52468","keyInMemory":false}}
KeyInMemory: False
We are using encryption at rest but the the file with the key should be there.
Am I doing something wrong?
usr/local/yugabyte/src/yugabyte-2.11.0.1/bin/yb-admin -master_addresses fqdn:7100 all_masters_have_universe_key_in_memory 7e13c99e-5278-4abd-ab78-79f70d6c2679
Error running all_masters_have_universe_key_in_memory: Operation failed. Try again. (yb/tools/yb-admin_client_ent.cc:1027): Unable to check whether master has universe key in memory.: Node fqdn:7100 does not have universe key in memory
The above error seems to indicate the tserver is looking for a key - “c7b91fad-dd60-404f-8846-cab568e52468” in order to open the file, but can’t find it.
Realizing that there’s an actual key file on the masters. That’s been deprecated for using more secure in-memory keys. I just did a little digging through the code, and sure enough, we don’t actually support sending on-disk keys to tablet servers on restart.
The command you have to run is this one:
yb-admin -master_addresses <master-addresses> add_universe_keys_to_all_masters <key_id> <key_path>
and then right after that it should work:
/usr/local/yugabyte/src/yugabyte-2.11.0.1/bin/yb-admin -master_addresses $master all_masters_have_universe_key_in_memory 1
Node fqdn1:7100 has universe key in memory: 1
Node fqdn2:7100 has universe key in memory: 1
Node fqdn3:7100 has universe key in memory: 1

ELK - Time difference calculation on Logstash

I need to calculate the time difference and display the same on Kibana between two events from the log mentioned below, I have a unique identifier and message to identify.
2019-12-20 13:00:22.493 [http-nio-8080-exec-2933] INFO c.g.i.g.m.c.GRInternalController.getGRForPO(68) - Request Started
2019-12-20 13:00:30.882 [http-nio-8080-exec-2933] INFO c.g.i.g.m.s.i.GoodsReceiptServiceImpl.getGRForPO(1647) - Request Completed
2019-12-20 13:01:02.570 [http-nio-8080-exec-2940] INFO c.g.i.g.m.c.GRInternalController.getGRForPO(68) - Request Started
2019-12-20 13:01:09.930 [http-nio-8080-exec-2940] INFO c.g.i.g.m.s.i.GoodsReceiptServiceImpl.getGRForPO(1647) - Request Completed
Unique identifiers of the event: [http-nio-8080-exec-2933], [http-nio-8080-exec-2940]
Time diff of [http-nio-8080-exec-2933] : 8.389
Time diff of [http-nio-8080-exec-2940] : 7.36
Can someone please suggest the solution to figure it out? Thanks in advance.
we can use ruby filter on logstash to achieve this:
ruby {
code => "duration =
(DateTime.parse(event.get('time1')).to_time.to_f*1000-DateTime.parse(event.get('time2')).to_time.to_f*1000) rescue nil;
event.set('timedifference', duration); "
}
The time difference will be in milliseconds.

Logstash 6.2 - full persistent queue (wrong mapping?)

My queue is almost full and I see this errors in my log file:
[2018-05-16T00:01:33,334][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"2018.05.15-el-mg_papi-prod", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x608d85c1>], :response=>{"index"=>{"_index"=>"2018.05.15-el-mg_papi-prod", "_type"=>"doc", "_id"=>"mHvSZWMB8oeeM9BTo0V2", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [papi_request_json.query.disableFacets]", "caused_by"=>{"type"=>"i_o_exception", "reason"=>"Current token (VALUE_TRUE) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#56b8442f; line: 1, column: 555]"}}}}}
[2018-05-16T00:01:37,145][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 69
[2018-05-16T00:01:37,147][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 84
...
[2018-05-16T15:28:09,981][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
If I understand first warning, problem is with mapping. I have a lot of files in my queue Logstash folder. My questions is:
How to empty my queue, can i just delete all files from logstash queue folder(all logs will be lost)? And then resend all the data to logstash to proper index?
How can I determine where exactly is problem in mapping, or which servers sending data of wrong type?
I have pipeline on port 5000 named testing-pipeline just for checking if Logstash is active from nagios. What is that [INFO ][org.logstash.beats.BeatsHandler] logs?
If I understand correctly, [INFO ][logstash.outputs.elasticsearch] is just logs about retrying to process logstash queue?
On all servers is FIlebeat 6.2.2. Thank you for your help.
All pages in queue could be deleted but it is not the proper solution. In my case, the queue was full because there was events with different mapping of index. In Elasticsearch 6, you cannot send documents with different mapping to the same index so the logs stacked in queue because of this logs (even if there is only one wrong event, all others will not be processed). So how to process all data you can process an skip the wrong one? Solution is to configure DLQ (dead letter queue). Every event with response code 400 or 404 is moved to DLQ so others could be process. The data from DLQ can be processed later with pipeline.
Wrong mapping can be determined by error log "error"=>{"type"=>"mapper_parsing_exception", ..... }. To specify exact place with wrong mapping, you have to compare mapping of events and indices.
[INFO ][org.logstash.beats.BeatsHandler] was caused by Nagios server. The check did not consist of valid request, that's why the Handling exception. The check should test if Logstash service is active. Now I check Logstas service on localhost:9600, for more info here.
[INFO ][logstash.outputs.elasticsearch] means that Logstash trying to process the queue but index is locked ([FORBIDDEN/12/index read-only / allow delete (api)]) because the indices was set to read-only state. Elasticsearch, when there is not enough space on server, automatically configure indices to read-only. This can be change by cluster.routing.allocation.disk.watermark.low, for more info here.

CouchDB log does not work in map nor reduce function?

The doc says there is a log method to write message to log file at INFO level.
I tried but it does not work. (CouchDB 1.6.1)
First I start monitoring the log file
tail -f couch.log
I see the log file is being appended and other INFO messages appear like.
[Tue, 06 Jan 2015 08:16:10 GMT] [info] [<0.321.0>] 192.168.1.43 - - GET /test/ 200
[Tue, 06 Jan 2015 08:16:10 GMT] [info] [<0.323.0>] 192.168.1.45 - - GET /test/ 200
I tried log in views (including temp view or persistence view), the message never appears while the other INFO messages are being appended. The view responses correctly. also tried to add new documents and then trigger the view, still nothing.
function(doc) {
log('LOG NEVER APPEARS');
emit(null, doc);
}
Does anybody know what the reason is?
What probably happened was you created the view and the log function logged the request on the terminal. When you tailed it the index was already created and the view code was not run again. Try to add some data and call the view again. Or change the view->save it->call the view again and you should see the logged message.
This is happening because the view code is run only when the underlying data in couchdb changes. Once a view is created it grows by appending the new data to the already existing index. If the data is not changed the code inside the view will not be run.

phpcassa creating column family

I get a very strange error when creating column family with phpcassa, here is my code:
$sys = new SystemManager("127.0.0.1:9160");
$attr = array("comparator" => "UTF8Type");
$data = $sys->create_column_family("my_key_space", "user_likes", $attr);
So i'm not actually sure if it's a valid code, but i am quite sure it is, so this is the error i get:
TTransportException [ 0 ]: TSocket: timed out reading 4 bytes from 127.0.0.1:9160
And i get this error after a really long loading, maybe 30-60 secs, but any other code like retrieving or inserting data works perfectly, so what could it be?
I believe the attribute name should be "comparator_type" instead of "comparator".
As for why the server isn't responding, you'll probably find an Exception or stack trace in your Cassandra logs. If you're using an up-to-date version of Cassandra (like 1.1.5 or 1.1.6), I suggest opening a ticket in the Cassandra JIRA, because it should be returning an error instead of timing out.

Resources