I have a java XPages domino application that's running on my server and serves as an API for handling the Rooms & Resoruces database remotely (the main role is obtaining reservations for a set of rooms and updating it periodically).
Everything was fine when testing, but once I put the app on my production server, I got a crash after some time:
Domino version: Release 10.0.1FP3 August 09, 2019
OS Version: Windows/2016 10.0 [64-bit]
Error Message = PANIC: semaphore invalid or not allocated
SharedDPoolSize = 33554432
FaultRecovery = 0x00010012
Cleanup Script Timeout= 600
Crash Limits = 3 crashes in 5 minutes
StaticHang = Virtual Thread [ nHTTP: 0674: 0011] (Native thread [ nHTTP: 0674: 145c]) (0x674/0x11/0x170000145C)
ConfigFileSem = ( SEM:#0:0x1520000010D) n=0, wcnt=-1, Users=-1, Owner=[ : 0000]
FDSem = ( RWSEM:#53:0x410f) rdcnt=-1, refcnt=0 Writer=[ : 0000], n=53, wcnt=-1, Users=0, Owner=[ : 0000]
<## ------ Notes Data -> OS Data -> Semaphores -> SEM Info (Time 10:34:34) ------ ##>
SpinLockIterations = 1500
FirstFreeSem = 819
SemTableSize = 827
############################################################
### thread 46/89: [ nHTTP: 0674: 145c] FATAL THREAD (Panic)
### FP=0xF3AC3F61E8, PC=0x7FFFC6DD5AC4, SP=0xF3AC3F61E8
### stkbase=0xF3AC400000, total stksize=1048576, used stksize=40472
### EAX=0x00000004, EBX=0x00000000, ECX=0x00001c6c, EDX=0x00000000
### ESI=0x000927c0, EDI=0x00001c6c, CS=0x00000033, SS=0x0000002b
### DS=0x00000000, ES=0x00000000, FS=0x00000000, GS=0x00000000 Flags=0x1700000246
############################################################
[ 1] 0x7FFFC6DD5AC4 ntdll.ZwWaitForSingleObject+20 (10,0,0,F3AC3F6300)
[ 2] 0x7FFFC3464ABF KERNELBASE.WaitForSingleObjectEx+143 (10,F3AC3F69B0,7FFF00000000,1c6c)
#[ 3] 0x7FFFB326DAD0 nnotes.OSRunExternalScript+1808 (5,0,424,0)
#[ 4] 0x7FFFB3269E9C nnotes.FRTerminateWindowsResources+1532 (5,23B45D80D50,0,1)
#[ 5] 0x7FFFB326BA23 nnotes.OSFaultCleanupExt+1395 (0,7f60,0,F3AC3F7C70)
#[ 6] 0x7FFFB326B4A7 nnotes.OSFaultCleanup+23 (7f60,7FFFB3DE7E30,0,200000000)
#[ 7] 0x7FFFB32D6D76 nnotes.OSNTUnhandledExceptionFilter+390 (F3AC3F7B50,7FFFB485A818,F3AC3F7C70,FFFFEB865BDB003)
#[ 8] 0x7FFFB326E70A nnotes.Panic+1066 (5dc,125851500347E41,7FF786A7B4A0,23B1D91F9A8)
#[ 9] 0x7FFFB329FDD6 nnotes.OSLockSemInt+70 (23B1D91F9A4,145c,7FF786A84578,7FF786A84578)
#[10] 0x7FFFB32A04ED nnotes.OSLockWriteSem+77 (23B1D92AA18,7FF786A84578,23B14EA41B0,7FF786A84578)
#[11] 0x7FFFAC74DDC1 nlsxbe.ANDatabase::ANDRemoveCalendar+33 (23B1D92AA18,7FF786A84578,0,23B18FFCBA8)
#[12] 0x7FFFAC881CBB nlsxbe.ANCalendar::`scalar deleting destructor'+91 (7FF786A84578,23B1BB6FC78,0,1)
#[13] 0x7FFFAC7FFAF7 nlsxbe.Java_lotus_domino_local_NotesBase_RecycleObject2+471 (23B159C7A00,23B1BB6FC78,23B1BB6FC70,0)
#[14] 0x7FFFAC7FF91A nlsxbe.Java_lotus_domino_local_NotesBase_RecycleObject+42 (23B159C7A00,23B1BB6FC78,23B1BB6FC70,23B159C7A00)
Most of the operations rely on searching for a room by it's internet address in the $Rooms view of names.nsf, then heading to its adequate RnR database and getting all reservation documents for that specific room. Sometimes (although very rarely) I also open the users calendar and create/update reservations.
At first I thought it's caused by some memory leak or something, I went through all the code and recycled() everything I could find (and I found some places with obvious handle leaks), but it didn't help at all.
What bothers me is that the crashes happened at almost identical hours (4 days later, several minutes after 10AM).
What can be the cause of this crash? I'm not good at reading dump data, but I can see that the first call from the fatal stack calls list is an RecycleObject, followed by some calendar related things.
I have no idea where I should look in my code, why would even recycle crash the server? Does the ANCalendar suggest that I shouldn't look at code that's accessing the database directly, but rather opening users calendar?
Update
Studying the crash logs, I managed to find out the place where the crash occured. It's my appointment creation code, which uses NotesCalendar.createEntry() on the users calendar. The code is like this:
Session session = reDatabase.getParent();
Name nnOrganizer = session.createName(session.getEffectiveUserName());
String organizerEmail = "";
DirectoryNavigator nav = session.getDirectory().lookupNames("$Users", nnOrganizer.getCommon(), "InternetAddress");
if(nav.findFirstMatch() && !nav.getFirstItemValue().isEmpty()) {
organizerEmail = (String)nav.getFirstItemValue().get(0);
}
Recycler.recycle(nav);
Name nnResource = session.createName(roomName);
DbDirectory dir = session.getDbDirectory(session.getServerName());
Database mdb = dir.openMailDatabase();
NotesCalendar cal = session.getCalendar(mdb);
String dStart = DateUtil.formatICalendar(dtStart);
String dEnd = DateUtil.formatICalendar(dtEnd);
String iCalEntry = "BEGIN:VCALENDAR\n" +
// Rest of iCalendar string goes here
iCalEntry += "END:VEVENT\n" +
"END:VCALENDAR\n";
cal.setAutoSendNotices(true);
String apptUNID ="";
try {
NotesCalendarEntry entry = cal.createEntry(iCalEntry);
Document doc = entry.getAsDocument();
apptUNID = doc.getItemValueString("ApptUNID");
Recycler.recycle(doc, entry);
} catch (NotesException ex) {
System.out.println("Couldn't create appointment!: " + ex.toString());
throw ex;
} finally {
Recycler.recycle(mdb, cal, nnOrganizer, nnResource, dir, reDatabase, session);
}
return apptUNID; // return the UNID of the created entry (if any)
Considering the fatal call stack starts with an RecycleObject call, is there anything wrong in my recycling here? Can I recycle the calendar entry directly after creating it? It's still kinda confusing to me, but this code works well on my test server. Is there anything wrong about it?
This is the last code that's being executed when creating an appointment, a HTTP response with the apptUNID is made directly after calling the above function.
I am trying to use tf.train.batch to run enqueue images in multiple threads. When the number of threads is 1, the code works fine. But when I set a higher number of threads I receive an error:
Failed precondition: Attempting to use uninitialized value Variable
[[Node: Variable/read = Identity[T=DT_INT32, _class=["loc:#Variable"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable)]]
The main thread has to run for some time under one second to index the database of folders and put it into tensor.
I tried to use sess.run([some_image]) before running tf.train.bath loop. In that case workers fail in the background first with the same error, and after that I receive my images.
I tried to use time.sleep(), but it does not seem to be possible to delay the workers.
I tried adding a dependency to the batch:
g = tf.get_default_graph()
with g.control_dependencies([init_one,init_two]):
example_batch = tf.train.batch([my_image])
where init_one, and init_two are tf.initialize_all(variables) and tf.initialize_local_variables()
the most relevant issue I could find is at: https://github.com/openai/universe-starter-agent/issues/44
Is there a way I could ask the synchronize worker threads with the main thread so that they don't race first and die out ?
A similar and easy to reproduce error with variable initialization happens when set up the epoch counter to anything that is not None Are there any potential solutions ? I've added the code needed to reproduce the error below:
def index_the_database(database_path):
"""indexes av4 database and returns two tensors of filesystem path: ligand files, and protein files"""
ligand_file_list = []
receptor_file_list = []
for ligand_file in glob(os.path.join(database_path, "*_ligand.av4")):
receptor_file = "/".join(ligand_file.split("/")[:-1]) + "/" + ligand_file.split("/")[-1][:4] + '.av4'
if os.path.exists(receptor_file):
ligand_file_list.append(ligand_file)
receptor_file_list.append(receptor_file)
index_list = range(len(ligand_file_list))
return index_list,ligand_file_list, receptor_file_list
index_list,ligand_file_list,receptor_file_list = index_the_database(database_path)
ligand_files = tf.convert_to_tensor(ligand_file_list,dtype=tf.string)
receptor_files = tf.convert_to_tensor(receptor_file_list,dtype=tf.string)
filename_queue = tf.train.slice_input_producer([ligand_files,receptor_files],num_epochs=10,shuffle=True)
serialized_ligand = tf.read_file(filename_queue[0])
serialized_receptor = tf.read_file(filename_queue[1])
image_one = tf.reduce_sum(tf.exp(tf.decode_raw(serialized_receptor,tf.float32)))
image_batch = tf.train.batch([image_one],100,num_threads=100)
init_two = tf.initialize_all_variables()
init_one = tf.initialize_local_variables()
sess = tf.Session()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)
sess.run([init_one])
sess.run([init_two])
while True:
print "next"
sess.run([image_batch])
When configuring watchers, what would be the purpose of including both of these settings under a watching:
singleton = True
numprocess = 1
The documentation states that setting singleton has the following effect:
singleton:
If set to True, this watcher will have at the most one process. Defaults to False.
I read that as negating the need to specify numprocesses however in the github repository they provide an example:
https://github.com/circus-tent/circus/blob/master/examples/example6.ini
Included here as well, where they specify both:
[circus]
check_delay = 5
endpoint = tcp://127.0.0.1:5555
pubsub_endpoint = tcp://127.0.0.1:5556
stats_endpoint = tcp://127.0.0.1:5557
httpd = True
debug = True
httpd_port = 8080
[watcher:swiss]
cmd = ../bin/python
args = -u flask_app.py
warmup_delay = 0
numprocesses = 1
singleton = True
stdout_stream.class = StdoutStream
stderr_stream.class = StdoutStream
So I would assume they do something different and in some way work together?
numprocess is the initial number of process for a given watcher. In the example you provided it is set to 1, but a user can typically add more processes as needed.
singleton would only allow a maxiumum of 1 process running for a given watcher, so it would forbid you from increment the number of processes dynamically.
The code below from circus test suite describes it well ::
#tornado.testing.gen_test
def test_singleton(self):
# yield self._stop_runners()
yield self.start_arbiter(singleton=True, loop=get_ioloop())
cli = AsyncCircusClient(endpoint=self.arbiter.endpoint)
# adding more than one process should fail
yield cli.send_message('incr', name='test')
res = yield cli.send_message('list', name='test')
self.assertEqual(len(res.get('pids')), 1)
yield self.stop_arbiter()
I am trying to use Brightway's ParallelMonteCarloand MultiMonteCarloclass but have run into a KeyError. I am in a Brightway project with an LCI database:
In [1] bw.databases
Out [1] Brightway2 databases metadata with 2 objects:
biosphere3
ecoinvent 3_2 CutOff
Selecting an activity and a method:
In [2] db = bw.Database('ecoinvent 3_2 CutOff')
act = db.random()
method = ('CML 2001', 'climate change', 'GWP 100a')
My code is as follows:
In [3] ParallelMC_LCA = bw.ParallelMonteCarlo({act:1},
method = myMethod,
iterations=1000,
cpus=mp.cpu_count())
results = np.array(ParallelMC_LCA.calculate())
and
In [4] act1 = db.random()
act2 = db.random()
multiMC_LCA = bw.MultiMonteCarlo(demands = [{act1:1}, {act2:1}],
method = myMethod,
iterations = 10)
results = np.array(ParallelMC_LCA.calculate())
Both give me a KeyError: 'ecoinvent 3_2 CutOff'.
My question is: why?
This is a known issue due to differences in how multiprocessing works on Windows and Unix. Specifically, on Windows the project is not set correctly, causing a KeyError. As such, it isn't a Stack Overflow question.
I'm using jBoss Rules.But I run in to memory issues after using JBoss rules. Using a profiling tool I collected heap dump
and I got the result as :
One instance of "org.drools.reteoo.ReteooStatefulSession" loaded by
"sun.misc.Launcher$AppClassLoader # 0x7f899fdb6d88" occupies 657,328,888 (78.91%) bytes.
The memory is accumulated in one instance of "org.drools.reteoo.ReteooStatefulSession"
loaded by "sun.misc.Launcher$AppClassLoader # 0x7f899fdb6d88".
Keywords
sun.misc.Launcher$AppClassLoader # 0x7f899fdb6d88
org.drools.reteoo.ReteooStatefulSession
The code I used for JBoss rules is given below.
kbase= KnowledgeBaseFactory.newKnowledgeBase();
ksession= kbase.newStatefulKnowledgeSession();
final String str = CISMSRemotingUtils.getFullConfigFilePath("change-set.xml") ;
final String filePath = str.replaceAll(" ", "%20");
aconf = KnowledgeAgentFactory .newKnowledgeAgentConfiguration();
aconf.setProperty("drools.agent.newInstance", "false");
kagent = KnowledgeAgentFactory.newKnowledgeAgent( "Agent", aconf);
kagent.applyChangeSet( ResourceFactory.newFileResource(filePath) );
kbase = kagent.getKnowledgeBase();
ksession= kbase.newStatefulKnowledgeSession();
sconf =ResourceFactory.getResourceChangeScannerService().newResourceChangeScannerConfiguration();
sconf.setProperty( "drools.resource.scanner.interval", "3600");
ResourceFactory.getResourceChangeScannerService().configure( sconf );
ResourceFactory.getResourceChangeNotifierService().start();
ResourceFactory.getResourceChangeScannerService().start();
This piece of code is given in the class constructor and rules are fired inside this class
ksession.insert(data);
ksession.fireAllRules();
I'm using drools 5.4.0
Can anyone help me to identify the problem?