Neo4j is running out of memory / GC errors - garbage-collection

part of my bachelor thesis is loading a .nt-file (turtle format) into a neo4j database, after reading into it and looking for possible implementations I found out that an implementation with a SailRepositoryConnection by OpenRDF should work just like I need it to, but I ran into heavy OOM/GC problems.
Libraries that i use:
- Blueprints 2.5
- OpenRDF Sesame 2.7.8
- Neo4j 2.0.3
The System that I use:
- Fedora 19 64bit
- 4 Gb Ram
- Java version 1.7.0_51
My Code is the following:
public void loadFile() throws SailException {
//load file
Neo4j2Graph neo4jGraph = new Neo4j2Graph(this.DB_DIRECTORY);
BatchGraph<TransactionalGraph> neo = new BatchGraph<TransactionalGraph>(neo4jGraph, VertexIDType.NUMBER, 1);
GraphSail sail = new GraphSail( neo4jGraph );
sail.initialize();
SailRepositoryConnection connection;
try {
connection = new SailRepository( sail ).getConnection();
URL url = f.toURI().toURL(); // getClass().getResource( this.f.getCanonicalPath() );
// System.out.println( "Loading " + url + ": " );
connection.add(url, null , RDFFormat.NTRIPLES);
connection.commit();
connection.close();
} catch ( Exception e ) {
e.printStackTrace(System.out);
}
// System.out.print( "Done." );
sail.shutDown();
neo4jGraph.shutdown();
neo.shutdown();
}
With very small .nt files (like 4800 triple) it works just fine.
But trying to load a .nt file that contains 1.8 million triples the following error occurs:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.search.TermQuery$TermWeight.<init>(TermQuery.java:53)
at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:199)
at org.apache.lucene.search.Searcher.createNormalizedWeight(Searcher.java:168)
at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:664)
at org.apache.lucene.search.Query.weight(Query.java:103)
at org.neo4j.index.impl.lucene.Hits.<init>(Hits.java:101)
at org.neo4j.index.impl.lucene.LuceneIndex.search(LuceneIndex.java:427)
at org.neo4j.index.impl.lucene.LuceneIndex.query(LuceneIndex.java:314)
at org.neo4j.index.impl.lucene.LuceneIndex.get(LuceneIndex.java:229)
at org.neo4j.kernel.impl.coreapi.AbstractAutoIndexerImpl$IndexWrapper.get(AbstractAutoIndexerImpl.java:176)
at com.tinkerpop.blueprints.impls.neo4j2.Neo4j2Graph.getVertices(Neo4j2Graph.java:369)
at com.tinkerpop.blueprints.oupls.sail.GraphSail$DataStore.findVertex(GraphSail.java:297)
at com.tinkerpop.blueprints.oupls.sail.GraphSailConnection.getOrCreateVertex(GraphSailConnection.java:241)
at com.tinkerpop.blueprints.oupls.sail.GraphSailConnection.addStatementInternal(GraphSailConnection.java:208)
at com.tinkerpop.blueprints.oupls.sail.GraphSailConnection.addStatementInternal(GraphSailConnection.java:165)
at org.openrdf.sail.helpers.SailConnectionBase.addStatement(SailConnectionBase.java:471)
at org.openrdf.repository.sail.SailRepositoryConnection.addWithoutCommit(SailRepositoryConnection.java:281)
at org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:469)
at org.openrdf.repository.util.RDFInserter.handleStatement(RDFInserter.java:207)
at org.openrdf.rio.ntriples.NTriplesParser.parseTriple(NTriplesParser.java:319)
at org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:193)
at org.openrdf.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:132)
at org.openrdf.repository.util.RDFLoader.loadInputStreamOrReader(RDFLoader.java:325)
[Full GC 963967K->963934K(963968K), 5.8010570 secs]
at org.openrdf.repository.util.RDFLoader.load(RDFLoader.java:222)
at org.openrdf.repository.util.RDFLoader.load(RDFLoader.java:180)
at org.openrdf.repository.base.RepositoryConnectionBase.add(RepositoryConnectionBase.java:253)
at src4neo2graph.QnL2.loadFile(QnL2.java:181)
at main.Main.main(Main.java:52)
I also tried running it with 4Gb on another System but the same errors occurred.
So I tried using some commands for the garbage collection: (not specifically in that order but I tried them all at sometime)
Xms2G -Xmx2G -XX:PermSize=256M -XX:MaxPermSize=256M -XX:+UseConcMarkSweepGC
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:MaxHeapFreeRatio=95
But that didn't help either.
I've searched around a lot and this seems to be the implementation everyone is using for loading a .nt File into a Neo4j DB and just like I wrote it works just fine with a small number of triples.
So basically: Is there something wrong with my implementation?
Are there better ones?
What can I do (besides buying more RAM (like I said I tested with more on another system)) to make this work?
Thanks in advance.

You shouldn't need to actually break your file into chunks; just count the statements as they are added and commit() after each nth statement. There is a convenience class, SailLoader (source), in blueprints-graph-sail which does this for you, e.g.
public void loadFile() throws Exception {
File f = new File("/tmp/sp2bench-1000000.nt.gz");
Neo4j2Graph neo4jGraph = new Neo4j2Graph(this.DB_DIRECTORY);
try {
GraphSail sail = new GraphSail(neo4jGraph);
sail.initialize();
try {
SailLoader loader = new SailLoader(sail);
loader.setBufferSize(1000); // this is the default
loader.setVerbose(true);
loader.load(f);
} finally {
sail.shutDown();
}
} finally {
neo4jGraph.shutdown();
}
}

I'm sure how tinkerpop internally works, but I guess everything happens in a single transaction. In Neo4j a transaction is first built up in memory and upon finishing the tx, it's persisted and memory used is freed upon next gc. If you have large transactions, you need a suitable amount of RAM.
If there's no easy way to control transaction sizes within tinkerpop, consider splitting your input file into smaller chunks and import them one-by-one and make sure to call .commit() after each chuck.

Did you figure out the solution? I have the same problem. Did you divide your files in chunks and loaded them? or anything else? If you divide your files in chunks, does your queries work as they were to?
To answer your question,
If you're not planning to use Neo4j as pure triple store. Here's an alternative to load your turtle file really fast and create a graph which can be queried using cypher.
http://michaelbloggs.blogspot.de/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html
Hope it helps!

Related

In Kotlin Native, how to keep an object around in a separate thread, and mutate its state from any other thead without using C pointers?

I'm exploring Kotlin Native and have a program with a bunch of Workers doing concurrent stuff
(running on Windows, but this is a general question).
Now, I wanted to add simple logging. A component that simply logs strings by appending them as new lines to a file that is kept open in 'append' mode.
(Ideally, I'd just have a "global" function...
fun log(text:String) {...} ]
...that I would be able to call from anywhere, including from "inside" other workers and that would just work. The implication here is that it's not trivial to do this because of Kotlin Native's rules regarding passing objects between threads (TLDR: you shouldn't pass mutable objects around. See: https://github.com/JetBrains/kotlin-native/blob/master/CONCURRENCY.md#object-transfer-and-freezing ).
Also, my log function would ideally accept any frozen object. )
What I've come up with are solutions using DetachedObjectGraph:
First, I create a detached logger object
val loggerGraph = DetachedObjectGraph { FileLogger("/foo/mylogfile.txt")}
and then use loggerGraph.asCPointer() ( asCPointer() ) to get a COpaquePointer to the detached graph:
val myPointer = loggerGraph.asCPointer()
Now I can pass this pointer into the workers ( via the producer lambda of the Worker's execute function ), and use it there. Or I can store the pointer in a #ThreadLocal global var.
For the code that writes to the file, whenever I want to log a line, I have to create a DetachedObjectGraph object from the pointer again,
and attach() it in order to get a reference to my fileLogger object:
val fileLogger = DetachedObjectGraph(myPointer).attach()
Now I can call a log function on the logger:
fileLogger.log("My log message")
This is what I've come up with looking at the APIs that are available (as of Kotlin 1.3.61) for concurrency in Kotlin Native,
but I'm left wondering what a better approach would be ( using Kotlin, not resorting to C ). Clearly it's bad to create a DetachedObjectGraph object for every line written.
One could pose this question in a more general way: How to keep a mutable resource open in a separate thread ( or worker ), and send messages to it.
Side comment: Having Coroutines that truly use threads would solve this problem, but the question is about how to solve this task with the APIs currently ( Kotlin 1.3.61 ) available.
You definitely shouldn't use DetachedObjectGraph in the way presented in the question. There's nothing to prevent you from trying to attach on multiple threads, or if you pass the same pointer, trying to attach to an invalid one after another thread as attached to it.
As Dominic mentioned, you can keep the DetachedObjectGraph in an AtomicReference. However, if you're going to keep DetachedObjectGraph in an AtomicReference, make sure the type is AtomicRef<DetachedObjectGraph?> and busy-loop while the DetachedObjectGraph is null. That will prevent the same DetachedObjectGraph from being used by multiple threads. Make sure to set it to null, and repopulate it, in an atomic way.
However, does FileLogger need to be mutable at all? If you're writing to a file, it doesn't seem so. Even if so, I'd isolate the mutable object to a separate worker and send log messages to it rather than doing a DetachedObjectGraph inside an AtomicRef.
In my experience, DetachedObjectGraph is super uncommon in production code. We don't use it anywhere at the moment.
To isolate mutable state to a Worker, something like this:
class MutableThing<T:Any>(private val worker:Worker = Worker.start(), producer:()->T){
private val arStable = AtomicReference<StableRef<T>?>(null)
init {
worker.execute(TransferMode.SAFE, {Pair(arStable, producer).freeze()}){
it.first.value = StableRef.create(it.second()).freeze()
}
}
fun <R> access(block:(T)->R):R{
return worker.execute(TransferMode.SAFE, {Pair(arStable, block).freeze()}){
it.second(it.first.value!!.get())
}.result
}
}
object Log{
private val fileLogger = MutableThing { FileLogger() }
fun log(s:String){
fileLogger.access { fl -> fl.log(s) }
}
}
class FileLogger{
fun log(s:String){}
}
The MutableThing uses StableRef internally. producer makes the mutable state you want to isolate. To log something, call Log.log, which will wind up calling the mutable FileLogger.
To see a basic example of MutableThing, run the following test:
#Test
fun goIso(){
val mt = MutableThing { mutableListOf("a", "b")}
val workers = Array(4){Worker.start()}
val futures = mutableListOf<Future<*>>()
repeat(1000) { rcount ->
val future = workers[rcount % workers.size].execute(
TransferMode.SAFE,
{ Pair(mt, rcount).freeze() }
) { pair ->
pair.first.access {
val element = "ttt ${pair.second}"
println(element)
it.add(element)
}
}
futures.add(future)
}
futures.forEach { it.result }
workers.forEach { it.requestTermination() }
mt.access {
println("size: ${it.size}")
}
}
The approach you've taken is pretty much correct and the way it's supposed to be done.
The thing I would add is, instead of passing around a pointer around. You should pass around a frozen FileLogger, which will internally hold a reference to a AtomicRef<DetachedObjectGraph>, the the attaching and detaching should be done internally. Especially since DetachedObjectGraphs are invalid once attached.

primaryValues behave not as expected

In our poc, we have a cache in PARTIONED MODE, with 2 backups, and we started 3 nodes. 100 entries were loaded into cache and we did below steps to retrive it.
public void perform () throws GridException {
final GridCache<Long, Entity> cache= g.cache("cache");
GridProjection proj= g.forCache("cache");
Collection< Collection<Entity>> list= proj .compute().broadcast(
new GridCallable< Collection<Entity>>() {
#Override public Collection<Entity> call() throws Exception {
Collection<Entity> values= cache.primaryValues();
System.out.println("List size on each Node: "+ values.size());
// console from each node shows 28,38,34 respectively, which is correct
return values;
}
}).get();
for (Collection<Entity> e: list){
System.out.println("list size when arrives on main Node :"+ e.size());
//console shows 28 for three times, which is not correct
}
}
I assume that primaryValues() is to take value of each element returned by primaryEntrySet() out and put into a Collection. I also tried to use primaryEntrySet and it works without such problem.
The way GridGain serializes cache collections is by reference which may not be very intuitive. I have filed a Jira issue with Apache Ignite project (which is the next version of GridGain open source edition): https://issues.apache.org/jira/browse/IGNITE-38
In the mean time, please try the following from your GridCallable, which should work:
return new ArrayList(cache.primaryValues());

How to use Apache OpenNLP in a node.js application

What is the best way to use Apache Open NLP with node.js?
Specifically, I want to use Name Entity Extraction API. Here is what is says about it - the documentation is terrible (new project, I think):
http://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind
From the docs:
To use the Name Finder in a production system its strongly recommended
to embed it directly into the application instead of using the command
line interface. First the name finder model must be loaded into memory
from disk or an other source. In the sample below its loaded from
disk.
InputStream modelIn = new FileInputStream("en-ner-person.bin");
try {
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
There is a number of reasons why the model loading can fail:
Issues with the underlying I/O
The version of the model is not compatible with the OpenNLP version
The model is loaded into the wrong component, for example a tokenizer
model is loaded with TokenNameFinderModel class.
The model content is not valid for some other reason
After the model is loaded the NameFinderME can be instantiated.
NameFinderME nameFinder = new NameFinderME(model);
The initialization is now finished and the Name Finder can be used.
The NameFinderME class is not thread safe, it must only be called from
one thread. To use multiple threads multiple NameFinderME instances
sharing the same model instance can be created. The input text should
be segmented into documents, sentences and tokens. To perform entity
detection an application calls the find method for every sentence in
the document. After every document clearAdaptiveData must be called to
clear the adaptive data in the feature generators. Not calling
clearAdaptiveData can lead to a sharp drop in the detection rate after
a few documents. The following code illustrates that:
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData()
}
the following snippet shows a call to find
String sentence = new String[]{
"Pierre",
"Vinken",
"is",
"61",
"years"
"old",
"."
};
Span nameSpans[] = nameFinder.find(sentence);
The nameSpans arrays contains now exactly one Span which marks the
name Pierre Vinken. The elements between the begin and end offsets are
the name tokens. In this case the begin offset is 0 and the end offset
is 2. The Span object also knows the type of the entity. In this case
its person (defined by the model). It can be retrieved with a call to
Span.getType(). Additionally to the statistical Name Finder, OpenNLP
also offers a dictionary and a regular expression name finder
implementation.
Checkout this NodeJS library.
https://github.com/mbejda/Node-OpenNLP
https://www.npmjs.com/package/opennlp
Just do NPM install opennlp
And look at the examples on the Github.
var nameFinder = new openNLP().nameFinder;
nameFinder.find(sentence, function(err, results) {
console.log(results)
});

How to use IObservable/IObserver with ConcurrentQueue or ConcurrentStack

I realized that when I am trying to process items in a concurrent queue using multiple threads while multiple threads can be putting items into it, the ideal solution would be to use the Reactive Extensions with the Concurrent data structures.
My original question is at:
While using ConcurrentQueue, trying to dequeue while looping through in parallel
So I am curious if there is any way to have a LINQ (or PLINQ) query that will continuously be dequeueing as items are put into it.
I am trying to get this to work in a way where I can have n number of producers pushing into the queue and a limited number of threads to process, so I don't overload the database.
If I could use Rx framework then I expect that I could just start it, and if 100 items are placed in within 100ms, then the 20 threads that are part of the PLINQ query would just process through the queue.
There are three technologies I am trying to work together:
Rx Framework (Reactive LINQ)
PLING
System.Collections.Concurrent
structures
Drew is right, I think the ConcurrentQueue even though it sounds perfect for the job is actually the underlying data structure that the BlockingCollection uses. Seems very back to front to me too.
Check out chapter 7 of this book*
http://www.amazon.co.uk/Parallel-Programming-Microsoft-NET-Decomposition/dp/0735651590/ref=sr_1_1?ie=UTF8&qid=1294319704&sr=8-1
and it will explain how to use the BlockingCollection and have multiple producers and multiple consumers each taking off the "queue". You will want to look at the "GetConsumingEnumerable()" method and possibly just call .ToObservable() on that.
*the rest of the book is pretty average.
edit:
Here is a sample program that I think does what you want?
class Program
{
private static ManualResetEvent _mre = new ManualResetEvent(false);
static void Main(string[] args)
{
var theQueue = new BlockingCollection<string>();
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 1", 10000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 2", 50000000));
theQueue.GetConsumingEnumerable()
.ToObservable(Scheduler.TaskPool)
.Subscribe(x => ProcessNewValue(x, "Consumer 3", 30000000));
LoadQueue(theQueue, "Producer A");
LoadQueue(theQueue, "Producer B");
LoadQueue(theQueue, "Producer C");
_mre.Set();
Console.WriteLine("Processing now....");
Console.ReadLine();
}
private static void ProcessNewValue(string value, string consumerName, int delay)
{
Thread.SpinWait(delay);
Console.WriteLine("{1} consuming {0}", value, consumerName);
}
private static void LoadQueue(BlockingCollection<string> target, string prefix)
{
var thread = new Thread(() =>
{
_mre.WaitOne();
for (int i = 0; i < 100; i++)
{
target.Add(string.Format("{0} {1}", prefix, i));
}
});
thread.Start();
}
}
I don't know how best to accomplish this with Rx, but I would recommend just using BlockingCollection<T> and the producer-consumer pattern. Your main thread adds items into the collection, which uses ConcurrentQueue<T> underneath by default. Then you have a separate Task that you spin up ahead of that which uses Parallel::ForEach over the BlockingCollection<T> to process as many items from the collection as makes sense for the system concurrently. Now, you will probably also want to look into using the GetConsumingPartitioner method of the ParallelExtensions library in order to be most efficient since the default partitioner will create more overhead than you want in this case. You can read more about this from this blog post.
When the main thread is finished you call CompleteAdding on the BlockingCollection<T> and Task::Wait on the Task you spun up to wait for all the consumers to finish processing all the items in the collection.

SubSonic2.2 SharedDbConnectionScope and TransactionScope Transaction Confusion

ARGH!!!
There seems to be a little confusion surrounding the SharedDbConnectionScope and TransactionScope objects to enable wrapping your SubSonic queries within a transaction.
The docs suggest specifying the using SharedDbConnectionScope wrapped around the using TransactionScope...
using(SharedDbConnectionScope scope = new SharedDbConnectionScope())
{
using(TransactionScope ts = new TransactionScope())
{
// do something
ts.Complete();
}
}
Then other question here such as Subsonic: Using SharedDbConnectionScope together with TransactionScope seems to be broken suggest the docs are wrong and the two objects should be the other way around...
using(TransactionScope ts = new TransactionScope())
{
using(SharedDbConnectionScope scope = new SharedDbConnectionScope())
{
// do something
ts.Complete();
}
}
But looking into the source code I am even more confused.
In the SqlQuery.cs code file it has a number of ExecuteTransaction overloads. For example...
public static void ExecuteTransaction(List<SqlQuery> queries)
{
using(SharedDbConnectionScope scope = new SharedDbConnectionScope())
{
using(TransactionScope ts = new TransactionScope())
{
foreach(SqlQuery q in queries)
q.Execute();
}
}
}
Umm... Interesting... Matches the docs but... Where's the ts.Complete() call?
How is that supposed to commit the transaction? As far as I can see it will always rollback. And it is the same for all the ExecuteTransaction overloads!
But here is the real kicker...
In the TransactionWithDtcOffTests.cs code has some nice tests except they have set up the SharedDbConnectionScope and TransactionScope around the other way!
using(TransactionScope ts = new TransactionScope())
{
using(SharedDbConnectionScope connScope = new SharedDbConnectionScope())
{
// <snip />
}
}
I haven't had the opportunity to run the tests for SubSonic 2.2 but I assume someone has and they passed..
Finally...
Can someone give me the definitive answer to how Transactions in SubSonic2.2 should be set up? Are the docs indeed wrong? Does the source for the ExecuteTransaction overloads and tests be aligned to whichever way is actually correct?
The SharedConnectionScope (SCS) block must be inside a TransactionScope (TS).
The purpose of the SCS is to prevent escalating the transaction to the MSDTC if possible, so having the TS using block inside of a SCS using block makes little sense to me.
In any case, every TS block must have a Complete() call for the transaction to be committed.
Personally i found that when using SQL 2005, SCS must be inside TS
and when using SQL 2000 (with MSDTC), SCS must wrap TS.
I hope this helps...

Resources