Gridgain: Query running slow - gridgain

I'm using Gridgain 6.5.5 version with 3 nodes(on same host).
I'm performing the following tests:
Test 1:
Load data using GridDataLoader initially from a .csv file.
Run sample queries.
Test 2:
With the data loaded, I'm trying to run the same queries.
Even if I don't load the data, Gridgain is taking more time (> 10 min)to execute the query
The config file I'm using the example-cache.xml.
Is it because of some configuration errors?

Given the heapsize I guess GC pauses could be one of the reasons.

Related

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

How does Cassandra stress test determine threadcount?

I ran a Cassandra stress-test and the output came back to between 4 and 913 threadcounts. What causes Cassandra to increase and stop the threadcount?
When you use Cassandra Stress, I see these tests
First, Cassandra starts with a small amount of thread and displays the result, then it also raises the thread until (This number seems to depend on the cluster that Cassandra has attached to stress and allowed to connect. Given this parameter, thread are counted.)ends the test.
And in the end, the results of all the tests with the number of thread they used in testing them in
As you can see above, the system I tested on was able to run 32 threads and the test was completed with the same amount and the results were displayed.

spark embedded (same JVM) for <400k text file process

Can we do a mini spark job embedded in our app? Any example? Reason : want to process part of a file and give results quicker than submitting a regular job. File is only 500 lines. But do not want to keep 2 code bases - just the one that is used for the large files too. File size is less than an MB.
I want to process the file in the same JVM that my client code is running. Want a single executor kicked off from within same JVM, via a flag in the config. (so a few jobs will have this flag set and others wont. Those that do not will run as usual on the cluster.)

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?
I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.
Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.
I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

Resources