ElasticSearch Scroll API with multi threading - multithreading

First of all, I want to let you guys know that I know the basic work logic of how ElasticSearch Scroll API works. To use Scroll API, first, we need to call search method with some scroll value like 1m, then it will return a _scroll_id that will be used for the next consecutive calls on Scroll until all of the doc returns within loop. But the problem is I just want to use the same process on multi-thread basis, not on serially. For example:
If I have 300000 documents, then I want to process/get the docs this way
The 1st thread will process initial 100000 documents
The 2nd thread will process next 100000 documents
The 3rd thread will process remaining 100000 documents
So my question is as I didn't find any way to set the from value on scroll API how can I make the scrolling process faster with threading. Not to process the documents in a serialized manner.
My sample python code
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type, scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])

Have you tried a sliced scroll? According to the linked docs:
For scroll queries that return a lot of documents it is possible to
split the scroll in multiple slices which can be consumed
independently.
and
Each scroll is independent and can be processed in parallel like any
scroll request.
I have not used this myself (the largest result set I need to process is ~50k documents) but this seems to be what you're looking for.

You should used sliced scroll for that, see https://github.com/elastic/elasticsearch-dsl-py/issues/817#issuecomment-372271460 on how to do it in python.

I met the same problem as yours, but the doc size is 1.4 million. I've had to use concurrency method and use 10 threads for data writting.
I wrote the code with Java thread pool, and you can find the similar way in Python.
public class ControllerRunnable implements Runnable {
private String i_res;
private String i_scroll_id;
private int i_index;
private JSONArray i_hits;
private JSONObject i_result;
ControllerRunnable(int index_copy, String _scroll_id_copy) {
i_index = index_copy;
i_scroll_id = _scroll_id_copy;
}
#Override
public void run(){
try {
s_logger.debug("index:{}", i_index );
String nexturl = m_scrollUrl.replace("--", i_scroll_id);
s_logger.debug("nexturl:{}", nexturl);
i_res = get(nexturl);
s_logger.debug("i_res:{}", i_res);
i_result = JSONObject.parseObject(i_res);
if (i_result == null) {
s_logger.info("controller thread parsed result object NULL, res:{}", i_res);
s_counter++;
return;
}
i_scroll_id = (String) i_result.get("_scroll_id");
i_hits = i_result.getJSONObject("hits").getJSONArray("hits");
s_logger.debug("hits content:{}\n", i_hits.toString());
s_logger.info("hits_size:{}", i_hits.size());
if (i_hits.size() > 0) {
int per_thread_data_num = i_hits.size() / s_threadnumber;
for (int i = 0; i < s_threadnumber; i++) {
Runnable worker = new DataRunnable(i * per_thread_data_num,
(i + 1) * per_thread_data_num);
m_executor.execute(worker);
}
// Wait until all threads are finish
m_executor.awaitTermination(1, TimeUnit.SECONDS);
} else {
s_counter++;
return;
}
} catch (Exception e) {
s_logger.error(e.getMessage(),e);
}
}
}

scroll must be synchronous, this is the logic.
You can use multi thread, this is exactly why elasticsearch is good for: parallelism.
An elasticsearch index, is composed of shards, this is the physical storage of your data. Shards can be on the same node or not (better).
Another side, the search API offers a very nice option: _preference(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html)
So back to your app:
Get the list of index shards (and nodes)
Create a thread by shard
Do the scroll search on each thread
Et voilĂ !
Also, you could use the elasticsearch4hadoop plugin, which do exactly that for Spark / PIG / map-reduce / Hive.

Related

content script heavy calculation causes poor performance (chrome extension)

I have a simple chrome extension, and I'm trying to do some page analysis through the content.js. this is the code:
console.log("content.js running.."); //debug
var fromDOM = new XMLSerializer().serializeToString(document);
console.log(fromDOM)
var i = 0;
var item;
for (item in fromDOM) {
var x = fromDOM[item];
if (x == "/"){
i++;
console.log(i);
chrome.runtime.sendMessage({lala: i});
}
}
This code searches for any occurence of "/" in the page and sends a message to a background script (that currently does nothing).
This for loop alone causes any tab I load to load slower than usual.. affecting user performance.
What am I doing wrong here? I can't do heavy lifting on content.js scripts? or is there a better way I'm missing.
Assuming you want to process the current HTML of the page:
Use document.documentElement.innerHTML
Use string methods like indexOf to get the position of each / without enumerating the long HTML string character by character.
Accumulate all the positions in an array and send it in one message since sending of a message is an expensive operation that involves internal JSON.stringify+JSON.parse.
Don't use console.log when devtools is open as it does a lot of extra processing to format the messages. And generally prefer debugging interactively - there's a panel in devtools to inspect and set breakpoints in content scripts so you can debug the code, view the variables, and so on.
const html = document.documentElement.innerHTML;
const slashes = [];
let pos = -1;
do {
pos = html.indexOf('/', pos + 1);
if (pos >= 0) {
slashes.push(pos);
}
} while (pos >= 0);
chrome.runtime.sendMessage({lala: slashes});
Now your background listener will receive an array of character positions - not really useful per se, but that's just an example. You can put more info inside the array to make it more meaningful.

Sink to java list possible with Hazelcast Jet?

I have a list of accounts and perform a hashjoin on ticks and return the accounts with ticks data. But after hashjoin I have drainTo lListJet and then read it with DistributedStream and return it.
public List<Account> populateTicksInAccounts(List<Account> accounts) {
...
...
Pipeline p = Pipeline.create();
BatchSource<Tick> ticksSource = Sources.list(TICKS_LIST_NAME);
BatchSource<Account> accountSource = Sources.fromProcessor(AccountProcessor.of(accounts));
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(Sinks.list(TEMP_LIST));
jet.newJob(p).join();
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
}
Is it possible to drainTo to java List instead of lListJet after performing a hashjoin?
Something like below is possible?
IListJet<Account> accountWithTicks = new ArrayList<>();
p.drawFrom(ticksSource)
.hashJoin(p.drawFrom(accountSource), JoinClause.joinMapEntries(Tick::getTicker), accountMapper())
.drainTo(<CustomSinkProcessor(accountWithTicks)>);
return accountWithTicks;
where in CustomSinkProcessor will take empty java list and return with the accounts?
Keep in mind that the code you submit to Jet for execution runs outside the process where you submit it from. While it would be theoretically possible to provide the API you're asking for, under the hood it would just have to perform some tricks to run the code on each member of the cluster, let all members send their results to one place, and fill up a list to return to you. It would go against the nature of distributed computing.
If you think it will help the readability of your code, you can write a helper method such as this:
public <T, R> List<R> drainToList(GeneralStage<T> stage) {
String tmpListName = randomListName();
SinkStage sinkStage = stage.drainTo(Sinks.list(tmpListName));
IListJet<R> tmpList = jet.getList(tmpListName);
try {
jet.newJob(sinkStage.getPipeline()).join();
return new ArrayList<>(tmpList);
} finally {
tmpList.destroy();
}
}
Especially note the line
return new ArrayList<>(tmpList);
as opposed to your
IListJet<Account> list = jet.getList(TEMP_LIST);
return DistributedStream.fromList(list).collect(DistributedCollectors.toIList());
This just copies one Hazelcast list to another one and returns a handle to it. Now you have leaked two lists in the Jet cluster. They don't automatically disappear when you stop using them.
Even the code I provided can still be leaky. The JVM process that runs it can die during Job.join() without reaching finally. Then the temporary list lingers on.
No, it's not, due to the distributed nature of Jet. The sink will execute in multiple parallel processors (workers). It can't add to plain Collection. The sink has to be able to insert items on multiple cluster members.

Reading Real Time Table - Selenium

I'm trying to read a dynamic table, which is updated 1-3 times per second. I'm using Selenium, in Python 3.x, but if you have a solution for other languages I can work it out as well.
My question is: what is the best practice for reading frequently updated tables?
What I've tried:
driver.wait.until along with expected_conditions
re-read the table with a call to find_elements if a stale exception is thrown
Neither of them is working, due to the high refreshing rate. I can successfully retrieve the table for a moment, but when I try to access its rows the moment after, I get a stale exception. It's worth to say that when I try the same code in the same table when there are less frequent updates everything works fine.
I'm not posting any code for the moment, as I'd be interested in knowing what more experienced people do in this case.
My naive thinking: Being non-expert (but keen to learn) in web scraping nor in any web-related languages, I'd say that if this was a problem with dynamic data, I'd take a pointer or a reference to the actual table (and then looping dynamically on the rows). Is that possible in this framework?
We usually get stale element exception when the Webelement has been changed at present when compared to its attributes at the time of webelement's creation.
Let's say the intent is to print second data element in a table every seconds, our code looks like this, (Sorry for giving the code in Java)
//This will work if the page is static
WebElement element = driver.findElement(By.xpath("//td[2]"));
for(int i = 0; i< 10;i++)
{
System.out.println(element.getText());
Thread.sleep(1000);
}
To make this work for dynamic loading tables / refreshing tables we need to initiate the webelement before the each iteration something like this,
//This will work for dynamic content
WebElement element = null;
for(int i = 0; i< 10;i++)
{
element = driver.findElement(By.xpath("//td[2]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
In the case, if you need to get the i'th cell value in a table, we can parameter the value inside the xpath such as,
//In this case we need the fifth cell value
int j = 5;
WebElement element = null;
for(int i = 0; i< 10;i++)
{
element = driver.findElement(By.xpath("//td["+j+"]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
In the case if you need to have all five cell values,
WebElement element = null;
for(int i = 1; i<=5;i++)
{
element = driver.findElement(By.xpath("//td["+i+]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
Just construct a loop accordingly.
Hope this helps you. Thanks.

Parallel.ForEach Ordered Execution

I am trying to execute parallel functions on a list of objects using the new C# 4.0 Parallel.ForEach function. This is a very long maintenance process. I would like to make it execute in the order of the list so that I can stop and continue execution at the previous point. How do I do this?
Here is an example. I have a list of objects: a1 to a100. This is the current order:
a1, a51, a2, a52, a3, a53...
I want this order:
a1, a2, a3, a4...
I am OK with some objects being run out of order, but as long as I can find a point in the list where I can say that all objects before this point were run. I read the parallel programming csharp whitepaper and didn't see anything about it. There isn't a setting for this in the ParallelOptions class.
Do something like this:
int current = 0;
object lockCurrent = new object();
Parallel.For(0, list.Count,
new ParallelOptions { MaxDegreeOfParallelism = MaxThreads },
(ii, loopState) => {
// So the way Parallel.For works is that it chunks the task list up with each thread getting a chunk to work on...
// e.g. [1-1,000], [1,001- 2,000], [2,001-3,000] etc...
// We have prioritized our job queue such that more important tasks come first. So we don't want the task list to be
// broken up, we want the task list to be run in roughly the same order we started with. So we ignore tha past in
// loop variable and just increment our own counter.
int thisCurrent = 0;
lock (lockCurrent) {
thisCurrent = current;
current++;
}
dothework(list[thisCurrent]);
});
You can see how when you break out of the parallel for loop you will know the last list item to be executed, assuming you let all threads finish prior to breaking. I'm not a big fan of PLINQ or LINQ. I honestly don't see how writing LINQ/PLINQ leads to maintainable source code or readability.... Parallel.For is a much better solution.
If you use Parallel.Break to terminate the loop then you are guarenteed that all indices below the returned value will have been executed. This is about as close as you can get. The example here uses For but ForEach has similar overloads.
int n = ...
var result = new double[n];
var loopResult = Parallel.For(0, n, (i, loopState) =>
{
if (/* break condition is true */)
{
loopState.Break();
return;
}
result[i] = DoWork(i);
});
if (!loopResult.IsCompleted &&
loopResult.LowestBreakIteration.HasValue)
{
Console.WriteLine("Loop encountered a break at {0}",
loopResult.LowestBreakIteration.Value);
}
In a ForEach loop, an iteration index is generated internally for each element in each partition. Execution takes place out of order but after break you know that all the iterations lower than LowestBreakIteration will have been completed.
Taken from "Parallel Programming with Microsoft .NET" http://parallelpatterns.codeplex.com/
Available on MSDN. See http://msdn.microsoft.com/en-us/library/ff963552.aspx. The section "Breaking out of loops early" covers this scenario.
See also: http://msdn.microsoft.com/en-us/library/dd460721.aspx
For anyone else who comes across this question - if you're looping over an array or list (rather than an IEnumberable ), you can use the overload of Parallel.Foreach that gives the element index to maintain original order too.
string[] MyArray; // array of stuff to do parallel tasks on
string[] ProcessedArray = new string[MyArray.Length];
Parallel.ForEach(MyArray, (ArrayItem,loopstate,ArrayElementIndex) =>
{
string ProcessedArrayItem = TaskToDo(ArrayItem);
ProcessedArray[ArrayElementIndex] = ProcessedArrayItem;
});
As an alternate suggestion, you could record which object have been run and then filter the list when you resume exection to exclude the objects which have already run.
If this needs to be persistent across application restarts, you can store the ID's of the already executed objects (I assume here the objects have some unique identifier).
For anybody looking for a simple solution, I have posted 2 extension methods (one using PLINQ and one using Parallel.ForEach) as part of an answer to the following question:
Ordered PLINQ ForAll
Not sure if question was altered as my comment seems wrong.
Here improved, basically remind that parallel jobs run in out of your control order.
ea printing 10 numbers might result in 1,4,6,7,2,3,9,0.
If you like to stop your program and continue later.
Problems alike this usually endup in batching workloads.
And have some logging of what was done.
Say if you had to check 10.000 numbers for prime or so.
You could loop in batches of size 100, and have a prime log1, log2, log3
log1= 0..99
log2=100..199
Be sure to set some marker to know if a batch job was finished.
Its a general aprouch since the question isnt that exact either.

Multi-threading on a foreach loop?

I want to process some data. I have about 25k items in a Dictionary. IN a foreach loop, I query a database to get results on that item. They're added as value to the Dictionary.
foreach (KeyValuePair<string, Type> pair in allPeople)
{
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
MySqlDataReader reader2 = comd.ExecuteReader();
Dictionary<string, Dictionary<int, Log>> allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader2.Read())
{
if (!allViews.ContainsKey(reader2.GetString("src")))
{
allViews.Add(reader2.GetString("src"), reader2.GetInt32("time"));
}
}
reader2.Close();
reader2.Dispose();
allPeople[pair.Key].View = allViews;
}
I was hoping to be able to do this faster by multi-threading. I have 8 threads available, and CPU usage is about 13%. I just don't know if it will work because it's relying on the MySQL server. On the other hand, maybe 8 threads would open 8 DB connections, and so be faster.
Anyway, if multi-threading would help in my case, how? o.O I've never worked with (multiple) threads, so any help would be great :D
MySqlDataReader is stateful - you call Read() on it and it moves to the next row, so each thread needs their own reader, and you need to concoct a query so they get different values. That might not be too hard, as you naturally have many queries with different values of pair.Key.
You also need to either have a temp dictionary per thread, and then merge them, or use a lock to prevent concurrent modification of the dictionary.
The above assumes that MySQL will allow a single connection to perform concurrent queries; otherwise you may need multiple connections too.
First though, I'd see what happens if you only ask the database for the data you need ("SELECT src,time FROMlogsWHERE IP = '" + pair.Key + "' GROUP BY src") and use GetString(0) and GetInt32(1) instead of using the names to look up the src and time; also only get the values once from the result.
I'm also not sure on the logic - you are not ordering the log events by time, so which one is the first returned (and so is stored in the dictionary) could be any of them.
Something like this logic - where each of N threads only operates on the Nth pair, each thread has its own reader, and nothing actually changes allPeople, only the properties of the values in allPeople:
private void RunSubQuery(Dictionary<string, Type> allPeople, MySqlConnection con, int threadNumber, int threadCount)
{
int hoppity = 0; // used to hop over the keys not processed by this thread
foreach (var pair in allPeople)
{
// each of the (threadCount) threads only processes the (threadCount)th key
if ((hoppity % threadCount) == threadNumber)
{
// you may need con per thread, or it might be that you can share con; I don't know
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", con);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no thread will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
++hoppity;
}
}
This isn't tested - I don't have MySQL on this machine, nor do I have your database and the other types you're using. It's also rather procedural (kind of how you would do it in Fortran with OpenMPI) rather than wrapping everything up in task objects.
You could launch threads for this like so:
void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
lock (allPeople)
{
const int threadCount = 8; // the number of threads
// if it takes 18 seconds currently and you're not at .net 4 yet, then you may as well create
// the threads here as any saving of using a pool will not matter against 18 seconds
//
// it could be more efficient to use a pool so that each thread takes a pair off of
// a queue, as doing it this way means that each thread has the same number of pairs to process,
// and some pairs might take longer than others
Thread[] threads = new Thread[threadCount];
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber] = new Thread(new ThreadStart(() => RunSubQuery(allPeople, connection, threadNumber, threadCount)));
threads[threadNumber].Start();
}
// wait for all threads to finish
for (int threadNumber = 0; threadNumber < threadCount; ++threadNumber)
{
threads[threadNumber].Join();
}
}
}
The extra lock held on allPeople is done so that there is a write barrier after all the threads return; I'm not quite sure if it's needed. Any object would do.
Nothing in this guarantees any performance gain - it might be that the MySQL libraries are single threaded, but the server certainly can handle multiple connections. Measure with various numbers of threads.
If you're using .net 4, then you don't have to mess around creating the threads or skipping the items you aren't working on:
// this time using .net 4 parallel; assumes that connection is thread safe
static void RunQuery(Dictionary<string, Type> allPeople, MySqlConnection connection)
{
Parallel.ForEach(allPeople, pair => RunPairQuery(pair, connection));
}
private static void RunPairQuery(KeyValuePair<string, Type> pair, MySqlConnection connection)
{
MySqlCommand comd = new MySqlCommand("SELECT src,time FROM `logs` WHERE IP = '" + pair.Key + "' GROUP BY src", connection);
using (MySqlDataReader reader = comd.ExecuteReader())
{
var allViews = new Dictionary<string, Dictionary<int, Log>>();
while (reader.Read())
{
string src = reader.GetString(0);
int time = reader.GetInt32(1);
// do whatever to allViews with src and time
}
// no iteration will be modifying the same pair.Value, so this is safe
pair.Value.View = allViews;
}
}
The biggest problem that comes to mind is that you are going to use multithreading to add values to a dictionary, which isn't thread safe.
You'll have to do something like this to make it work, and you might not get that much of a benefit from implementing it this was as it still has to lock the dictionary object to add a value.
Assumptions:
There is a table People in your
database
There are alot of people in
your database
Each database query adds overhead you are doing one db query for each of the people in your database I would suggest it was faster to get all the data back in one query then to make repeated calles
select l.ip,l.time,l.src
from logs l, people p
where l.ip = p.ip
group by l.ip, l.src
Try this with a loop in a single thread, I belive this will be much faster then your existing code.
With in your existing code another thing you can do is to take the creation of the MySqlCommand out of the loop, prepare it in advance and just change the parameter. This should speed up execution of the SQL. see http://dev.mysql.com/doc/refman/5.0/es/connector-net-examples-mysqlcommand.html#connector-net-examples-mysqlcommand-prepare
MySqlCommand comd = new MySqlCommand("SELECT * FROM `logs` WHERE IP = ?key GROUP BY src", con);
comd.prepare();
comd.Parameters.Add("?key","example");
foreach (KeyValuePair<string, Type> pair in allPeople)
{
comd.Parameters[0].Value = pair.Key;
If you are using mutiple threads, each thread will still need there own Command, at lest in MS-SQL this would still be faster even if you recreated and prepared the statment every time, due to the ability for the SQL server to be able to cache the execution plan of a paramertirised statment.
Before you do anything else, find out exactly where the time is being spent. Check the execution plan of the query. The first thing I'd suspect is a missing index on logs.IP.
18 minutes for something like this seems much too long to me. Even if you can cut the execution time in eight by adding more threads (which is unlikely!) you still end up using more than 2 minutes. You could probably read the whole 25k rows into memory in less than five seconds and do the necessary processing in memory...
EDIT: Just to clarify, I'm not advocating actually doing this in memory, just saying that it looks like there's a bigger bottleneck here that can be removed.
I think if you are running this on a multi core machine you could gain benefits from multi threading.
However the way I would approach it is to first look at unblocking the thread you are currently using by making asynchronous database calls. The call backs will execute on background threads, so you will get some multi core benefit there and you won't be blocking threads waiting for the db to come back.
For IO intensive apps like this example sounds like you are likely to see improved throughput depending on what load the db can handle. Assuming the db scales to handle more than one concurrent request you should be good.
Thanks everyone for your help. Currently I am using this
for (int i = 0; i < 8; i++)
{
ThreadPool.QueueUserWorkItem(addDistinctScres, i);
}
ThreadPool to run all the threads. I use the method provided by Pete Kirkham, and I'm creating a new connection per thread.
Times went down to 4 minutes.
Next I'll make something wait for the callback of the threadpool? before performing other functions.
I think the bottleneck now is the MySQL server, because the CPU usage has drops.
#odd parity I thought about that, but the real thing is waaay more than 25k rows. Idk if that'd work.
This sound like the perfect job for map/reduce, i am not a .Net-programmer, but this seems like a reasonable guide:
http://ox.no/posts/minimalistic-mapreduce-in-net-4-0-with-the-new-task-parallel-library-tpl

Resources