Reading Real Time Table - Selenium - python-3.x

I'm trying to read a dynamic table, which is updated 1-3 times per second. I'm using Selenium, in Python 3.x, but if you have a solution for other languages I can work it out as well.
My question is: what is the best practice for reading frequently updated tables?
What I've tried:
driver.wait.until along with expected_conditions
re-read the table with a call to find_elements if a stale exception is thrown
Neither of them is working, due to the high refreshing rate. I can successfully retrieve the table for a moment, but when I try to access its rows the moment after, I get a stale exception. It's worth to say that when I try the same code in the same table when there are less frequent updates everything works fine.
I'm not posting any code for the moment, as I'd be interested in knowing what more experienced people do in this case.
My naive thinking: Being non-expert (but keen to learn) in web scraping nor in any web-related languages, I'd say that if this was a problem with dynamic data, I'd take a pointer or a reference to the actual table (and then looping dynamically on the rows). Is that possible in this framework?

We usually get stale element exception when the Webelement has been changed at present when compared to its attributes at the time of webelement's creation.
Let's say the intent is to print second data element in a table every seconds, our code looks like this, (Sorry for giving the code in Java)
//This will work if the page is static
WebElement element = driver.findElement(By.xpath("//td[2]"));
for(int i = 0; i< 10;i++)
{
System.out.println(element.getText());
Thread.sleep(1000);
}
To make this work for dynamic loading tables / refreshing tables we need to initiate the webelement before the each iteration something like this,
//This will work for dynamic content
WebElement element = null;
for(int i = 0; i< 10;i++)
{
element = driver.findElement(By.xpath("//td[2]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
In the case, if you need to get the i'th cell value in a table, we can parameter the value inside the xpath such as,
//In this case we need the fifth cell value
int j = 5;
WebElement element = null;
for(int i = 0; i< 10;i++)
{
element = driver.findElement(By.xpath("//td["+j+"]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
In the case if you need to have all five cell values,
WebElement element = null;
for(int i = 1; i<=5;i++)
{
element = driver.findElement(By.xpath("//td["+i+]"));
System.out.println(element.getText());
Thread.sleep(1000);
}
Just construct a loop accordingly.
Hope this helps you. Thanks.

Related

ElasticSearch Scroll API with multi threading

First of all, I want to let you guys know that I know the basic work logic of how ElasticSearch Scroll API works. To use Scroll API, first, we need to call search method with some scroll value like 1m, then it will return a _scroll_id that will be used for the next consecutive calls on Scroll until all of the doc returns within loop. But the problem is I just want to use the same process on multi-thread basis, not on serially. For example:
If I have 300000 documents, then I want to process/get the docs this way
The 1st thread will process initial 100000 documents
The 2nd thread will process next 100000 documents
The 3rd thread will process remaining 100000 documents
So my question is as I didn't find any way to set the from value on scroll API how can I make the scrolling process faster with threading. Not to process the documents in a serialized manner.
My sample python code
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type, scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
Have you tried a sliced scroll? According to the linked docs:
For scroll queries that return a lot of documents it is possible to
split the scroll in multiple slices which can be consumed
independently.
and
Each scroll is independent and can be processed in parallel like any
scroll request.
I have not used this myself (the largest result set I need to process is ~50k documents) but this seems to be what you're looking for.
You should used sliced scroll for that, see https://github.com/elastic/elasticsearch-dsl-py/issues/817#issuecomment-372271460 on how to do it in python.
I met the same problem as yours, but the doc size is 1.4 million. I've had to use concurrency method and use 10 threads for data writting.
I wrote the code with Java thread pool, and you can find the similar way in Python.
public class ControllerRunnable implements Runnable {
private String i_res;
private String i_scroll_id;
private int i_index;
private JSONArray i_hits;
private JSONObject i_result;
ControllerRunnable(int index_copy, String _scroll_id_copy) {
i_index = index_copy;
i_scroll_id = _scroll_id_copy;
}
#Override
public void run(){
try {
s_logger.debug("index:{}", i_index );
String nexturl = m_scrollUrl.replace("--", i_scroll_id);
s_logger.debug("nexturl:{}", nexturl);
i_res = get(nexturl);
s_logger.debug("i_res:{}", i_res);
i_result = JSONObject.parseObject(i_res);
if (i_result == null) {
s_logger.info("controller thread parsed result object NULL, res:{}", i_res);
s_counter++;
return;
}
i_scroll_id = (String) i_result.get("_scroll_id");
i_hits = i_result.getJSONObject("hits").getJSONArray("hits");
s_logger.debug("hits content:{}\n", i_hits.toString());
s_logger.info("hits_size:{}", i_hits.size());
if (i_hits.size() > 0) {
int per_thread_data_num = i_hits.size() / s_threadnumber;
for (int i = 0; i < s_threadnumber; i++) {
Runnable worker = new DataRunnable(i * per_thread_data_num,
(i + 1) * per_thread_data_num);
m_executor.execute(worker);
}
// Wait until all threads are finish
m_executor.awaitTermination(1, TimeUnit.SECONDS);
} else {
s_counter++;
return;
}
} catch (Exception e) {
s_logger.error(e.getMessage(),e);
}
}
}
scroll must be synchronous, this is the logic.
You can use multi thread, this is exactly why elasticsearch is good for: parallelism.
An elasticsearch index, is composed of shards, this is the physical storage of your data. Shards can be on the same node or not (better).
Another side, the search API offers a very nice option: _preference(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html)
So back to your app:
Get the list of index shards (and nodes)
Create a thread by shard
Do the scroll search on each thread
Et voilĂ !
Also, you could use the elasticsearch4hadoop plugin, which do exactly that for Spark / PIG / map-reduce / Hive.

remove previously added and() clause from Where com.datastax.driver.core.querybuilder.Select.Where

for (int i=0; i<mycolumns.length; i++)
{
where.and(QueryBuilder.eq(COLNAME, mycolumns[i]));
//how to remove the above and() call
}
In every iteration of the loop, I want to execute the query and then substitute the value in next loop iteration.
I'm not completely clear on what you are trying to accomplish. I am guessing that you are trying to update multiple rows sharing a primary key, updating 1 row at a time?
Unfortunately this isn't possible since when you call where.and you are adding data to the Where object and it is returning you a reference to the same Where object.
In short, Where is not immutable and neither is the Statement it belongs to, so you won't get a new copy every time you call it, rather you get an updated version of the Where object.
What you could do is generate your Statement again (whether it be QueryBuilder.update,delete, or insert) in the loop like:
for (int i=0; i<mycolumns.length; i++) {
Statement stmt = QueryBuilder.update("tableName").where(eq("key", 1)).and(QueryBuilder.eq(COLNAME, mycolumns[i]));
session.execute(stmt);
}

What is the most efficient method using Request.UserLanguages to render a page based on the browser language??

I am making a page which pulls from the user's browser their preferred language, via the Request.UserLanguages....which returns a two letter code (ex. "en") or detailed code (ex. "en-GB") .
I basically get the string of user languages (they are in order of preference) and store them in a string array. Then I use a loop to check if the language code in the first position of the string array is any of the codes for a certain language (another string array hard coded in).
Is there a better way to do this? I'm noticing increased load time and am worried additional languages will further slow the page load...
if (!IsPostBack)
{ //Holds possible user languages preferences to check client machine against
String[] compJapaneseLang = { "ja-jp","ja","jp","jpn","euc","shift-jis" };
}
//Get client machines langugage preferences
String[] userLang = Request.UserLanguages;
//Loop through variation of preferences from possible user langugaes
for (int i = 0; i < compJapaneseLang.Length; i++)
{
//IF JAPANESE
if (userLang.GetValue(0).ToString().ToLowerInvariant().Equals(compJapaneseLang.GetValue(i).ToString().ToLowerInvariant()))
cc.JapeneseObject();
}
Thanks!
Storing them in a list turned out best, not really much else one can do....

Dynamic data structures in C#

I have data in a database, and my code is accessing it using LINQ to Entities.
I am writing some software where I need to be able to create a dynamic script. Clients may write the scripts, but it is more likely that they will just modify them. The script will specify stuff like this,
Dataset data = GetDataset("table_name", "field = '1'");
if (data.Read())
{
string field = data["field"];
while (cway.Read())
{
// do some other stuff
}
}
So that script above is going to read data from the database table called 'table_name' in the database into a list of some kind based on the filter I have specified 'field='1''. It is going to be reading particular fields and performing normal comparisons and calculations.
The most important thing is that this has to be dynamic. I can specify any table in our database, any filter and I then must be able to access any field.
I am using a script engine that means the script I am writing has to be written in C#. Datasets are outdated and I would rather keep away from them.
Just to re-iterate I am not really wanting to keep with the above format, and I can define any method I want to behind the scenes for my C# script to call. The above could end up like this for instance,
var data = GetData("table_name", "field = '1'");
while (data.ReadNext())
{
var value = data.DynamicField;
}
Can I use reflection for instance, but perhaps that would be too slow? Any ideas?
If you want to read dynamically a DataReader context, it's a pretty easy step:
ArrayList al = new ArrayList();
SqlDataReader dataReader = myCommand.ExecuteReader();
if (dataReader.HasRows)
{
while (dataReader.Read())
{
string[] fields = new string[datareader.FieldCount];
for (int i =0; i < datareader.FieldCount; ++i)
{
fields[i] = dataReader[i].ToString() ;
}
al.Add(fields);
}
}
This will return an array list composed by a dynamic object based on the number of field the reader has.

Parallel.ForEach Ordered Execution

I am trying to execute parallel functions on a list of objects using the new C# 4.0 Parallel.ForEach function. This is a very long maintenance process. I would like to make it execute in the order of the list so that I can stop and continue execution at the previous point. How do I do this?
Here is an example. I have a list of objects: a1 to a100. This is the current order:
a1, a51, a2, a52, a3, a53...
I want this order:
a1, a2, a3, a4...
I am OK with some objects being run out of order, but as long as I can find a point in the list where I can say that all objects before this point were run. I read the parallel programming csharp whitepaper and didn't see anything about it. There isn't a setting for this in the ParallelOptions class.
Do something like this:
int current = 0;
object lockCurrent = new object();
Parallel.For(0, list.Count,
new ParallelOptions { MaxDegreeOfParallelism = MaxThreads },
(ii, loopState) => {
// So the way Parallel.For works is that it chunks the task list up with each thread getting a chunk to work on...
// e.g. [1-1,000], [1,001- 2,000], [2,001-3,000] etc...
// We have prioritized our job queue such that more important tasks come first. So we don't want the task list to be
// broken up, we want the task list to be run in roughly the same order we started with. So we ignore tha past in
// loop variable and just increment our own counter.
int thisCurrent = 0;
lock (lockCurrent) {
thisCurrent = current;
current++;
}
dothework(list[thisCurrent]);
});
You can see how when you break out of the parallel for loop you will know the last list item to be executed, assuming you let all threads finish prior to breaking. I'm not a big fan of PLINQ or LINQ. I honestly don't see how writing LINQ/PLINQ leads to maintainable source code or readability.... Parallel.For is a much better solution.
If you use Parallel.Break to terminate the loop then you are guarenteed that all indices below the returned value will have been executed. This is about as close as you can get. The example here uses For but ForEach has similar overloads.
int n = ...
var result = new double[n];
var loopResult = Parallel.For(0, n, (i, loopState) =>
{
if (/* break condition is true */)
{
loopState.Break();
return;
}
result[i] = DoWork(i);
});
if (!loopResult.IsCompleted &&
loopResult.LowestBreakIteration.HasValue)
{
Console.WriteLine("Loop encountered a break at {0}",
loopResult.LowestBreakIteration.Value);
}
In a ForEach loop, an iteration index is generated internally for each element in each partition. Execution takes place out of order but after break you know that all the iterations lower than LowestBreakIteration will have been completed.
Taken from "Parallel Programming with Microsoft .NET" http://parallelpatterns.codeplex.com/
Available on MSDN. See http://msdn.microsoft.com/en-us/library/ff963552.aspx. The section "Breaking out of loops early" covers this scenario.
See also: http://msdn.microsoft.com/en-us/library/dd460721.aspx
For anyone else who comes across this question - if you're looping over an array or list (rather than an IEnumberable ), you can use the overload of Parallel.Foreach that gives the element index to maintain original order too.
string[] MyArray; // array of stuff to do parallel tasks on
string[] ProcessedArray = new string[MyArray.Length];
Parallel.ForEach(MyArray, (ArrayItem,loopstate,ArrayElementIndex) =>
{
string ProcessedArrayItem = TaskToDo(ArrayItem);
ProcessedArray[ArrayElementIndex] = ProcessedArrayItem;
});
As an alternate suggestion, you could record which object have been run and then filter the list when you resume exection to exclude the objects which have already run.
If this needs to be persistent across application restarts, you can store the ID's of the already executed objects (I assume here the objects have some unique identifier).
For anybody looking for a simple solution, I have posted 2 extension methods (one using PLINQ and one using Parallel.ForEach) as part of an answer to the following question:
Ordered PLINQ ForAll
Not sure if question was altered as my comment seems wrong.
Here improved, basically remind that parallel jobs run in out of your control order.
ea printing 10 numbers might result in 1,4,6,7,2,3,9,0.
If you like to stop your program and continue later.
Problems alike this usually endup in batching workloads.
And have some logging of what was done.
Say if you had to check 10.000 numbers for prime or so.
You could loop in batches of size 100, and have a prime log1, log2, log3
log1= 0..99
log2=100..199
Be sure to set some marker to know if a batch job was finished.
Its a general aprouch since the question isnt that exact either.

Resources