elasticsearch : pagination of continuosly updating data

elasticsearch : pagination of continuosly updating data - pagination

I am new to elasticsearch.
I am listening a network traffic and index them into the elasticsearch.
But when I search it I dont want to give a size for searching.I learnt that I have to pagination but when I search, I couldnt understand and I couldnt find example.
I need some advice.
Sorry about my English and Thanks for your advices.
My search code is a Python code.

Be aware that you can't see immediately the indexed data : an operation called refresh has to be performed (default value for search_interval is 1 second).
You can find here some documentation about pagination from the elasticsearch definitive guide.
If you have lots of data indexed constinuously (e.g. lines of logs), you could use a scroll and filtering the results on the last 5 minutes for example (here is a simple example of this filtering). Then you can relaunch another scroll search for the next 5 minutes etc...

Related

Is there a better way to fetch table data which are more than 20,000?

I'm recently trying to use selenium(Chrome driver one) to get some data in a web. Normally the table shows up to 30 with multiple page, but I changed an argument of it, so It can show me up to 30,000 now.
The problem is, when I use my code to fetch data, it took too long time for it.
I divided it with multiple pages with 2000 data per page, but still it took too long.
This is the code I used to get data
It took about 3? 5? minute when I tried to get 1000 data.
texts = [t.text for t in driver.find_elements_by_xpath("//div[#class='datagrid_class']/div/table[#class='table1']/tbody/tr/td")]
I just want to check if anyone has better idea for this.
Thank you for your kind advice in advance!
Thank you!

You can use JavaScript to get data much faster, try code below:
texts = driver.execute_script('return [...document.querySelectorAll("div.datagrid_class table.table1 tbody tr td")].map(e=>e.textContent)')
Also you can find some more examples here and here

youtube api v3 search returns different results from youtube site

I'm trying to do a search on the v3 api using this url:
https://www.googleapis.com/youtube/v3/search?part=id,snippet&channelId=UCtVd0c0tGXuTSbU5d8cSBUg&maxResults=10&order=date&q=game&key=[API_KEY]
but this returns me only one playlist.
When I do this search on youtube site directly it returns more results to me:
https://www.youtube.com/user/YouTubeDev/search?query=game
Why this happens, is there something wrong that I'm doing?

We ran into a similar issue when we tried to search for large amounts of content. This is especially evident if you set the time range you're looking for using publishedAfter and publishedBefore to a very small range (say for example 1 hour). Even when we get to very small result sets (you can only paginate around 20 times on the API using pageToken back when we tried it, so it was when our totalResults were less than 1,000), we were finding actually only as little as 540 items.
We reached out to YouTube and our contacts there confirmed that the totalResults are just an estimate, and are not actually accurate. You may get up to the amount of items specified, but there is no guarantee that you will get exactly that. Your best bet is to capture as much as you can, and scan for data using a different time range.
Source: Reddit

In the first one you are using search->list method. Which is searching for channels?
In the second one you are doing a playlist search inside the channel.
You can do the same on API via playlists->list.
(Or if you want the videos inside the channel straight, use videos->list)

Might be a bug. If so and not yet filed, you can file it here: https://code.google.com/p/gdata-issues/issues/list?q=label%3aAPI-YouTube
The problem seems to be caused by the parameter order=date.
Adding order to the "YouTube query" (using channel): https://www.youtube.com/channel/UCtVd0c0tGXuTSbU5d8cSBUg/search?query=game&order=date ,is not different. However omitting order from the "api request" gives the same result (6 items): https://www.googleapis.com/youtube/v3/search?part=id,snippet&channelId=UCtVd0c0tGXuTSbU5d8cSBUg&maxResults=10&q=game&key=YOUR-API-KEY-HERE
Note, that with using order=date in the api request only 1 item is shown, while the same response shows totalResults": 6 (which seems to be right). I did not try all, but using order=relevance does not give this problem.

Drupal, Solr & Facet Api - Persistent facet links in blocks

I need to produce facet block from two vocabularies in my site. I am using Views and a patched version of Views infinite Scroll to generate the search page, using my search index, and I have tweaked everything I could in the facet display settings to see if I could produce the requested results, to no avail.
I do not need keyword searches. I need to show all taxonomy terms in each facets at all times and to be able to select a single criteria at a time from each vocabulary. So, never more thane one selection at a time from each facet block.
Why are you using Solr to store data and generate your search page, if you do not need keyword search and are trying to go against the native working of solr Facets, I hear you say? For performance reasons, it is the reason why I am using Solr to store & serve the results, I have even gone as far as pushing renedered node to the index with the help of the somwhat obscure search_api_solr_view_modes module.
I could take two separate routes
Create a custom block, load all the taxonomy terms, alter the output of the term link to point to the view and provide the TID for the View. The active filter data could be obtained from the view arguments. I know how to do that but feel it is the wrong way to go about it, if I am working with Solr, I should be using a facet, not a custom block.
Build a custom Facet block that has this exact behaviour. After reading a lot of documentation, I git kind of dicouraged with the possibility of doing this simply without having to develop a Facet plugin, which is kind of out of my league.
Any advice is appreciated.
Here is a screenshot of the interface I have to produce.
http://imageshack.com/a/img834/9836/kr0i.png
Each taxonomy term has to be persistent, i.e., produce a link event if there are no nodes indexed under this term.
Selecting a term in one of the vocabularies will deselect previously selected terms
Clicking on the x next to a term will remove it form the active search criterias.

Have a look at this. https://drupal.org/project/ajax_facets This might get you to where you need to be. Sans you infinite scroll. There is a youtube video that goes with it. http://www.youtube.com/watch?v=pBj3OkXLyWs
I'd appreciate it if it works as I haven't tried it my self.

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.

As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.

How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

Get Google Discussion search results

I would like go get the results retrieved by Google Discussion Search, like this. Notice the Discussion tab on the bottom left side.
I prefer to use Python, and the Google Custom Search API, but I am not sure if they support the Discussion search, so any option is welcome.

Does not appear to be any way the API results can be tailored to fetch 'discussion' results but the API can filter by filetype, so why not try a q query looking for the RSS feeds produced by forums?
https://www.googleapis.com/customsearch/v1?key= your key &cx= your_id &q=diablo 3 forum RSS&fileType=rss&alt=json
Many tend to have a latest/ last post feed and monitoring these for changes should produce a good stream of data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string