Azure-search document count issue - azure

We are using azure-search and are experiencing some strange issues. We have two environments (internal production and external production), both of which have their own separate index (IndexInternal and IndexExternal).
I am using the Red Dog search portal to check the document size and count (https://github.com/reddog-io/RedDog.Search.Portal).
What I have experienced is this:
Re-indexed internal production site and there are 14500 documents
Re-indexed external production site and there are 14500 documents
After 3/4 days I have checked the indexes and the count seem to have changed drastically:
Internal production site now has 24500 documents
External production site now has 24500 documents
When I query the indexes there doesn't seem to be anything unusual in there.
There is no way that 10000 documents have been pushed up since the re-index of the sites.
Has anyone any idea as to what could be happening here?
Thanks in advance,
Adam

The count that is shown from Azure Search is not updated real-time. Is it possible that you looked at the count soon after the documents and we are showing you this count based on the current numbers we have? Also, please remember that it also takes some time for your data to become searchable after you index the documents.
Liam

Related

ElasticSearch - search statistic - like google analytics

I am looking into using ElasticSearch as a search engine for one of the projects I am working on.
There is still one thing which I need to find an answer for, and I hope someone inhere can help.
The customer want to be able to see some search statistic, like google analytics. Most searched words, new search words and so on.
Is there a way to easily setup this type of search statistic. My idea is something like ElasticSearch stores search history, about the search request made to the REST API. Then my customer can use Kibana or some other visual tool to monitor the search history of ElasticSearch.
Hope someone can help me with an answer for this.
Regards
Jacob
You could adjust the slow log to a time which it will capture all requests, however this will then produce large log files which will require maintenance. You could write an application which handles all of your ES requests, takes the search phrase and indexes this in a separate index i.e. your search history index and then deals with the actual request as normal, returning the response to the user.

Solr - most frequent searched words

I'm trying to organize a solr search engine. I've already set up the misspelling system and the suggestions.
However I can't seem to find how to retrieve the top 10 most searched words/terms/keywords in solr/lucene. How can I get this? I want to display those on my homepage.
Solr does not provide this kind of feature out of the box. There is the StatsComponent, that provides you with all kind of statistics, but all of those are numeric only.
Depending on how you access solr (directly or via your own app) you could intercept all calls an log the query string. I did this in a recent project where I logged a queries to a database. If you submit all keywords to an other core on your solr server, you can faceting queries on your search terms as described by Hyque
You could use a facet for retrieving the Top X words like this:
http://yourservergoeshere/solr/select?q=*&wt=xml&indent=true&facet=true&facet.query=*&facet.field=message&facet.limit=10&facet.minCount=1
The value of facet.field depends on the field you like to search in. With facet.limit you'll (obviously) limit the amount of results to 10. You'll find the facet results at the end of the results, starting with "facet_counts"
Edit: I really should go to bed earlier. I didn't see the "most searched" in your question. Sorry for that.
Apache Solr does not provide any such capability as of today. There is a desire for this and a JIRA ticket corresponding to it. You can vote for it if you'd like to see it in Solr some day: https://issues.apache.org/jira/browse/SOLR-10359.
The stats component provides information around statistics, but it's mostly numeric in nature. You could parse server logs and come up with a way to build a Frequently Searched Terms (e.g. pump those logs in SiLK or Kibana for visualization).
If you have the ability to change the front end and add some javascript code to the UI or can intercept the search request and make an async or batch calls to APIs for tracking, you can use SearchStax Analytics that provides Search Analytics that tracks searches, clicks, cart actions, revenue, etc.

The New Google Takeout

Is there a way to include one's search history within Google Takeout?
https://www.google.com/takeout/
Takeout purports to let you download everything stored within your Google account.
As far as I can tell, no. It's an obvious and mysterious gap in service.
You can download your recent google searches via an rss feed.
https://www.google.com/history/?output=rss
You can add commands to the url. The max query is 1000. num is number of searches and start is how many back to draw from. Like so:
https://www.google.com/history/?output=rss&num=1000&start=4000
Unfortunately, it starts to become somewhat reduced (as in not actually all of your searches) after a few thousand. I have over 40,000 searches on google, but I can only go back 7000 on this rss feed. Bummer. This means we still donĀ“t have access to all our data that they have.
Please prove me wrong!
Today is the last day to delete your search history, as suggested by the EFF(https://www.eff.org/deeplinks/2012/02/how-remove-your-google-search-history-googles-new-privacy-policy-takes-effect), before the new google terms come into force, linking that history with all the other google products. So if you can't grab it today, delete it so as to partially anonymise it eventually, or be tentacularised.

Drupal search engine does not index my custom nodes!

Somebody has posted an hour ago or so a question that was about the drupal search engine and was about like this:
I know drupal should index anything that is returned by node_view() but this is not happening for my custom content. Also: are there better alternatives to Drupal built-in functionality?
As the question has been removed while I was answering, and didn't want to throw away 20 minutes of my life for nothing ;) I thought to re-create the question a second time. Hope this is fine by the rules of SO! :)
The Drupal search engine is probably not the most celebrated feature of Drupal, but is fairly solid, sophisticated and reliable. There are plenty of modules that enhance or substitute it but - at least in my experience - there is not a commonly accepted "better way" to manage searching and indexing.
However, for very big and busy sites people prefer to use external tools altogether, like a google searchbox or even dedicated software or hardware, like solr / lucene or google search appliance (GSA).
The link I provided above - however - sorts the search-related modules by descending usage statistics, so you will find on the first page the one most commonly used. One that I personally like for English language sites is the porter-stemmer, which index words by their stem (eg: highness, highest and higher will all be returned as matches for the word "high").
That was for the general information on search and Drupal. As for your problem, there are a number of things you could check to track down your problem:
Have your cron.php been executed lately? Indexing is done as part of the cron run, so - if you do not have a crontab set or if you haven't executed it by hand, your node will likely not been indexed yet.
Are the settings correct? Settings for the search module are located at http://example.com/admin/settings/search : is your minimum word length sufficient for your needs (the default is 3 letters)?
Has the 100% of the site being indexed? (You can check that from the setting page). If it is not, and running cron.php doesn't solve the matter, look further down.
Does a re-index solve the problem? Especially if you inserted data by mean of SQL queries directly on the Drupal tables, chances are Drupal hasn't realised the content of the node has changed and therefore doesn't update the index.
Is the node you are trying to find, visible? Search results about unpublished nodes or nodes that require higher-than-yours permissions to be viewed are not returned, AFAIK.
As for the "stuck indexing" that happened to me once as well. It turned out it was some PHP code within a node body that would trigger a PHP exception when the node was being indexed, and as a result the indexing process would halt and all the following nodes would not be indexed as well.
Hope this helps. Good luck!

How Prevent Google Duplicate Content Problem | Multi Site

I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.

Resources