How to implement Lucene .Net search on Azure webrole

How to implement Lucene .Net search on Azure webrole - azure

I'm using AzureDirectory and Lucene .NET 2.9.4 but I have wo problems:
searcher doesn't seems to be so fast. I'm indexing with these settings:
indexWriter.SetUseCompoundFile(false);
indexWriter.SetMergeFactor(1000);
index is around 3.5gb and it has 12.126.436 docs.
To create the indexSearcher it takes around 5 min or more even if index is already on local disk. Is the index too big? I tried to perform a single term search using MultiFieldQueryParser on two fields. TermVector on fields is off
Everywhere is suggested to create only an instance of indexSearcher and share it between queries (in fact it is slow to be created) but I don't know how to share the Searcher singleton (it is the class that perform the search) between various web requests. If I create the singleton on the webrole class, then how can I use that instance to perform the search? At this moment every web requests recreates the singleton.
Thanks a lot

I have actually used that exact version of Lucene.NET with AzureDirectory and it doesn't work well. AzureDirectory in my opinion is not written for production scale.
If you look at the source code for AzureDirectory, it is using:
older version of Lucene as a base (2.3x)
exceptions are thrown everywhere (hard to debug/catch the right ones in production)
it uses the old storage API (pre 1.8 version of the SDK)
I ended up creating my own dedicated Virtual Machine and using the .net 3.0.3 Lucene.Net library. Works like a champ in that environment, since I do not need to implement AzureDirectory.
You should have only ONE IndexWriter that is easy to implement with a storage queue. You can have multiple IndexReaders if you want to limit them write a IndexReader pool (like a SQL connection pool). I have multiple of those run fine with no exceptions flying around like they where with AzureDirectory.
My environment is a bit different lots of smaller indexes....not one massive one.

Maybe this is the AzureDirectory that people are talking about, maybe not - I tweaked this in order to get better performance. While I won't claim that it's production-grade and rock solid, it may help you over the AzureDirectory that you are currently using.
Hope it helps,

Related

arangodb 3.1 foxx docs?

I think that arangodb is presently the best nosql db and that foxx microservices are a great resource.
Alas, the related docs that comes with the 3.xxx version can help build only a minimalistic service.
Also, many apps you can find as examples in the arangodb store have been developed with deprecated tools (eg. controllers, repositories).
And while the wizard available in the web interface easily allows to create a new service, I don't understand why a new collection, prefixed with the mount point, has to be created. So a complete REST API is generated with a great documentation, but it is absolutely useless unless I change the name of an already existing collection. Why is that ???

The generator is meant as a quick boilerplate generator to allow you to build prototypes more easily. In practice it's not a great starting point for real-world projects (especially if you already have created collections manually) but if you just quickly need a REST API you can expand with your own logic it can come in handy.
As you've read the docs I'm sure you've followed this Getting Started guide: https://docs.arangodb.com/3/Manual/Foxx/GettingStarted.html
In it, the reasoning for prefixed vs non-prefixed collection names is given as such:
Because we have hardcoded the collection name, multiple copies of the service installed alongside each other in the same database will share the same collection. Because this may not always be what you want, the Foxx context also provides the collectionName method which applies a mount point specific prefix to any given collection name to make it unique to the service. It also provides the collection method, which behaves almost exactly like db._collection except it also applies the prefix before looking the collection up.
On the technical side the documentation for the Context#collection method further specifies what the method does:
Passes the given name to collectionName, then looks up the collection with the prefixed name.
The documentation for Context#collectionName:
Prefixes the given name with the collectionPrefix for this service.
And finally Context#collectionPrefix:
The prefix that will be used by collection and collectionName to derive the names of service-specific collections. This is derived from the service's mount point, e.g. /my-foxx becomes my_foxx.
So, yes, if you just want to use a collection shared by all your services the unprefixed version (using the db object directly) is the way to go. But this often encourages tight coupling between different services, defeating the purpose of having them as separate services in the first place and becomes problematic when you need multiple instances of the same service but don't want them to share data, so most examples encourage you to use the module.context.collection method instead.

Azure Mobile Services - complex processing

I am fairly new to Azure and mobile services, and all the examples and tutorials I can find for the table and API scripts are fairly simplistic.
If I have some processes that are fairly complex and rely on pulling information from many different tables and processing contingent on that data, should I be doing that somewhere other than the API scripts? I am new to node.js as well so maybe that's the problem but I was wondering if there is a more appropriate place for business logic, such as some bridge I need to add to my stack?

There are a lot of examples of how to use MSSql object which is used to query tables and Node in general available. A healthy search will reveal just about anything you need. Since you said you are new to Node.js consider using the .NET backend instead. It is based on Entity Framework and there are lots of Entity framework examples out there for you too. Finally, there are some really good examples of complex logic being used in the back ends in the sample code available: http://azure.microsoft.com/en-us/develop/mobile/ios-samples/ (pick your client OS) and here: http://azure.microsoft.com/blog/topics/mobile/ and here: http://blogs.msdn.com/b/azuremobile/
Let us know if you have specific questions!

Entity Framework migrations on legacy database

We have several legacy SQL Server databases that we occasionally make schema changes to. We currently have a utility written in C++ that allows users to update their DB's with these schema changes. The utility currently generates dynamic sql to create all DB objects. I am looking into redoing this and thought EF migrations might be a good way to go. I have read up a bit on the subject and I have a general idea of how it works. But I'm having a bit of a hard time figuring out how I would set it up to replace our current procedure (or if it is even possible). Currently, a client could be on any one of a number of previous versions. I'm assuming I would have to go back to the oldest possible version and create my model/initial migration from that, then generate incremental migrations for each version change in order to support updates from all versions. Is that a correct assumption? Also, currently our clients could be using sql server 2000, 2005, or 2008. Would this have any effect on how I would set things up (or if I even could)? Further, the goal is to create a utility with a (C# - probably WPF) UI that the user can use to manipulate the migrations (up or down, preferably). I've seen a lot of examples of how to manipulate migrations from command-line within package manager but not a lot of stuff on how to create a utility with a friendly UI for upgrading/downgrading DB's in production. Also, I have not seen anything that shows how to create stored procedures in a migration (our DBs rely on some stored procedures). I'm assuming that, if nothing else, I can use the Sql() method to generate a SQL query to create a SP. Is that correct? Is there a better way?
I know my questions are a bit non-specific and I apologize for that. But I'm still in the beginning processes of learning this and I'd like to get an idea of whether or not this is a good way to go. Any guidance would be greatly appreciated.
Thanks,
Dennis

Firstly, on SQL Server support, Entity Framework doesn't really support SQL Server 2000. See this question:
EntityFramework SQL Server 2000?
On the question of supporting all the multiple versions, you have the right idea about needing to generate an initial migration for the oldest version first then incrementally altering the model and generating migrations to support the later versions. This will be a pain as the migrations are opinionated about how they represent the model in the database and you will be doing a lot of messing about to end up with a model and a set of migrations that fully represent that. Specific concerns are indexes, column lengths, data types, stored procedures, triggers, functions, partitioning.
The Sql() function gets you around most issues, though also helpful in the migrations are functions like CreateIndex and AlterColumn.
For automating this, the migrations are definitely available as powershell cmdlets which are themselves just .Net objects so can be called programmatically.
As this question is a year old, I assume you will have made a decision on whether to do this. My opinion is that it is hard to see that it's worth the effort. If you were re-platforming the code base that uses this database to Entity Framework then it would make sense. Otherwise there are bound to be better tools out there for database version management. My first port of call would be Redgate.

Solr vs. ElasticSearch [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
What are the core architectural differences between these technologies?
Also, what use cases are generally more appropriate for each?

Update
Now that the question scope has been corrected, I might add something in this regard as well:
There are many comparisons between Apache Solr and ElasticSearch available, so I'll reference those I found most useful myself, i.e. covering the most important aspects:
Bob Yoplait already linked kimchy's answer to ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?, which summarizes the reasons why he went ahead and created ElasticSearch, which in his opinion provides a much superior distributed model and ease of use in comparison to Solr.
Ryan Sonnek's Realtime Search: Solr vs Elasticsearch provides an insightful analysis/comparison and explains why he switched from Solr to ElasticSeach, despite being a happy Solr user already - he summarizes this as follows:
Solr may be the weapon of choice when building standard search
applications, but Elasticsearch takes it to the next level with an
architecture for creating modern realtime search applications.
Percolation is an exciting and innovative feature that singlehandedly
blows Solr right out of the water. Elasticsearch is scalable, speedy
and a dream to integrate with. Adios Solr, it was nice knowing you. [emphasis mine]
The Wikipedia article on ElasticSearch quotes a comparison from the reputed German iX magazine, listing advantages and disadvantages, which pretty much summarize what has been said above already:
Advantages:
ElasticSearch is distributed. No separate project required. Replicas are near real-time too, which is called "Push replication".
ElasticSearch fully supports the near real-time search of Apache
Lucene.
Handling multitenancy is not a special configuration, where
with Solr a more advanced setup is necessary.
ElasticSearch introduces
the concept of the Gateway, which makes full backups easier.
Disadvantages:
Only one main developer [not applicable anymore according to the current elasticsearch GitHub organization, besides having a pretty active committer base in the first place]
No autowarming feature [not applicable anymore according to the new Index Warmup API]
Initial Answer
They are completely different technologies addressing completely different use cases, thus cannot be compared at all in any meaningful way:
Apache Solr - Apache Solr offers Lucene's capabilities in an easy to use, fast search server with additional features like faceting, scalability and much more
Amazon ElastiCache - Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory cache in the cloud.
Please note that Amazon ElastiCache is protocol-compliant with Memcached, a widely adopted memory object caching system, so code, applications, and popular tools that you use today with existing Memcached environments will work seamlessly with the service (see Memcached for details).
[emphasis mine]
Maybe this has been confused with the following two related technologies one way or another:
ElasticSearch - It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Apache Lucene.
Amazon CloudSearch - Amazon CloudSearch is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications.
The Solr and ElasticSearch offerings sound strikingly similar at first sight, and both use the same backend search engine, namely Apache Lucene.
While Solr is older, quite versatile and mature and widely used accordingly, ElasticSearch has been developed specifically to address Solr shortcomings with scalability requirements in modern cloud environments, which are hard(er) to address with Solr.
As such it would probably be most useful to compare ElasticSearch with the recently introduced Amazon CloudSearch (see the introductory post Start Searching in One Hour for Less Than $100 / Month), because both claim to cover the same use cases in principle.

I see some of the above answers are now a bit out of date. From my perspective, and I work with both Solr(Cloud and non-Cloud) and ElasticSearch on a daily basis, here are some interesting differences:
Community: Solr has a bigger, more mature user, dev, and contributor community. ES has a smaller, but active community of users and a growing community of contributors
Maturity: Solr is more mature, but ES has grown rapidly and I consider it stable
Performance: hard to judge. I/we have not done direct performance benchmarks. A person at LinkedIn did compare Solr vs. ES vs. Sensei once, but the initial results should be ignored because they used non-expert setup for both Solr and ES.
Design: People love Solr. The Java API is somewhat verbose, but people like how it's put together. Solr code is unfortunately not always very pretty. Also, ES has sharding, real-time replication, document and routing built-in. While some of this exists in Solr, too, it feels a bit like an after-thought.
Support: there are companies providing tech and consulting support for both Solr and ElasticSearch. I think the only company that provides support for both is Sematext (disclosure: I'm Sematext founder)
Scalability: both can be scaled to very large clusters. ES is easier to scale than pre-Solr 4.0 version of Solr, but with Solr 4.0 that's no longer the case.
For more thorough coverage of Solr vs. ElasticSearch topic have a look at https://sematext.com/blog/solr-vs-elasticsearch-part-1-overview/ . This is the first post in the series of posts from Sematext doing direct and neutral Solr vs. ElasticSearch comparison. Disclosure: I work at Sematext.

I see that a lot of folks here have answered this ElasticSearch vs Solr question in terms of features and functionality but I don't see much discussion here (or elsewhere) regarding how they compare in terms of performance.
That is why I decided to conduct my own investigation. I took an already coded heterogenous data source micro-service that already used Solr for term search. I switched out Solr for ElasticSearch then I ran both versions on AWS with an already coded load test application and captured the performance metrics for subsequent analysis.
Here is what I found. ElasticSearch had 13% higher throughput when it came to indexing documents but Solr was ten times faster. When it came to querying for documents, Solr had five times more throughput and was five times faster than ElasticSearch.

Since the long history of Apache Solr, I think one strength of the Solr is its ecosystem. There are many Solr plugins for different types of data and purposes.
Search platform in the following layers from bottom to top:
Data
Purpose: Represent various data types and sources
Document building
Purpose: Build document information for indexing
Indexing and searching
Purpose: Build and query a document index
Logic enhancement
Purpose: Additional logic for processing search queries and results
Search platform service
Purpose: Add additional functionalities of search engine core to provide a service platform.
UI application
Purpose: End-user search interface or applications
Reference article : Enterprise search

I have been working on both solr and elastic search for .Net applications.
The major difference what i have faced is
Elastic search :
More code and less configuration, however there are api's to change
but still is a code change
for complex types, type within types i.e nested types(wasn't able to achieve in solr)
Solr :
less code and more configuration and hence less maintenance
for grouping results during querying(lots of work to achieve in
elastic search in short no straight way)

I have created a table of major differences between elasticsearch and Solr and splunk, you can use it as 2016 update:

While all of the above links have merit, and have benefited me greatly in the past, as a linguist "exposed" to various Lucene search engines for the last 15 years, I have to say that elastic-search development is very fast in Python. That being said, some of the code felt non-intuitive to me. So, I reached out to one component of the ELK stack, Kibana, from an open source perspective, and found that I could generate the somewhat cryptic code of elasticsearch very easily in Kibana. Also, I could pull Chrome Sense es queries into Kibana as well. If you use Kibana to evaluate es, it will further speed up your evaluation. What took hours to run on other platforms was up and running in JSON in Sense on top of elasticsearch (RESTful interface) in a few minutes at worst (largest data sets); in seconds at best. The documentation for elasticsearch, while 700+ pages, didn't answer questions I had that normally would be resolved in SOLR or other Lucene documentation, which obviously took more time to analyze. Also, you may want to take a look at Aggregates in elastic-search, which have taken Faceting to a new level.
Bigger picture: if you're doing data science, text analytics, or computational linguistics, elasticsearch has some ranking algorithms that seem to innovate well in the information retrieval area. If you're using any TF/IDF algorithms, Text Frequency/Inverse Document Frequency, elasticsearch extends this 1960's algorithm to a new level, even using BM25, Best Match 25, and other Relevancy Ranking algorithms. So, if you are scoring or ranking words, phrases or sentences, elasticsearch does this scoring on the fly, without the large overhead of other data analytics approaches that take hours--another elasticsearch time savings.
With es, combining some of the strengths of bucketing from aggregations with the real-time JSON data relevancy scoring and ranking, you could find a winning combination, depending on either your agile (stories) or architectural(use cases) approach.
Note: did see a similar discussion on aggregations above, but not on aggregations and relevancy scoring--my apology for any overlap.
Disclosure: I don't work for elastic and won't be able to benefit in the near future from their excellent work due to a different architecural path, unless I do some charity work with elasticsearch, which wouldn't be a bad idea

If you are already using SOLR, remain stick to it. If you are starting up, go for Elastic search.
Maximum major issues have been fixed in SOLR and it is quite mature.

Imagine the use case:
A lot(100+) of small(10Mb-100Mb, 1000-100000 documents) search indexes.
They are using by a lot of applications (microservices)
Each application can use more than one index
Small by size index, yes. But huge load(hundreds search-requests per second) and requests are complex (multiple aggregations, conditions and so on)
Downtimes are not allowed
All of that is working years long, and constantly growing.
Idea to have individual ES instance per each index - is huge overhead in this case.
Based on my experience, this kind of use case is very complex to support with Elasticsearch.
Why?
FIRST.
The major problem is fundamental back compatibility disregard.
Breaking changes are so cool!
(Note: imagine SQL-server which require you to do small change in all your SQL-statements, when upgraded... can't imagine it. But for ES it's normal)
Deprecations which will dropped in next major release are so sexy!
(Note: you know, Java contain some deprecations, which 20+ years old, but still working in actual Java version...)
And not only that, sometimes you even have something which nowhere documented (personally came across only once but... )
So. If you want to upgrade ES (because you need new features for some app or you want to get bug fixes) - you are in hell. Especially if it is about major version upgrade.
Client API will not back compatible. Index settings will not back compatible.
And upgrade all app/services same moment with ES upgrade is not realistic.
But you must do it time to time. No other way.
Existing indexes is automatically upgraded? - Yes. But it not help you when you will need to change some old-index settings.
To live with that, you need constantly invest a lot of power in ... forward compatibility of you apps/services with future releases of ES.
Or you need to build(and anyway constantly support) some kind of middleware between you app/services and ES, which provide you back compatible client API.
(And, you can't use Transport Client (because it required jar upgrade for every minor version ES upgrade), and this fact do not make your life easier)
Is it looks simple & cheap? No, it's not. Far from it.
Continuous maintenance of complex infrastructure which based on ES, is way to expensive in all possible senses.
SECOND.
Simple API ? Well... no really.
When you is really using complex conditions and aggregations.... JSON-request with 5 nested levels is whatever, but not simple.
Unfortunately, I have no experience with SOLR, can't say anything about it.
But Sphinxsearch is much better it this scenario, becasue of totally back compatible SphinxQL.
Note:
Sphinxsearch/Manticore are indeed interesting. It's not Lucine based, and as result seriously different. Contain several unique features from the box which ES do not have and crazy fast with small/middle size indexes.

I have use Elasticsearch for 3 years and Solr for about a month, I feel elasticsearch cluster is quite easy to install as compared to Solr installation. Elasticsearch has a pool of help documents with great explanation. One of the use case I was stuck up with Histogram Aggregation which was available in ES however not found in Solr.

Add an nested document in solr very complex and nested data search also very complex. but Elastic Search easy to add nested document and search

I only use Elastic-search. Since I found solr is very hard to start.
Elastic-search's features:
Easy to start, very few setting. Even a newbie can setup a cluster step by step.
Simple Restful API which using NoSQL query. And many language libraries for easy accessing.
Good document, you can read the book: . There is a web version on official website.

mvc-mini-profiler - working with a load balanced web role (azure et al)

I believe that the mvc mini profiler is a bit of a 'God-send'
I have incorporated it in a new MVC project which is targeting the Azure platform.
My question is - how to handle profiling across server (role instance) barriers?
Is this is even possible?

I don't understand why you would need to profile these apps any differently. You want to profile how your app behaves on the production server - go ahead and do it.
A single request will still be executed on a single instance, and you'll get the data from that same instance. If you want to profile services located on a different physical tier as well, that would require different approaches; involving communication through internal endpoints which I'm sure the mini profiler doesn't support out of the box. However, the modification shouldn't be that complicated.
However, would you want to profile physically separated tiers, I would go about it in a different way. Specifically, profile each tier independantly. Because that's how I would go about optimizing it. If you wrap the call to your other tier in a profiler statement, you can see where the problem lies and still be able to solve it.

By default the mvc-mini-profiler stores and delivers its results using HttpRuntime.Cache. This is going to cause some problems in a multi-instance environment.
If you are using multiple instances, then some ways you might be able to make this work are:
to change the Http Cache to an AppFabric Cache implementation (or some MemCached implementation)
to use an alternative Storage strategy for your profile results (the code includes SqlServerStorage as an example?)
Obviously, whichever strategy you choose will require more time/resources than just the single instance implementation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string