Lucene .NET 3 and Azure: unable to complete indexing

Lucene .NET 3 and Azure: unable to complete indexing - azure

I'm trying to use the latest version of Lucene.NET (applied to my project via NuGet) in an Azure web role.
The original web application (MVC4) has been created to be able to run in a traditional web host or in Azure: in the former case it uses a file-system based Lucene directory, writing the Lucene index in an *App_Data* subdirectory; in the latter case it uses the AzureDirectory installed from NuGet (Lucene.Net.Store.Azure).
The documents being indexed either come from the web, or from some files uploaded locally, as some of the collections to index are closed and rather small. To start with, I am trying with one of these small closed sets, counting about 1,000 files for a couple of GB.
When I index this set locally in my development environment, the indexing is completed and I can
successfully use it for searching. When instead I try indexing on Azure, it fails to complete and I have no clue about the exact problem: I added both Elmah and NLog for logging any problem, but nothing gets registered in Elmah or in the monitoring tools configured from the Azure console. Only once I got an error from NLog, which was an out of memory exception thrown by Lucene index writer at the end of the process, when committing the document additions. So I tried:
explicitly setting a very low RAM buffer size calling
SetRAMBufferSizeMB(10.0) on my writer.
committing multiple times, e.g. every 200 documents added.
removing any call to Optimize after the indexing completes (see also http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/ on this).
targeting either the file system or the Azure storage.
upscaling the web role VM up to the large size.
Most of these attempts fail at different stages: some times the indexing stops after 1-200 documents, some other times it gets up to 8-900; when I'm lucky, it even completes. This happened only for file system, and never for Azure storage: I never had luck to complete indexing with this.
The essential part of my Lucene code is very simple:
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
writer.SetRAMBufferSizeMB(10.0);
where directory is an instance of FSDirectory or AzureDirectory, according to the test being executed. I then add documents with their fields (using UpdateDocument, as one of the fields represents a unique ID). Once finished, I call writer.Dispose(). If required by the test, I call writer.Commit() several times before the final Dispose; this usually helps the system go on before hitting the memory exception.
Could anyone suggest a hint to be able to complete my indexing?

The error seems to hold the key: Lucene is running out of memory while indexing.
From my perspective you have two options:
Allocate more memory to the RAM buffer, which actually improves your performance (refer to the Lucene documentation on the subject) or,
Reduce the number of documents between each commit.
You could try unit testing your indexing job under several different configurations (more RAM vs. less documents) until you come up with a suitable combination for your application.
In the other hand, if the problem is strictly with the Azure server you might want to try to use a local file cache instead of a RAM cache.

Related

Will filter using filepath wildcard in azure cause storage usage increase?

azure experts
I am a DS recently trying to manage the pipelines after our DE left, so quite new to azure. I recently ran into an issue that our azure storage usage increased quite significantly, I found some clues, but now stuck and will be really grateful if an expert can help me through.
Looking into the usage, it seems the usage increase to be mainly due to a significant increase of transaction and ingress of the API ListFilesystemDir.
I traced the usage profile back to the time when our pipelines run, I located the main usage to be from a copy data activity, which copies data from a folder in ADLS to SQL DW staging table. The amount of data being actually transferred is only tens of MB, similar to before, so should not be the reason leading to the substantial increase. The main difference I found is in the transaction and ingress using the API ListFilesystemDir, which shows an ingress volume of hundreds of MB and ~500k transactions during the pipeline run. The surprising thing is, the usage of this API ListFilesystemDir was very small before (at tens of kB and ~1k transaction).
recent storage usage of API ListFilesystemDir
Looking at the pipeline, I notice it uses a filepath wildcard to filter and select all the most recently updated files within a higher-level folder.
pipeline
My questions are
Is there any change to the usage/limit/definition of API ListFilesystemDir that could cause such an issue?
Does the application of wildcard path filter means azure need to query all the file lists in the root folder and then do filter, such it leads to a huge amount the transaction? Although it didn't seem to cause the issue before, has there been any change to the API usage of this wildcard filter?
Any suggestion on solving the problem?
Thank you very much!

How to reduce used storage in azure function for container

A little bit of background first. Maybe, hopefully, this can save some else some trouble and frustration. Move down to the TL;DR to move on to the actual question.
We currently have a couple genetics workflows related to gene sequencing running in azure batch. Some of which are quite light and I'd like to move them to an Azure Function running a Docker container. For this purpose I have created a docker image, based on the azure-functions image, containing anaconda with the necessary packages to run our most common, lighter workflows. My initial attempt produced a huge image of ~8GB. Moving to Miniconda and a couple other adjustments has reduced the image size to just shy of 3.5GB. Still quite large, but should be manageable.
For this function I created an Azure Function running on an App Service Plan on the P1V2 tier on the belief that I would have 250GB storage to work with as stated in the tier description:
I encountered some issues with loading my first image (the large one) after a couple fixes where the log indicated that there was no more space left on the device. This puzzled me since the quota stated that I'd used some 1.5MB of the 250 total. At this point I reduced the image size and could at least successfully deploy the image again. Enabling SSH support I logged in to the container via SSH and ran df -h.
Okay so the function does not have the advertised 250GB of storage available runtime. It only has about 34. I spent some time searching in the documentation but could not find anything indicating that this should be the case. I did find this related SO question which clarified things a bit. I also found this still open issue on the azure functions github repo. Seems that more people are having the same issue and is not aware of the local storage limitation of the SKU. I might have overlooked something so if this is in fact documented I'd be happy if someone could direct me there.
Now the reason I need some storage is that I need to get the raw data file which can be anything from a handful of MBs to several GBs. And the workflow then subsequently produces multiple files varying between a few bytes and several GBs. The intention was, however, not to store this on the function instances but to complete the workflow and then store the resulting files in a blob storage.
TL;DR
You do not get the advertised storage capacity for functions running on an App Service Plan on the local instance. You get around 20/60/80GB depending on the SKU.
I need 10-30GB of local storage temporarily until the workflow has finished and the resulting files can be stored elsewhere.
How can I reduce the spent storage on the local instance?
Finally, the actual question. You might have noticed on the screenshot from the df -h command that of the available 34GB a whopping 25GB is already used. Which leaves 7.6GB to work with. I already mentioned that my image is ~3.5GB of size. So how come there is a total of 25GBs used and is there any change at all to reduce this aside from shrinking my image? That being said, if I'd removed my image completely (freeing 3.5GB of storage) it would still not be quite enough. Maybe the function simply needs stuff worth over 20GB of storage to run?
Note: It is not a result of cached docker layers or the like since I have tried scaling the app service plan which clears the cached layers/images and re-downloads the image.
Moving up a tier gives me 60GB of total available storage on the instance. Which is enough. But it feels very overkill when I don't need the rest that this tier offers.
Attempted solution 1
One thing I have tried, which might be of help to others, is mounting a file share on the function instance. This can be done with very little effort as shown by the MS docs. Great, now I could directly write to a file share saving me some headache and finally move on. Or so I thought. While this mostly worked great it still threw an exception indicating that it ran out space on the device at some point leading me to believe that it may be using local storage as temporary storage, buffer, or whatever. I will continue looking into it and see if I can figure that part out.
Any suggestions or alternative solutions will be greatly appreciated. I might just decide to move away from Azure Functions for this specific workflow. But I'd still like to clear things up for future reference.
Thanks in advance
niknoe

Azure App Service - Running Solr on Jetty - LockObtainFailedException after Azure maintenance

I'm running a single (not scaled) solr instance on a Azure App Service.
The App Service runs Java 8 and a Jetty 9.3 container.
Everything works really well, but when Azure decides to swap to another VM sometimes the JVM doesn't seem to shutdown gracefully and we encounter issues.
One of the reasons for Azure to decide to swap to another VM is infrastructure maintenance. For example Windows Updates are installed and your app is moved to another machine.
To prevent downtime Azure spins up the new app and when it's ready it will swap over to the new app. Seems fine, but this does not seem to work well with solr's locking mechanism.
We are using the default native lockType, which should be fine since we're only running a single instance. Solr should remove the write.lock file during shutdown, but this does not seem to happen all of the time.
The Azure Diagnostics tools clearly show this event happening:
And the memory usage shows both apps:
During the start of the second instance solr tries to lock the index, but this is not possible because the first one is still using it (it also has the write.lock file). Sometimes the first one doesn't remove the write.lock file and this is were the problems start.
The second solr instance will never work correctly without manual intervention (manually deleting the write.lock file).
The solr logs:
Caused by: org.apache.solr.common.SolrException: Index dir 'D:\home\site\wwwroot\server\solr\****\data\index/' of core '*****' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native
and
org.apache.lucene.store.LockObtainFailedException
What can be done about this?
I was thinking of changing the lockType to a memory-based lock, but I'm not sure if that would work because both instances are alive at the same time during a short period of time.

You could try and set WEBSITE_DISABLE_OVERLAPPED_RECYCLING=1
Overlapped recycling makes it so that before the current instance on
an app is shut down, a new instance starts. It can in some cases cause
file locking issues, in which case you can try turning it off:
Reference

If you would like to run Solr without any locks at all you could do these by specifying in your solrconfig.xml instead of usual <lockType>native</lockType> you could use <lockType>none</lockType>.
Obviously, you need to be careful with this mechanism, since different Solr instances could try to change index at the same time which could lead to potential corruptions.
All available lock types are listed there

How to share Azure Redis Cache between environments?

We want to save a few bucks and share our 1GB dedicated Azure Redis Cache between Development, Test, QA and maybe even production.
Is there a better way than prefixing all keys with an environment string like "Dev_[key]", "Test_[key]" etc.
We are using the StackExchange Redis client for .NET.
PS: We tried using the cheap 250GB (Shared infrastructure), but had very slow performance. Read operations were consistent between 600-800ms... without any load (for a ~300KB object). Upgrading to dedicated 1GB services changed that to 30-40ms. See more here: StackExchange.Redis with Azure Redis is unusably slow or throws timeout errors

One approach is to use multiple Redis databases. I'm assuming this is available in your environment :)
Some advantages over prefixing your keys might be:
data is kept separate, you can flushdb in test and not touch the production data
keys are smaller and consume less memory
The main disadvantage would be not taking advantage of multiple cores, like you could do if you ran multiple instances of Redis on the same server. Obviously not an issue in this case. Also note that this feature is not deprecated, like one of the answers suggests.
Another thing I've seen people complain about is that databases are numbered, they don't have meaningful names. Some people create a hash in database 0 that maps each number to a name.

Here is another idea to save some bucks: use separate Redis cache machines for each environment - so no problems with the keys, but stop them when you don't use them, like in the weekend and during nights. Probably more than 50% of the time you are not using them. I think it would be easy to start and stop them with some PowerShell script, we are using AWS and here it is possible.
Now from what I see the Redis persistence in Azure is not enabled, but they started working on it http://feedback.azure.com/forums/169382-cache/status/191763 - it would be nice to do a RDB snapshot before stopping and then on start to load it. So if you need to save some values and reload them on start you should do it manually (with your own service).

Creating a sub site in SharePoint takes a very long time

I am working in a MOSS 2007 project and have customized many parts of it. There is a problem in the production server where it takes a very long time (more than 15 minutes, sometimes fails due to timeouts) to create a sub site (even with the built-in site templates). While in the development server, it only takes 1 to 2 minutes.
Both servers are having same configuration with 8 cores CPU and 8 GIGs RAM. Both are using separate database servers with the same configuration. The content db size is around 100 GB. More than a hundred subsites are there.
What could be the reason why in the other server it will take so much time? Is there any configuration or something else I need to take care?
Update:
So today I had the chance to check the environment with my clients. But site creation was so fast though they said they didn't change any configuration in the server.
I also used that chance to examine the database. The disk fragmentation was quite high at 49% so I suggested them to run defrag. And I also asked the database file growth to be increased to 100MB, up from the default 1MB.
So my suspicion is that some processes were running heavily on the server previously, that's why it took that much of time.
Update 2:
Yesterday my client reported that the site creation was slow again so I went to check it. When I checked the db, I found that instead of the reported 100GB, the content db size is only around 30GB. So it's still far below the recommended size.
One thing that got my attention is, the site collection recycle bin was holding almost 5 millions items. And whenever I tried to browse the site collection recycle bin, it would take a lot of time to open and the whole site collection is inaccessible.
Since the web application setting is set to the default (30 days before cleaning up, and 50% size for the second stage recycle bin), is this normal or is this a potential problem also?
Actually, there was also another web application using the same database server with 100GB content db and it's always fast. But the one with 30GB is slow. Both are having the same setup, only different data.
What should I check next?
So today I had the chance to check the environment with my clients. But site creation was so fast though they said they didn't change any configuration in the server.
I also used that chance to examine the database. The disk fragmentation was quite high at 49% so I suggested them to run defrag. And I also asked the database file growth to be increased to 100MB, up from the default 1MB.
So my suspicion is that some processes were running heavily on the server previously, that's why it took that much of time.
Thanks for the inputs everyone, I really appreciate.
Yesterday my client reported that the site creation was slow again so I went to check it. When I checked the db, I found that instead of the reported 100GB, the content db size is only around 30GB. So it's still far below the recommended size.
One thing that got my attention is, the site collection recycle bin was holding almost 5 millions items. And whenever I tried to browse the site collection recycle bin, it would take a lot of time to open and the whole site collection is inaccessible.
Since the web application setting is set to the default (30 days before cleaning up, and 50% size for the second stage recycle bin), is this normal or is this a potential problem also?
Actually, there was also another web application using the same database server with 100GB content db and it's always fast. But the one with 30GB is slow. Both are having the same setup, only different data.
Any idea what should I check next? Thanks a lot.

Yes, its normal OOB if you haven't turned the Second Stage Recycle bin off or set a site quota. If a site quota has not been set then the growth of the Second Stage Recycle bin is not limited...
the second stage recycle bin is by default limited to 50% size of the site quota, in other words if you have a site quota of 100gb then you would have a Second Stage recycle bin of 50gb. If a site quota has not been set, there are not any growth limitations...

I second everything Nat has said and emphasize splitting the content database. There are instructions on how to this provided you have multiple site collections and not a single massive one.
Also check your SharePoint databases are in good shape. Have you tried DBCC CHECKDB? Do you have SQL Server maintenance plans configured to reindex and reduce fragmentation? Read these resources on TechNet (particularly the database maintenance article) for details.
Finally, see if there is anything more you can do to isolate the SQL Server as the problem. Are there any other applications with databases on the same SQL Server and are they having problems? Are you running performance monitoring on the SQL Server or SharePoint servers that show any bottlenecks?

Backup the production database to dev and attach it to your dev SharePoint server.
Try and create a site. If it does not take forever to create a site, you can assume there is a problem with the Prod database.
Despite that, at 100gig, you are running up to the limit for a content database and should be planning to put content into more than one. you will know why when you try and backup the database. Searching should also be starting to take a good long time now.
So long term you are going to have to plan on splitting your websites out into different content databases.
--Responses--
Yeah, database size is all just about SQL server handling it. 100GB is just the "any more than this and it starts to be a pain" rule of thumb. Full Search crawls will also start a while.
Given that you do not have access to the production database and that creating a sub-site is primarily a database operation, there is nothing you can really do to figure out what the issue is.
You could try creating a subsite while doing a trace of the Dev database and look at the tables those commands reference to see if there is a smoking gun, but without production access you are really hampered.
Does the production system server pages and documents at a reasonable speed?
See if you can start getting some stats from the database during the creation, find out what work is being done. SQL has some great tools for that now.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string