Search solutions to index Windows SMB file share

Search solutions to index Windows SMB file share - search

I'm identifying search solutions that can crawl Windows SMB file shares and expose the search results through an API (with security trimming). So far, it seems like SharePoint and Elastic Search can handle this. But are there other possible solutions?

this tool (a filesystem indexer that useses Elasticsearch as a backend) is made specifically for this use case:
https://github.com/shirosaidev/diskover
You could - for example - create a Linux machine, mount the SMB share there as read-only and let Diskover index that path.

Related

Copy files from one Azure VM to another with a file watch

I'm trying to set up a situation where I drop files into a folder on one Azure VM, and they're automatically copied to another Azure VM. I was thinking about mapping a drive from the receiver to the sender and using a file watch/copy program to send the files over the mapped drive.
What's a good recommendation for a file watch/copy program that's simple and efficient, and what security setups do I need to get the two Azure boxes to "talk" to each other? They're in the same account/resource group/etc, so I'm not going outside of a virtual network or anything like that.

By default, VMs in the same virtual network can talk to each other (this is true even if default NSGs are applied). So you wouldn't have to do anything special to get that type of communication working.
To answer the second part, you might want to consider just using built-in FCI rules to execute a short script to do the copy. See this link for a short intro into FCI rules.
Alternatively, you could use a service such as Azure files to have files shared between those servers using CIFS. It really depends on why you are trying to have a copy of the file on two servers.
Hope that helps!

File server in container

I can see there are some implemented Web, DB servers are able to run as a container, it occurred to me that why not be able to implement as a file server with a centralized storage (e.g. SAN)
Does anyone try this before, or any recommendation to me?
My basic idea is use 2-3 docker images to create the file servers (mostly Windows servers) and they are mounting on the same storage. For the front-end, I may go or DFS namespaces to normalize the UNC path.

Windows based images have Server service disabled out of the box. It's impossible to start it either since drivers are removed as well. It will not be possible to do in Windows containers.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?

I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.

Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

File Server vs NAS for hosting media files

I have a web portal where my users can login and submit artworks (image, documents, etc.). This web portal is hosted in 2 load-balanced web servers.
Because of this load balancing, I'm thinking of using NAS to serve as a centralized media file storage for my web portal. I'm considering NAS because it's cheaper than a file server and it's easier to maintain.
Now the questions are:
File hosting - Is there any NAS device that can act as a file hosting server? Or, do I need to create a virtual path in my web server to the NAS? This can be achieved easily if I use a file server, I can just bind a separate domain to the file server, something like media.mydomain.com, so all media files will be served through this domain. I don't mind serving the media files through a virtual path from my web servers, smthg like mydomain.com/media. I would like to know if NAS can do any of the approaches above, and whether it's secure, easy to setup, etc.
Performance - This is more important because read and writes are quite intensive. I never use NAS before. I'm thinking of getting 2 hard drives (2TB, 15000RPM) configured for RAID-1. Would this be able to match the performance of a common file server? I know the answer to this question is relative but I just want to see how NAS can be used as a file hosting, not just as a file sharing device.
My web servers are running Windows Server 2008R2 with IIS 7.5. I would appreciate if anyone can also share best practices for integrating NAS with Win Server/IIS.
Thanks.

A NAS provides a shared location for information on a private network (at least you shouldn't expose NAS technologies as NFS and CIFS over the internet) and is not really designed as a web file host. That is not to say you can't configure a NAS as a web file host utilizing IIS/apache/nginx, but then you don't need your web servers. NAS setup is well documented for both windows server and most unix/linux distros, both are relatively easy. A NAS is as secure as it is designed to be, you can utilize a variety of access control methods to secure a NAS depending on your implementation.
This really depends on your concurrent users and what kind of load you are expecting them to put on the system, for the most part performance over a 1Gb LAN connection and a 15,000 RPM hard drive for a NAS should provide ample performance for a decent amount of concurrent users, but I can't say for certain because if a user sits there downloading hundreds of files at a time you can have issues. As with any web technology wrap limits around user usage to prevent one user bringing down your entire system. I'm not sure what you are defining a file server (a NAS is a file server), if you think of a file server as a website that hosts files, a NAS will provide the same if not better performance based on where the device is in relation to your web servers (again, depending on utilization). If you are worried about performance you can always build a larger RAID array using RAID 5, RAID 6, RAID 10 or use SSDs to increase storage performance. For the most part in any NAS the hardware constraints usually are: storage speed, network speed, ram, cpu. Again this really depends on utilization, so test well, benchmark, and monitor performance
Microsoft provides a tuning document for server 2008 r2 that is useful: http://msdn.microsoft.com/en-us/library/windows/hardware/gg463392.aspx
In my opinion your architecture would be your 2 web servers referencing the NAS as a shared location using a virtual directory pointed at the NAS for your files or handle the NAS location in code (using code provides a whole plethora of options around security and usage).

How to access an azure storage blob as if it were a local file?

Is there a way to "mount" a file for read/write access? I realize that I can GetBlobReference and BlobStream but what if I just want to give a file path to a library that doesn't understand Azure?
One example: A "Logger" library that just appends text to a specified file in some fashion.
Another (more realistic) example: Datasource for .sdf (sql server compact 4)

You can mount a page blob as an NTFS file system using XDrive. See this blog post.

As Oliver says you can use XDrive for this - but if you do this then please consider that only one role instance can have write access to the drive - so it's not a good solution for load balanced SQL CE.
For the logger scenario, you could also consider using some algorithm on "paged blobs" - see http://msdn.microsoft.com/en-us/library/ee691964.aspx. This won't be "normal file access", but could be used across multiple roles and role instances.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string