Centralized Indexing on Server for Window7 Search - search

I have read this interesting article
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part1.html
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part2.html
http://www.windowsnetworking.com/articles-tutorials/windows-7/Exploring-Windows-7s-New-Search-Features-Part3.html
This article ends with "Sadly though, if you want to index network locations then you will be forced to cache the locations that you want to index. Searches of network volumes are still possible even without indexing those locations, but require a bit more effort than a typical search."
I have plenty of files in a network drive and searching is very slow and misses files and I wish to have them indexed so that I can have fast searches. The files may be edited from time to time (so the indexing must be updated with the changes etc). Making the files available offline is not an option a sit would defeat the purpose of a network drive.
I was wondering is there a solution for this problem like some software that runs on an independent machine and windows search from workstations connect to this machine while doing searches on the network drive?
I've searched a bit and there are software that can index files such as Solr http://lucene.apache.org/solr/ which is built on lucene.
Is there any software out there that does the whole thing?
Anyone ever done something like this?
and if its not possible, why?

Related

Creating a file server in Azure

Our company has an on-prem file server that I'd like to move to the cloud. I followed these directions and was successfully able to map a drive on my local work computer to connect to an Azure File Share. Our company has about 20 locations, ~5 TB of data (mostly "office" type of files) in total, and about 500 users accessing them.
There are two issues I would like to improve but I'm not sure how:
There's somewhat of a lag when opening files. Other than increasing our office's internet speed, is there anything to be done to make it faster? Would some kind of site-to-site VPN help? Would adding some type of server or VM in the "middle" (maybe one per location?) that would perhaps somehow cache the files reduce the lag?
Also, we have and use an Office 365 subscription. What's the easiest way to use our existing AD structure to transfer over the NTFS permissions that are currently in place?
I Googled around and found a bunch of companies advertising their services, notable among them was Talon Storage. But it seems like something that could be done without hiring a company. What I'm hoping for is a DIY direction to optimally solve these issues. Perhaps there's a standard or commonly recommended solution for such issues. Any guidance would be greatly appreciated.
L-A-T-E-N-C-Y. The number one enemy for any cloud-based file server attempt. It ranges from annoying to down right unusable, depending on how far you are from the Azure datacenter of choice.
Imagine a poor soul trying to "stream" a large 20-meg Excel file with 20 references to external files. What used to take maybe 8 seconds on-prem will now take 40 in the cloud (on a good day). It's game over for productivity. Your marketing department that sometimes used to cut video in iMovie over the network? Those days are over.
I understand this is not the answer you were after, but it's the crude reality.
Do not panic, there are solutions, here's a good one - https://azure.microsoft.com/en-us/services/storsimple/
I'm sure you wanted to get rid of boxes not buy more, but it is what it is.

Good distributed general purpose filesystem in my case?

I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.
We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.
Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.
I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.

Distributing a bundle of files across an extranet

I want to be able to distribute bundles of files, about 500 MB per bundle, to all machines on a corporate "extranet" (which is basically a few LANs connected using various private mechanisms, including leased lines and VPN).
The total number of hosts is roughly 100, and the goal is to get a copy of the bundle from one host onto all the other hosts reliably, quickly, and efficiently. One important issue is that some hosts are grouped together on single fast LANs in which case the network I/O should be done once from one group to the next and then within each group between all the peers. This is as opposed to a strict central server system where multiple hosts might each fetch the same bundle over a slow link, rather than once via the slow link and then between each other quickly.
A new bundle will be produced every few days, and occasionally old bundles will be deleted (but that problem can be solved separately).
The machines in question happen to run recent Linuxes, but bonus points will go to solutions which are at least somewhat cross-platform (in which case the bundle might differ per platform but maybe the same mechanism can be used).
That's pretty much it. I'm not opposed to writing some code to handle this, but it would be preferable if it were one of bash, Python, Ruby, Lua, C, or C++.
I think all these problems have been solved by modern research into p2p networking and well packaged into nice forms. A bit of script and bit torrent should solve these problems. torrent clients exist for all modern OSs, then a script on each machine to check a location for a new torrent file, start the DL, then delete the old bundle once the DL has finished.
What about rsync?
I'm going to suggest you use compie's idea of rysnc to copy the files in which case you can use a scripting language of your choice.
On the propagating system you will need a script containing some form of representation of the hosts and a matrix between them weighted with the speed. You then need to calculate a minimum spanning tree from that information. From that, you can then send messages to the systems to which you intend to propagate detailing the MST and the bundle to fetch, whereby that script/daemon begins transfer. That host then contacts the hosts over the fastest links...
You could implement it in bash - python might be better or a custom C daemon.
When you update the network you'll need to update the matrix based on latest information.
See: Prim's Algorithm.

network drive file sharing

For the better part of 10 years + we have relied on various network mapped drives to allow file sharing. One drive letter for sharing files between teams, a seperate file share for the entire organization, a third for personal use etc. I would like to move away from this and am trying to decide if an ECM/Sharepoint type solution, or home grown app, is worth the cost and the way to go? Or if we should simply remain relying on login scripts/mapped drives for file sharing due to its relative simplicity? Does anyone have any exeperience within their own organization or thoughts on this?
Thanks.
SharePoint is very good at document sharing.
Documents generally follow a process for approval, have permissions, live in clusters... and these things lend themselves well to SharePoints document libraries.
However there are somethings that don't lend themselves well to living inside SharePoint... do you have a virtual hard drive (.vhd) file that you want to share with a workmate? Not such a good idea to try and put a 20GB file into SharePoint.
SharePoint can handle large files, and so can SQL Server behind it... but do you want your SQL Server bandwidth being saturated by such large files? Do you want your backup of SQL Server to hold copies of such large files multiple times?
I believe that there are a few Microsoft partners who offer the ability to disassociate file blobs from the SharePoint database, so that SharePoint can hold the metadata and a file system holds the actual files, and SharePoint simply becomes the gateway to manage access, permissions, and offer a centralised interface to files throughout an organisation. This would offer you the best of both worlds.
Right now though, I consider SharePoint ideal for documents, and I keep large files (that are not document centric) on Windows file shares.
Definetely, use a tool.
The main benefit here is version control. Being able to jump easily to a previous version, diff'ing and seeing who modified what (see most VCS' blame/annotate tool- it prints out a text file showing when/who modified each line in the text file).
Second, you can probably benefit from issue tracking/task tracking.
Other benefits include web access from the internet, having a wiki (which can be great in some situations), etc.
I use Subversion + Redmine at work, and I find it highly useful- test a few solutions and you will surely find out further advantages for you.
One thing that can be overlooked in the change to an document management tool is the planning required around how much is going to be stored and information architecture issues like where different content is going to end up.
SharePoint particularly is easy to setup without a good plan going forward and is particularly vulnerable to difficulties later on when things get to busy.
I would not recommend a home grown app for something like this. The problem has been solved by off the shelf tools and growing one from scratch is going to cost a huge amount and not get you any way near the features for the money.
Did I mention how important planning your security groups and document areas (IA) was?
If you need just document storage then sharepoint can do very well. WSS is ewen free and it provides very good document storage capabilities.
But you have to plan carefully as updating existing applications is painfull. If you decide to go with Sharepoint then I can give you few advices from top of my head
Pay attention to security configuration (user groups, privilegies,..)
Plan your document libraries well as it is not easy to just move documents betveen them
Also consider limiting number of versions that one document can have, because sharepoint stores full backups betveen verions, not just changes
Don't use infopath:) we have very bad experience with it (just don't tell this to managers)
If you don't really need to change graphical look of Sharepoint than don't bother with it as it brings many problems (I'm talking about custom masterpages and custom site templates)
Try to use as much OOB stuff as possible, because developing your own webparts not only cost more, but it can be quite complicated.
Make sure to turn-on search indexing. This is quite tricky, because it is by default turned off and then you will be as surprised that search is not working as I was :)
If you try to just deploy it and load 10.000 documents into it then you will surely have problems with it later. If you give a little thought about structure then you will end up with really good document storage.
Migrating is very probably worth the cost in the long term. You will gain reliability, versioning, traceability, and extensibility.
Be sure to first identify the groups/rights, and to identify which links need to be fixed (maybe you have applications that use links to the shares).
An open source alternative to SharePoint is Alfresco, it is very good for CIFS (Windows shares) too.

ensuring uploaded files are safe

My boss has come to me and asked how to enure a file uploaded through web page is safe. He wants people to be able to upload pdfs and tiff images (and the like) and his real concern is someone embedding a virus in a pdf that is then viewed/altered (and the virus executed). I just read something on a procedure that could be used to destroy stenographic information emebedded in images by altering least sifnificant bits. Could a similar process be used to enusre that a virus isn't implanted? Does anyone know of any programs that can scrub files?
Update:
So the team argued about this a little bit, and one developer found a post about letting the file download to the file system and having the antivirus software that protects the network check the files there. The poster essentially said that it was too difficult to use the API or the command line for a couple of products. This seems a little kludgy to me, because we are planning on storing the files in the db, but I haven't had to scan files for viruses before. Does anyone have any thoughts or expierence with this?
http://www.softwarebyrob.com/2008/05/15/virus-scanning-from-code/
I'd recommend running your uploaded files through antivirus software such as ClamAV. I don't know about scrubbing files to remove viruses, but this will at least allow you to detect and delete infected files before you view them.
Viruses embedded in image files are unlikely to be a major problem for your application. What will be a problem is JAR files. Image files with JAR trailers can be loaded from any page on the Internet as a Java applet, with same-origin bindings (cookies) pointing into your application and your server.
The best way to handle image uploads is to crop, scale, and transform them into a different image format. Images should have different sizes, hashes, and checksums before and after transformation. For instance, Gravatar, which provides the "buddy icons" for Stack Overflow, forces you to crop your image, and then translates it to a PNG.
Is it possible to construct a malicious PDF or DOC file that will exploit vulnerabilities in Word or Acrobat? Probably. But ClamAV is not going to do a very good job at stopping those attacks; those aren't "viruses", but rather vulnerabilities in viewer software.
It depends on your company's budget but there are hardware devices and software applications that can sit between your web server and the outside world to perform these functions. Some of these are hardware firewalls with anti-virus software built in. Sometimes they are called application gateways or application proxies.
Here are links to an open source gateway that uses Clam-AV:
http://en.wikipedia.org/wiki/Gateway_Anti-Virus
http://gatewayav.sourceforge.net/faq.html
You'd probably need to chain an actual virus scanner to the upload process (the same way many virus scanners ensure that a file you download in your browser is safe).
In order to do this yourself, you'd have to keep it up to date, which means keeping libraries of virus definitions around, which is likely beyond the scope of your application (and may not even be feasible depending on the size of your organization).
Yes, ClamAV should scan the file regardless of the extension.
Use a reverse proxy setup such as
www <-> HAVP <-> webserver
HAVP (http://www.server-side.de/) is a way to scan http traffic though ClamAV or any other commercial antivirus software. It will prevent users to download infected files.
If you need https or anything else, then you can put another reverse proxy or web server in reverse proxy mode that can handle the SSL before HAVP
Nevertheless, it does not work at upload, so it will not prevent the files to be stored on servers, but prevent the files from being downloaded and thus propagated. So use it with a regular file scanning (eg clamscan).

Resources