Reverse Engineering data from Lucene/Solr Indexes

Reverse Engineering data from Lucene/Solr Indexes - security

I am investigating whether it is feasable to deploy search servers to the cloud and one of the questions I had revolved around data security. Currently all of our fields (except a few used for faceting) are indexed and not stored (except for the ID, which we use to retrieve the document after search has completed).
If for some reason the servers within the cloud were compromized, would it be possible for that person to reverse engineer our data from the indexes even without the fields being stored.

Depends on the security level you need and the sensitivity of the document content...
With a configuration you describe it wouldn't be possible to rebuild the original as a "clone"... BUT it would be possible to reverse enough information to gain a lot of knowledge about the content... depending on the context this could be damaging...
An important point:
If you use the cloud based servers to build the index and they get compromized THEN there would be no need for "reversing" depending on your configuration: at least for any document you index after the servers get compromized because for building the index the document gets sent over as it is (for example when using http://wiki.apache.org/solr/ExtractingRequestHandler)...

As Yahia says, it's possible to get some information. If you're really concerned about this, use an encrypted file system, as Amazon suggests.

Related

Should I be using SharePoint sub-sites or metadata?

We're starting to use SharePoint 2013 to manage our department's process documentation and I have some questions about best practices for site structure. I'm a little surprised I can't find the answers via web search, since this seems like a basic question every new SharePoint user must deal with.
Moving from a file share environment, I'm trying hard to get out of that mindset and I understand the many benefits of SharePoint over file shares. I also understand why creating folders in SharePoint forces arbitrary divisions on files whereas one large set of documents with metadata lets you filter and group the files based on different needs.
What's confusing me is that I also read that it's better to have too many sub-sites than not enough. It seems like sub-sites can easily become pseudo-folders and I'm not sure where that line is crossed.
Here's an example.
We have a SharePoint site devoted to our department. We've create a sub-site dedicated to an application we developed to load data into our business systems. It mainly holds technical documentation about the application. This application supports many different data sources, each with its own set of user instructions for loading, its own schedule (calendar), contact lists, supporting files, etc. There's no compelling reason to separate them to restrict access. However, there doesn't seem like a lot of value in having them all in the same sub-site, either, since someone working on a job will only want to see the docs and supporting files for that data source. I just can't foresee someone wanting to view supporting files across all data sources, although, I could see someone wanting to see the schedule for all data sources combined.
My question is, should I create separate sub-sites under the application for each data source or do I just store everything in the application sub-site and use metadata and views to group things by data source? Putting all the items for a specific data source into its own sub-site seems to be much simpler to manage and present than having to specify metadata for every new file and creating a lot of views. However, I can't shake the feeling that I'm still using file share thinking. Or maybe I'm just missing some basic concept of SharePoint.
Any advice or links to good discussions of this topic would be greatly appreciated. Thanks.

I would recommend that you use metadata and views to separate data in one repository/site.
My reasons are as below:
In SharePoint, it is recommended to use metadata than "evil"
folders(or subsites in your case). Keep in mind that maintaining
multiple subsites requires big administrative efforts in long term,
for example, some sites will be inherited while others unique
permission.
As time passes by and people rotate, it becomes vague
that where the data was stored and where the new data should go to,
especially with large volume subsites.
Since confidentiality is not concerned in your case, keep data centralized and open to people working in related field increases sharing and collaborating phenomenon. In contrast, using subsites increases the possibility of data silos.
people are all lazy :). Taken me as example, I dont want to remember all those xyz URLs, I want to go to one place and be able to fetch everything that I need.

Create a search engine on specific sites and gather specific info

I need to create a search engine that crawls thru a list of websites and searches there for a query, and those website all return some data in various formats and structures, I need to collect specific info (in a unique structure) from all these websites.
Is there a way I can do that with an existing engine such as Google Custom Search Engine? Or am I better creating one of my own? If yes, what's the first step I should take towards learning about indexing and searching these website efficiently and without filling up my servers with unuseful trash.
So to sum up, besides searching a query on each of these websites' search box, I need to handle the results of each of them appropriately and lay it over in a union structure in one place altogether. All the results are to be parsed and extracted into 4-6 fields (unless, of course, there is a way to this with Google CSE.

Google CSE provides some interfaces to the standard Google web search. You can control the user interface and the search parameters, but you have no control over the indexing, nor any direct access to the index data.
You might be more interested in the Google Search API's that are available with GAE. These are quite different: they are search services in which you provide the data and control the indexes.

here in dec 2018, with google CSE, we can define a set of websites from where we can do our request. google CSE offers up to 2000 website sources to include and up to 5000 sources Overall.
a simple comparison:
Google CSE provides a strong API , custom requests, and nothing to run in your server but in contrast it permits only 100 requests by day for free use.
developing a new SE could be helpful for small sets of websites and it provides a customized SE for the business needs but it requires : time, infrastructure, money investement ,developement of SE algorithms: indexing, storage and analyis.
To sum up. It depends on what side you really need it.

storing quick analytics using redis and node.js

I am new to redis and would like to store the web analytic of web site globally and per user activity .
Below is what i am stuck with.
// to get all unique ips
client.sadd('visitors',ip);
// to records hits per ip
client.hincrby('hits',ip,1);
The above so far works fine and i do get number of different ips and hit counter per ip.
the problem comes to store the activities made by each ip. i.e. Storing the link he clicked, searches he did, with datetime
Can some one please throw light on how to best manage it.
Thanks

the problem comes to store the activities made by each
You will need a separate structure for storing these.
The simplest rational structure is to have a "list of actions by session". Take a look at the sorted sets commands which provide a basic framework for creating a list of actions within a session.
This will get you something quickly. However, this is probably not what you really want. In fact redis is probably not useful for this at all.
If you want to re-trace an entire site visit you really want to connect to some sort of true analytics framework. There are dozens of website tracking tools that provide this type of functionality, so it's not really clear that building one is very efficient.

Leverage Google's spidering infrastructure to build your own niche index?

Let's say I want to build a specialised catalog of information that organisations can provide about themselves. We agree a metadata standard, and they include this information on webites.
Is it possible to use Google's infrastructure somehow to solve the problem of discovering sites with that metadata, and regularly re-spidering to pick up any updates?
The way this kind of problem is often solved seems to involve "registering" the site with the central index, who then build infrastructure to regularly visit each registered site. But I wonder if it can be done smarter - and without the need to formally "register".
For example, presumably you could make part of the metadata standard a unique string, which you could then literally Google search for. Then you'd process the rest of the page. But is there a more streamlined, smarter, more formal way to do this?

Plone CMS: how heavy are search requests compared to typical CMS GET requests?

Plone CMS: how heavy are search requests compared to typical CMS GET requests?
I fear that on a large site (0.5 milions of documents) enabling search capability is asking for DOS. If so, how this threat can be mitigated? Can search work on a different ZOE instance?

Plone's portal_catalog is rather efficient/fast/optimized. It's not like an SQL Query where you can construct searches that take minutes to complete.
The heavy part is usually "waking up" objects when presenting the search results, you should work as much as possible with the metadata (so called "brains") that the catalog returns. This is what Plone tries to do by default anyway.
But still, you can use a seperate ZEO instance for handling search request if you feel that this may be a bottleneck. Just make sure requests for /search and /search_form (or generically, /search*) end up at this specific ZEO instance. How you do this is rather specific to your current load balancing setup (apache, squid, nginx, etc)

With that many documents you want to be investigating a dedicated search system - Plone's text indexes are really not that great. Take a look at http://plone.org/products/collective.solr for a Plone integration with http://lucene.apache.org/solr/ or http://pypi.python.org/pypi/collective.gsa if you have a Google Search Appliance.

Plone's search engine is awesome in that it is fully integrated and ships with the default install. When your site grows to areas of 500k documents you generally want a more solid search.
We have used SOLR with great success is large projects, and there already exist several integrations with Plone:
http://plone.org/products/collective.solr
http://plone.org/products/alm.solrindex
http://plone.org/products/collective.recipe.solrinstance

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string