Cannot crawl complex URL's without setting a site-wide rule to 'crawl as http content'

Cannot crawl complex URL's without setting a site-wide rule to 'crawl as http content' - sharepoint

I have pages within a site containing a control that uses a query string to provide dynamic data to the user (http://site/pages/example.aspx?id=1).
I can get my content source to index these dynamic pages only if I create a rule which sets the root site (http://site/*) to 'include complex urls' and 'crawl sharepoint content as http content'. This is NOT acceptable as changing the crawling protocol from SharePoint's to HTTP will prevent any metadata from being collected on the indexed items. The managed metadata feature is a critical component to our SharePoint applications.
To dispel any wondering of whether or not this is simply a configuration error on my part refer to http://social.technet.microsoft.com/Forums/en-US/sharepointsearch/thread/4ff26b26-84ab-4f5f-a14a-48ab7ec121d5 . The issue mentioned is my exact problem but the solution is unusable as I mentioned before.
Keep in mind this is for an external publishing site and my search scope is being trimmed using content classes to only include documents/pages (STS_List_850 and STS_ListItem_DocumentLibrary). Creating a new web site content source and adding it to my scope presents 2 problems: duplicate content in scope and no content class defining it that I know of.
What options do I have?

Just a thought: maybe you should create two data sources, one - SharePoint - for metadata and items and one - HTTP - for the pages. Set rules on each one to exclude the other's content. Would that solve your problem?

I have decided to take a different approach to this problem as combining dynamic http content and sharepoint content into one scope is a non trivial problem and is better suited to a entirely new project and not a retrofit as I was attempting.
If you have dynamic content from a separate system which you want to crawl without sacrificing SharePoint metadata information from the rest of your site it seems the only option is to write a BCS application/search connector, crawl the two content sources separately and combine them with a scope and possibly an extended core results webpart. Good luck!

Related

Azure cdn Ignore query strings purpose

I know what is the difference between Azure CDN query string modes and I have read a helpfull example of query string modes but...
I don't understand what is the purpose of "Ignore query strings" or how this can be useful in a real dynamic web.
For example, suppose we have a product purchase website with a URL similar to www.myweb.com/products?id=3
If we use "Ignore query strings"... Does this mean that if an user later requests product 4 (www.myweb.com/products?id=4), he will receive the page for product 3?
I think I'm not understanding correctly Azure CDN, I'm seeing Azure CDN as a dynamic content CDN, however Azure CDN is only used for static content as this article explains:
Standard content delivery network (CDN) capability includes the ability to cache files closer to end users to speed up delivery of static files.
This is correct? Any help or example on the subject is welcome

Yes, if you are selected Ignore query strings Query string caching behavior (this is the default), in your case subsequent requests after the initial request www.myweb.com/products?id=3, no matter the query string value, that POP server will serve the same content until it's cache period expires.
And for the second question, CDN is all about serving static files. To my understanding i believe what the article says is about dynamic site accelaration. It's about bunch of techniques to optimize dynamic web sites content serving performance. Because unlike static web sites, dynamic web sites assets (static files. ex: images, js, css, html) are loading dynamically based on the user behavior.

Now that I have it clearer, I will answer my question:
Azure CDN - Used to cache static content, even on dynamic web pages.
For the example in the question, all products must download the same javascript and css content, for those types of files Azure CDN is used. Real example using "Ignore query strings":
User A www.myweb.com/products?id=3, jquery-versionX.js and mystyles.css are not cached, the server is requested and the user receives it.
User B www.myweb.com/products?id=4, since we are using "Ignore query strings" the jquery-versionX.js and mystyles.css files are cached, they are served to the user without requesting it from the server again.
User C www.myweb.com/products?id=3, since we are using "Ignore query strings" the jquery-versionX.js and mystyles.css files are cached, they are served to the user without requesting it from the server again.
Reddis or other similar - Used to cache dynamic content (queries to databases for example).
For the example in the question, all the products have different information, which is obtained by doing a database query. We can store those queries or JSON objects in a Reddis cache. Real example:
User A www.myweb.com/products?id=3, product 3 is not cached, it is requested from the server and received by the user.
User B www.myweb.com/products?id=4, product 4 is not cached, it is requested from the server and received by the user.
User C www.myweb.com/products?id=3, product 3 is cached, the server is not requested and the user receives it from the cache.
Summary:
Both methods can be used simultaneously, Azure CDN is for static content and Reddis or similar for dynamic content.

Serving some files from Azure CDN to only certain users

Suppose I have a website that is served by an Azure CDN endpoint (via files that have been uploaded to blob storage).
I want the minified website content to be available to everyone -- that part is easy, since that's what the CDN does by default.
Ideally, I would also have the sourcemaps available on that same CDN (so that the default behavior of //# sourceMappingURL=0-8d1d0e3cc4594b2c2758.js.map within my JS files would "just work"). However, I'd like for those sourcemaps to only be served to a subset of users.
Is there a way of accomplishing this scenario? I'm happy to defined "subset" in any way that would make this scenario work (e.g., being connected to a certain VPN or being in a certain IP-address range; or using Fiddler to set a secret header; etc.)
Thanks!

I assume that what you need is to build a system that, in production, allows to offer sourcemaps to a certain group of users, for instance, a team of developers, but not to everyone, the sourcemaps should not be publicly accessible.
There are different alternatives that can help achieve this goal.
On the one hand, we can try to use a rules engine that analyzes the received HTTP traffic and offers one or the other response depending on the criteria deemed appropriate.
These rules engines allows you to customize how HTTP requests are handled, by defining a set of possible match condition(s) on the incoming requests, and actions to be performed if the match condition(s) apply.
Azure CDN provides two types of rules engines, one standard rule engine for Azure CDN from Microsoft, and other premium from Verizon, which provide more advanced features.
How you use these rule engines depends largely on how you need to identify your user group and what you want to do to condition the response offered by your application to a sourcemap request.
For instance, one of the standard rule engines match conditions - also available in the premium rule engine - is the remote IP address where the request comes from: maybe it could be a good criterion to discriminate between your different subsets of users.
Or, as you suggested with the use of Fiddle, you can analyze incoming request header in search of a custom one.
The Azure CDN Verizon Premium rule engine provides more advanced match conditions based in browser, device type, etcetera.
Once the users have been identified, the system must consider the action to take depending on whether they belong to one or another group.
Both the standard and Verizon rules engines provides that could be relevant for this purpose.
I think that the best option, if you can use the Verizon rule engine, will be to deny access to the HTTP requests send by users that does not belong to the group allowed to access the sourcemaps.
Other options, although I think more difficult to implement if your are working with webpack and SPA, can be redirect the requests received from one subset of users to certain files which contains the sourcemaps - or to different index.html pages if you are using SPA in your frontend, each with different js and css resources, with sourcemaps or not -, or rewrite the URL to directly deliver a different set of files.
Another possible action could be to not include the inline sourcemap location in your minified files and to take advantage of the capabilities to modify response headers and Append a SourceMap header that points to the actual sourcemaps instead. This header will only be sent for the desired user group. Again, depending of how you are building your frontend it could not be an easy task.
Finally, if you are using Webpack and the SourceMapDevToolPlugin to build your frontend, you can use the publicPath option to point, in production, your sourcemaps to a non public, more developer oriented, URL location. This is the approach followed in this article. I think this approach is also worth looking into.

Script file file not being loaded through ScriptLink custom action

I am having trouble with script link custom actions. I am building a SharePoint app, and I successfully added a site-scope custom action pointing to a script file in the Style Library, as I want this particular script to be injected to all the pages of my SharePoint site.
While it works in certain situations, the script link injection breaks without apparent reason under certain conditions. For example, when I arrive on my root web, the script will be injected. But, if I go to a certain link within this web (for example Home or Site Contents), the file that is supposed to be injected will simply not be fetched from the Style Library and therefore never be injected, resulting in an uncaught ReferenceError when I try to call one of the script's function. The weirdest part is that a page refresh through Ctrl+F5 will fetch the script file without any problem, regardless of the page's ability to originally fetch the script file when first accessed. It will keep the script until it is accessed through a link again.
I've read up on Sharepoint caching, thinking it may be the cause of my problem, but the trouble is that these articles mostly talk about cache-induced errors when updating a file, while I am only trying to access it.
One thing to note is that, due to limitations, I am adding the script link custom action through code. Here's an example of what this kind of call currently looks like in my app:
context.Load(context.Site.UserCustomActions);
context.ExecuteQuery();
customAction.Name = "MyScriptLink";
customAction.Location = "ScriptLink";
customAction.Sequence = 100;
customAction.ScriptSrc = "~SiteCollection/Style Library/MySite/MyScript.js";
customAction.Update();
context.ExecuteQuery();
So, what's going on here ? Why is my script no injected on certain pages ? Why does a refresh on these exact same pages manage to fetch the file without any problem ?

Found it ! Three words: Minimum Download Strategy. Disable it, it messes with you page redirect behavior within a SharePoint site (either through code or through site settings)
Edit: If you still want MDS enabled on your site, there is a solution

Does adding a DNS TXT record slow down loading times?

Google Webmaster Tools offers several methods to verify ownership of websites. Meta tags, DNS records, linking to a Google Analytics account, or uploading an HTML file to the server. My website has already been verified through the HTML file method, but I'd like to make my verification more resilient with Google (yes, they do actually recommend more than one method of verification). I don't want to make our usage of Google any more public than it already is, so adding meta tags is out of the picture - as well as using a Google Analytics account, as we don't utilize that for visitor reporting.
This brings up my original question, if I choose to add a DNS record in the form of the following:
TXT Record google-site-verification=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
How would adding this TXT record affect site loading times and overall performance - especially in terms of new visitors who must perform a new DNS lookup? Substantial, marginal?
Most likely marginal, but we're pinching at pennies here and trying to squeeze ever last bit of optimization out of our server box. Any feedback and/or your own speed tests would be more than welcome!

Typical users will not see the TXT records. They'd only request for A (or AAAA) records to access your services. People interested in the TXT record need to ask for it explicitly:
dig TXT your.fqdn.com
So there is no effect on site load times.

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.

This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string