crawling intranet credentials issues

crawling intranet credentials issues - search

I've been trying to crawl and index the intranet and the internet. But It doesn't work at all, I think it's due to proxy/security restrictions. I get the indexed parsed to true but the content length is -1 so it crawls nothing. Is there anyway I can put the credentiels I have on the intranet to crawl it in open search server knowing that it has only basic/Digest or NTLM authentication?
Is there anyway to configure the proxy on oss more than just on the one in the crawler tab?
I have set the credentials but it doesn't seem like oss recognises the proxy of the company so it doesn't give me the box to enter the credentials.

Since version 1.5.4, OpenSearchServer supports authentication on proxy.
Here is the GITHub issue:
https://github.com/jaeksoft/opensearchserver/issues/589
For now, the v1.5.4 is only available as nightly build.

Related

Detect that a Browser is on the Intranet

I've got a requirement to detect if a webpage is being served on the internet or intranet, i.e. assuming a url of https://accessibleanyway.com, is the phone connected to the work wifi or to something else like their home wifi or the phone network?
What different ways are there to do this?
(1) Use WebRTC to get the local ip address. Not widely supported
(2) Try to access a local web page using jsonp/cors/iframe
The problem with 2 is that the webpage is https and the local resource is likely to be http which you can't do in IE afaik. If I make the local resource https then it's via a self cert which means installing CAs on the phones (can you buy certificates for the intranet anymore?)
Any suggestions?

The problem with (2) was that the same page was trying to use http and https, and even with an iframe you get issues.
What you could do instead is start on a http loading page, use an iframe to access a local resource which you can only access if you are on the intranet, jsonp will work fine for this. Once that's worked or failed, redirect to your start page with some token in the querystring to indicate that you are on the intranet or not
NB jumping from http to https would probably have some security issues if you are on the same website (authentication cookies being initially visible), but I would have thought it would be fine if you are going to a different one
Obviously there'll be some security needed around the token as otherwise the user could just generate their own but that's a different matter which depends on individual setups. It would obviously have to be generated by a server call, otherwise someone could just read the client code.
NB I think the IP address approach is never going to work as you have no way of knowing what a companies intranet setup looks like until you go there, so it's not a generic answer

Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/

We're looking to do some scraping on a specific URL that uses cloudflare. Has anyone experienced issues using Zombie.js/user-agents while trying to crawl cloudflare hosted sites.
Would love some help!

I am trying to interface to an API on a client's site and I am getting a 403 error indeed. The request doesn't even reach my server.
Turning security to "essentially off" did not help. The final solution was to white-list the developer machine's IP.
The error is triggered on a single URL (json serving API) with a Java client with standards compliant libraries.
Solution:
1. try to set a rule to allow direct access for that URL
2. try setting security to weaker and weaker ("essentially off")
3. if both fails: try whitelisting
4. set up an alternate non-cloudflare url (direct.domain.com)
These will of course only work if you can negotiate with the site owners.
Backup solution: use an embedded browser that you can "frame" and "remote control" or a testing framework that does the same through a plugin, and extract the content from there (if you can)
Hope this helps.

You're probably triggering one of our security features by trying to scrape a site on us. The only option, really, would be to ask the site owner to whitelist your IP(s) to override the behavior.

How to stop "only secure content is displayed" post-SSL update? CRM 2011 OnPrem

We recently updated a CRM 2011 on premise instance to use SSL i.e. https. I wasn't involved in the server part of the updates. Everything works fine except at initial login, IE displays the "Only secure content is displayed" warning. If I look at the source of the page, I see a bunch of http://... refs to microsoft sites for example. So presumably that is the source of the issue. The landing page doesn't have any custom "stuff" on it, all OOTB.
What can we do to get around this? I know we could change an IE setting but that isn't an option for us. Is there some IIS voodoo tthat we can use? Surely we don't have to go through all http refs in the web app and change them?

I know we could change an IE setting but that isn't an option for us. Is there some IIS voodoo tthat we can use?
Man, I wish. Even when we get HTML e-mails with images in them we get that message.
Because it's a security setting and it's the browser causing the error message and not the server, there really isn't much we can do about it on the server side except for serving all content over SSL.
That being said, it seems really strange that out of the box content is giving you errors.
It's possible that using a re-write rule on IIS will stop this from happening, as all the content on your server is capable of being served in SSL, but CRM is not requesting it - I'm just hoping that this doesn't break any customization and links to external services.

The site uses SSL, but Google Chrome has detected insecure content on the page

I'm using SSL on my website and it is giving me the lock with yellow triangle icon ("The site uses SSL, but Google Chrome has detected insecure content on the page.")
On clicking the lock icon it says:
Your connection to domainname is encrypted with 256-bit encryption. However, this page includes other resources which are not secure. These resources can be viewed by others while in transit, and can be modified by an attacker to change the look of the page. The connection uses TLS 1.0. The connection is encrypted using AES_256_CBC, with SHA1 for message authentication and DHE_RSA as the key exchange mechanism. The connection is not compressed.
How do I ensure I get the green lock?

You must have resources (images, stylesheets, scripts, etc...) which are embedded on the page but are not served over https. Make sure all your resources are served over https, and that warning should go away.

I had the same problem and it occoured because I included a script from Google Analytics using HTTP.
With a provider like Google, one can simply change HTTP to HTTPS - and it will work. This will not work with all providers.
If you are trying to load something from a website that you own, you will have to secure that website with HTTPS.
Google Chrome will detect this and automatically not load the insecure content (from the HTTP domain) which may take away some functionality from the website.
Certain AV/Malware softwares will also detect this and give a security warning which may frighten your visitors away.
If you are using Google Chrome, then you might not notice such a warning, because the AV/Malware software never sees this HTTP-link because it is blocked by Google Chrome.
And if you do not have the kind of AV/Malware software that detects this then you may never notice such a warning while the visitors are.
What you must do is:
Install Google Chrome and go to the website.
Click on "Tools >> JavaScript Console" and see if any warning appears. (this has been commented by Brad Koch on the question as well)
Go throught the different pages on your website and see if any errors appear - if so, then go change the URLs to HTTPS (if this is possible) or find another provider for this javascript.

make sure all references to resources such as images, js files, css files, ads, etc are served through https. If the uri to the resource is relative, e.g. /images/logo.png, then the resource is fetched from the same host and port and protocol as the page itself, in your case https. I would use fiddler to find what files get fetched over http:// when the page is loaded.

This is what I get when I go to TOOLS and then click on JAVASCRIPT CONSOLE ::
Failed to load resource chrome://thumb/https://accounts.google.com/ServiceLogin?service=chromiumsyn...
s%3A%2F%2Fwww.google.com%2Fintl%2Fen-US%2Fchrome%2Fblank.html%3Fsource%3D1
Failed to load resource chrome://thumb/http://www.xe.com/
What do I do after this?

How to prevent SSL urls from leaking info?

I was using google SSL search (https:www.google.com) with the expectation that my search would be private. However, my search for 'toasters' produced this query:
https://encrypted.google.com/search?hl=en&source=hp&q=toasters&aq=f
As you can see, my employer can still log this and see what the search was. How can I make sure that when someone searches on my site using SSL (using custom google search) their search terms isn't made visible.

The URL is sent over SSL. Of course a user can see the URL in their own browser, but it isn't visible as it transits the network. Your employer can't log it unless they are the other end of the SSL connection. If your employer creates a CA certificate and installs it in your browser, they could use a proxy to spoof Google host names, but otherwise, the traffic is secure.

HTTPS protects the entire HTTP exchange, including the URL, so the only thing someone intercepting network traffic will be able to determine is that there was communication between the browser and your site (or Google in this case). Even without the innards, that information can be useful.
Unless you have full administrative control over the systems making the queries, you should assume that anything transpiring on them can be intercepted or logged. Browsers typically store history and cache pages in files on the local disk which can be read by administrators. You also can't verify that the browser itself hasn't been recompiled with code to log sites that were visited, even in "private" mode.

Presumably your employer provides you with a PC, the software on it, the LAN connection to its own corporate network, the internet proxy and corporate firewall, maybe DNS servers, etc etc.
So you are exposed to traffic sniffing and tracing at many different levels. Even if you browse to a url over SSL TLS, you have to assume that the contents of your http session can be recorded. Do you always check that the cert in your browser is from google and not your employer's proxy? Do you know what software sits between your browser and your network card, etc.
However, if you had complete control over the client, then you could be sure that no-one external to your https conversation with google would be able to see the url you are requesting.
Google still knows what you're up to, but that's a private matter between your search engine and your conscience ;)

to add to what #erickson said, read this. SSL will protect the data between the connected parties. If you need to hide that link from the boss then disable the browser caching of the sites visited, i.e. disable or delete the history data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

crawling intranet credentials issues - search

Since version 1.5.4, OpenSearchServer supports authentication on proxy. Here is the GITHub issue: https://github.com/jaeksoft/opensearchserver/issues/589 For now, the v1.5.4 is only available as nightly build.

Related

Detect that a Browser is on the Intranet

Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/

How to stop "only secure content is displayed" post-SSL update? CRM 2011 OnPrem

The site uses SSL, but Google Chrome has detected insecure content on the page

How to prevent SSL urls from leaking info?

Categories

Resources