Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/

Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/ - dns

We're looking to do some scraping on a specific URL that uses cloudflare. Has anyone experienced issues using Zombie.js/user-agents while trying to crawl cloudflare hosted sites.
Would love some help!

I am trying to interface to an API on a client's site and I am getting a 403 error indeed. The request doesn't even reach my server.
Turning security to "essentially off" did not help. The final solution was to white-list the developer machine's IP.
The error is triggered on a single URL (json serving API) with a Java client with standards compliant libraries.
Solution:
1. try to set a rule to allow direct access for that URL
2. try setting security to weaker and weaker ("essentially off")
3. if both fails: try whitelisting
4. set up an alternate non-cloudflare url (direct.domain.com)
These will of course only work if you can negotiate with the site owners.
Backup solution: use an embedded browser that you can "frame" and "remote control" or a testing framework that does the same through a plugin, and extract the content from there (if you can)
Hope this helps.

You're probably triggering one of our security features by trying to scrape a site on us. The only option, really, would be to ask the site owner to whitelist your IP(s) to override the behavior.

Related

Detect that a Browser is on the Intranet

I've got a requirement to detect if a webpage is being served on the internet or intranet, i.e. assuming a url of https://accessibleanyway.com, is the phone connected to the work wifi or to something else like their home wifi or the phone network?
What different ways are there to do this?
(1) Use WebRTC to get the local ip address. Not widely supported
(2) Try to access a local web page using jsonp/cors/iframe
The problem with 2 is that the webpage is https and the local resource is likely to be http which you can't do in IE afaik. If I make the local resource https then it's via a self cert which means installing CAs on the phones (can you buy certificates for the intranet anymore?)
Any suggestions?

The problem with (2) was that the same page was trying to use http and https, and even with an iframe you get issues.
What you could do instead is start on a http loading page, use an iframe to access a local resource which you can only access if you are on the intranet, jsonp will work fine for this. Once that's worked or failed, redirect to your start page with some token in the querystring to indicate that you are on the intranet or not
NB jumping from http to https would probably have some security issues if you are on the same website (authentication cookies being initially visible), but I would have thought it would be fine if you are going to a different one
Obviously there'll be some security needed around the token as otherwise the user could just generate their own but that's a different matter which depends on individual setups. It would obviously have to be generated by a server call, otherwise someone could just read the client code.
NB I think the IP address approach is never going to work as you have no way of knowing what a companies intranet setup looks like until you go there, so it's not a generic answer

Hide referral information when my site users click on external links

I apologize for my lack of knowledge on how the intricacies of the web work ahead of time.
I run a fairly large deal site (lets call it dealsite.com) and we send a lot of traffic to Amazon.com. Is there anyway for me to hide from Amazon that the users are are coming from dealsite.com? I do not want Amazon to know that we (dealsite.com) are the ones sending the traffic.
Maybe strip certain cookies?
Send outbound traffic through a proxy?
I am not doing anything illegal and these are real users not bots.

By using the noreferrer tag on your links, you can prevent Amazon from learning their traffic is coming from your site, and you don't need to set up a proxy, vpn, or cookie redirects.
HTTP generally sends the referring page along with its request for the new page as part of the HTTP referer section of the request header, and that's how sites track where their visitors come from. So for example, a user would click through to Amazon.com from Dealsite.com, and the request would include an HTTP referer telling Amazon.com that the user was linked from Dealsite.com.
To prevent web sites like Amazon from learning that their traffic came from your site, prevent your links from sending the HTTP referer. In HTML5, just add rel="noreferrer" to your links, and then referral information will not be sent to the site that was linked. The noreferrer link type is only suppported in new browsers, so I suggest using the knu's noreferrer polyfill to make sure it works on older browsers too.
So far this will prevent referrer information from being sent from 99.9% of your users - the only users that will send referral information will be users that are both using old browsers and have JavaScript disabled. To make it 100%, you could require users have JavaScript enabled to be able to click on those particular links.

Disclaimer: This is not the thorough idea you're looking for. I ran out of space in the comments so posted it as an answer. A couple of possible solutions come to my mind.
Proxy servers: Multiple distributed proxy servers to be specific. You can round robin your users through these servers and and hit Amazon so that the inbound traffic to Amazon from dualist.com keeps revolving. Disadvantage is that this will be slow depending on where the proxy server resides. So not the most ideal solution for an Ecommerce site but it works. And the major advantage is that implementation will be very simple.
VPN tunneling: Extremely similar to proxy server. VPN tunnel to another server and send redirect to Amazon from there. You'll get a new (non dealsite.com) IP from the VPN server of this network and your original IP will be masked
Redirects from user (Still in works) For this one I was thinking of if you could store the info you need from dealsite.com in a cookie and then instruct the host to redirect to Amazon by itself. Hence the inbound traffic to Amazon will be from the users IP and not dealsite.coms. If you need to get back to the dealsite session from Amazon, you could use the previously saved cookie to do so.
Ill add to this answer if I find something better.
Edit 1 A few hours more hours researching brought me to the Tor project. This might be useful but be wary, Many security experts advise against using Tor. See here

How to stop "only secure content is displayed" post-SSL update? CRM 2011 OnPrem

We recently updated a CRM 2011 on premise instance to use SSL i.e. https. I wasn't involved in the server part of the updates. Everything works fine except at initial login, IE displays the "Only secure content is displayed" warning. If I look at the source of the page, I see a bunch of http://... refs to microsoft sites for example. So presumably that is the source of the issue. The landing page doesn't have any custom "stuff" on it, all OOTB.
What can we do to get around this? I know we could change an IE setting but that isn't an option for us. Is there some IIS voodoo tthat we can use? Surely we don't have to go through all http refs in the web app and change them?

I know we could change an IE setting but that isn't an option for us. Is there some IIS voodoo tthat we can use?
Man, I wish. Even when we get HTML e-mails with images in them we get that message.
Because it's a security setting and it's the browser causing the error message and not the server, there really isn't much we can do about it on the server side except for serving all content over SSL.
That being said, it seems really strange that out of the box content is giving you errors.
It's possible that using a re-write rule on IIS will stop this from happening, as all the content on your server is capable of being served in SSL, but CRM is not requesting it - I'm just hoping that this doesn't break any customization and links to external services.

crawling intranet credentials issues

I've been trying to crawl and index the intranet and the internet. But It doesn't work at all, I think it's due to proxy/security restrictions. I get the indexed parsed to true but the content length is -1 so it crawls nothing. Is there anyway I can put the credentiels I have on the intranet to crawl it in open search server knowing that it has only basic/Digest or NTLM authentication?
Is there anyway to configure the proxy on oss more than just on the one in the crawler tab?
I have set the credentials but it doesn't seem like oss recognises the proxy of the company so it doesn't give me the box to enter the credentials.

Since version 1.5.4, OpenSearchServer supports authentication on proxy.
Here is the GITHub issue:
https://github.com/jaeksoft/opensearchserver/issues/589
For now, the v1.5.4 is only available as nightly build.

How to enable custom URLs in Google Chrome?

for local development I'm running a local webserver with virtual hosts to manage multiple webprojects requiring their own URL. Normally I use URLs like myproject.com.local and the real project will be located at myproject.com. Everything works fine in Safari, IE or Firefox. But Google Chrome throws a 404. As far as I know they have some kind of intelligent address bar. Is there any possibility to get it working with all domains?
Best Regards,
Bernd

I think it should be working with all domains, as long as your workstations DNS can resolve the name to an ip-address. Also, check if you have any proxy settings in Chrome, sometimes it helps to check the 'Bypass proxy for local domains'-checkbox (somewhere in the settings).
Also make sure that when you request non-standard domains or port-numbers to put http:// in front of your url.
Good luck.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Has anyone experienced Cloudflare 403 Errors with zombie.js web scraping?/ - dns

We're looking to do some scraping on a specific URL that uses cloudflare. Has anyone experienced issues using Zombie.js/user-agents while trying to crawl cloudflare hosted sites. Would love some help!

You're probably triggering one of our security features by trying to scrape a site on us. The only option, really, would be to ask the site owner to whitelist your IP(s) to override the behavior.

Related

Detect that a Browser is on the Intranet

Hide referral information when my site users click on external links

How to stop "only secure content is displayed" post-SSL update? CRM 2011 OnPrem

crawling intranet credentials issues

How to enable custom URLs in Google Chrome?

Categories

Resources