Scraping AWS with Puppeteer runs locally but fails on Heroku - node.js

I know it sounds a lot like other issues here in Stackoverflow, bear with me, it's not (not that I could tell)
I have a scraping app (using Puppeteer) that I use to scrape an Amazon public page.
It works great, I've debugged it by setting the headless: false and I see it works, and it gives me back the expected result.
The same app fails on Heroku, but the problem is not with launching or using Puppeteer (I have several indications), but probably because I'm being identified as a robot.
The error returned is:
waiting for selector `#link_continue input` failed: timeout 30000ms exceeded
Important to say that the error is a generic Puppeteer error that indicates that the selector I'm waiting for just doesn't appear on-page.
I know it should as it's a selector on the first page I navigate to, and it works locally (as mentioned before) - the selector always exists if the page loads.
I had the exactly same error when I've tried to run the scraping on my local machine before setting a User-Agent header. But at that time I could use the headless:false so I saw in my eyes that I'm being rejected due to illegal operations on their page (robots-like operations) so I was redirected to an error page that didn't contain this selector on it.
For this reason, I suspect it recognizes me as a robot, but I don't know how to debug it, it drives me crazy.
Now, if you'd like to reproduce the problem:
You need to wait for the mentioned selector on this site:
https://sellercentral.amazon.com/hz/fba/profitabilitycalculator/index
and then deploy it to Heroku and try to run it maybe 2-3 times
** Two questions: **
How can I proceed from here, I'm 99.9% sure it's the same issue I had previously, but I can't verify... any suggestions?
Given that this is actually the problem, can anyone suggest an easy-to-use/deploy hosting that also allow easy VPN configuration? I think Heroku doesn't give you to do that unless you have an enterprise account
Thanks

I would like to point out that Amazon is very good at blocking IPs. It is very likely that they already blacklisted IPs of cloud services like Heroku, Azure, etc... Previously I have observed services like Cloudflare, Akamai etc... blacklisting these known IPs.
In this scenario Rotating proxies could help you to avoid getting blocked.

Related

ASP.NET Core 5 MVC web app returning bad request errors in some pages after deployment to IIS

I have tried everything. I configured Windows Server 2019 according to Microsoft documentation and I successfully deployed a .NET 5 web application to the IIS.
I can get to the login page. I can even get to the forgot password page and they show themselves fine. However when I try to do any action (send the forgot password link or login to the page) I get a "Bad Request" from the server. I haven't found a way to explain why.
I have tried several, and I mean several things found Googling around but nothing helps. This include disabling https within the .NET Core application, trying to get a detailed error page using the app.UseDeveloperExceptionPage(); instruction inside Startup, etc etc but nothing works. I always receive this page trying to execute any action:
If someone could help or point me into the right direction, I will really, REALLY appreciate it.
Thank you
PD: In case it has anything to do with the problem, the error, at least the two that I can reproduce (because I can't even log in), happens, I think (maybe don't) when redirecting to another page in Microsoft Identity.
EDIT: code was asked by one of you. Thank you.
As you see, there's nothing specific in the forgot password screen for my application. This is scaffold code from Microsoft Identity. I even edited it and just let one line of code inside it, which is the default return code anyway as follow:
public async Task<IActionResult> OnPostAsync()
{
return RedirectToPage("./ForgotPasswordConfirmation");
}
As you can see, there's nothing special with that code. Here's the html that calls it, again, is a scaffold of Microsoft Identity with little to no changes (by little, I mean, maybe some CSS and a new value of view data):
But then again, forgot password page actually shows and seems well in the front end, but immediately I try to enter my email and click enter in this page, (also, just a scaffold of Microsoft Identity):
Nothing happens. I receive the bad request. There's NO magic nor custom code here. Something silly is going on.
EDIT II: YES, locally it works perfectly. The strange behavior happens only when deployed to IIS.
EDIT III: I coded and enabled logging in my .NET Core APP and wrote that to a file, and I think I finally got, at least the error (not the reason yet):
But why?? Cookies are enabled in the server browser without avail, same issue. Someone has a better idea than disabling anti forgery rules to login and forgot password pages?
Thank you
For some reason, when I deployed the first version of my app into IIS, I thought it was a good idea to just browse it from the IIS link. Of course, in a new mounted Windows Server 2019, IE is still the default browser. I connected directly to the IP of my web app via VPN, but used Chrome this time. Guess what? All problems disappeared. Yes, it's a bad idea to try to use a modern framework like .NET Core Identity with IE.

Cypress issue with connection to the site is not secure?

I'm testing the website which have request to optimizely api to do some checking.
It request to url like https://cdn.optimizely.com/datafiles/XXX.json I suppose that this site required secure network.
I tried to open the url in cypress chrome and I get this error
This page isn’t workingcdn.optimizely.com didn’t send any data.
ERR_EMPTY_RESPONSE
But when I tried with the same network in chrome, I get fine response.
I need to be able to load the url to test my site.
Is there any solution to this matter. Please advice.
The resource returns 403 status code, that most likely indicates you don't have sufficient rights to see it:
Your Chrome outside Cypress runs might be set up differently, might already have session cookies.
You most likely need to figure out how to log into some account on the site throught Cypress.
Since cypress will not load optimizely neither nor google-analytics or any . My work around solution is by using cy.intercept() function in before/beforeEach
The code looks something like
cy.intercept('https://cdn.optimizely.com/datafiles/XXX8.json', {
"version": "4"
}
Reference: cypress-example-recipes

Why does "CORS Anywhere" suddenly returns the error: "Missing required request header. Must specify one of: origin,x-requested-with"

I use "CORS Anywhere":
https://github.com/Rob--W/cors-anywhere
Everything worked just fine until a few days ago.
Now, every request I make returns the same error:
Missing required request header. Must specify one of: origin,x-requested-with
for example:
https://cors-anywhere.herokuapp.com/https://www.instagram.com/adidas/
You can try for yourself, every request returns the same error.
I uploaded the code to my server and i still have the same problem.
I think CORS anywhere's owner got finally fed up with people abusing his service, kindly and freely provided for development purposes only:
Please see the related issue: https://github.com/Rob--W/cors-anywhere/issues/301
The demo server of CORS Anywhere (cors-anywhere.herokuapp.com) is meant to be a demo of this project. But abuse has become so common that the platform where the demo is hosted (Heroku) has asked me to shut down the server, despite efforts to counter the abuse (rate limits in #45 and #164, and blocking other forms of requests). Downtime becomes increasingly frequent (e.g. recently #300, #299, #295, #294, #287) due to abuse and its popularity.
EDIT: Turns out that I'm a few years too late, oh well, better late than never :D

azure 502 bad gateway

has anyone seen this before so I am getting a 502 bad gateway error on my app, the issue I have is that the detailed error information I am getting says my requested url is https://SOX:80/api however my site is configured to use https://sox.domain.com and the site largely works pulling the various JS files required
my app service name is SOX in the azure dashboard so I assume that is where it is picking up SOX from but I have no idea why it is using this.
So overall the issue had me perplexed... however with more testing I soon figured out what was going on.
my backend is Dotnet core Azure throwing the 502 bad gateway was its way of handling exceptions ultimately the problem was code based.
I am mentioning this purely so that it will help others
my first issue was based on cert handling it seems dotnet runs in a container that is specified by your app name as i mentioned above https://SOX:80
the below was causing my issues
sslPolicyErrors = X509StoreStoreHelper.ValidateSSLPolicy(cert.Thumbprint, cert);
after commenting this out for testing my problem went away(we are putting in a proper fix )
my second issue came from using an unsupported view in Azure SQL master.sys.master_files which again just threw a 502 bad gateway error referencing https://SOX:80
please note I have used https://SOX:80 as a reference to mask the real site.
hope this helps the next person.
Based on your description, I have checked your site (https://sox.azurewebsites.net/) and found that it contains three static files (index.html,generic.html,elements.html). I viewed your website in Chrome incognito window as follows:
I did not find any requests against https://SOX:80/api in your html page or JavaScript files. Please try to access your website in a new incognito window to isolate the cache issue or just press CTRL + F5 to refresh your current page to narrow this issue. Moreover, you need to check whether you have configured URL Rewrite. If you still could not solve this issue, you need to update your question with the details for us to reproduce this issue.

Heroku Node.js sample facebook app does not work in Google Chrome

The Heroku app i'm trying to get to work (code here):
https://github.com/heroku/facebook-template-nodejs
"Unsafe Javascript attempt to access frame with URL" errors occur when the page is loaded in chrome.
The login button takes you to facebook but does not actually log you into the app and gives the same errors.
Has anyone got this app to work on Chrome or can anyone advise as to how to patch it up?
P.S. it seems to work fine on Mozilla.
Almost certain this is a cross domain policy issue, as stated above. Generally speaking, you just need to add the correct header info to the response.
Access-Control-Allow-Origin: *
In Node, I think it is just a matter of adding it as another header in the response, using
response.writeHead
See http://nodejs.org/api/http.html#http_response_writehead_statuscode_reasonphrase_headers
Oh, and there's explicit instructions on how to do it if you're using Express. I see no reason why it can't work using plain old node then.
http://enable-cors.org/server_expressjs.html
So I looked at your link, in your case I think you just have to enter the header info prior to using any other express app methods.
As to why it works in Firefox and not Chrome, not sure. Both support CORS many versions back. Maybe you have some Chrome extension that's interfering.

Resources