Setting a User Agent in scrape-it - node.js

I'm using scrape-it in my node.js scraping tool (for identifying proper keyword usage) but being identified as a bot by some websites and not getting any content. Is there a way to configure a known user agent header for the GET request to bypass the block?

You can set the headers, including User-agent, by passing an options object to scrape-it:
scrapeIt({
url: "http://example.com"
, headers: { "User-agent": "known-user-agent-of-choice" }
},
{
// some scrapeHTML options ...
})
.then(
// some code ...
);

Related

getting 403 error while sending file to githib via REST using nodejs

I want to send multiple files to Github repository via nodejs. Tried several approaches and end up using node-rest-client module. Tried below code send a sample file to repository called 'metadata'. But after post I am getting error message "Request forbidden by administrative rules. Please make sure your request has a User-Agent header"...please let me know if anyone faced this error before and get rid of it.
convertval = "somedata";
var dataObj = {
"message": "my commit message",
"committer": {
"name": "Scott Chacon",
"email": "ravindra.devagiri#gmail.com"
},
"content": "bXkgbmV3IGZpbGUgY29udGVudHM="
}
debugger;
var Client = require('node-rest-client').Client;
var client = new Client()
var args = {
data: dataObj,
headers: { 'Content-Type': 'application/json' },
};
client.post("https://api.github.com/repos/metadata/contents", args, function (data, response) {
console.log("file send: True : " + data);
});
According to the REST API:
All API requests MUST include a valid User-Agent header. Requests with
no User-Agent header will be rejected.
First of all, you need to define 'User-Agent' with value 'request' in your request header. Refer to this link.
Second, endpoint you are trying to call might require authentication. Generate a personal token from here, add that token in your request header, 'Authorization': 'token '.
If you're using Git extensively in your code, I suggest you to use this - Nodegit.
Edit:
I don't think sending multiple files in a single request is possible in 'Contents' endpoints group (link).
You can checkout Git Data API (as discussed here).

postman POST request doesn't work in nodejs (accessing 3rd party API)

I've been playing around with some web scraping but I've run into an issue I can't figure out; Using a nodejs server (on my local computer) I cannot get passed a permission error barring me from accessing the data. What is confusing to me most is that using the chrome extension "Postman" I don't run into the permission errors, but using the code generated by postman, I do (as well as fiddling with variations of my own scratch code).
Do I have to be using a live server? Do I need to include some extra items in the headers that aren't being put there by Postman? Is there some layer of security around the API that for some reason Postman has access do that a local machine doesnt?
Any light that can be shed would be of use. Note that there is no public documentation of the SmithsFoodAndDrug API (that I can find), so there aren't necessarily APIKeys that are going to be used. But the fact that Postman can access the information makes me think I should be able to on a node server without any special authentication set up.
In Summary:
I'm looking at SmithsFoodAndDrug product information, and found the API where they are grabbing information from.
I figured out the headers needed in order to get local price information on products (on top of the json body format for the POST request)
Using postman I can generate the POST request and retrieve the desired API results
Using nodejs (and the code generated by postman to replicate the request) with both 'request' module and standard 'http' module request module I receive permission errors from the server.
Details: (assume gathering data on honeycrisp apples (0000000003283) with division-id of 706 and store-id of 00144)
http://www.smithsfoodanddrug.com/products/api/products/details
Headers are 'division-id' and 'store-id'. Body is in format of {"upcs":["XXX"],"filterBadProducts":false} where XXX is the specific product code.
Here are the Request Headers in postman. Here are the Request Body settings in postman. The following is a portion of the json response (which is what I want).
{"products": [
{
"brandName": null,
"clickListItem": true,
"countryOfOrigin": "Check store for country of origin details",
"customerFacingSize": "price $2.49/lb",
...
"calculatedPromoPrice": "2.49",
"calculatedRegularPrice": "2.99",
"calculatedReferencePrice": null,
"displayTemplate": "YellowTag",
"division": "706",
"minimumAdvertisedPrice": null,
"orderBy": "Unit",
"regularNFor": "1",
"referenceNFor": "1",
"referencePrice": null,
"store": "00144",
"endDate": "2018-09-19T00:00:00",
"priceNormal": "2.55",
"priceSale": "2.12",
"promoDescription": "About $2.12 for each",
"promoType": null,
...
"upc": "0000000003283",
...
}
],
"coupons": {},
"departments": [],
"priceHasError": false,
"totalCount": 1 }
When using the code given by postman to replicate the request, I get the error saying 'You don't have permission to access "http://www.smithsfoodanddrug.com/products/api/products/details" on this server.
Reference #18.1f3de93f.1536955806.1989a2b1.' .
// Code given by postman
var request = require("request");
var options = { method: 'POST',
url: 'http://www.smithsfoodanddrug.com/products/api/products/details',
headers:
{ 'postman-token': 'ad9638c1-1ea5-1afc-925e-fe753b342f91',
'cache-control': 'no-cache',
'store-id': '00144',
'division-id': '706',
'content-type': 'application/json' },
body: { upcs: [ '0000000003283' ], filterBadProducts: false },
json: true };
request(options, function (error, response, body) {
if (error) throw new Error(error);
console.log(body);
});
change headers
headers:
{
'store-id': '00144',
'division-id': '706'
//'content-type': 'application/json'
}

Request a GET failed in Node

I am requesting a GET to a 3rd party api service from my node back-end.
I am getting a response of 403 forbidden:
request("http://www.giantbomb.com/api/search/?api_key=my_api_key&field_list=name,image,id&format=json&limit=1&query=street%20fighter%203&resources=game",(err,res,body) => {
console.log(body);
})
Querying the same request in my browser return the expected results.
Any idea why this can happen?
EDIT:
Logging the response body, I receive the following page (without the JS):
<h1>Wordpress RSS Reader, Anonymous Bot or Scraper Blocked</h1>
<p>
Sorry we do not allow WordPress plugins to scrape our site. They tend to be used maliciously to steal our content. We do not allow scraping of any kind.
You can load our RSS feeds using any other reader but you may not download our content.
<a href='/feeds'>Click here more information on our feeds</a>
</p>
<p>
Or you're running a bot that does not provide a unique user agent.
Please provide a UNIQUE user agent that describes you. Do not use a default user agent like "PHP", "Java", "Ruby", "wget", "curl" etc.
You MUST provide a UNIQUE user agent. ...and for God's sake don't impersonate another bot like Google Bot that will for sure
get you permanently banned.
</p>
<p>
Or.... maybe you're running an LG Podcast player written by a 10 year old. Either way, Please stop doing that.
</p>
this service requires User-Agent in headers, see this example
const rp = require('request-promise')
const options = {
method: 'GET',
uri: 'http://www.giantbomb.com/api/search/?api_key=my_api_key&field_list=name,image,id&format=json&limit=1&query=street%20fighter%203&resources=game',
headers: { 'User-Agent': 'test' },
json: true
}
rp(options)
.then(result => {
// process result
})
.catch(e => {
// handle error
})
Include User-Agent header in request like this
var options = {
url: 'http://www.giantbomb.com/api/search/? api_key=my_api_key&field_list=name,image,id&format=json&limit=1&query=street%20fighter%203&resources=game',
headers: {
'User-Agent': 'request'
}
};
request(options, (err,res,body) => {
console.log(body);
})

IBM Social Business SmartCloud OAuth 2.0 POSTing new events

I am trying to post an event to my IBM Social Business SmartCloud account. I have been able to grant access to the application and get the access and refresh tokens. but when posting the new even I get a 401 error "No 'Access-Control-Allow-Origin' header is present on the requested resource."
function postEvent(){
var postString = '{'+
'"actor": {'+
'"id": "#me"'+
'},'+
'"verb": "post",'+
'"title": "${share}",'+
'"content":"This event is my <b>first content</b>",'+
'"updated": "2012-01-01T12:00:00.000Z",'+
'"object": {'+
'"summary": "My Summary",'+
'"objectType": "note",'+
'"id": "someid",'+
'"displayName": "My displayName",'+
'"url": "mydomain.com"'+
'}}';
$.ajax({
url: 'https://apps.na.collabserv.com/connections/opensocial/basic/rest/activitystreams/#me/#all?format=json&access_token=<access_token>',
data: postString,
contentType: 'application/json',
method: 'POST',
dataType: 'json',
headers: {
// Set any custom headers here.
// If you set any non-simple headers, your server must include these
// headers in the 'Access-Control-Allow-Headers' response header.
'Content-Type: application/json',
'Origin': 'https://mydomain.com/',
'Access-Control-Allow-Headers' :'*',
'Access-Control-Allow-Origin': '*'
}
}).done(function(data) {
console.log(data);
});
}
so this is the proxy method using file_get_contents I used to get it working in php, cURL did not work.
$post = file_get_contents('https://apps.na.collabserv.com/connections/opensocial/basic/rest/activitystreams/#me/#all?format=json',FALSE,stream_context_create(array(
'http' => array(
'method' => 'POST',
'header' => "Authorization: Bearer $access_token\r\n".
"Content-type: application/json\r\n".
"Content-length: " . strlen($json_data) . "\r\n",
'content' => $json_data,
),
)));
now the issues is other people can not see my posting even though we're following each other.
got the json structure for embedding a webpage with og tags for a video into my ibm sb activity stream. so the video opens directly in my stream with a thumbnail without linking out in a new window
$json_data = '{"content":"https://mydomain.com/somevideo/",
"attachments":[{"objectType":"link",
"displayName":"My Display Name",
"url":"https://mydomain.com/somevideo/",
"summary":"My summary",
"image":{
"url":"{thumbnail}/api/imageProxy?url=http%3a%2f%2fmydomain.com%2fsomevideo%2fthumbnail.jpg",
"alt":"My Display Name"
},
"connections":{
"video":{
"connections":{"mime-type":"application/x-shockwave-flash"},
"width":"853",
"height":"480",
"url":"https://mydomain.com/somevideo/"
}
}
}
]
}';
and you post to this url:
https://apps.na.collabserv.com/connections/opensocial/rest/ublog/#me/#all?format=json
As long as your code is hosted on a different domain than apps.na.collabserv.com you won't be able to access the REST API with JavaScript alone
In this situation, it is the browser blocking you. The cross origin header won't work, because the backend is not configured to enable CORS requests.
You can work around this by accessing the REST API trough an ajax proxy, deployed on the same domain as your page

Modify HTTP responses from a Chrome extension

Is it possible to create a Chrome extension that modifies HTTP response bodies?
I have looked in the Chrome Extension APIs, but I haven't found anything to do this.
In general, you cannot change the response body of a HTTP request using the standard Chrome extension APIs.
This feature is being requested at 104058: WebRequest API: allow extension to edit response body. Star the issue to get notified of updates.
If you want to edit the response body for a known XMLHttpRequest, inject code via a content script to override the default XMLHttpRequest constructor with a custom (full-featured) one that rewrites the response before triggering the real event. Make sure that your XMLHttpRequest object is fully compliant with Chrome's built-in XMLHttpRequest object, or AJAX-heavy sites will break.
In other cases, you can use the chrome.webRequest or chrome.declarativeWebRequest APIs to redirect the request to a data:-URI. Unlike the XHR-approach, you won't get the original contents of the request. Actually, the request will never hit the server because redirection can only be done before the actual request is sent. And if you redirect a main_frame request, the user will see the data:-URI instead of the requested URL.
I just released a Devtools extension that does just that :)
It's called tamper, it's based on mitmproxy and it allows you to see all requests made by the current tab, modify them and serve the modified version next time you refresh.
It's a pretty early version but it should be compatible with OS X and Windows. Let me know if it doesn't work for you.
You can get it here http://dutzi.github.io/tamper/
How this works
As #Xan commented below, the extension communicates through Native Messaging with a python script that extends mitmproxy.
The extension lists all requests using chrome.devtools.network.onRequestFinished.
When you click on of the requests it downloads its response using the request object's getContent() method, and then sends that response to the python script which saves it locally.
It then opens file in an editor (using call for OSX or subprocess.Popen for windows).
The python script uses mitmproxy to listen to all communication made through that proxy, if it detects a request for a file that was saved it serves the file that was saved instead.
I used Chrome's proxy API (specifically chrome.proxy.settings.set()) to set a PAC as the proxy setting. That PAC file redirect all communication to the python script's proxy.
One of the greatest things about mitmproxy is that it can also modify HTTPs communication. So you have that also :)
Like #Rob w said, I've override XMLHttpRequest and this is a result for modification any XHR requests in any sites (working like transparent modification proxy):
var _open = XMLHttpRequest.prototype.open;
window.XMLHttpRequest.prototype.open = function (method, URL) {
var _onreadystatechange = this.onreadystatechange,
_this = this;
_this.onreadystatechange = function () {
// catch only completed 'api/search/universal' requests
if (_this.readyState === 4 && _this.status === 200 && ~URL.indexOf('api/search/universal')) {
try {
//////////////////////////////////////
// THIS IS ACTIONS FOR YOUR REQUEST //
// EXAMPLE: //
//////////////////////////////////////
var data = JSON.parse(_this.responseText); // {"fields": ["a","b"]}
if (data.fields) {
data.fields.push('c','d');
}
// rewrite responseText
Object.defineProperty(_this, 'responseText', {value: JSON.stringify(data)});
/////////////// END //////////////////
} catch (e) {}
console.log('Caught! :)', method, URL/*, _this.responseText*/);
}
// call original callback
if (_onreadystatechange) _onreadystatechange.apply(this, arguments);
};
// detect any onreadystatechange changing
Object.defineProperty(this, "onreadystatechange", {
get: function () {
return _onreadystatechange;
},
set: function (value) {
_onreadystatechange = value;
}
});
return _open.apply(_this, arguments);
};
for example this code can be used successfully by Tampermonkey for making any modifications on any sites :)
Yes. It is possible with the chrome.debugger API, which grants extension access to the Chrome DevTools Protocol, which supports HTTP interception and modification through its Network API.
This solution was suggested by a comment on Chrome Issue 487422:
For anyone wanting an alternative which is doable at the moment, you can use chrome.debugger in a background/event page to attach to the specific tab you want to listen to (or attach to all tabs if that's possible, haven't tested all tabs personally), then use the network API of the debugging protocol.
The only problem with this is that there will be the usual yellow bar at the top of the tab's viewport, unless the user turns it off in chrome://flags.
First, attach a debugger to the target:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
// TODO
});
});
Next, send the Network.setRequestInterceptionEnabled command, which will enable interception of network requests:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
});
Chrome will now begin sending Network.requestIntercepted events. Add a listener for them:
chrome.debugger.getTargets((targets) => {
let target = /* Find the target. */;
let debuggee = { targetId: target.id };
chrome.debugger.attach(debuggee, "1.2", () => {
chrome.debugger.sendCommand(debuggee, "Network.setRequestInterceptionEnabled", { enabled: true });
});
chrome.debugger.onEvent.addListener((source, method, params) => {
if(source.targetId === target.id && method === "Network.requestIntercepted") {
// TODO
}
});
});
In the listener, params.request will be the corresponding Request object.
Send the response with Network.continueInterceptedRequest:
Pass a base64 encoding of your desired HTTP raw response (including HTTP status line, headers, etc!) as rawResponse.
Pass params.interceptionId as interceptionId.
Note that I have not tested any of this, at all.
While Safari has this feature built-in, the best workaround I've found for Chrome so far is to use Cypress's intercept functionality. It cleanly allows me to stub HTTP responses in Chrome. I call cy.intercept then cy.visit(<URL>) and it intercepts and provides a stubbed response for a specific request the visited page makes. Here's an example:
cy.intercept('GET', '/myapiendpoint', {
statusCode: 200,
body: {
myexamplefield: 'Example value',
},
})
cy.visit('http://localhost:8080/mytestpage')
Note: You may also need to configure Cypress to disable some Chrome-specific security settings.
The original question was about Chrome extensions, but I notice that it has branched out into different methods, going by the upvotes on answers that have non-Chrome-extension methods.
Here's a way to kind of achieve this with Puppeteer. Note the caveat mentioned on the originalContent line - the fetched response may be different to the original response in some circumstances.
With Node.js:
npm install puppeteer node-fetch#2.6.7
Create this main.js:
const puppeteer = require("puppeteer");
const fetch = require("node-fetch");
(async function() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
})();
Run it:
node main.js
With Deno:
Install Deno:
curl -fsSL https://deno.land/install.sh | sh # linux, mac
irm https://deno.land/install.ps1 | iex # windows powershell
Download Chrome for Puppeteer:
PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/puppeteer#16.2.0/install.ts
Create this main.js:
import puppeteer from "https://deno.land/x/puppeteer#16.2.0/mod.ts";
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', async (request) => {
let url = request.url().replace(/\/$/g, ""); // remove trailing slash from urls
console.log("REQUEST:", url);
let originalContent = await fetch(url).then(r => r.text()); // TODO: Pass request headers here for more accurate response (still not perfect, but more likely to be the same as the "actual" response)
if(url === "https://example.com") {
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8', // For JS files: 'application/javascript; charset=utf-8'
body: originalContent.replace(/example/gi, "TESTING123"),
});
} else {
request.continue();
}
});
await page.goto("https://example.com");
Run it:
deno run -A --unstable main.js
(I'm currently running into a TimeoutError with this that will hopefully be resolved soon: https://github.com/lucacasonato/deno-puppeteer/issues/65)
Yes, you can modify HTTP response in a Chrome extension. I built ModResponse (https://modheader.com/modresponse) that does that. It can record and replay your HTTP response, modify it, add delay, and even use the HTTP response from a different server (like from your localhost)
The way it works is to use the chrome.debugger API (https://developer.chrome.com/docs/extensions/reference/debugger/), which gives you access to Chrome DevTools Protocol (https://chromedevtools.github.io/devtools-protocol/). You can then intercept the request and response using the Fetch Domain API (https://chromedevtools.github.io/devtools-protocol/tot/Fetch/), then override the response you want. (You can also use the Network Domain, though it is deprecated in favor of the Fetch Domain)
The nice thing about this approach is that it will just work out of box. No desktop app installation required. No extra proxy setup. However, it will show a debugging banner in Chrome (which you can add an argument to Chrome to hide), and it is significantly more complicated to setup than other APIs.
For examples on how to use the debugger API, take a look at the chrome-extensions-samples: https://github.com/GoogleChrome/chrome-extensions-samples/tree/main/mv2-archive/api/debugger/live-headers
I've just found this extension and it does a lot of other things but modifying api responses in the browser works really well: https://requestly.io/
Follow these steps to get it working:
Install the extension
Go to HttpRules
Add a new rule and add a url and a response
Enable the rule with the radio button
Go to Chrome and you should see the response is modified
You can have multiple rules with different responses and enable/disable as required. I've not found out how you can have a different response per request though if the url is the same unfortunately.

Resources