http - change request url? - node.js

Is it possible to change the url of a HTTP-request without redirection?
For example instead of:
request 1
GET /user/abc123/ HTTP/1.1
HTTP/1.1 301 Moved Permanently
Location: /files/abc123
request 2
GET /files/abc123 HTTP/1.1
HTTP/1.1 200 OK
.
.
[filecontent]
I could could respond the file directly, but letting the client know that he got redirected:
single request
GET /user/abc123/ HTTP/1.1
HTTP/1.1 200 OK
Location: /files/abc123
.
.
[filecontent]

As far as I know, it's not possible to do this with HTTP. Redirection in HTTP specifically means the the client is supposed to send a second a request.
I think what you want is more akin to specifying a "canonical url" for some resources, and then having this canonical url displayed in the browsers location bar.
RFC 6596 specifies a way to specify canonical urls with <link rel="canonical">. However, it does not specify what a browser should do with it, if anything. Google uses it to make better choices about which urls to index.
Other than using <link> tags, it's also possible to specify relationships between resources via the HTTP Link header, i.e. Link: </better-url>; rel=canonical. See http://www.w3.org/wiki/LinkHeader . I'm not sure if this would be picked up by Google though. The page at http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394 doesn't mention Google supports it. Browsers surely will disregard it, as they do with practically any link tag, stylesheets being the notable exception.
If the content in question is a HTML document, you could use the HTML5 history API for this. Specifically, use the history.replaceState method. I don't think achieving something similar is possible with other types of content.
Edit
Content-Location header may actually fit what you want quite well.
From section 14.14 of HTTP 1.1 RFC:
The Content-Location entity-header field MAY be used to supply the resource location for the entity enclosed in the message when that entity is accessible from a location separate from the requested resource's URI. A server SHOULD provide a Content-Location for the variant corresponding to the response entity; especially in the case where a resource has multiple entities associated with it, and those entities actually have separate locations by which they might be individually accessed, the server SHOULD provide a Content-Location for the particular variant which is returned.
Content-Location = "Content-Location" ":"
( absoluteURI | relativeURI )
The value of Content-Location also defines the base URI for the entity.
The Content-Location value is not a replacement for the original requested URI; it is only a statement of the location of the resource corresponding to this particular entity at the time of the request. Future requests MAY specify the Content-Location URI as the request- URI if the desire is to identify the source of that particular entity.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
See also What is the purpose of the HTTP header field “Content-Location”?

Well, it is possible, but it feels a bit dirty though.
A quick demo:
var express = require('express');
var app = express();
app.get('/user/abc123', function(req, res, next) {
req.path = req.url = '/files/abc123';
next();
});
app.get('/files/abc123', function(req, res) {
res.set('Location', req.url);
res.send('files!');
});
app.listen(3012);

This is my simple approach, not just change the originalUrl but the path too. My method suggestions:
app.use(function(req, res, next) {
console.log("request", req.originalUrl);
const removeOnRoutes = '/not-wanted-route-part';
req.originalUrl = req.originalUrl.replace(removeOnRoutes,'');
req.path = req.path.replace(removeOnRoutes,'');
return next();
});
By this way /not-wanted-route-part/users will became /users

Related

How to ignore specific files to be loaded when I use route parameters in Express

When I make a GET request with route parameters in express with mongoose like the following code, I sometimes see that the browser tries to load some unexpected files such as favicon.ico, robots.txt, humans.txt, sitemap.xml, ads.txt, etc., and 404 error shows up in the browser console.
app.get("/:userId", ...);
By refering to this Q&A, I figured out that if I don't use the route parameters right after the root route like the following code, it doesn't happen.
app.get("/user/:userId", ...);
In the same Q&A, however, there seem to be another way that uses req.url to ignore those unexpected files to be loaded, but it isn't explained in detail.
How do you do that?
All that's meant in that other answer is that you could examine req.url in your route handler and make sure it is not a known special name. In this specific case, it's probably simpler to use req.params.userId instead of req.url, but you could also use req.url in the same way.
const specials = new Set(["favicon.ico", "robots.txt", "humans.txt", "sitemap.xml", "ads.txt"]);
app.get("/:userId", (res, res, next) => {
// if it's a special URL, then skip it here
if (specials.has(req.params.userId)) {
next();
return;
}
// process your route here
});
Personally, I wouldn't recommend this solution because it presupposes a perfect knowledge of all possible special filenames. I don't use a top level wildcards ever because they ruin the ability to use your server for anything else.

Cannot GET (Passed url as parameter)

Can routes in express not take a full URL as a parameter?
For example,
router.get("/new/:url", <some function>);
gives me the Cannot GET error when the :url is https://www.google.com
You can't get full URL like this format.This type of format is used to take parameters send by client
router.get("/new/:url", <some function>);
//you can get url as params
req.params.url//Use your URL
You should encode url parameter before sending. Your example encoded would be Http%3A%2F%2Fwww.google.com. On server side you can decode parameter to get value from before.
I think you are not much aware about ExpressJS routing because your url https://www.google.com have // which is used route separation.
In you case, we know that ExpressJS support regex route. I think following regex will work for you
app.get("/new/:protocol(http:|https:|ftp:)?/?/:url", <some function>);
In above case, you have bunded with limited protocol http, https and ftp. You may add more protocol by using | separator( or condition) and even you don't know what would be protocol then you like following
app.get("/new/:protocol?/?/:url", <some function>);
In above both route, ? means option that routes works file for
/new/www.google.com
/new/https://www.google.com
and in your function, you may append protocol in url like
function newUrl(req, res) {
if(req.params.protocol)
req.params.url = req.params.protocol + '//' + req.params.url;
console.log(req.params.url);
}

URL parameters before the domain

I have a question about routing and urls in general. My question regards parameters or queries in the url before the domain itself. For example:
http://param.example.com/
I ask this question because I am using ExpressJS and you only define your routes after the domain itself like:
http://example.com/yourRoute
Which would be added to the end of the url. What I want is to be able to store parameters inbefore the domain itself.
Thank you in advance!
FYI
I do know how to use parameters and queries, I just don't know how I would go about to insert them before the domain.
You can create an if statement which can look at the sub-domain through the express req.headers.host variable which contains the domain of the request. For example:
-- google.com/anything/another
req.headers.host => "google.com"
-- docs.google.com/anything/
req.headers.host => "docs.google.com"
So working off this in your route you can call Next() if the request doesn't match the form you want.
router.get('/', function(req, res, next) {
if (req.headers.host == "sub.google.com") {
//Code for res goes here
} else {
//Moves on to next route option b/c it didn't match
next();
}
});
This can be expanded on a lot! Including the fact that many packages have been created to accomplish this (eg. subdomain) Disclaimer you may need to account for the use of www. with some urls.
Maybe this vhost middleware is useful for your situation: https://github.com/expressjs/vhost#using-with-connect-for-user-subdomains
Otherwise a similar approach would work: create a middleware function that parses the url and stores the extracted value in an attribute of the request.
So I would use something like
router.get('/myRoute', function(req, res,next) {
req.headers.host == ":param.localhost.com"
//rest of code
}
I think I understand what you are saying, but I will do some testing and some further reading upon the headers of my request.
EDIT: Right now it seems like an unnecessary hassle to continue with this because I am also working with React-router at the moment. So for the time being I am just going to use my params after the /.
I thank you for your time and your answers. Have a nice day!

NodeJs web crawler file extension handling

I'm developing a web crawler in nodejs. I've created a unique list of the urls in the website crawle body. But some of them have extensions like jpg,mp3, mpeg ... I want to avoid crawling those who have extensions. Is there any simple way to do that?
Two options stick out.
1) Use path to check every URL
As stated in comments, you can use path.extname to check for a file extension. Thus, this:
var test = "http://example.com/images/banner.jpg"
path.extname(test); // '.jpg'
This would work, but this feels like you'll wind up having to create a list of file types you can crawl or you must avoid. That's work.
Side note -- be careful using path. Typically, url is your best tool for parsing links because path is aimed at files/directories, not urls. On some systems (Windows), using path to manipulate a url can result in drama because of the slashes involved. Fair warning!
2) Get the HEAD for each link & see if content-type is set to text/html
You may have reasons to avoid making more network calls. If so, this isn't an option. But if it is OK to make additional calls, you could grab the HEAD for each link and check the MIME type stored in content-type.
Something like this:
var headersOptions = {
method: "HEAD",
host: "http://example.com",
path: "/articles/content.html"
};
var req = http.request(headersOptions, function (res) {
// you will probably need to also do things like check
// HTTP status codes so you handle 404s, 301s, and so on
if (res.headers['content-type'].indexOf("text/html") > -1) {
// do something like queue the link up to be crawled
// or parse the link or put it in a database or whatever
}
});
req.end();
One benefit is that you only grab the HEAD, so even if the file is a gigantic video or something, it won't clog things up. You get the HEAD, see the content-type is a video or whatever, then move along because you aren't interested in that type.
Second, you don't have to keep track of file names because you're using a standard MIME type to differentiate html from other data formats.

Throw 404 from shows / lists

I query the view like this:
/db/_design/myviewname/_view/foo?key=%22ABC123%22
The result is the following:
{
total_rows: 3,
offset: 3,
rows: [ ]
}
All good.
Since no doc was found I'd like to throw a 404 from a show or list.
Is that possible?
According to the wiki, you can issue redirect responses via Show/List functions. As such, it is also possible to send out arbitrary HTTP status codes. (like 404)
function (head, req) {
start({ code: 404 });
}
I'm not sure if 404 would be the right choice here. It really means not found.
From the W3 HTTP/1.1 rfc2616:
10.4.5 404 Not Found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
There is another more appropriate response status code I think. 204 No Content which sounds more like what you really want to tell the client.
10.2.5 204 No Content
The server has fulfilled the request but does not need to return an entity-body, and might want to return updated metainformation. The response MAY include new or updated metainformation in the form of entity-headers, which if present SHOULD be associated with the requested variant.
If the client is a user agent, it SHOULD NOT change its document view from that which caused the request to be sent. This response is primarily intended to allow input for actions to take place without causing a change to the user agent's active document view, although any new or updated metainformation SHOULD be applied to the document currently in the user agent's active view.
The 204 response MUST NOT include a message-body, and thus is always terminated by the first empty line after the header fields.
Now to set a custom response header you simply specify it in the object passed to the start function, like this.
function(head, req) {
return { "code": 204 };
}

Resources