Can I use wildcard pattern matching in Libsoup server? - linux

I am using libsoup to implement a HTTP server. I want to catch all wildcard patterns of the form -
"/foo/*/bar/"
in the URL but I dont know how to do this right now.
How can I implement this using the Libsoup and GLib libraries?
My current code is using:
soup_server_add_handler (server, "/foo/*/bar/", NULL, server_callback,
unregister_callback, data);
The above doesnt work if I try to catch the URL "/foo/abc/bar"
Please let me know if this is possible in libsoup and what is the correct syntax to be passed to soup_server_add_handler()

soup_server_add_handler() doesn't take wildcards in its path. You will have to add a handler for / and then examine the path handled to see if it matches your wildcard expression.
There is a merge request that adds something like this to Soup, but it is stalled.

Related

A regex route does not pass req.params

I have the following two routes in an Express router:
router.get("/v|verses", (req, res) => ...
router.get("/v|verses/:book", (req, res) => ....
Why does an invocation of /verses/john route to the first one with req.params an empty object?
It works fine if I don't use a regular expression but have separate routes for /v and /verses.
You need to change to the following /v(?:|erses)?/:book.
From the Express documentation:
Express uses path-to-regexp for matching the route paths; see the path-to-regexp documentation for all the possibilities in defining route paths. Express Route Tester is a handy tool for testing basic Express routes, although it does not support pattern matching.
When /v|verses/:book is evaluated through the Express Route Tester tool, the resulting regex is /^\/v\|verses\/((?:[^\/]+?))(?:\/(?=$))?$/i which will fail due to the way the alteration is used - the regex says, to either patch something that starts with ^\/v\ (a plain /v) OR ends with verses\/((?:[^\/]+?))(?:\/(?=$)) (basically verses/<anything>).
The alteration goes in order and it matches the first thing it finds, so for So with input of "/verses/john" in only matches the first alteration and not the second. You can also see this on Regex101.
One thing that you need to keep in mind is that Express uses an old version of the path-to-regexp library - the Express dependency is 0.1.7 whereas the current package version is 6.1.0. I'm not sure why Express is not using a newer version - the older one doesn't seem to fully support some groupings, so it produces invalid regular expression for them.
One option was to pass in a regular expression directly, so you could go for app.get(/^\/(?:v\|verses)\/((?:[^\/]+?))(?:\/(?=$))?$/, (req, res) => {}) - similar to what SHOULD be generated but done by hand. However, it's not readable and you don't get the mapping of req.params.book, you just get.
Another option is to supply an array of paths: app.get(['/verses/:book', '/v/:book'], (req, res) => {}). This is valid way to map multiple paths. If you wish you could go with that.
Finally, however to fix the syntax, you need /v(?:|erses)?/:book - a v optionally followed by erses or nothing in a non-capturing group. If you use a normal capturing group, then /verses/john produces req.params of type: {0: erses, book: john}. So, with this, you get the correct pattern here /^\/((?:v|verses))\/((?:[^\/]+?))(?:\/(?=$))?$/i. See on Regex101.

expressjs pattern to match the rest of the path

I'm trying to create an endpoint that contains an actual path that I extract and use as a parameter. For instance, in the following path:
/myapi/function/this/is/the/path
I want to match "/myapi/function/" to my function, and pass the parameter "this/is/the/path" as the parameter to that function.
If I try this it obviously doesn't work because it only matches the first element of the path:
app.get("/myapi/function/:mypath")
If I try this it works, but it doesn't show up in req.params, I instead have to parse req.path which is messy because the logic has to know about the whole path, not just the parameter:
app.get("/myapi/function/*")
In addition, the use of wildcard routing seems to be discouraged as bad practice. I'm not sure I understand what alternative the linked article is trying to suggest, and I'm not using the query as part of a database call nor am I uploading any information.
What's the proper way to do this?
You can use wildcard
app.get("/myapi/function/*")
And then get your path
req.params[0]
// Example
//
// For the route "/myapi/function/this/is/my/path"
// You will get output "this/is/my/path"

NodeJs web crawler file extension handling

I'm developing a web crawler in nodejs. I've created a unique list of the urls in the website crawle body. But some of them have extensions like jpg,mp3, mpeg ... I want to avoid crawling those who have extensions. Is there any simple way to do that?
Two options stick out.
1) Use path to check every URL
As stated in comments, you can use path.extname to check for a file extension. Thus, this:
var test = "http://example.com/images/banner.jpg"
path.extname(test); // '.jpg'
This would work, but this feels like you'll wind up having to create a list of file types you can crawl or you must avoid. That's work.
Side note -- be careful using path. Typically, url is your best tool for parsing links because path is aimed at files/directories, not urls. On some systems (Windows), using path to manipulate a url can result in drama because of the slashes involved. Fair warning!
2) Get the HEAD for each link & see if content-type is set to text/html
You may have reasons to avoid making more network calls. If so, this isn't an option. But if it is OK to make additional calls, you could grab the HEAD for each link and check the MIME type stored in content-type.
Something like this:
var headersOptions = {
method: "HEAD",
host: "http://example.com",
path: "/articles/content.html"
};
var req = http.request(headersOptions, function (res) {
// you will probably need to also do things like check
// HTTP status codes so you handle 404s, 301s, and so on
if (res.headers['content-type'].indexOf("text/html") > -1) {
// do something like queue the link up to be crawled
// or parse the link or put it in a database or whatever
}
});
req.end();
One benefit is that you only grab the HEAD, so even if the file is a gigantic video or something, it won't clog things up. You get the HEAD, see the content-type is a video or whatever, then move along because you aren't interested in that type.
Second, you don't have to keep track of file names because you're using a standard MIME type to differentiate html from other data formats.

Get params using ExpressJS

I'd like to get a specific param using ExpressJS with "#" instead of "?" in the url...
My URL :
http://localhost:3000/#access_token=LMkdfkdmsklmfdkslklmdskfmsda
I'd like to get "access_token" and "req.params.access_token" doesn't work...
Anthony
Short answer: you can't.
Longer answer: fragment identifiers (that's the part after the #) are supposed to be evaluated on the client and are not supposed to be sent to server. Your express app has no way of knowing them.
You could try to convert them to query parameters or path variables (i.e. by handling fragment identifier change in javascript) to make them visible server-side.

ExpressJS Route Parameter with Slash

Im using ExpressJS. I want pass url as parameter.
app.get('/s/:url', function(req, res) {
console.log(req.params.url);
});
/s/sg.com //sg.com
/s/www.sg.com //www.sg.com
/s/http://sg.com //http://sg.com
/s/http://sg.com/folder //http://sg.com/folder
How to correct the route such that everything afterr /s/ will be considered as paramenter including slashes.
Thanks
Uh, if you want to stick a URL inside of another URL, you need to URLencode it. If you want to stick one in their raw and suffer the consequences, just use app.get('/s/*'... and then manually parse out the url with req.url.slice(3). But hear me know and believe me later, URL Encoding is the right way to do this via the encodeURIComponent that is built in to JavaScript and works in both the browser and node.js.

Resources