ElasticSearch ID constraints - document

For ElasticSearch document IDs, are there any character constraints or restrictions?
I am really interested to know if forward slash '/' would cause any issues here? I have some news feeds which I would like to index. The problem is that the database that contains this data that has UID set to the URL of the news feed. Don't ask me why it was designed this way because I haven't gotten a clue.
I want to use the same identifier(URL) for ElasticSearch document. I have successfully used GUIDs, alphanumeric and numeric characters without problems.
If I can't what would be the best workaround - should i encode the entire url?
Thanks

There are no constraints. Forward slashes can be used. But in order to use such id in the REST API, it has to be url encoded:
$ curl -XPUT "localhost:9200/id-test-index/rec/1+1%2F2" -d '{"field" : "one and a half"}'

Slash "/" URL encoding is broken: https://github.com/elasticsearch/elasticsearch/issues/2903
Slash "/" is no longer broken. This issue has been fixed.

Related

IBM Domino - internet sites + substitution with '?' in incoming rule errors

For upcomming larger Xpages project we need to use substitution rules to provide SEO friendly URLs. We need to define rules similar to this one:
Incoming URL pattern: /*/products?*
Replacement pattern: /web.nsf/view.xsp?lang=*&*
This substitution should work with URL e.g.:
/cz/products?start=1&count=20
and substitute to
/web.nsf/view.xsp?lang=cz&start=1&count=20
But we just found out that when incoming rule contains '?' it simply returns Error 404 . We found this reported here http://www-10.lotus.com/ldd/nd8forum.nsf/DateAllFlatWeb/a8162420467d5b45852576c7007fc045?OpenDocument.
Is there any workaround or fix for this situation ? Documentation doesnt mention such limitation ... which is, in fact, very significant because we are not able to redefine the rule to fit our (very common) situation.
Any idea how to fix this?
I do not think you can solve that issue easily.
If you can't find an easy solution I would suggest you to look on these 2 approaches:
You build DSAPI filter and define your custom substitutions there (that way speed won't be affected).
You point requests to single xsp and that xsp will check incoming requests and redirect them to appropriate place (based on your custom substitutions).

comma in the URL , google ignore all after comma

My site have links contains comma. eg. http://example.com/productName,product2123123.html
I set sitemap of this links and google webmaser tools report information that url is not found.
I see google ignore all after comma in url and try index http://example.com/productName that is error url and site generate 404.
Google have bug ? or i must change routing of my site ? or change comma to "%2C", but this could remove my actual offer from google ?
I'm not sure if this could help you but maybe this could help you understand more of what your problem is. Try reading the following links:
Using commas in URL's can break the URL sometimes?
Are you using commas in your URLs? Here’s what you need to know.
Google SEO News and Discussion Forum

Notes 9, rewriting URLs

How do you rewrite a URL in Notes 9 XPages.
Let's say I have:
www.example.com/myapp.nsf/page-name
How do I get rid of that .nsf part:
www.example.com/page-name
I don't want to do lots of manual re-direct because my pages are dynamically formed like wordpress.
I've read this: http://www.ibm.com/developerworks/lotus/library/ls-Web_site_rules/
It does not address the issue.
If you use substitution rules like the following, you can get rid of the db.nsf part and call your XPages directly as example.com/xpage1.xsp:
Rule (substitution): /db.nsf/* -> /db.nsf/*
Rule (substitution): /* -> /db.nsf/*
However, you have to "manually" generate your URLs without the db.nsf part in e.g. menus because the XPages runtime will include the db.nsf part in the URLs if you use for instance the openPage simple action.
To completely control what is going in and out put your Domino behind an Apache HTTP and use mod_rewrite. On Domino 9.0 Windows you can use mod_domino
You can do it with a mix of subsitutions, "URL-pattern" and paritial refresh.
I had the same problem, my customers wants clean URLs for SEO.
My URLs now looks like these:
www.myserver.de/products/financesoftware/anyproduct
First i used one subsitution to cover the folder, database and xpage part of the URL.
My substitution: "/products" -> "/web/techdemo.nsf/product.xsp"
Problem with these is, any update on this site (with in redirect mode) and the user gets back the "dirty" URL.
I solved this with the use of paritial refreshes only.
Last but not least, i uses my own slash pattern at the end of the xpage call (.xsp)
In my case thats the "/financesoftware/anyproduct/" part.
I used facesContext.getExternalContext().getRequestPathInfo() to resolve that URL part.
Currently i used good old RegExp to get the slash separated parameters back out of the url, but i am investigating a REST solution at the moment.
I haven't actually done this, but just saw the option yesterday while looking for something else. In your Xpage, go to All Properties, and look at 'navigationRules' and 'pageBaseUrl'. I think you will find what you are looking for there.

& Ampersand in URL

I am trying to figure out how to use the ampersand symbol in an url.
Having seen it here: http://www.indeed.co.uk/B&Q-jobs I wish to do something similar.
Not exactly sure what the server is going to call when the url is accessed.
Is there a way to grab a request like this with .htaccess and rewrite to a specific file?
Thanks for you help!
Ampersands are commonly used in a query string. Query strings are one or more variables at the end of the URL that the page uses to render content, track information, etc. Query strings typically look something like this:
http://www.website.com/index.php?variable=1&variable=2
Notice how the first special character in the URL after the file extension is a ?. This designates the start of the query string.
In your example, there is no ?, so no query string is started. According to RFC 1738, ampersands are not valid URL characters except for their designated purposes (to link variables in a query string together), so the link you provided is technically invalid.
The way around that invalidity, and what is likely happening, is a rewrite. A rewrite informs the server to show a specific file based on a pattern or match. For example, an .htaccess rewrite rule that may work with your example could be:
RewriteEngine on
RewriteRule ^/?B&Q-(.*)$ /scripts/b-q.php?variable=$1 [NC,L]
This rule would find any URL's starting with http://www.indeed.co.uk/B&Q- and show the content of http://www.indeed.co.uk/scripts/b-q.php?variable=jobs instead.
For more information about Apache rewrite rules, check out their official documentation.
Lastly, I would recommend against using ampersands in URLs, even when doing rewrites, unless they are part of the query string. The purpose of an ampersand in a URL is to string variables together in a query string. Using it out of that purpose is not correct and may cause confusion in the future.
A URI like /B&Q-jobs gets sent to the server encoded like this: /B%26Q-jobs. However, when it gets sent through the rewrite engine, the URI has already been decoded so you want to actually match against the & character:
Rewrite ^/?B&Q-jobs$ /a/specific/file.html [L]
This makes it so when someone requests /B&Q-jobs, they actually get served the content at /a/specific/file.html.

Nutch 1.2 - Why won't nutch crawl url with query strings?

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!#=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!#]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

Resources