Config for find url in html content - search

Can anybody help me to configure Sphinx for best matching url (part of url) in html content?
My config:
index base_index
{
docinfo = extern
mlock = 0
morphology = none
min_word_len = 3
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, a..z
enable_star = 1
blend_chars = _, -, #, /, .
html_strip = 0
}
I use SphinxAPI on backend (PHP) with SPH_MATCH_EXTENDED mode.
I don't understand how search works. If I find "domain.com" I have 37 results. If "www.domain.com" - 643 results. But why? The "domain.com" is needle of "www.domain.com" and in theory with first query a have to get more results.
FreeBSD 9.2. Sphinx 2.1.2
16 distributed indexes (147Gb)

This is a bit late, but here's my thoughts anyway.
It looks like when you search www.domain.com, sphinx is actually looking for www domain and com respectively. If you're searching for just domain.com, it's just looking for domain and com. This is probably the reason why www.domain.com returns more results, because www appears more frequently throughout the index.
Since you're searching URLs, I would setup stopwords depending on how you want to search. For me, I would make www com org and basically all top-level domains stopwords. You might want to leave the top-level domains and just make www a stopword. This would allow you to weight com higher than a net in a search result.
If you setup your stopwords right, when someone searches domain.com sphinx actually just looks for hits of domain in the index, whether it be domain.com or domain.org or domain.net.

Related

How to make apache treat query string as file name?

I mirrored a site to local server with wget and the file names locally look like this:
comments
comments?id=123
Locally these are static files that show unique content.
But when I access second file in browser it keeps showing content from file comments and appends the query string to it ?id=123 so it is not showing content from file comments?id=123
It loads the correct file if I manually encode the ? TO %3F in browser window and I type:
comments%3Fid=123
Is there a way to fix this ? Maybe make apache stop treating ? as query separator and treat it as file name character ? Or make an URL rewrite and change ? into %3F ?
Edit: Indeed too many problems caused by ? in file name and requests. I ended up using the wget option --restrict-file-names=windows that would convert ? into an # when saving file name.
The short answer is "don't do that."
The longer answer is that ? is a reserved character in URLs, using it as a part of a filename is going to cause problems forever, and the recommended solution is to pick a different character to use in those filenames. There are many to choose from - just avoid ? & # and # and you'll probably be fine.
If you insist on keeping the file name (or if you don't have an option) try:
RewriteCond %{QUERY_STRING} (.*)
RewriteRule (.*) $1%%3F%1 [NE]
However, this is going to fire any time you have a query string, which is likely not what you want.

.htaccess condition that works on many conditions inside

I want to try something like if in .htaccess:
I want to Redictes each ?sp=SOMEWHAT to diffrent ?p=NNN (some number)
I have a 100 ?sp= pages.
And I don't want to work on 100 Rules each page load.
If this another method to solve it, I happy to know.
if(RewriteCond %{HTTP_HOST ^?sp=}{
RewriteRule ^?sp=bar ?p=5
RewriteRule ^?sp=foo ?p=9
RewriteRule ^?sp=tin ?p=15
}
This is no logic between the ?sp= and ?p=
Update: I doesn't have access to server config.
This can be done with the RewriteMap directive (iff you have access to the server configuration, as pointed out in a comment. No idea why they thought that needed to be restricted...). For example:
RewriteMap sp_to_s txt:/path/to/map.txt
RewriteRule ^?sp=(.*) ?p=${sp_to_s:$1|0}
(the 0 is the default value if none of the pairs in the map match).
Here's a sample map.txt:
bar 5
foo 9
tin 15
There are more ways to use the map feature; see the documentation for mod_rewrite for details.

htaccess rewrite url parameter with defined values

I need to rewrite URLs like this
www.url.com?host_interface=220?222?770
Into this (this the the one the user should see)
www.url.com/host_interface-usb,firewire,pcie.html
For this example I need to replace these parameters:
220 = usb
222 = firewire
770 = pcie
To make the url above better readable I exchanged the "%2C" to "?"
? = %2C
So, it actually looks like this
220%2C222%2C770
Do I need to translate this too?
besides the parameter the
?host_interface=
need to rewriten to
host_interface-
also its needs the ending
.html
So at the end I want to place several parameter like
xyz = abx
in the htaccess so all the parameters gets to be rewritten.

URLs with symbol "%" at the end make http error, how to prevent it with htaccess?

I have a doubt with some of my URLs from my acces_log . There are some URLs from external sites linking me like http://domain.com/url_name.htm% (yes, with %).
Then... my server returns http error, I need to redirect this fake URLs to the correct way, and I thought in htaccess.
I only need to detect the % symbol in the last character of URL, and redirect without it.
http://domain.com/url_name.htm% --> http://domain.com/url_name.htm
How can I do this? I was trying with some samples with ? symbol but I didn't have lucky.
Thanks!
I already found the mistake...
It seems that some malformed URLs don't pass to vhost, then these petitions don't read the .htaccess.
The only way to solve this, is adding in httpd.conf the ErrorDocument 400 directive... Not is the best option for servers with different vhosts.. because all of the will have the same behaviour... but I think that is the only way for this case.
Quotation from Apache documentation:
Although most error messages can be overriden, there are certain circumstances where the >internal messages are used regardless of the setting of ErrorDocument. In particular, if a >malformed request is detected, normal request processing will be immediately halted and the >internal error message returned. This is necessary to guard against security problems >caused by bad requests.
Thanks anyway!!
This page is super helpful about the .htaccess rules.
http://www.helicontech.com/isapi_rewrite/doc/RewriteRule.htm
I saw a few solutions to this that use a small php script too. IE this one replaces #
.htaccess
RewriteRule old.php redirect.php? url=http://example.com/new.php|hash [R=301,QSA,L]
redirect.php
<?php
$new_url = str_replace("|", "#", $_GET['url']);
header("Location: ".$new_url, 301);
die;
?>

Is it possible to get a list of .com.eg domains programmatically?

I want to create a spider on the Egyptians domains, I was wondering if there is any method I can use to communicate with domain servers to get the list of all domains that end in .com.eg?
You could parse google's results for "site:.com.eg" search.
Here's the code, using python and xgoogle library:
from xgoogle.search import GoogleSearch, SearchError
try:
page = 1
gs = GoogleSearch("site:.com.eg")
gs.results_per_page = 100
results = []
while page < 10:
gs.page = page
results += gs.get_results()
page += 1
except SearchError, e:
print "Search failed: %s" % e
for res in results:
print res.url
I got a list of hundreds of ".com.eg" domains with this script.
Some registries offer a way to download the "zone file" which is a list of all domains registered with the registry. I looked on http://www.nic.eg but I can't read Arabic and they didn't offer an English translation of most pages.
You will be looking for something like VeriSign's TLD Zone Access Program.
Alexa's Top Sites page offers a download of the top 1 million domain names globally. There are just 43 names from .com.eg in that list though, here they are:
google.com.eg
vodafone.com.eg
telecomegypt.com.eg
yellowpages.com.eg
efa.com.eg
etisalat.com.eg
nbe.com.eg
carrefour.com.eg
link.com.eg
edita.com.eg
gom.com.eg
vodafonelive.com.eg
travian.com.eg
nilesat.com.eg
toyotaegypt.com.eg
faisalbank.com.eg
oriflame.com.eg
nsgb.com.eg
skoool.com.eg
betterhome.com.eg
espace.com.eg
mcsd.com.eg
banquemisr.com.eg
mobileshop.com.eg
st.com.eg
egyptinmypocket.com.eg
hyperone.com.eg
resala.com.eg
arabbank.com.eg
nestle.com.eg
eaec.com.eg
elman.com.eg
nas.com.eg
nissan.com.eg
asset.com.eg
tech.com.eg
selaheltelmeez.com.eg
mh.com.eg
cookdoor.com.eg
siemens.com.eg
bmisr-payment.com.eg
citystars.com.eg
global-id.com.eg
No, but it's possible to get all of the IP address ranges in Egypt.
this might help.
No باشا, you can't do that.
No server will give you this information.

Resources