Why don't browsers accept slightly misspelled URLs? - browser

I'm north of 50, and can't always make the connection from my eyes to my fingers while looking at the screen rather than the keys, resulting in things like "www,something.tld" being entered into whichever program (Windows run dialog, browser, something else)
Wouldn't the browser developers have addressed minor flubs like this a LONG time ago, and accepted them for the most obvious result?
Or is there something else implied by the above that I'm missing?
dL

Consider the fact there there are now over a billion Web sites on the internet (http://www.theatlantic.com/technology/archive/2015/09/how-many-websites-are-there/408151/).
This means that there is a good chance that www.example.com may be a completely different Web site than www.exaample.com, www.examplee.com, etc.
Also, in your specific case with the comma, www,something.tld--modern Web browsers will usually do a Web search (using Yahoo, Google, Bing, etc.) for text you put in the address bar that doesn't match a valid domain name (especially when there is a comma in it), because it has no way of knowing if you really meant to put the comma in there.

Computers only ever do what they're told to do. They will also take you literally when asking them to do something, no matter how ridiculous it may seem. That is their purpose, and it allows people to do ridiculous things with them - sometimes. It also allows for very similar yet distinct URLs.
Since commas aren't valid characters to use in filenames it is fairly safe to say they can't be used in URLs. As for why it hasn't been fixed, developers like to be lazy / efficient and there hasn't really been a need to fix it when most people will just use a search engine like Google and click links instead of manually entering addresses. Plus most people will have favourites in their browser to go to instead of searching each time.
Recently Chrome has gotten better with misspelled URLs and if a page doesn't exist it will offer to search for pages including similar words.
For example:
There may be a plugin out there to convert all commas into stops but I am yet to find it.

The number of websites is a good argument, but I think it cannot be applied to commas.
Although there are nearly 2 billion websites out there (https://www.millforbusiness.com/how-many-websites-are-there/) this is still not the reason for not fixing this kind of minor issues.
No website uses a comma in its URL. Accordingly, it's not possible to access any website with a web browser in case you type a comma instead of a dot.
I agree with the question and do not understand why this is still an issue.

Related

.htaccess 5000 redirects with no pattern

Hi Im sorry if this is a vague question but all the solutions I find seem to assume there is a pattern when doing 1000s of redirects.
My issue is that we inherited a new website and were asked to make it live, host it and basically be the guys working on it from now on, they werent happy with the developers that built the new site.
The old site is a totally different platform to the new one and basically even though there is a vague pattern in terms of products, categories, etc, really I dont see anything that I could solidly work as a shortcut without losing lots of direct redirects of products.
Some products have more words in the new urls than the old ones, categories have been moved around and recategorized etc.
What is the standard practice when the sitemap has this many urls (approx 5000)?
Interestingly enough , i googled the site using site:www.example.com and 32 pages showed up, so I was tempted to simply do the redirects for these and leave the rest. Is this is a big no no?

Which points should be noted and observed while building a web application in a way that it can function well on most web browsers?

I'm working especially with Java web applications (in which mostly with JSF, Java Server Faces). I'm less concerned with the rest of the technologies.
Since different web browsers function less or more differently from one another, any web application should be designed in such a way that it can be incorporated and executed in a defined way by most browsers (might not be all). Which points should be kept in mind to design a web application in such a way that it can function almost exactly on most browsers?
What are the major differences among different browsers which should be noted by web application developers?
you have to check all of these points befor developing web applications with any language...
Almost all web developers (ahem! – perhaps that should read “quite a lot of web developers”) are aware of the need to check how their site looks in a variety of browsers. How far you go obviously depends on the resources available – not everyone is in a position to check Windows, Mac, Unix and Linux platforms. The minimum test would probably be:
Firefox, as that has the best standards compliance and is the second most-used browser;
Internet Explorer for Windows – currently the most widely used browser. It is essential to check both versions 6 and 7 as version 7
fixed quite a lot of bugs in 6 but introduced a new set of its own.
(Microsoft is however still kicking developers in the teeth by not
making it possible to install both versions on the same computer; you
will either need two computers or one of the work-arounds available
on the net.) Version 5 should preferably also be checked; as of
spring 2008 the number of users is not yet negligible. However it is
now uncommon enough that you needn’t worry about cosmetic issues; as
long as the site is readable that should be sufficient.
Opera – growing in popularity due to its speed and pretty good standards compliance.
For some time I also recommended checking Netscape 4 as well, as it often produces radically different results from any other browser, and was very popular for a long time. However the number of users of this bug-ridden browser is now so small (under 0.1% and decreasing) that it can now probably safely be ignored.
Check printed pages
Print some of the pages on a normal printer (i.e. with a paper size of A4 or Letter) and check that they appear sensibly. Due to the somewhat limited formatting options available for printing, you probably can’t achieve an appearance comparable to a document produced by a word-processor, but you should at least be able to read the text easily, and not have lines running off the right-hand side of the page. It is truly extraordinary how many site authors fail to think of this most elementary of operations.
You should also consider using CSS to adjust the appearance of the page when printed. For example you could – probably should – suppress the printing of information which is not relevant to the printed page, such as navigation bars. This can be done using the “#media print” or “#import print” CSS features.
Some sites provide separate “printer friendly” versions of their pages, which the user can select and print. While this may occasionally be necessary as a last resort, it significantly increases the amount of work needed to maintain the site, is inconvenient for the reader and shouldn’t usually be needed.
Switch Javascript off
There are unfortunately quite a number of Internet sites which abuse Javascript by, for example, generating unwanted pop-ups and irritating animations. There are also a number of Javascript-related security holes in browsers, especially Internet Explorer. As a result a lot of readers switch Javascript off – indeed I often do myself. (I have a page giving the reasons in more detail.) Some organisations even block the usage of Javascript completely. Furthermore few, if any, search engines support Javascript.
It is therefore important to check that your site still functions with Javascript disabled. A lot of sites rely – quite unnecessarily – on Javascript for navigation, with the result that the lack of Javascript renders the site unusable.
Clearly if you need it for essential content, that functionality will be lost. But there is no reason why the basic text of the site should be unavailable.
Avoid nearly-meaningless messages like “Javascript needed to view this site”. If you have something worth showing, tell the user what it is, e.g. “enable Javascript to see animation of solar system”.
Switch plug-ins off
The considerations for plug-ins (such as Flash or Java) are very similar to those for Javascript above. Check the site with any plug-ins disabled. The basic text and navigation should still work.
Interest the reader sufficiently, and he might just go to the trouble of down-loading the plug-in. Greet him with a blank screen or a “You need Flash to read this site” message and he will probably go away, never to return.
Switch images off
If scanning a number of sites quickly for information, many readers (including myself) switch images off, for quick loading. Other people cannot view images. So switch images off and check that the site is readable and navigable. This means, in particular, checking that sensible ALT texts have been provided for images. (This check is similar to using a text browser, but not quite the same).
its worth to take a look at this link for more info

When designing a website, do you need to consider users who disable CSS?

Have we finally got to the point where we assume CSS2, and hope for CSS3?
(Not looking for discussion, if the answer is "yes, you idiot", go for it...)
You should always take into consideration users who
A. use screen readers and text-only browsers
B. are on mobile devices
C. are not human (i.e. search engine spiders)
By having a good separation of content and style, you should be able to address each of these with ease. As far as users who have CSS disabled, in this day and age, I don't think a designer should concern themselves over it too much. It's certainly not worth spending a significant amount of time and resources on.
What is your target audience and what is your cost for supporting (or not supporting) certain clients?
In addition to the fine points made by pst and ttreat31, I'll add that using semantic markup will generally let your document be readable with CSS disabled (i.e. using the browser's default CSS).
There may be a few quirks (forms come to mind), but generally I find with my own pages, they are plenty readable.
You, and your business, will probably survive if you require CSS. But you'll probably do better if you DON'T require it.
By catering for non-CSS cases, you'll write better markup, with better-structured content. You'll mitigate cross-browser problems, and develop a more robust API. Search engines will be able to parse and 'understand' your content that much better.
Allowing for 'no CSS' is much more about the philosophies relating to web standards and good coding practises than it is actually about the common final rendering.
I don't take any effort to help users who disable CSS or javascript. If I worked on a site which counted on attracting new customers and had lots of first time hits, then I would probably try and give non-javascript users a scaled down set of features. But I would never bother with users who disable CSS. I think that is probably a very small minority.
I often surf in the terminal using links or lynx when my computer is overloaded and I just can't have Firefox, Java, and some Flash applications taking half of my RAM. Text-only browsers don't have advanced CSS or Javascript support.
Many server administrators might do similar thing as most servers are headless, and some administrator might be too lazy to open their other laptop just for a quick browse. People using screenreaders usually have similar view as text-only browser, except it's now read aurally instead of text-only.
When using text browsers, I wouldn't expect any fancies colors or tables, usually I just need to have some quick information. So, IMO, you should at least make all the page's essential information available as plain HTML.

how can I protect scraping of certain data on my web pages? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?
The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.
It's not possible.
You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
You use images and I can just feed them through an OCR.
You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.
Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).
Apart from the images, you could display the numbers using JavaScript or flash.
You could also use CSS to position individual digits using various combinations of absolute or relative positions.
You could also use JavaScript to help you create these DIV.
The point is just to obfuscate enough that it becomes really hard.
One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display.
You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.
At any rate, you can't stop a determined scraper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR.
There is nothing anyone from paying someone pennies to get the data manually anyway.
The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.
However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.
Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.
I can't believe I'm promoting a common malware scripting tactic, but...
You could encode the numbers as encoded Javascript that gets rendered at runtime.
Generate an image containing those numbers and display the image. :-)
I think you guys are being too reactive with these solutions. Javascript, Capcha, even litigation and the DMCA process don't address the complex adaptive nature of web scraping and data theft. Don't you think the "ideal" solution to prevent malicious bots and website scraping would be something working in a real-time proactive mitigation strategy? Very similar to a Content Protection Network. Just say'n.
Examples:
IBM - IBM ISS Data Security Services
DISTIL - www.distil.it
Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?
Using JavaScript is the same as using text. It's trivial to reverse engineer.
Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.
What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.
Don't output the numbers, i.e. prefix
echo $secretNumber;
with //.
For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.
Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.
just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.
I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...
Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.
For pointers on throttling see this question.

Search by hash?

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?
Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.
The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As #rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).
Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.
It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.
If I understand your proposal right, http://bitzi.com/ has done this for a while.

Resources