Search by hash? - search

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?

Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.

The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As #rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).

Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.

It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.

If I understand your proposal right, http://bitzi.com/ has done this for a while.

Related

How do I store versions of consistently changing text with low cost?

I have been creating a text editor online, just for learning experience. I was curious what the best way to store multiple versions of a text file that is consistently changing is.
I've looked at a variety of options and I am yet to see a cheap, and scale-able option.
I've looked into Google Cloud Storage and Amazon S3. The only issue is that too many requests to save the file start to add up a lot in cost. I'd like files to be saved practically instantly, and also versioned every so often. I've also looked into data deduplication which looks like a great option, but I have not yet found a way to do it without writing my own software.
Any and all advice would be greatly appreciated. Thanks!
This is a very broad question, but the basic answer is usually some flavor of Operational Transform. Basically you don't want to be constantly sending the entire document back and forth between the user(s) and the server, nor do you want to overwrite the whole of the document repeatedly. Instead you want to store diffs. Then you need to deal with the idea that multiple users might be changing the file simultaneously, but possibly in different areas, and dealing with that effectively.
Wikipedia has some good, formal discussion of the idea: https://en.wikipedia.org/wiki/Operational_transformation
You wouldn't need all of that for a document that will only be edited by one person at a time, but even then, the answer is to think in terms of diffs from previous versions and only occasionally persist whole snapshots.

Unchangeable EXIF datas

Do you know if there are unchangeable EXIF datas ?
In my case i want to know the real date of creation of a jpeg image. So I thought the EXIF's datas was the best way but I realized that with a software like XnView you can change it. So there is any way i can now the real date of the creation of an image ?
In another hand, is it possible to know if a EXIF datas has been modified ?
Thx fo all,
And sorry for my bad english
Have a good day !
:)
In principle, it is not possible to be sure the data hasn't been edited, although it may take a great deal of skill to do so indetectably. Some of the major camera makers (Canon and Nikon, possibly others) offer an "image authentication" feature in their pro model cameras which is designed to make it impossible to modify the image after it has been taken. They do this for the benefit of people doing legal work - evidence shots and the like. To use this, you have to switch it on (via the camera settings) before you take the picture. Even with these though, it is still possible to alter the data: both the Canon and Nikon authentication systems have been cracked (presumably with considerable difficulty).
As for normal pictures, yes, these are very easy to alter. However many (most?) programs which can edit EXIF data leave their own signs. For example, Adobe Photoshop always adds its own name somewhere in the EXIF, apparently whether you want it to or not. You can see this with many different EXIF viewers, especially with the more advanced ones like PhotoME. (Which, sadly, is no longer maintained.)
Short answer: yes, it is always possible to exit EXIF, and almost always possible to do it indetectably, but it may requite the right tools and quite a lot of skill. You can't ever be certain it has not been done.

When designing a website, do you need to consider users who disable CSS?

Have we finally got to the point where we assume CSS2, and hope for CSS3?
(Not looking for discussion, if the answer is "yes, you idiot", go for it...)
You should always take into consideration users who
A. use screen readers and text-only browsers
B. are on mobile devices
C. are not human (i.e. search engine spiders)
By having a good separation of content and style, you should be able to address each of these with ease. As far as users who have CSS disabled, in this day and age, I don't think a designer should concern themselves over it too much. It's certainly not worth spending a significant amount of time and resources on.
What is your target audience and what is your cost for supporting (or not supporting) certain clients?
In addition to the fine points made by pst and ttreat31, I'll add that using semantic markup will generally let your document be readable with CSS disabled (i.e. using the browser's default CSS).
There may be a few quirks (forms come to mind), but generally I find with my own pages, they are plenty readable.
You, and your business, will probably survive if you require CSS. But you'll probably do better if you DON'T require it.
By catering for non-CSS cases, you'll write better markup, with better-structured content. You'll mitigate cross-browser problems, and develop a more robust API. Search engines will be able to parse and 'understand' your content that much better.
Allowing for 'no CSS' is much more about the philosophies relating to web standards and good coding practises than it is actually about the common final rendering.
I don't take any effort to help users who disable CSS or javascript. If I worked on a site which counted on attracting new customers and had lots of first time hits, then I would probably try and give non-javascript users a scaled down set of features. But I would never bother with users who disable CSS. I think that is probably a very small minority.
I often surf in the terminal using links or lynx when my computer is overloaded and I just can't have Firefox, Java, and some Flash applications taking half of my RAM. Text-only browsers don't have advanced CSS or Javascript support.
Many server administrators might do similar thing as most servers are headless, and some administrator might be too lazy to open their other laptop just for a quick browse. People using screenreaders usually have similar view as text-only browser, except it's now read aurally instead of text-only.
When using text browsers, I wouldn't expect any fancies colors or tables, usually I just need to have some quick information. So, IMO, you should at least make all the page's essential information available as plain HTML.

online trading bot

I want to code a trading bot for Magic: The Gathering Online. This bot should wait until someone offers to trade, accept, look through the cards available from the other trader (the information is shown on screen), and perform other similar functions. I have several questions:
How can it know that someone is offering a trade?
How can it know that the other trader has some card (the informaion is stored in pictures)?
I just cannot imagine right now how to do it, I have no experience with it, until now I've been coding only console programs for my physics neсessities.
First, you should note that some online games forbid bots, as they can give certain players unfair advantages. The MTGO Terms of Service do not seem to say anything about this, though they do put restrictions on anything that might negatively impact the service. They have also said that there is a possibility they will add an API in the future, so they don't seem to be against the idea of automation, but are not supporting it at the moment. Tread carefully here, but it looks like it should be OK to write a bot as long as it is not harmful or abusive. This is not legal advice, and it would be a good idea to ask the folks who run MTGO for permission. edit since I wrote this, it has been pointed out that there are lots of bots already, so there should be no problems writing bots.
Assuming that it is not forbidden by the terms of service, but they do not have an API, you will have to find a way to detect what's going on, and control the game automatically. There's a pretty good series of articles on writing poker bots (archived copy), which has some good information on how to inject a DLL into an application, scrape the screen, and control the application. That might provide you with a starting point for doing this sort of thing.
You might also want to look for tools that other people have already written for doing this. It looks like there are several existing MTGO bots, but they all seem a bit sketchy (there have been some reports of them stealing passwords), so be careful there.
Edit
Since this answer still seems to be getting upvotes, I should probably update it with some more useful information. Since writing this, I have found a great UI automation system called Sikuli. It allows you to write programs in Python that automate a GUI. It includes image recognition features which make it very easy to recognize buttons, cards, and other UI elements; you just take a screenshot, crop it down to include just the thing you're interested in, and do fuzzy image matching (so that changing backgrounds and the like doesn't cause the match to fail). It even includes a custom IDE that allows you to embed those screenshots directly in your source code, so you can see exactly what the code is looking for. Here's an example from the documentation (apologies for the code formatting, doing images inline in code is not easy given StackOverflow's restricted subset of HTML):
def resizeApp(app, dx, dy):
switchApp(app)
corner = find(Pattern().targetOffset(3,14))
drop_point = corner.getTarget().offset(dx, dy)
dragDrop(corner, drop_point)
resizeApp("Safari", 50, 50)
This is much easier to get started with than the techniques mentioned in the article linked above, of injecting a DLL into the process you are debugging. Sikuli runs entirely at the UI level, so you never have to modify the program you are automating or worry about changes to the internals breaking your script.
One thing it is a bit poor at is handling text; it has OCR features, but they aren't all that good. If the text is selectable, however, you can select the text, copy it, and then look directly at the clipboard.
If I were to write a bot to automate something without a good API or text-based interface, Sikuli is probably the first tool I would reach for.
This answer is constructed from my comments.
What you are trying to do is hard, any way you try and do it.
Arguably the easiest way to do it is to totally mimic the user. So the application presses buttons, moves the mouse etc. The downside with this is that it is dependant on being able to recognise the screen.
This is easier if you can alter the games files as you can then just skin ( changing the image (texture)) the required cards to a single unique colour.
The major down side is you have to have the game as the top level window or have the game running in a virtual machine. Neither of which is ideal.
Another method is to read the processes memory. You may be able to find a list of memory locations, which would make things simpler, otherwise it involves a lot of hardwork, a debugger to deduce the memory addresses. It also helps (a lot) to be able to understand assembly.
The third method is to intercept the packets, and alter them. This is easier that the method above as it (at least for me) is easier to reverse engine the protocol as you have less information to deal with. It is just a matter of setting up a packet sniffer and preforming a action with one variable different (for example, the card) and comparing the differences.
The thing you need to check are that you are not breaking the EULA. I don't know how the game works, but most of the games I have come across have a EULA that prohibits (i.e. You get banned) doing any of the things I have mentioned.

how can I protect scraping of certain data on my web pages? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?
The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.
It's not possible.
You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
You use images and I can just feed them through an OCR.
You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.
Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).
Apart from the images, you could display the numbers using JavaScript or flash.
You could also use CSS to position individual digits using various combinations of absolute or relative positions.
You could also use JavaScript to help you create these DIV.
The point is just to obfuscate enough that it becomes really hard.
One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display.
You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.
At any rate, you can't stop a determined scraper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR.
There is nothing anyone from paying someone pennies to get the data manually anyway.
The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.
However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.
Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.
I can't believe I'm promoting a common malware scripting tactic, but...
You could encode the numbers as encoded Javascript that gets rendered at runtime.
Generate an image containing those numbers and display the image. :-)
I think you guys are being too reactive with these solutions. Javascript, Capcha, even litigation and the DMCA process don't address the complex adaptive nature of web scraping and data theft. Don't you think the "ideal" solution to prevent malicious bots and website scraping would be something working in a real-time proactive mitigation strategy? Very similar to a Content Protection Network. Just say'n.
Examples:
IBM - IBM ISS Data Security Services
DISTIL - www.distil.it
Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?
Using JavaScript is the same as using text. It's trivial to reverse engineer.
Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.
What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.
Don't output the numbers, i.e. prefix
echo $secretNumber;
with //.
For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.
Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.
just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.
I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...
Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.
For pointers on throttling see this question.

Resources