Unfortunately I am having trouble summing up this question in one phrase/line, thus I am forced to initially hint as to what my understanding is before asking the question in fear of me asking a question based on wrong facts or assumptions. As I understand “and please correct me if am wrong”, as I understand captchas work like so:
Have numerous images, and associate each image name/source to its displayed characters value.
Display image, then have user input what they see.
Match user's input against the character value which is associated with that image's name/source.
Assuming my understanding is correct: Given an unlimited amount of time,
Can't one associate image
names/sources with the displayed
characters increasing the chance of
cracking the captcha as they gather
more associations?
In that case, wouldn't the security strength of captchas be parallel the size of the image database?
NOTICE: As i suspected my question was based on wrong understanding.
Short answer! These are dynamic images and they are not stored anywhere. You wont even find them in the source code..
Wikipedia has good explanation about this. Alternatively check out the related questions in SO.
Edit: Goto this page where you can see an example of a captcha. Use firebug to see the HTML code for this image and you will see something like this.
<img height="57" width="300" src="http://www.google.com/recaptcha/api/image?c=03AHJ_VutaG4ahxWuQv0e6edYypp_FM8QuFIZkG75AnAm8iu3WRmwQ41jqcvojmKmbSKXxkf_s9fk61-axEp77_omKZZEYliE35BND_hXNh3Jac6ZUAeD08wOMZPj4W2s-A39vAI84eim5q-z9kFnmoSmon1jG2LmmFw" style="display: block;">
Did you notice the source? It does not point to an image file.
You can copy this url and generate the image (just open it in a browser). So you can develop an application which can download this image and then scan for color change in pixels and try to match for alphabets and numbers but if you notice almost all the alphabets and numbers are connected and closer so it is difficult to seperate different alphabets.
Even if you manage to seperate most of the alphabtes are not perfect. example :
(source: watblog.com)
Related
I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.
This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.
I am currently working on customizing a corporate GSA box for the client's website with Goggle's XSL stylesheet.
Unfortunately I do not have direct access to the box, this involves a meeting with a fellow in another timezone, so experiments and learning is very short on my part.
One of the biggest issues we are having is weather or not it is possible to get more characters/words into a resulting search-snippet in the XML output. More specifically, the returned field.
I've gone through a lot of the documentation for this and so far I've only found the tlen value for Title length, but not snippet length.
I know there are some parameters (hidden fields) to customize some options in the search form, but not finding anything relating to this. Since I can not access the Administrator Control panel itself, I've no idea what options are there. Can anyone point me to something that will help with this? It's greatly appreciated, I'm striking out on this.
BTW; we are at the current Version 6.14 I believe.
The GSA administrator can increase the length of the snippet. This is a system wide setting and not able to be customized at query time.
See this
http://www.google.com/support/enterprise/static/gsa/docs/admin/70/admin_console_help/serve_query_expansion.html#snippets
Changing the Snippet Length in Search Results
By default the Search Appliance, for most languages, will return Snippets with the length of 160 characters. Some exceptions to this rule are CJK languages for which Snippets are by default 240 characters. To Change the Snippet Length:
Under Snippet Generation, type a number in the Snippet Length box. The number must be bigger or equal to 0 and not bigger than 1024.
Click Apply Snippets Generation Settings.
<div id="temp_1333021214801">
<input type="text"/>
</div>
$browser.text_field(:xpath,".//*[#id='temp_1333018770709']/input").set("apple")
I am getting error "unable to locate element", because the ID changes dynamically.
Please help me to set the text in the text field.
It seems like your dynamic id is temp_ so this should do it given information above:
browser.div(:id, /temp_\d+/).text_field.set 'something'
Issues with my solution is that it assumes id will always be temp_ regex matching any number set consecutively, which seems to be the case with your sample above. Also, it assumes there is no other div(:id, /temp_\d+/) combination in the DOM of that page, most likely should not be an issue.
If you have dynamic IDs I can suggest the following:
Code to object counts. For example
$browser.text_field(:index => 2)
gives the third text_field on the page.
Code to what is around the thing you're trying to find.
$browser.div(:name => 'mydiv').text_field(:index=>2)
gives the third text field in the div called 'mydiv'.
HOWEVER
If your front-end is less-than-testable in this way I highly suggest you put time into thinking over your commitment to automated testing in the first place. Any minor change to the software is going to have you working until 9pm pulling your hair out and rocking back and forth as you update all your scripts, so unless code maintenance is your weekend hobby think about semi-automation or exploratory testing or manual scripts. Talk to development (whomever that might be. It might be you!) or the higher-ups (unless that's you too) to see if it can be made more testable. Also don't use xpaths unless you take some deviant pleasure in it.
Hope that was helpful, I can't do anything specific without the source HTML.
I know i can use
<mt:EntryAssets lastn="1">
<img src="<$mt:AssetThumbnailURL width="100"$>" />
</mt:EntryAssets>
to show the 'last' asset...how do I show the 'first' or 'oldest' assest?
[I'll point out here that "first" and "oldest" are not necessarily the same question.
You'll see why this is important below. Given the snippet you used, I'm going to assume what you're asking for is first as in position within the entry content. Sorry for length, but this is one of my pet bugs.]
Technically, you can't. That bug(summarized further down if you don't have an Fbz account) has finally been attached to a milestone, so hopefully this won't always be the case.
Practically, reversing the sort order will usually probably output what you expect:
<mt:entryassets limit="1" sort_order="ascend">
...as long as you compose your entries top-to-bottom, and don't later mess with the assets much
The underlying problem is that the current EntryAssets implementation doesn't actually take your content into account. It just loads a list of associated assets and then sorts them by the created_on dates of the assets themselves, not what physical order they appear in or even when they were attached to that particular entry. So as an extreme example, if you insert five images into a post, my snippet above will return the first image, as expected. If you later reverse their order and save, it'll still output that same image, which is now the (ordinal) last one. So, back to what I said at top, you're thinking "first" and MT is always giving you "oldest." And this requires an even further assumption that you're always uploading the assets at time of composition. If one of them was already in the system from say, two years ago, it's going to get returned because it's just older than everything else.
If you're using MT4.3x with the Entry Asset Manager in the sidebar of the composition screen and use it to attach(rather than insert) assets, this is going to get even more complicated, because there's no way to distinguish between assets that were associated with the entry via each manner.
So.
If you absolutely need the returned asset to be predictable, you'll need to actually distinguish it from the group in some way. There's this suggestion to tag the asset with "#first" or something similar. It's not great, but you'll at least know what you're getting(assuming you only tag one asset per entry as such). If you've got custom fields available, you might see if it makes more sense to create a separate "featured/thumbnail image" asset field that it would go into so that you could explicitly test for it. It'll ultimately depend some upon why you're wanting to extract this particular asset.
I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.