Search Algorithm for a web application that needs to look for a specific value - search

I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.

This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.

Related

what are disadvantages of using ":xpath" attribute to identify an object using Watir?

I am using :xpath attribute frequently to identify an element for my automation scripts using Watir and found it really amazing. It is least changing attribute so less work to maintain automated scripts.. off course for those elements which can't be identified otherwise easily through :id, :name, :value attributes..
I am bit concerned to take some expert advise before building so many automated scripts using :xpath.
What is disadvantage of using :xpath to identify an object using Watir?
Do :xpath value of an element will be same in IE, Chrome and FF?when
Is there anything else important i should be aware about using :xpath?
Thanks
The xpath should always be the same in all browsers.
The problem with using xpath is that it is the easiest locator to break, as the locator for the element is dependant on nothing else in that xpath changing. e.g. if you are locating a results table on a page using an xpath and at a later date another table gets added above the table, then the xpath will be broken and your tests will fail until you update the xpath. If that table was located using an id then adding the second table wouldn't break anything as the new table would have a different id.
If the pages you're working on don't have id's and it isn't an option to add some/ ask for some to be added then remember that in watir you can use multiple locators.
e.g. #browser.table(class: 'results_table', text: /Original results table/)
This is a silly example but hopefully it illustrates the point. If there are cases when using multiple locators still won't work for any reason, then I would look into using css selectors instead of xpath as you should be able to achieve the same things but it will be less brittle.
The issue of how often tests break isn't too important in a small test suite, especially if you're the only one working on the tests. However, a couple of years from now when you have hundreds of tests to maintain and two or three people sharing the codebase you can end up spending longer fixing old tests that you spend writing the new ones. It's worth doing anything you reasonably can to minimise this as you go along as doing a rewrite later will always take longer.
Hopefully some of this helps!

Searching Single Pages with Dynamic Content

I have a slight problem I have been trying to address for a client I have been working with. We have 4 sets of single pages that are loading content from a database using PHP based upon a get string that is provided. These pages that are generated are optimized well for SEO and have alt tags for images and Content that we need to be able to search using a search feature.
Now i had assumed (An everyone knows what assuming gets you) that these pages by default would be able to be searched by the concrete 5 built in search feature. But it doesn't work. If I search for a word that I know is definitely on one of these pages even multiple times no results are found.
How can I make Concrete5 search these pages. If its no do able by a default or by a plugin, then can someone please offer some advice on how to fix this. This is an important feature and must be completed.
EDIT: See my comment below. I still need some help or direction here as CSE inst much of an option.
EDIT2: It may be viable for me to install a crawler and a custom search engine to address my problems. I was thinking of spider. Any other suggestions on that or other options are much appreciated!
Unfortunately C5 doesn't provide a way to do this -- the only way to tap into the search index is with blocks. And even if you created a phony block just to pass content from the single_page through to the search index, there's no way to say that some content is from one URL while other content is from another URL (which you'd need to do since your single_page controller is handling many different URL's).
I don't know of a way to achieve what you want to do (and it appears that nobody else does either -- http://www.concrete5.org/community/forums/customizing_c5/make-content-in-single-pages-searchable/ ), other than building your own internal search engine.
EDIT: I just did some digging, and thought that perhaps you could manually insert records into the PageSearchIndex table and specify the searchable content and the desired path there -- but this won't work because it relies on one cID (collection id, a.k.a. page id) per entry -- so you'd only be able to insert one record for the top-level single_page path.
I think the simplest solution here would be to create your own searching infrastructure for your single_pages (like some kind of function in the controller that would return an array of page paths and searchable content for each one), then override the search block and perform an additional search of your single_page -- then combine the results on the search results page there. Or just use google site search for your site, which will actually crawl the pages and hence find your various single_page urls: https://www.google.com/cse/
Best of luck.
I have not tested this, but maybe you can put a function getSearchableContent() in the single pages controller like you do for blocks. This would return the string to be searched. Would look something like this:
function getSearchableContent() {
// ... compose searchstring depending on the queried content.
return $searchstring;
}
But I don't know if this works for dynamic content. If not, I'd look into C5's search index core classes and try to extend them for your project.

Modx - Extend site_content - Add new table

Currently, we're running revolution 2.2. On site_content, we have some tags that are ran for crawling twitter. I want to start tracking the number of results for each tag as results come in, to determine which tags don't return that many results, etc.
So I was thinking that I should create a new table (twitter_data), and have a foreign key that will link it to the search tag ID, which is stored in site_content.
What is the best path to accomplish this? Should I create my table then run the reverse schema tool, outlined here?
http://rtfm.modx.com/display/revolution20/Reverse+Engineer+xPDO+Classes+from+Existing+Database+Table#ReverseEngineerxPDOClassesfromExistingDatabaseTable-CreatingaMySQLtable
I also found this, but not sure if this is what I should be looking into:
http://rtfm.modx.com/display/revolution20/Using+Custom+Database+Tables+in+your+3rd+Party+Components
Probably not - if you can avoid modifying the core modx schema do so. an external table may be your best option, but requires a fair bit of work.
though if you can explain wht you mean by 'tags' a little better [html tags? snippets? content tags? not sure what you mean] there may be other options. for example. one of our clients wanted to count page hits [and didn't want to use google to do it] so all we did was to create a template variable bound to each page they wanted to count and then updated that appropriate variable by writing plugin to fire on the onpageload or onpagerender event. [I don't ermember exactly which or what it was called]
Basically, you may be able to do this by writng a plugin rather than trying to extend anything or add snippets/chunks.

can you have "variables" in text in google sites?

Sorry, this is a bad question. I don't even know what the title should be. I'm a total noob at making websites so this might be easy to find but I just don't know the terminology to search for. I cannot find anything about how to do this...
What I want to do is have something like references/variables that I can use in a block of text and it will automatically get replaced with whatever value should be there. Best way I can think of to describe it would be if I was using the site as a design doc for a game or something, I would be able to type in [Title] or something similar on any page and when it loads that text would be replaced with whatever my Title is. That way If I ever change titles, names, classes, races, places, items, etc... they would only have to be changed in 1 place and the change would be reflected everywhere.
I notice if I add a link to a page it will automatically use the Title of that page as the text of the link. That is almost exactly what I want. Except when I change the Title of the other page the text of the link remains as the original text. It doesn't get updated to the new Title and that is not at all what I want.
Also, I want to do this in Google Sites and as simply as possible. I don't really want to use a database. I was hoping Google Sites would have some kind of funcionality for this.
I don't believe this is possible (on Google Sites) and likely you need to consider a hosted solution.
Quoting the answer from this relevant post:
You should consider hosting your solution using Google's App Engine
instead of Google Sites. You can set it up so it uses PHP (see link
below), you can configure it to use your domain name and you get
enough CPU, disk and bandwidth allowance to serve around five million
page views for free each month, if you are serving more than that,
their prices are extremely competitive.
Google App Engine:
http://code.google.com/appengine/docs/whatisgoogleappengine.html How
to setup PHP using Google App Engine: http://blog.caucho.com/?p=187
Also I'm not sure how your PHP skills are but if you're unfamiliar with it then this should help to get you started.

Movable Type 4: How can I show the last asset?

I know i can use
<mt:EntryAssets lastn="1">
<img src="<$mt:AssetThumbnailURL width="100"$>" />
</mt:EntryAssets>
to show the 'last' asset...how do I show the 'first' or 'oldest' assest?
[I'll point out here that "first" and "oldest" are not necessarily the same question.
You'll see why this is important below. Given the snippet you used, I'm going to assume what you're asking for is first as in position within the entry content. Sorry for length, but this is one of my pet bugs.]
Technically, you can't. That bug(summarized further down if you don't have an Fbz account) has finally been attached to a milestone, so hopefully this won't always be the case.
Practically, reversing the sort order will usually probably output what you expect:
<mt:entryassets limit="1" sort_order="ascend">
...as long as you compose your entries top-to-bottom, and don't later mess with the assets much
The underlying problem is that the current EntryAssets implementation doesn't actually take your content into account. It just loads a list of associated assets and then sorts them by the created_on dates of the assets themselves, not what physical order they appear in or even when they were attached to that particular entry. So as an extreme example, if you insert five images into a post, my snippet above will return the first image, as expected. If you later reverse their order and save, it'll still output that same image, which is now the (ordinal) last one. So, back to what I said at top, you're thinking "first" and MT is always giving you "oldest." And this requires an even further assumption that you're always uploading the assets at time of composition. If one of them was already in the system from say, two years ago, it's going to get returned because it's just older than everything else.
If you're using MT4.3x with the Entry Asset Manager in the sidebar of the composition screen and use it to attach(rather than insert) assets, this is going to get even more complicated, because there's no way to distinguish between assets that were associated with the entry via each manner.
So.
If you absolutely need the returned asset to be predictable, you'll need to actually distinguish it from the group in some way. There's this suggestion to tag the asset with "#first" or something similar. It's not great, but you'll at least know what you're getting(assuming you only tag one asset per entry as such). If you've got custom fields available, you might see if it makes more sense to create a separate "featured/thumbnail image" asset field that it would go into so that you could explicitly test for it. It'll ultimately depend some upon why you're wanting to extract this particular asset.

Resources