Extract parameters and result contents from website - web

I have a website where I can input a list of strings and it'll display the results of each in the same format (basically a table).
What I want to do is to be able to save the results as well as their corresponding parameters (the input string that I searched) and output them into a file to analyze later. So basically capture my input and the output it returns. It's kind of like, if I search "stack" on google, I want my output file to be "stack" and all the displayed results from the search.
I've done some research on web and screen scraping, but I can't find anything that fits my needs. I looked into the curl function in php, but it looks like it can only get the contents of a specific URL, which I don't have since I'll be repeating the searches frequently.
I also looked into the HTML Agility Pack and HttpWatch, but they don't seem to be able to extract contents this dynamically.
I was wondering if there are any ideas or tips that I could use. I was thinking maybe a plugin or application that I could write that captures the parameters of my request (input strings) and the results sent from the server, but I'm not really sure how to do this, any tips? Or maybe there's an existing one that I wasn't able to find?
Thanks in advance!

Related

Converting-punycode-with-dash-character-to-unicode

This is in reference to this topic on the page here:
Converting punycode with dash character to Unicode
//Javascript Punycode converter derived from example in RFC3492.
I don't know where to place the input 清华大学.cn domain to get the Javascript to work. I am not a real a programmer.
I want to use the js code on this page to convert IDN domain names to penycode if possible. I'm using a ColdFusion html page to process the JS. Then I'll save the penycode to our SQL database.
Example: 清华大学.cn needs to be converted to penycode.
I can use any number of online converters but that won't help. It has to be automated with a script. FYI, the penycode for 清华大学.cn is xn--xkry9kk1bz66a.cn.
HERE IS MY PROBLEM:
Even after copying the js code into Dreamweaver, I have no idea where to place the domain 清华大学.cn into the Javascript code be converted. I can't see a hint where the input is - if any. I can figure things out okay if there was some hint at where to begin.
I just need to know where to place the input or someone to tell me this can't be done with the Javascript example on that page.
We are using ColdFusion 19 and SQL on our under construction domain marketplace website. We want to accept IDN domains to be listed and I am hoping your JS will do what I want.
If I'm totally wrong then perhaps someone can suggest another js code that will convert the domain to correct penycode.
After searching I found an close answer I can at least work with, I hope. I needed an html input form to process the Javascript.
I found that information here.
How to convert domain names with greek characters to an ascii URL?
I then copied the page, inserted the Javascript as puny.js and it works. Now I need to figure out how to somehow capture the input "id" and "label for" to save the result into SQL using ColdFusion. Not sure if this can be done. But at least the somewhat answers my question. Maybe it's the best I'm going to get here on Stackoverflow.

Getting thumbnails in OpenSearchServer search results

I need an alternative to Google Custom Search for a website I look after, it has to be something that will crawl a website, index it, allow fiddling of priorities, and then allow search queries via REST or something similar and return XML or JSON etc. It needs to run on a Windows Server instance.
So, I'm up and running with http://www.opensearchserver.com/ and it seems to do the trick, but can't, for the life of me, work out how to get thumbnail images in the results? I've searched the documentation and read everything I could, but can't find out how to do this (or how to get my head around it).
I'm crawling standard web pages and they all have thumbnail meta data, which I'm assuming should be able to be parsed somehow for results and included in the JSON results?
Any pointers at all would be very helpful, thanks!
I figured this out, in case anyone else is struggling, here's how I did it. The answer is in the documentations, it's just not that simple.
Read: http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md - it contains the method
Assume you set up a 'web crawler' index.
Assuming you're using a meta thumbnail like this:
<meta name="thumbnail" content="http://my_cdn.com/news/images/29637.jpg">
Go into Schema / Fields. Add a new field called 'thumbnail' with index no, store yes, vector no, analyser Text, copy of blank. Save that.
Now go to schema / parser list, edit HTML parser. Go to 'field mapping', now add a new regex for the thumbnail in the html. We map from the 'htmlSource' to the thumbnail' with the matching regex.
My imperfect regex (that works though) is:
htmlSource -> linked in: thumbnail -> captured by:
(?s)<meta name="thumbnail" content="(.*?)">
Now SAVE this and go to crawl/manual crawl, enter a url that has a thumbnail and then check if the field now appears in the list below when it's read. If not check your regex, and check you actually saved the HTML Parser changes.
To get the thumb in your results, simply add the fieldname to the JSON you send with the query:
"returnedFields": [ "
"url",
"thumbnail"
],

Search Algorithm for a web application that needs to look for a specific value

I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.
This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.

Best non-interactive approach to enter a string into a formular field and get the resulting text

In some website for which I have access, there are some input fields. In the sixth field I need to enter some input string from a list of 10000 strings, then a new page appears, for which I would just need to count the number of lines. Finally I would like to get a table with two columns like input string and number of resulting lines. Since I have to manually enter the info for all the different 10000 strings, I wonder therefore what is the best approach to enter a string into a generic formular field and get the resulting text. I heard about curl but I am not sure whether this is the easiest one.
P.S.
Example of interactive way: I type some string o words into google search and then I get a new page with the search results. Previously I have introduced my google username and password, so the results will be probably filtered according to my profile.
Example of non-interactive way: A script somehow introduces my user information, search query and saves to some text file the search results. Imagine the same idea but for a more complicated website like this.
What you want to do is to send a HTTP POST with specific data. This can be done with any proper HTTP client code, and one such is libcurl (or the pycurl binding or even using the curl command line tool). On the response from the post, you probably get a redirect and then the results, or you need to do a separate request for the results and then you're done and go back to do the next POST. Repeat until all POSTs are done.
What you may need to take into account is that you may have to deal with cookies and possibly to follow a redirect from the POST. A good approach is to record a "manual session" as done with a browser (use firebug or LiveHTTPHeaders etc) and then use that recording to help you repeat the same thing with a HTTP client.
A decent tutorial to get some starting up details on this kind of work can be found here: http://curl.haxx.se/docs/httpscripting.html
You could also use JMeter to run all the posts. You may use the CSV input to set the 10000 strings. Then you save the result as xml and extract the necessary data.

Website Input/Output using a premade .exe (Need direction)

I am trying to build a part of a website which takes in a text passage as input, and outputs the same text passage, except with the definition of each word appearing when the user rolls over (or clicks) any given word. I have a pre-made .exe file which maps input words to their definitions (takes in words from standard input and outputs the definition to standard output).
My problem, then, is to run user input through the .exe file on the website's server, then put the output back onto the page. It seems like a fairly trivial problem, but I have no idea where to start.
So my questions are: is this project even possible? If so, what languages/tools do I need to be able to use in order to implement it? Are there keywords that describe what I'm talking about that I could use to look up tutorials/solutions on the Web?
I have rudimentary knowledge of PHP, HTML, and Javascript, but so little experience that I can't judge whether (and how) they can be used to approach this problem.
Note: I do not have access to the .exe source, so I must use the .exe itself as my input-output mechanism.
With AJAX and PHP, you can do accomplish this with minimal effort.
JavaScript's AJAX features would send the word you input to the PHP page, and from the PHP page, you can run the external exe file with the sent word as an argument (sanitize it, please. People can inject code which will explode your servers!):
<?php
$word = $_POST['input_word']; // MAKE SURE YOU SANITIZE THIS. If you don't, system security goes down the toilet.
exec('myprogram.exe ' + $word, $output_array);
print_r($output_array);
?>

Resources