Getting data from a browser by screen-scraping - browser

I have gone thru several relevant looking questions but they did not contain the answer I am looking for. So, here is my question:
I have several web applications at my workplace, which are written using different frameworks and the authors are long gone to ask for feature updates. Hence I have to go thru the same grueling sequence of actions to get, which amounts to a file size of few kilobytes, everyday.
I tried parsing the page source but the programming technique of the authors were all over the place. Some even intentionally obscure the code to not let the data show as text, and there is no reason for this as the code they wrote is company asset. Long story short, I realized if I can copy and paste the textual content of these pages, I can process that data much easily than parsing the page source to get the text (which is sometimes totally impossible)
So, I am now looking for a browser plug-in (in windows or linux environments) or equivalent text based tools on windows or linux, which will load these pages and save the text on the screen to file(s) when invoked.
Despite how hard I tried, I am coming up empty handed.
I do not want to utilize the services of a third party screen-scraping web site, as the data is company confidential and not accessible by outside parties. Everything has to happen on the client end as I do not have access to the servers these apps are running on (mostly IIS on windows front end and a oracle db at the back end. The middle tier, as I have explained before is anyone's wild guess, ranging from native oracle apps to weblogic to tomcat and to some in house developed java/javascript stuff.
Thanks for all the help in advance

After searching for an answer for well over a year, I came to realize, as long as I use windows, a modern version of it that is, autohotkey is my savior.
I open the web page, maximize it, place my cursor (mousemove, x, y) then left click (mouseclick, L) then send ctrl-A followed by ctrl-C.
Voila ! everything is in the clipboard. Then I activate my unix session (winactivate PuTTY) and send appropriate key press commands to launch the editor of my choice (which is vi) and finally send a shift-Insert to paste the clipboard into my document. Then save and exit of course.
As an added bonus, right after my document is saved, I can invoke the script of my choice to parse this file and give me back the portion(s) I am interested in.
I know it is not bullet proof, but for my purpose, it helps to a great extent. As a matter of fact, I can do whatever I want with this method.

What about something like this: http://www.nirsoft.net/utils/htmlastext.html
Freeware that converts an HTML page to text

Any of links, lynx or w3m will do what you want, they are text browsers and you can dump text from a webpage with, for example:
w3m -dump http://www.google.com > g.txt

Related

Script to paste a specific string into a text field with a hotkey

I am trying to find a way to paste a predefined string upon entering a specific keyboard sequence, on any app.
For example if I have to paste an url or a password into a field, I can have said password in a hidden script and when I press, say, [ctrl] + [5], it would write "example123" on the text field where my cursor is.
Ideally without copying to the clipboard (I'd prefer keeping what I have on my clipboard and also avoiding to paste a password or such by mistake elsewhere).
I have tried every solution I've found so far that include xclip, xdotool and xvkdb. All of them either do not work or are really inconsistent: They only paste the string sometimes, and when they do, it's usually only part of the string ("ample123" instead of "example123").
I thought of using compose key, which I heavily use anyway to write in french on an us keyboard, but it seems it only supports 1 character sequences, as nothing is printed when I modify my .XCompose to include custom output sequences of len > 1.
I am using Ubuntu 18.04 with Gnome as a DE. Ideally something that also works when logging back (like compose keys).
You need to walk the Document Object Model for either Gnome or your web-page. My concern is that with a desktop script you wont be able to access the web page because you will need to be able to establish a target to send string to. I see in your question that you tried using using "x{tool-name}" to grab the text field element. Delivering the sting really isn't the problem. The problem is getting the GUI element of text box pragmatically. The easiest way to get access to this in a user loaded web-page is with WebExtensions API which is how to make extensions for most modern browsers. Otherwise, if you can get away with only having access to Gnome's GUI I would try LDTP, it's a library used for testing, but it looks like it can be used for automation too.
For keyboard shortcuts:
It really shouldn't matter what the script is doing to how you want to activate it. I would just go to Gnome/Settings/Keyboard and set the path to where I saved the script to be the Command. If you go the WebExtension route, you will want to build the shortcut into your extension.

Search through scripts of (multiple) cimplicity screens

We are using Cimplicity to operate some installations at our plant. The frontend consists of a lot of .cim files, which are the screens presented to the operator. These files are built with 'cimedit', which is basically a graphical click and drag program with which you can assemble the screens. Each object you drag onto the screen has the option to run a script, which brings me to my problem.
Because each screen contains a lot of small scripts and functions it is hard to keep track of what does what. For example I'm trying to figure out where a certain table from my database is being accessed or updated. Since the files all seem to be compressed (or so) I can't use a regular 'search the contents of this file' search.
Things I've tried so far are searching using windows, with the content option enabled and also tried the compression option. This had no success. It makes sense because like I said, the files seem to be compressed, so the actual script is not stored in plain text.
So, my question in short:
How do I search all the scripts of (preferably multiple) cimplicity screens?
Any tips on how to search compressed files are also very much appreciated.
I stumbled upon another stackoverflow post while searching for a better windows search tool and ended up finding this post: https://superuser.com/questions/26593/best-way-to-confidently-search-files-and-contents-in-windows-without-using-an
This posts recommends Agent Ransack and it is actually possible to search through the .cim files with this tool.

Making Chrome/Firefox Reuse Existing Opened File

I've added logic in Emacs to automatically call browse-url on a DMD generated html documentation file upon completion of a special build finish hook I've written.
In order for this to be usable I now want this call to only open a new browser tab the first time it is called and the rest of the times only reload the tab already showing the the doc file.
Is this possible, preferrably in Google Chrome?
I've scanned command line arguments for both GC and FF but have found nothing.
I suspect some Javascript/HTML-5 may do the trick but I know nothing about that.
For Firefox look into browse-url-firefox-new-window-is-tab and / or browse-url-maybe-new-window. You could follow the execution path from the definition of browse-url-default-browser, all in the browse-url.el.
But the basic idea is that you could just look at how, for example, browse-url-firefox is implemented, write the one that does exactly what you want (launches Firefox in the way that you need), and set it to be the browse-url-browser-function. The value of this variable must be a function which is called from browse-url.
What is interesting (perhaps something similar is available in Google Chrome), there's MozRepl, obviously, it will run in Mozilla browsers, and there's a binding for Emacs to talk to this REPL (interactive JavaScript interpreter). Using that you can have very fine-grained control over the behaviour of the browser, including, but not limited to creating new GUI components (using XUL), manipulating browser windows and so on. Would probably depend on how much time you are willing to spend on it.

DDE/IPC in linux gui?

There used to be Dynamic Data Exchange API (type of IPC) in windows which allowed sending notifications with params to running process and they would grab focus and conduct the operation. Is there anything similar in xwindows/gnome?
Like for example, when I get my phpunit errors, it comes with file path and line number. Was wondering if using any bash script or perl etc, I could grab the output and make the line below clickable
protected/tests/controllers/CmsControllerTest.php:17
so it quickly focus on my eclipse, open the file and moves cursor to the right line number.
phpunit and eclipse is just for examples. enough said.
The usual way to address this, would be to make the functionality an eclipse plugin.
There are lots of examples on how to write such plugins.
Moreover, you can probably lean on/reuse rather feature complete existing views (Problems view, Tasks view etc.) so making it look beautiful and matching eclipse should be a breeze.
Alternatively, there is a rich API that you could use to implement your own IPC channel to talk with your test runner outside Eclipse. An example of that is eclimd, the Vim-eclipse integration thing. Specifically, look at it's behaviour in 'Headed Eclipse' mode.

how can I extract text contents from GUI apps in linux?

I want to extract text contents from GUI apps,here are 2 examples::
example 1:
Suppose I opened firefox, and input url : www.google.com
how can I extract the string "www.google.com" from firefox using my own app ?
example 2:
open calculator(using gcalctool),then input 1+1
How can I extract the string "1+1" of calculator from my own program?
in brief ,what I want is to find out whether there is a way to extract the text contents from any widget of an GUI application
Thanks
I don't think there's a generic way to do this, at least not a very elegant one.
Some inelegant ideas:
You might be able to modify the X window system or even some toolkit framework to extract what is being displayed in specific window elements as text.
You could take a screenshot and use an OCR library to convert the pixels back into text for the interesting areas.
You could recompile the apps of interest to add some kind of mechanism for asking them questions.
You could use something like xtest to inject events highlighting the region of interest and copying it to the clipboard.
I believe firefox and gcalctool are for examples only and you just want to know in general how to pass output of one application to other application.
There are many ways to do that on Linux, like:
piping
application1 | application2
btw here is the Firefox command line manual if you want to start firefox on Ubuntu with a URL. eg:
firefox "$url"
where $url is a variable whose value can be www.mozilla.org
That sounds difficult. Supposing you're running X11, you can very easily grab a window picture ( see "man xwd"); however there is no easy way to get to the text unless it's selected and therefore copied to the clipboard.
Alternatively, if you only want to capture user input, this is quite easy to do, too, by activation the X11 record extension: put this in your /etc/X11/xorg.conf:
Section "Module"
Load "record"
#Load other modules you need ...
EndSection
though it may prove difficult to use too, see example code for Xorg/X11 record extension fails

Resources