I need to read with the function SSJS fromJson() a URL.
For example the Data access API for a Notes View
http://{host}/{database}/api/data/collections/name/{name}
How can I do this ?
P.S I think (I don't know if is true) that if I use Java code
(for example the class URLReader from this blogger, I
lose authors/readers functionality because is my server and not the current user that execute the reading of the stream?
I'll explain why I'm trying to understand this...
I need to use this plugin JQuery Jquery Data Tables in my app.
I need a complete Server-side processing because I have over 10.000 documents for any view.
This jQueryPlugin send a parameters to a specif URL (I think my XAgent) so that I think to create a XAgent that read this parameter and parsing a JSON API Data for the output.
This because I need a fasted response.
The solution of Oliver Busse it very slow because load all entries of my view in a JSON (I have many entries) and I wait 30/40 seconds for this operation
I gather from the PS that you're specifically looking to fetch JSON on the server from itself, while retaining user authentication information. Sven's post there does a bit of that, but I think that the most reliable way would be to grab the Authorization and Cookie headers from the request and then pass them along in your URL request. This answer has a good basis for doing this with Java. You could expand that to do something like this (which, granted, I haven't tested, but it's a starting point):
HttpServletRequest req = (HttpServletRequest)FacesContext.getCurrentInstance().getExternalContext().getRequest();
String authorization = req.getHeader("Authorization");
String cookie = req.getHeader("Cookie");
URL myURL = new URL("http://foo.com");
HttpURLConnection myURLConnection = (HttpURLConnection)myURL.openConnection();
if(StringUtil.isNotEmpty(authorization)) {
myURLConnection.setRequestProperty("Authorization", authorization);
}
if(StringUtil.isNotEmpty(cookie)) {
myURLConnection.setRequestProperty("Cookie", cookie);
}
myURLConnection.setRequestMethod("GET");
myURLConnection.setDoInput(true);
myURLConnection.setDoOutput(true);
myURLConnection.connect();
InputStream is = null;
try {
is = myURLConnection.getInputStream();
String result = StreamUtil.readString(is);
} finally {
StreamUtil.close(is);
myURLConnection.disconnect();
}
Ideally, you would also fetch the server host name, protocol, and port from the request.
Eric's comment is also wise: if this is something you can do with the normal classes, that's going to be more flexible and less problem-prone, due to how fiddly server-self HTTP calls can be.
As I mentioned in my comment, this approach forces that your call go through a client-side call to the Domino Data Service and otherwise complicates a normal handle to establish on a View (and it's contents) via a cross-NSF call (e.g.- var vwEnt:ViewEntryCollection = session.getDatabase("serverName", "path/myDb.nsf").getView("viewName").getAllEntries();).
As a blog post of mine previously outlines, you can definitely achieve this as Jesse's answer (curse you fast typer Jesse!) outlines. Something I include in my "grab bag" of tools is a Java class that's a starting point for getting JSON formatted content. Here's the link (here's one with basic authorization in a request header) and the class I generally start from:
package com.eric.awesome;
import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import com.google.gson.*;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.MalformedURLException;
import org.apache.commons.validator.routines.*;
/**
* Class with a single, public, static method to provide a REST consumer
* which returns data as a JsonObject.
*
* #author Eric McCormick, #edm00se
*
*/
public class CustJsonConsumer {
/**
* Method for receiving HTTP JSON GET request against a RESTful URL data source.
*
* #param myUrlStr the URL of the REST endpoint
* #return JsonObject containing the data from the REST response.
* #throws IOException
* #throws MalformedURLException
* #throws ParseException
*/
public static JsonObject GetMyRestData( String myUrlStr ) throws IOException, MalformedURLException {
JsonObject myRestData = new JsonObject();
try{
UrlValidator defaultValidator = new UrlValidator();
if(defaultValidator.isValid(myUrlStr)){
URL myUrl = new URL(myUrlStr);
URLConnection urlCon = myUrl.openConnection();
urlCon.setConnectTimeout(5000);
InputStream is = urlCon.getInputStream();
InputStreamReader isR = new InputStreamReader(is);
BufferedReader reader = new BufferedReader(isR);
StringBuffer buffer = new StringBuffer();
String line = "";
while( (line = reader.readLine()) != null ){
buffer.append(line);
}
reader.close();
JsonParser parser = new JsonParser();
myRestData = (JsonObject) parser.parse(buffer.toString());
return myRestData;
}else{
myRestData.addProperty("error", "URL failed validation by Apache Commmons URL Validator");
return myRestData;
}
}catch( MalformedURLException e ){
e.printStackTrace();
myRestData.addProperty("error", e.toString());
return myRestData;
}catch( IOException e ){
e.printStackTrace();
myRestData.addProperty("error", e.toString());
return myRestData;
}
}
}
To invoke from SSJS as a POJO, you would want do something like:
importPackage(com.eric.awesome);
var urlStr = "http://{host}/{database}/api/data/collections/name/{name}";
var myStuff:JsonObject = CustJsonConsumer.GetMyRestData(urlStr);
Related
I have the following step:
#Given("Request specifications are set with base uri {string}")
public void setRequestsSpec(String baseUri){
requestSpecification = new RequestSpecBuilder()
.setBaseUri(baseUri)
.addFilter(new ResponseLoggingFilter())//log request and response for better debugging. You can also only log if a requests fails.
.addFilter(new RequestLoggingFilter())
.addFilter(new RcAllureFilter())
.build();
Then I have:
#When("^Azure Login Request Executed$")
public void azureLoginExecuted() {
response =
given() //Add x-www-form-urlencoded body params:
.spec(testContext().getRequestSpec())
.formParam(GRANT_TYPE_KEY, GRANT_TYPE_VALUE)
.formParam(AUTO_TEAM_CLIENT_ID_KEY, AUTO_TEAM_CLIENT_ID_VALUE)
.formParam(AUTO_TEAM_CLIENT_SECRET_KEY, AUTO_TEAM_CLIENT_SECRET_VALUE)
.formParam(RESOURCE_KEY, RESOURCE_VALUE)
.when()
.post(AUTO_TEAM_TENANT_ID + RESOURCE); //Send the request along with the resource
setAuthorizationToken();
}
How can I extract the request's details like URI, headers and parameters from it?
I cannot find a class from which I can extract the request details.
In RequestSpecification class, I hardly can find any getter functions in this class.
I need this values in order to build a formatted log message.
Is there another way?
if you are trying to get the details from requestspecification, then you can use like this .
RequestSpecification spec = new RequestSpecBuilder().setContentType(ContentType.JSON).addHeader("h1", "h2")
.build();
QueryableRequestSpecification queryable = SpecificationQuerier.query(spec);
System.out.println(" Content is " + queryable.getContentType());
System.out.println(" Header is " + queryable.getHeaders().getValue("h1"));
But in your scenario, you want request details too. so , best way would be to use a requestlogging filter, which accepts a PrintStream (which in turn can work With ByteArrayOutPutStream which can convert to a String ) . Basic idea is, to use RequestLoggingFilter with a PRintStream and then use any code to save PrintStream to a String. You can usr StringWriter too.
RequestSpecification spec = new RequestSpecBuilder().build();
StringWriter requestWriter = new StringWriter();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream printStream = new PrintStream(baos);
Response response = given().spec(spec).contentType(ContentType.JSON)
.filter(new RequestLoggingFilter(printStream)).get("https://jsonplaceholder.typicode.com/todos/1");
printStream.flush();
System.out.println(baos);
I am writing a DialogFlow service.
My service receives DialogFlow fulfilment requests via an API gateway that is inserting some additional authorization headers that I need to pick up in my service.
I am using the DialogFlowApp handlerRequest method in my code but I cannot get access to the headers from the requestHandler.
The headers are added to the handleRequestMethod in the usual way
#RequestMapping(value = "/google", method = RequestMethod.POST)
public String handleGoogle(#RequestBody String form, HttpServletRequest request) {
String jsonResponse = "";
try {
jsonResponse = google.handleRequest(form, getHeadersMap(request)).get();
} catch (Exception e) {
System.out.println("Error Resposnse "+e);
}
return jsonResponse;
}
I have tried accessing the headers using something like this:
public ActionResponse genericIntentHandler(ActionRequest request) {
Argument a = request.getArgument("someval");
//Or
Object b = request.getParameter("someval");
}
Neither of these work.
Looking at the code on github, the request handler seems to add the headers to the AogRequest class
https://github.com/actions-on-google/actions-on-google-java/blob/master/src/main/kotlin/com/google/actions/api/impl/AogRequest.kt
However, the specific function called to create the class seems to do nothing with the headerMap in the constructor.
Could someone please help.
We are using the Gmail API Java Client version 1.19.0. Is there anyone that has implemented successfully a working mock object that could be used for stubing requests such as:
gmailClient.users().history().list("me").setStartHistoryId(startHistoryId).setPageToken(pageToken).execute();
Essentially, we would like to stub the above call and create a specific response, to test different business scenarios.
Please check below a working example of the above question. No need to use powermock. Mockito is only needed.
#Before
public void init() throws Exception{
ListHistoryResponse historyResponse = new ListHistoryResponse();
historyResponse.setHistoryId(BigInteger.valueOf(1234L));
List<History> historyList = new ArrayList<>();
History historyEntry = new History();
Message message = new Message();
message.setId("123456");
message.setThreadId("123456");
List<Message> messages = new ArrayList<>();
messages.add(message);
historyEntry.setMessages(messages);
historyList.add(historyEntry);
mock = mock(Gmail.class);
Gmail.Users users = mock(Gmail.Users.class);
Gmail.Users.History history = mock(Gmail.Users.History.class);
Gmail.Users.History.List list = mock(Gmail.Users.History.List.class);
when(mock.users()).thenReturn(users);
when(users.history()).thenReturn(history);
when(history.list("me")).thenReturn(list);
when(list.setStartHistoryId(BigInteger.valueOf(123L))).thenReturn(list);
when(list.setPageToken(null)).thenReturn(list);
when(list.execute()).thenReturn(historyResponse);
}
you can mock the classes are long as they're not final, etc. what's the limitation here? (haven't looked at the source code for the Google java client libraries but shouldn't be gmail-specific--if you've found someone doing it for another Google java client API you should be able to re-use it).
There is also MockHttpTransport helper class for such a scenario. Please consult with documentation chapter HTTP Unit Testing
HttpTransport transport = new MockHttpTransport() {
#Override
public LowLevelHttpRequest buildRequest(String method, String url) throws IOException {
return new MockLowLevelHttpRequest() {
#Override
public LowLevelHttpResponse execute() throws IOException {
MockLowLevelHttpResponse response = new MockLowLevelHttpResponse();
response.addHeader("custom_header", "value");
response.setStatusCode(404);
response.setContentType(Json.MEDIA_TYPE);
response.setContent("{\"error\":\"not found\"}");
return response;
}
};
}
};
Is anyone aware of a method to dynamically combine/minify all the h:outputStylesheet resources and then combine/minify all h:outputScript resources in the render phase? The comined/minified resource would probably need to be cached with a key based on the combined resource String or something to avoid excessive processing.
If this feature doesn't exist I'd like to work on it. Does anyone have ideas on the best way to implement something like this. A Servlet filter would work I suppose but the filter would have to do more work than necessary -- basically examining the whole rendered output and replacing matches. Implementing something in the render phase seems like it would work better as all of the static resources are available without having to parse the entire output.
Thanks for any suggestions!
Edit: To show that I'm not lazy and will really work on this with some guidance, here is a stub that captures Script Resources name/library and then removes them from the view. As you can see I have some questions about what to do next ... should I make http requests and get the resources to combine, then combine them and save them to the resource cache?
package com.davemaple.jsf.listener;
import java.util.ArrayList;
import java.util.List;
import javax.faces.component.UIComponent;
import javax.faces.component.UIOutput;
import javax.faces.component.UIViewRoot;
import javax.faces.context.FacesContext;
import javax.faces.event.AbortProcessingException;
import javax.faces.event.PhaseEvent;
import javax.faces.event.PhaseId;
import javax.faces.event.PhaseListener;
import javax.faces.event.PreRenderViewEvent;
import javax.faces.event.SystemEvent;
import javax.faces.event.SystemEventListener;
import org.apache.log4j.Logger;
/**
* A Listener that combines CSS/Javascript Resources
*
* #author David Maple<d#davemaple.com>
*
*/
public class ResourceComboListener implements PhaseListener, SystemEventListener {
private static final long serialVersionUID = -8430945481069344353L;
private static final Logger LOGGER = Logger.getLogger(ResourceComboListener.class);
#Override
public PhaseId getPhaseId() {
return PhaseId.RESTORE_VIEW;
}
/*
* (non-Javadoc)
* #see javax.faces.event.PhaseListener#beforePhase(javax.faces.event.PhaseEvent)
*/
public void afterPhase(PhaseEvent event) {
FacesContext.getCurrentInstance().getViewRoot().subscribeToViewEvent(PreRenderViewEvent.class, this);
}
/*
* (non-Javadoc)
* #see javax.faces.event.PhaseListener#afterPhase(javax.faces.event.PhaseEvent)
*/
public void beforePhase(PhaseEvent event) {
//nothing here
}
/*
* (non-Javadoc)
* #see javax.faces.event.SystemEventListener#isListenerForSource(java.lang.Object)
*/
public boolean isListenerForSource(Object source) {
return (source instanceof UIViewRoot);
}
/*
* (non-Javadoc)
* #see javax.faces.event.SystemEventListener#processEvent(javax.faces.event.SystemEvent)
*/
public void processEvent(SystemEvent event) throws AbortProcessingException {
FacesContext context = FacesContext.getCurrentInstance();
UIViewRoot viewRoot = context.getViewRoot();
List<UIComponent> scriptsToRemove = new ArrayList<UIComponent>();
if (!context.isPostback()) {
for (UIComponent component : viewRoot.getComponentResources(context, "head")) {
if (component.getClass().equals(UIOutput.class)) {
UIOutput uiOutput = (UIOutput) component;
if (uiOutput.getRendererType().equals("javax.faces.resource.Script")) {
String library = uiOutput.getAttributes().get("library").toString();
String name = uiOutput.getAttributes().get("name").toString();
// make https requests to get the resources?
// combine then and save to resource cache?
// insert new UIOutput script?
scriptsToRemove.add(component);
}
}
}
for (UIComponent component : scriptsToRemove) {
viewRoot.getComponentResources(context, "head").remove(component);
}
}
}
}
This answer doesn't cover minifying and compression. Minifying of individual CSS/JS resources is better to be delegated to build scripts like YUI Compressor Ant task. Manually doing it on every request is too expensive. Compression (I assume you mean GZIP?) is better to be delegated to the servlet container you're using. Manually doing it is overcomplicated. On Tomcat for example it's a matter of adding a compression="on" attribute to the <Connector> element in /conf/server.xml.
The SystemEventListener is already a good first step (apart from some PhaseListener unnecessity). Next, you'd need to implement a custom ResourceHandler and Resource. That part is not exactly trivial. You'd need to reinvent pretty a lot if you want to be JSF implementation independent.
First, in your SystemEventListener, you'd like to create new UIOutput component representing the combined resource so that you can add it using UIViewRoot#addComponentResource(). You need to set its library attribute to something unique which is understood by your custom resource handler. You need to store the combined resources in an application wide variable along an unique name based on the combination of the resources (a MD5 hash maybe?) and then set this key as name attribute of the component. Storing as an application wide variable has a caching advantage for both the server and the client.
Something like this:
String combinedResourceName = CombinedResourceInfo.createAndPutInCacheIfAbsent(resourceNames);
UIOutput component = new UIOutput();
component.setRendererType(rendererType);
component.getAttributes().put(ATTRIBUTE_RESOURCE_LIBRARY, CombinedResourceHandler.RESOURCE_LIBRARY);
component.getAttributes().put(ATTRIBUTE_RESOURCE_NAME, combinedResourceName + extension);
context.getViewRoot().addComponentResource(context, component, TARGET_HEAD);
Then, in your custom ResourceHandler implementation, you'd need to implement the createResource() method accordingly to create a custom Resource implementation whenever the library matches the desired value:
#Override
public Resource createResource(String resourceName, String libraryName) {
if (RESOURCE_LIBRARY.equals(libraryName)) {
return new CombinedResource(resourceName);
} else {
return super.createResource(resourceName, libraryName);
}
}
The constructor of the custom Resource implementation should grab the combined resource info based on the name:
public CombinedResource(String name) {
setResourceName(name);
setLibraryName(CombinedResourceHandler.RESOURCE_LIBRARY);
setContentType(FacesContext.getCurrentInstance().getExternalContext().getMimeType(name));
this.info = CombinedResourceInfo.getFromCache(name.split("\\.", 2)[0]);
}
This custom Resource implementation must provide a proper getRequestPath() method returning an URI which will then be included in the rendered <script> or <link> element:
#Override
public String getRequestPath() {
FacesContext context = FacesContext.getCurrentInstance();
String path = ResourceHandler.RESOURCE_IDENTIFIER + "/" + getResourceName();
String mapping = getFacesMapping();
path = isPrefixMapping(mapping) ? (mapping + path) : (path + mapping);
return context.getExternalContext().getRequestContextPath()
+ path + "?ln=" + CombinedResourceHandler.RESOURCE_LIBRARY;
}
Now, the HTML rendering part should be fine. It'll look something like this:
<link type="text/css" rel="stylesheet" href="/playground/javax.faces.resource/dd08b105bf94e3a2b6dbbdd3ac7fc3f5.css.xhtml?ln=combined.resource" />
<script type="text/javascript" src="/playground/javax.faces.resource/2886165007ccd8fb65771b75d865f720.js.xhtml?ln=combined.resource"></script>
Next, you have to intercept on combined resource requests made by the browser. That's the hardest part. First, in your custom ResourceHandler implementation, you need to implement the handleResourceRequest() method accordingly:
#Override
public void handleResourceRequest(FacesContext context) throws IOException {
if (RESOURCE_LIBRARY.equals(context.getExternalContext().getRequestParameterMap().get("ln"))) {
streamResource(context, new CombinedResource(getCombinedResourceName(context)));
} else {
super.handleResourceRequest(context);
}
}
Then you have to do the whole lot of work of implementing the other methods of the custom Resource implementation accordingly such as getResponseHeaders() which should return proper caching headers, getInputStream() which should return the InputStreams of the combined resources in a single InputStream and userAgentNeedsUpdate() which should respond properly on caching related requests.
#Override
public Map<String, String> getResponseHeaders() {
Map<String, String> responseHeaders = new HashMap<String, String>(3);
SimpleDateFormat sdf = new SimpleDateFormat(PATTERN_RFC1123_DATE, Locale.US);
sdf.setTimeZone(TIMEZONE_GMT);
responseHeaders.put(HEADER_LAST_MODIFIED, sdf.format(new Date(info.getLastModified())));
responseHeaders.put(HEADER_EXPIRES, sdf.format(new Date(System.currentTimeMillis() + info.getMaxAge())));
responseHeaders.put(HEADER_ETAG, String.format(FORMAT_ETAG, info.getContentLength(), info.getLastModified()));
return responseHeaders;
}
#Override
public InputStream getInputStream() throws IOException {
return new CombinedResourceInputStream(info.getResources());
}
#Override
public boolean userAgentNeedsUpdate(FacesContext context) {
String ifModifiedSince = context.getExternalContext().getRequestHeaderMap().get(HEADER_IF_MODIFIED_SINCE);
if (ifModifiedSince != null) {
SimpleDateFormat sdf = new SimpleDateFormat(PATTERN_RFC1123_DATE, Locale.US);
try {
info.reload();
return info.getLastModified() > sdf.parse(ifModifiedSince).getTime();
} catch (ParseException ignore) {
return true;
}
}
return true;
}
I've here a complete working proof of concept, but it's too much of code to post as a SO answer. The above was just a partial to help you in the right direction. I assume that the missing method/variable/constant declarations are self-explaining enough to write your own, otherwise let me know.
Update: as per the comments, here's how you can collect resources in CombinedResourceInfo:
private synchronized void loadResources(boolean forceReload) {
if (!forceReload && resources != null) {
return;
}
FacesContext context = FacesContext.getCurrentInstance();
ResourceHandler handler = context.getApplication().getResourceHandler();
resources = new LinkedHashSet<Resource>();
contentLength = 0;
lastModified = 0;
for (Entry<String, Set<String>> entry : resourceNames.entrySet()) {
String libraryName = entry.getKey();
for (String resourceName : entry.getValue()) {
Resource resource = handler.createResource(resourceName, libraryName);
resources.add(resource);
try {
URLConnection connection = resource.getURL().openConnection();
contentLength += connection.getContentLength();
long lastModified = connection.getLastModified();
if (lastModified > this.lastModified) {
this.lastModified = lastModified;
}
} catch (IOException ignore) {
// Can't and shouldn't handle it here anyway.
}
}
}
}
(the above method is called by reload() method and by getters depending on one of the properties which are to be set)
And here's how the CombinedResourceInputStream look like:
final class CombinedResourceInputStream extends InputStream {
private List<InputStream> streams;
private Iterator<InputStream> streamIterator;
private InputStream currentStream;
public CombinedResourceInputStream(Set<Resource> resources) throws IOException {
streams = new ArrayList<InputStream>();
for (Resource resource : resources) {
streams.add(resource.getInputStream());
}
streamIterator = streams.iterator();
streamIterator.hasNext(); // We assume it to be always true; CombinedResourceInfo won't be created anyway if it's empty.
currentStream = streamIterator.next();
}
#Override
public int read() throws IOException {
int read = -1;
while ((read = currentStream.read()) == -1) {
if (streamIterator.hasNext()) {
currentStream = streamIterator.next();
} else {
break;
}
}
return read;
}
#Override
public void close() throws IOException {
IOException caught = null;
for (InputStream stream : streams) {
try {
stream.close();
} catch (IOException e) {
if (caught == null) {
caught = e; // Don't throw it yet. We have to continue closing all other streams.
}
}
}
if (caught != null) {
throw caught;
}
}
}
Update 2: a concrete and reuseable solution is available in OmniFaces. See also CombinedResourceHandler showcase page and API documentation for more detail.
You may want to evaluate JAWR before implementing your own solution. I've used it in couple of projects and it was a big success. It used in JSF 1.2 projects but I think it will be easy to extend it to work with JSF 2.0. Just give it a try.
Omnifaces provided CombinedResourceHandler is an excellent utility, but I also love to share about this excellent maven plugin:- resources-optimizer-maven-plugin that can be used to minify/compress js/css files &/or aggregate them into fewer resources during the build time & not dynamically during runtime which makes it a more performant solution, I believe.
Also have a look at this excellent library as well:- webutilities
I have an other solution for JSF 2. Might also rok with JSF 1, but i do not know JSF 1 so i can not say. The Idea works mainly with components from h:head and works also for stylesheets. The result
is always one JavaScript (or Stylesheet) file for a page! It is hard for me to describe but i try.
I overload the standard JSF ScriptRenderer (or StylesheetRenderer) and configure the renderer
for the h:outputScript component in the faces-config.xml.
The new Renderer will now not write anymore the script-Tag but it will collect all resources
in a list. So first resource to be rendered will be first item in the list, the next follows
and so on. After last h:outputScript component ist rendered, you have to render 1 script-Tag
for the JavaScript file on this page. I make this by overloading the h:head renderer.
Now comes the idea:
I register an filter! The filter will look for this 1 script-Tag request. When this request comes,
i will get the list of resources for this page. Now i can fill the response from the list of
resources. The order will be correct, because the JSF rendering put the resources in correct order
into the list. After response is filled, the list should be cleared. Also you can do more
optimizations because you have the code in the filter....
I have code that works superb. My code also can handle browser caching and dynamic script rendering.
If anybody is interested i can share the code.
I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters?
Thanks
Theory
What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:
Connect to the web server.
Parse the page.
Get the first form on the page.
Fill in the form data.
Submit the form.
Read (and parse) the results.
The solution you pick will depend on a variety of factors, including:
Whether you need to emulate JavaScript
What you need to do with the data afterwards
What languages with which you are proficient
Application speed (is this for one query or 100,000?)
How soon the application needs to be working
Is it a one off, or will it have to be maintained?
For example, you could try the following applications to submit the data for you:
Lynx
curl
wget
Then grep (awk, or sed) the resulting web page(s).
Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.
Example
A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.
import java.io.FileInputStream;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.Properties;
import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;
import com.meterware.httpunit.WebClient;
import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebForm;
import com.meterware.httpunit.WebLink;
import com.meterware.httpunit.WebRequest;
import com.meterware.httpunit.WebResponse;
public class FormElements extends Properties
{
private static final String FORM_URL = "form.url";
private static final String FORM_ACTION = "form.action";
/** These are properly provided property parameters. */
private static final String FORM_PARAM = "form.param.";
/** These are property parameters that are required; must have values. */
private static final String FORM_REQUIRED = "form.required.";
private Hashtable fields = new Hashtable( 10 );
private WebConversation webConversation;
public FormElements()
{
}
/**
* Retrieves the HTML page, populates the form data, then sends the
* information to the server.
*/
public void run()
throws Exception
{
WebResponse response = receive();
WebForm form = getWebForm( response );
populate( form );
form.submit();
}
protected WebResponse receive()
throws Exception
{
WebConversation webConversation = getWebConversation();
GetMethodWebRequest request = getGetMethodWebRequest();
// Fake the User-Agent so the site thinks that encryption is supported.
//
request.setHeaderField( "User-Agent",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );
return webConversation.getResponse( request );
}
protected void populate( WebForm form )
throws Exception
{
// First set all the .param variables.
//
setParamVariables( form );
// Next, set the required variables.
//
setRequiredVariables( form );
}
protected void setParamVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_PARAM ) )
{
String fieldName = getProperty( property );
String propertyName = property.substring( FORM_PARAM.length() );
String fieldValue = getField( propertyName );
// Skip blank fields (most likely, this is a blank last name, which
// means the form wants a full name).
//
if( "".equals( fieldName ) )
continue;
// If this is the first name, and the last name parameter is blank,
// then append the last name field to the first name field.
//
if( "first_name".equals( propertyName ) &&
"".equals( getProperty( FORM_PARAM + "last_name" ) ) )
fieldValue += " " + getField( "last_name" );
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
protected void setRequiredVariables( WebForm form )
throws Exception
{
for( Enumeration e = propertyNames(); e.hasMoreElements(); )
{
String property = (String)(e.nextElement());
if( property.startsWith( FORM_REQUIRED ) )
{
String fieldValue = getProperty( property );
String fieldName = property.substring( FORM_REQUIRED.length() );
// If the field starts with a ~, then copy the field.
//
if( fieldValue.startsWith( "~" ) )
{
String copyProp = fieldValue.substring( 1, fieldValue.length() );
copyProp = getProperty( copyProp );
// Since the parameters have been copied into the form, we can
// eke out the duplicate values.
//
fieldValue = form.getParameterValue( copyProp );
}
showSet( fieldName, fieldValue );
form.setParameter( fieldName, fieldValue );
}
}
}
private void showSet( String fieldName, String fieldValue )
{
System.out.print( "<p class='setting'>" );
System.out.print( fieldName );
System.out.print( " = " );
System.out.print( fieldValue );
System.out.println( "</p>" );
}
private WebForm getWebForm( WebResponse response )
throws Exception
{
WebForm[] forms = response.getForms();
String action = getProperty( FORM_ACTION );
// Not supposed to break out of a for-loop, but it makes the code easy ...
//
for( int i = forms.length - 1; i >= 0; i-- )
if( forms[ i ].getAction().equalsIgnoreCase( action ) )
return forms[ i ];
// Sadly, no form was found.
//
throw new Exception();
}
private GetMethodWebRequest getGetMethodWebRequest()
{
return new GetMethodWebRequest( getProperty( FORM_URL ) );
}
private WebConversation getWebConversation()
{
if( this.webConversation == null )
this.webConversation = new WebConversation();
return this.webConversation;
}
public void setField( String field, String value )
{
Hashtable fields = getFields();
fields.put( field, value );
}
private String getField( String field )
{
Hashtable<String, String> fields = getFields();
String result = fields.get( field );
return result == null ? "" : result;
}
private Hashtable getFields()
{
return this.fields;
}
public static void main( String args[] )
throws Exception
{
FormElements formElements = new FormElements();
formElements.setField( "first_name", args[1] );
formElements.setField( "last_name", args[2] );
formElements.setField( "email", args[3] );
formElements.setField( "comments", args[4] );
FileInputStream fis = new FileInputStream( args[0] );
formElements.load( fis );
fis.close();
formElements.run();
}
}
An example properties files would look like:
$ cat com.mellon.properties
form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments
# Submit Button
#form.submit=submit
# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm
Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):
java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "John.Doe#gmail.com" "To whom it may concern ..."
Legality
Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.
Most of the time, you can just send a simple HTTP POST request.
I'd suggest you try playing around with Fiddler to understand how the web works.
Nearly all the programming languages and frameworks out there have methods for sending raw requests.
And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.
I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the User-Agent HTTP header and maybe some others.
Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.
Well, here's the html from the Google page:
<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%> </td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2> <a href=/advanced_search?hl=en>
Advanced Search</a><br>
<a href=/preferences?hl=en>Preferences</a><br>
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>
If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:
http://www.google.com/search?hl=en&q=Stack+Overflow
If you download Cygwin, and add Cygwin\bin to your path you can use curl to retrieve a page and grep/sed/whatever to parse the results. Why fill out the form when with google you can use the querystring parameters, anyway? With curl, you can post info, too, set header info, etc. I use it to call web services from a command line.