I'm having issues importing htmlunit (htmlunit.sf.net) into a groovy script.
I'm currently just using the example script that was on the web and it gives me unable to resolve class com.gargoylesoftware.htmlunit.WebClient
The script is:
import com.gargoylesoftware.htmlunit.WebClient
client = new WebClient()
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')
I downloaded the source from the website and placed the com folder (and all its contents) where my script was located.
Does anyone know what issue I'm encountering? I'm not quite sure why it won't import it
You could use Grape to get the dependecy for you during script runtime. Easiest way to do it is to add a #Grab annotation to your import statement.
Like this:
#Grab('net.sourceforge.htmlunit:htmlunit:2.7')
import com.gargoylesoftware.htmlunit.WebClient
client = new WebClient()
// Added as HtmlUnit had problems with the JavaScript
client.javaScriptEnabled = false
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')
There's only one problem. The page seems to be a little bit to much to chew off for HtmlUnit. When I ran the code I got OutOfMemoryException every time. I'd suggest downloading the html the normal way instead and then using something like NekoHtml or TagSoup to parse the html into XML and work with it that way.
This example uses TagSoup to work with html as xml in Groovy: http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/
you just need to download zip file, extract the jar file(s) and place them on the class path when compiling... You dont need the source
http://sourceforge.net/projects/htmlunit/files/htmlunit/2.8/htmlunit-2.8.zip/download
Related
I want to load a config value (something like json, yaml, xml or ini) from a jenkins pipeline script. When I try to use org.yaml.snakeyaml.Yaml I get
Scripts not permitted to use new org.yaml.snakeyaml.Yaml
I know I can unlock org.yaml.snakeyaml.Yam, but the message tells me that this does not seem to be the standard way of loading config files.
Is there a way of loading config files that is already unlocked?
If anyone looking for yaml parser in jenkinsfile, I recommend the following
def yamlData = readYaml file: 'cae.yaml'
Reference : https://jenkins.io/doc/pipeline/steps/pipeline-utility-steps/#code-readyaml-code-read-yaml-from-files-in-the-workspace-or-text
Try using the JsonSlurper:
def config = new JsonSlurper().parse(new File("config.json"))
I have python 3.3 installed.
i use the example they use on their site:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
the only thing that happens when I run it is I get this :
======RESTART=========
I know I am a rookie but I figured the example from python's own website should be able to work.
It doesn't. What am I doing wrong?Eventually I want to run this script from the website below. But I think urllib is not going to work as it is on that site. Can someone tell me if the code will work with python3.3???
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
I think I see what's probably going on. You're likely using IDLE, and when it starts a new run of a program, it prints the
======RESTART=========
line to tell you that a fresh program is starting. That means that all the variables currently defined are reset and/or deleted, as appropriate.
Since your program didn't print any output, you didn't see anything.
The two lines I suggested adding were just tests to figure out what was going on, they're not needed in general. [Unless the window itself is automatically closing, which it shouldn't.] But as a rule, if you want to see output, you'll have to print what you're interested in.
Your example works for me. However, I suggest using requests instead of urllib2.
To simplify the example you linked to, it would look like:
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.wunderground.com/history/airport/KBUF/2007/12/16/DailyHistory.html")
soup = BeautifulSoup(resp.text)
I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.
I have a schema file and I want to generate the class files directly into MEMORY instead of file system. I have searched a lot, but everywhere I am finding API to generate java files into filesystem only.
Can any please provide links of API to generate the java source files directly into memory.
Thanks,
Harish
I haven't leveraged this code in the way you described, but this fragment might point you in the right direction:
import com.sun.codemodel.*;
import com.sun.tools.xjc.*;
import com.sun.tools.xjc.api.*;
SchemaCompiler sc = XJC.createSchemaCompiler();
sc.setEntityResolver(new YourEntityResolver());
sc.setErrorListener(new YourErrorListener());
sc.parseSchema(SYSTEM_ID, element);
S2JJAXBModel model = sc.bind();
I am trying to build a tiny app to read from an xml file and display on a widget. I don't know which widget to use exactly; QTextBrowser, QTextedit and QWebView. I can't seem to find a good explanation. Please help as much as you can. Before i get, I'm so new to Python, PyQt and my programming ain't good at all.
I suggest you first interprete the xml content into a dom object, and then show whatever you want from that object into your widget. For the first part (detailed info here):
from xml.dom import minidom
dom = minidom.parse('my_xml.xml')
print(dom.toxml()) # .toxml() creates a string from the dom object
def print_some_info(node):
print('node representation: {0}'.format(node))
print('.nodeName: ' + node.nodeName)
print('.nodeValue: {0}'.format(node.nodeValue))
for child in node.childNodes:
print_some_info(child)
print_some_info(child)
(using e.g. an xml example in file 'my_xml.xml' from here)