Using rvest to scrape from password protected website - rvest

I have been sucessfully using rvest, on a HP EliteBook (running Windows 7), to access data on a password protected website for some time.
The code I was successsfully running is:
Load required packages
library(rvest)
# Connect to confidential live report
URL <- url("http://username:password#urlofpasswordprotectedsite")
# Read in data
RawData <- read_html(URL)
# Identify the table
RawDataTable <- html_nodes(RawData, "table")
RawDataTable1 <- html_table(RawDataTable[1], fill = TRUE)
# Make data.frame
RawData <- as.data.frame(RawData)
However after attempting to scrape data now via R, I am faced with the following error:
Error in open.connection(x, "rb") : cannot open the connection
This error happened when I worked on a Mac, but was content to stick with the HP for running analysis. I am able to load the following without issue:
htmlpage <- read_html("http://forecast.weather.gov/MapClick.php?lat=42.27925753000045&lon=-71.41616624299962#.V17UH-IrKHs")
Is this error due to my machine or is there a recent update to the rvest package that may be throwing the error?
Thank you.

I managed to solve this by using:
html_session("http://username:password#urlofpasswordprotectedsite")
in place of:
url("http://username:password#urlofpasswordprotectedsite")

Related

How to solve no such node error in pytables and h5py

I built an hdf5 dataset using pytables. It contains thousands of nodes, each node being an image stored without compression (of shape 512x512x3). When I run a deep learning training loop (with a Pytorch dataloader) on it it randomly crashes, saying that the node does not exist. However, it is never the same node that is missing and when I open the file myself to verify if the node is here it is ALWAYS here.
I am running everything sequentially, as I thought that I may have been the fault of multithreading/multiprocessing access on the file. But it did not fix the problem. I tried a LOT of things but it never works.
Does someone have an idea about what to do ? Should I add like a timer between calls to give the machine the time to reallocate the file ?
Initially I was working with pytables only, but in an attempt to solve my problem I tried loading the file with h5py instead. Unfortunately it did not work better.
Here is the error I get with h5py: "RuntimeError: Unable to get link info (bad symbol table node signature)"
The exact error may change but every time it says "bad symbol table node signature"
PS: I cannot share the code because it is huge and part of a bigger basecode that is my company's property. I can still share part of the code below to show how I load the images:
with h5py.File(dset_filepath, "r", libver='latest', swmr=True) as h5file:
node = h5file["/train_group_0/sample_5"] # <- this line breaks
target = node.attrs.get('TITLE').decode('utf-8')
img = Image.fromarray(np.uint8(node))
return img, int(target.strip())
Before accessing the dataset (node), add a test to confirm it exists. While you're adding checks, do the same for the attribute 'TITLE'. If you are going to use hard-coded path names (like 'group_0') you should check all nodes in the path exist (for example, does 'group_0' exist? Or use one of the recursive visitor functions (.visit() or .visititems() to be sure you only access existing nodes.
Modified h5py code with rudimentary checks looks like this:
sample = 'sample_5'
with h5py.File(dset_filepath, 'r', libver='latest', swmr=True) as h5file:
if sample not in h5file['/train_group_0'].keys():
print(f'Dataset Read Error: {sample} not found')
return None, None
else:
node = h5file[f'/train_group_0/{sample}'] # <- this line breaks
img = Image.fromarray(np.uint8(node))
if 'TITLE' not in node.attrs.keys():
print(f'Attribute Read Error: TITLE not found')
return img, None
else:
target = node.attrs.get('TITLE').decode('utf-8')
return img, int(target.strip())
You said you were working with PyTables. Here is code to do the same with PyTables package:
import tables as tb
sample = 'sample_5'
with tb.File(dset_filepath, 'r', libver='latest', swmr=True) as h5file:
if sample not in h5file.get_node('/train_group_0'):
print(f'Dataset Read Error: {sample} not found')
return None, None
else:
node = h5file.get_node(f'/train_group_0/{sample}') # <- this line breaks
img = Image.fromarray(np.uint8(node))
if 'TITLE' not in node._v_attrs:
print(f'Attribute Read Error: TITLE not found')
return img, None
else:
target = node._v_attrs['TITLE'].decode('utf-8')
return img, int(target.strip())

How do I open a .xls file in ArcMap without Office installed?

I am trying to open and copy xls documents in the Catalog window of ArcMap 10.6.1.
Everytime I try to expand the table I get this error message:
"Failed to connect to database. General function failure.
The external table doesn't have the expected format." (translated from German)
I've tried installing (then updating) "2007 Office System Driver: Data Connectivity Components" and "Microsoft Access Database Engine 2010 Redistributable" but that didn't help.
I'm running ArcGIS 10.6.1 on Windows Server 2012 R2 without an Office installation.
Interestinlgy, on another machine, running ArcGIS 10.5.1 with the same OS and no Office, it works fine!
Try this which can help you read and use data from a .xls using arcpy window:
import arcpy
import xlrd
inputExcel = xlrd.open_workbook("C:\Users\username\In_excel.xls")
sheetNames = inputExcel.sheet_names()
excelSheet = inputExcel.sheet_by_name(sheetNames[0])
for x in range (0, excelSheet.nrows):
for y in range (0 excelSheet.ncols):
cellValue = excelSheet.cell_value(x, y)
print (cellValue)

Why would Tensorflow not save run_metadata?

I was simply trying to generate a summary that would show the run_metadata as follows:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary = sess.run([x, y], options=run_options, run_metadata=run_metadata)
train_writer.add_run_metadata(paths.logs, 'step%d' % step)
train_writer.add_summary(paths.logs, step)
I made sure the path to the logs folder exists, this is confirmed by the fact the the summary file is generated but no metadata is presetn. Now I am not sure a file is actually generated to be honest (for the metadata), but when I open tensorboard, the graph looks fine and the session runs dropdown menu is populated. Now when I select any of the runs it shows a progress bar "Parsing metadata.pbtxt" that stops and hangs right half way through.
This prevents me from gathering any more additional info about my graph. Am I missing something ? A similar issue happened when trying to run this tutorial locally (MNIST summary tutorial). I feel like I am missing something simple. Does anyone have an idea about what could cause this issue ? Why would my tensorboard hang when trying to load a session run data ?
I can't believe I made it work right after posting the question but here it goes. I noticed that this line:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
was giving me an error so I removed the params and turned it into
run_options = tf.RunOptions()
without realizing that this is what caused the metadata not to be parsed. Once I researched the error message:
Couldn't open CUDA library cupti64_90.dll
I looked into this Github Thread and moved the file into the bin folder. After that I ran again my code with the trace_level param, had no errors and the metadata was successfully parsed.

Get data into Graphite/Carbon using Python3

I've got a Grafana docker image running with Graphite/Carbon. Getting data using CLI works, example:
echo "local.random.diceroll $(((RANDOM%6)+1)) `date +%s`" | nc localhost 2003;
The following Python 2 code also works:
sock = socket.socket()
sock.connect((CARBON_SERVER, CARBON_PORT))
sock.sendall(message)
sock.close()
message is a string containing key value timestamp and this works, the data can be found. So the Grafana docker image is accepting data.
I wanted to get this working in Python 3, but the sendall function requires bytes as parameter. The code change is:
sock = socket.socket()
sock.connect((CARBON_SERVER, CARBON_PORT))
sock.sendall(str.encode(message))
sock.close()
Now the data isn't inserted and I can't figure out why. I tried this on a remote machine (same network) and on the local server. I also tried several packages (graphiti, graphiteudp), but they all seem to fail to insert the data. They also don't show any error message.
The simple example for graphiteudp doesn't work either on the Github page
Got an idea what I'm doing wrong?
You can add \n to the message you send. I have tried it with Python 3, and that works.

Changing IOPub data rate in Jupyterlab

I will preface this by saying that I am very new to python and PostgreSQL, and programming in general. I am currently working on querying data from a PostgreSQL server and storing the data in a python program running on Jupyterlab 0.32.1. Up until this point, I have had no problems with querying this data, but now I am receiving an error.
import psycopg2 as p
ryandata= p.connect(dbname="agent_rating")
rcurr = ryandata.cursor()
rcurr.execute("SELECT ordlog_id FROM eta")
data = rcurr.fetchall()
mylist= []
for i in range(len(data)):
orderid = data[i]
mylist.append(orderid)
print (mylist)
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
Can anyone help fix this?

Resources