Unable to Load Resource Using getResource - groovy

I am trying to simply load a file from a packages resource folder. I have the following project structure:
And have tried the following in an attempt to load each of the txt files to the Populator.groovy script:
File file = new File(Populator.class.getResource("/names/first-names.txt").getFile())
The above results in a FileNotFoundException if any methods are called from the file instance. The path returned is correct, and the file is indeed where the path specifies. I am also using very similar methods of extracting resources in above modules and no errors are occurring. Whats going on here?

Why not
File file = new File(Populator.class.getResource("/names/first-names.txt").toURI())
Not sure why you want it as a file though? Wouldn't an input stream do?

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Autorenaming duplicate filename downloads in chrome/puppeteer/ubuntu

I'm downloading pdf files using headFULL chromium & puppeteer. I call a javascript function in the browser context and the download starts. The file name comes as is from the server. Issue: Many files I download in a directory are of same names coming from the server and Chrome instead of autosuffixing an index (1) to the file, overwrites the existing one.
Since the file is downloaded by calling a JS function and I have inspected the function as well, I don't have access to a the pdf url. It is triggered using the function call and thus I have no control over the file names.
I have a list of the file names but that in no way helps in changing the filename on the fly, if it it's duplicate name already exists on the machine.
Config: Ubuntu 18.04, Puppeteer 1.18.1
I know either it's a config issue with Nautilus file manager or with Chrome. Is it possible to configure any of these two?
I cannot foresee an option within nodejs where I can rename the file before it's downloaded. A workaround is to download each file in a temp folder, then move it to the required folder while doing a check if it already exists and rename if so. But it adds a lot of time complexity. It would be great to have chrome or nautilus do the task.
Function which triggers the download:
await page.evaluate( (doc_index,arg1,arg2) => openDocument(String(doc_index), String(arg1), String(arg2) ,'ABC','','','XYZ') , doc_index,arg1,arg2 )
Expected behaviour: When the above function is called and pdf starts downloading in the set folder, if a pdf of the same name exists, the new pdf should be renamed to something like pdf_name.pdf(1) or the like.

Spark - Folder with same name as text file automatically created after RDD?

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?
There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

Line 2 ERROR The file NuxeoCSV-USERDOC.pdf does not exist

When i want to add an attachement(csv) to a file using the addon nuxeo csv import. I got this issue:
Line 2 ERROR The file NuxeoCSV-USERDOC.pdf does not exist
This is the csv file :
name,"type","dc:title","dc:description","file:content","dc:nature","dc:source"
nuxeo-csv-userdoc,"File","Nuxeo CSV User documentation","This is the user guide for Nuxeo CSV","NuxeoCSV-USERDOC.pdf","procedure","http://doc.nuxeo.com"
Nuxeo-csv-sample-3,"File","Nuxeo CSV Sample","This a second file imported with Nuxeo CSV","Nuxeo-csv-sample-3.odt","article","http://doc.nuxeo.com"
It's demanded to make some changes in the file conf but I don't get the last line. How I'm supposed to add the path and how can I add nuxeo.csv.blobs.folder, just by pasting it?
Configuration :
The Nuxeo CSV addon enables users to create file documents and upload their
main attachment at the same time. This requires to configure where the
server will take the attachments. This is done adding the parameter
nuxeo.csv.blobs.folder in the server nuxeo.conf and giving it a value that
is a local path to a folder that can be accessed by the server.
Thanks in advance.

How can I write my properties file if I want to get a log file every hours using log4j?

I have make my properties file ok,but what should I do if I want to put the log file in a folder relate to the date?
For example,today is 12/29 2015,at 10:30,I started my java project,the log4j.propertites about the log like the following ones:
log4j.appender.inforlog=org.apache.log4j.DailyRollingFileAppender
log4j.appender.inforlog.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.inforlog.File=D:/inforLogs/2015/12/searchrecord
when it comes to 11:00,there will be a log file named searchrecord.2015-12-29-10 in "D:/inforLogs/2015/12/", when it comes to 01/01 2016,the log file will alse in file "D:/inforLogs/2015/12/",but I want to make it in file "D:/inforLogs/2016/01/" by write the properties file properly,what should I do?
I have resolve the problem myself,here is the properties file
log4j.appender.inforlog.DatePattern='s/'yyyy'/'MM'/searchrecord-'dd'_'HH'.log'
log4j.appender.inforlog.File=D:/inforLog

Resources