Spark create a temp directory structure on each node

Spark create a temp directory structure on each node - apache-spark

I am working on a spark java wrapper which uses third party libraries, which will read files from a hard coded directory name say "resdata" from where job executes. I know this is twisted but will try to explain.
when I execute the job it is trying to find the required files in the path something like this below,
/data/Hadoop/yarn/local//appcache/application_xxxxx_xxx/container_00_xxxxx_xxx/resdata
I am assuming it is looking for the files in the current data directory , under that looking for directory name "resdata". At this point I don't know how to configure the current directory to any path on hdfs or local.
So looking for options to create directory structure similar to what the third party libraries expecting and copying required files over there. This I need to do on each node. I am working on spark 2.2.0
Please help me in achieving this?

just now got the answer I need to put all the files under resdata directory and zip it say restdata.zip, pass the file using the options "--archives" . Then each node will have directory restdata.zip/restdata/file1 etc

Related

Entering a proper path to files on DBFS

I uploaded files to DBFS:
/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
I tried to access them by pandas and I always receive information that such files don't exist.
I tried to use the following paths:
/dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
./FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv
What is funny, when I check them by dbutils.fs.ls I see all the files.
I found this solution, and I tried it already: Databricks dbfs file read issue
Moved them to a new folder:
dbfs:/new_folder/
I tried to access them from this folder, but still, it didn't work for me. The only difference is that I copied files to a different place.
I checked as well the documentation: https://docs.databricks.com/data/databricks-file-system.html
I use Databricks Community Edition.
I don't understand what I'm doing wrong and why it's happening like that.
I don't have any other ideas.

The /dbfs/ mount point isn't available on the Community Edition (that's a known limitation), so you need to do what is recommended in the linked answer:
dbutils.fs.cp(
'dbfs:/FileStore/shared_uploads/name_surname#xxx.xxx/file_name.csv',
'file:/tmp/file_name.csv')
and then use /tmp/file_name.csv as input parameter to Pandas' functions. If you'll need to write something to DBFS, then you do other way around - write to local file /tmp/..., and copy that file to DBFS.

Python - working with unkown directory name versions

I'm using python to download an application from a content distribution network. The application downloads as self extract file cabinet. When executed it creates a directory using a version naming format. For example app-version2/../app.exe, Thus I cannot rely on the folder name as it may change in the future. I'm trying to find the best way to work with the content inside the folder without depending on the actual folder name.
My idea was to rename the folder using os.listdir() and then os.rename(app-version2, myapp) This would work but is not automated. What would be the best automated method to find a folder name that contains version numbers and change that to something more static?

Assuming you want to find the path of the directory which begins with app, you can accomplish this using pathlib.Path:
from pathlib import Path
app_path = next(Path().glob('app*'))
This will give you the path to the first file or directory in your current directory whose name begins with "app".

Python ImportError when executing through windows scheduler

I did some searches for this topic and found some prior threads, but I did not understand any of them as I am still a total beginner in Python.
I have a Python script which has some long string variables stored in various .py files in a sub-directory. I'm importing the .py files from that sub-directory when I run the script. There is a __init__.py file in the sub-directory. The only reason I'm using this setup is that the long string variables which I'm storing in those other files would make the code very difficult to read as they are SQL strings and can span 50-100 lines each.
Everything works perfectly when I run this script through PyCharm.
However, when I run the script through Windows Scheduler or a batch file, I get an ImportError for all of the .py files in the sub-directory. The problem is definitely related to the python script not knowing where to look for those .py files when it's run through Windows Scheduler. But I'm not sure how to fix it.
The action for the scheduler task is to run the python exe
D:\Python35\python.exe
with the argument as the script
D:\python\tableaudatasourcebuilds\dcitechnicalperformance\dcitechnicalperformance0.py
So the full action looks like:
D:\Python35\python.exe "D:\python\tableaudatasourcebuilds\dcitechnicalperformance\dcitechnicalperformance0.py"
The subdirectory which stores the long string variables .py files is:
D:\python\tableaudatasourcebuilds\dcitechnicalperformance\dcitechnicalperformance0\
The imports look like:
from dcitechnicalperformance.dcitechnicalperformance0.dciquer import nzsqldciwk
Does anyone know how to address this problem? Any help is much appreciated.

Good afternoon,
First of all i don't know how much sense there is to store long SQL querys on a module, I'm not by any means an expert, but something like a JSON file (or hell, even store them in a table inside the sql) seems like a better approach.
About your problem I think it resides on the current directory where the task is launched, let me explain:
In PyCharm when you run the code it launches from the location of the file, and with so, it's able to find the directory with the module.
With the scheduled task it may be launching in another directory and so, it's unable to find the module as the directory is not present.
If you decide to stick with your reproach a plausible solution would be to create a .bat file that browses to the project location:
#ECHO OFF
D:
cd D:\python\tableaudatasourcebuilds\dcitechnicalperformance\
D:\Python35\python.exe dcitechnicalperformance0.py
And that should work.

Default path for file import Julia

I have created a package and am now creating my tests within the package. For one test my inputs are a set of files and my outputs will be a different set a files created within the test.
I am saving the input files in the test directory of my package and would like to save the output files there too. Since others may run this test, I do not want to specify the input/output file location using my own path eg /home/myname/.julia/v4.0/MyPackage/test/MyInputFile.txt
How do I specify that the input location is within the package's test folder?
So basically how do I tell Julia to look in the packages's folder under the test directory and not have to worry about specifying the entire path including user name etc?
For example currently I have to say
readtable(/home/myname/.julia/v4.0/MyPackage/test/MyInputFile.txt, separator = '\t', header = false)
But I'd like to just be able to say
readtable(/MyPackage/test/MyInputFile.txt, separator = '\t', header = false)
so that no matter who the user of the package is and where they may store the package, they can still run the test?
I know that LOAD_PATH gives the path Julia looks for packages but I can't find any information on where it looks when importing files.

joinpath(Pkg.dir("MyPackage"), "test") is what you need.

As #GnimucK mentioned in a comment, a better solution is
dirname(#__FILE__)
Why is this better? A package could be installed and used from somewhere else (not the standard package directory). Pkg.dir is "stupid" and does not know better. This is rare, of course, and in most cases it won't matter.

Access test resources within Haskell tests

This is probably a basic question but I've been Googling for a while on it... I have a Cabal-ized Haskell project and I'm in the process of writing integration tests for it. I want to be able to include test resources for my project in the same repo and access them in tests. For example, here are a couple things I want to accomplish:
1) Check a dummy database instance into my repo, including a shell script that spins up a database process. I want to write an Hspec integration test that spins up the database process, makes some calls to it, and then shuts it down. So I need to be able to find the shell script so I can use System.Process.createProcess on it.
2) Check in paired "input" and "output" files. My test should process each of the input files and compare them to a corresponding output file to make sure they match. (I've read about "golden" but it doesn't seem to solve the problem of finding/reading the input files in the first place?)
In short, how can I go about creating a "resources" folder in the root folder of my Haskell project and find the path to it inside tests?

Have a look at an existing project that uses input and output file.
For example, take haddock, the source code is at https://github.com/haskell/haddock. They have the test files under a folder (https://github.com/haskell/haddock/tree/master/html-test/ref) and they are referenced as extra-source-files in the cabal file (https://github.com/haskell/haddock/blob/master/haddock.cabal). Then the test code (https://github.com/haskell/haddock/blob/master/html-test/run.lhs) uses some CPP macro (__FILE__) to get the current directory, and can then resolve the files relative to that folder.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string