Spark: Traverse HDFS subfolders and find all files with name "X" - apache-spark

I have a HDFS path and I want to traverse through all the subfolders and find all the files within that have the name "X".
I have tried to do this:
FileSystem.get( sc.hadoopConfiguration )
.listStatus( new Path("hdfs://..."))
.foreach( x => println(x.getPath))
But this only searches for files within 1 level and I want all levels.

You need to get all the files recursively. Loop through the path and get all the files, if it is a directory call the same function once again.
Below is a simple code you can modify as your configuration and test.
var fileSystem : FileSystem = _
var configuration: Configuration = _
def init() {
configuration = new Configuration
fileSystem = FileSystem.get(configuration)
val fileStatus: Array[FileStatus] = fileSystem.listStatus(new Path(""))
getAllFiles(fileStatus)
}
def getAllFiles(fileStatus: Array[FileStatus]) {
fileStatus.map(fs => {
if (fs.isDirectory)
getAllFiles(fileSystem.listStatus(fs.getPath))
else fs
})
}
Also filter the files that contains 'X' after getting the file list.

Related

How to get a list of all folders that list in a specific s3 location using spark in databricks?

Currently, I am using this code but it gives me all folders plus sub-folders/files for a specified s3 location. I want only the names of the folder live in s3://production/product/:
def get_dir_content(ls_path):
dir_paths = dbutils.fs.ls(ls_path)
subdir_paths = [get_dir_content(p.path) for p in dir_paths if p.isDir() and p.path != ls_path]
flat_subdir_paths = [p for subdir in subdir_paths for p in subdir]
return list(map(lambda p: p.path, dir_paths)) + flat_subdir_paths
paths = get_dir_content('s3://production/product/')
[print(p) for p in paths]
Current output returns all folders plus sub-directories where files live which is too much. I only need the folders that live on that hierachical level of the specifiec s3 location (no deeper levels). How do I teak this code?
just use dbutils.fs.ls(ls_path)

Creating an empty folder on Dropbox with Python. Is there a simpler way?

Here's my sample code which works:
import os, io, dropbox
def createFolder(dropboxBaseFolder, newFolder):
# creating a temp dummy destination file path
dummyFileTo = dropboxBaseFolder + newFolder + '/' + 'temp.bin'
# creating a virtual in-memory binary file
f = io.BytesIO(b"\x00")
# uploading the dummy file in order to cause creation of the containing folder
dbx.files_upload(f.read(), dummyFileTo)
# now that the folder is created, delete the dummy file
dbx.files_delete_v2(dummyFileTo)
accessToken = '....'
dbx = dropbox.Dropbox(accessToken)
dropboxBaseDir = '/test_dropbox'
dropboxNewSubDir = '/new_empty_sub_dir'
createFolder(dropboxBaseDir, dropboxNewSubDir)
But is there a more efficient/simpler way to do the task ?
Yes, as Ronald mentioned in the comments, you can use the files_create_folder_v2 method to create a new folder.
That would look like this, modifying your code:
import dropbox
accessToken = '....'
dbx = dropbox.Dropbox(accessToken)
dropboxBaseDir = '/test_dropbox'
dropboxNewSubDir = '/new_empty_sub_dir'
res = dbx.files_create_folder_v2(dropboxBaseDir + dropboxNewSubDir)
# access the information for the newly created folder in `res`

How do yoy recursively compare two directories in node

Id like to perform a comparison of two directories and all files within sub folders. The folder structure will be the same for both directories the files may be different. Call them directory A and directory B.
From that id like to create a directory C and directory D. All files in B that are newer than A or that are not found in A should copy over to C. Files missing from B that are found in A should be copied to directory D.
Id like to use node and either a library or run some other CLI tool like git perhaps that can do what I described without too much effort.
What would be some good approaches to accomplish this?
Get the list of filenames of both directories as two arrays, then find the difference between them.
const _ = require('lodash');
const fs = require('fs');
const aFiles = fs.readdirSync('/path/to/A');
const bFiles = fs.readdirSync('/path/to/B');
_.difference(aFiles, bFiles).forEach(v => {
// Files missing from B that are found in A should be copied to directory D
// Move file v to directory D
});
_.difference(bFiles, aFiles).forEach(v => {
// Files missing from A that are found in B should be copied to directory C
// Move file v to directory C
});
There's an npm package for this called dir-compare:
const dircompare = require('dir-compare');
const options = { compareSize: true };
const path1 = "dir1";
const path2 = "dir2";
const res = dircompare.compareSync(path1, path2, options)
console.log(res);

How to find missing files?

I have several files (with the same dim) in a folder called data for certain dates:
file2011001.bin named like this "fileyearday"
file2011009.bin
file2011020.bin
.
.
file2011322.bin
certin dates(files) are missing. What I need is just loop through these files
if file2011001.bin exist ok, if not copy any file in the directory and name it file2011001.bin
if file2011002.bin exist ok, if not copy any file in the directory and name it file2011002.bin and so on untill file2011365.bin
I can list them in R:
dir<- list.files("/data/", "*.bin", full.names = TRUE)
I wonder if it is possible thru R or any other language!
Pretty much what you'd expect:
AllFiles = paste0("file", 2010:2015, 0:364, ".bin")
for(file in AllFiles)
{
if(file.exists(file))
{
## do something
}
}

How to copy files in Groovy

I need to copy a file in Groovy and saw some ways to achieve it on the web:
1
new AntBuilder().copy( file:"$sourceFile.canonicalPath",
tofile:"$destFile.canonicalPath")
2
command = ["sh", "-c", "cp src/*.txt dst/"]
Runtime.getRuntime().exec((String[]) command.toArray())
3
destination.withDataOutputStream { os->
source.withDataInputStream { is->
os << is
}
}
4
import java.nio.file.Files
import java.nio.file.Paths
Files.copy(Paths.get(a), Paths.get(b))
The 4th way seems cleanest to me as I am not sure how good is it to use AntBuilder and how heavy it is, I saw some people reporting issues with Groovy version change.
2nd way is OS dependent, 3rd might not be efficient.
Is there something in Groovy to just copy files like in the 4th statement or should I just use Java for it?
If you have Java 7, I would definitely go with
Path source = ...
Path target = ...
Files.copy(source, target)
With the java.nio.file.Path class, it can work with symbolic and hard links. From java.nio.file.Files:
This class consists exclusively of static methods that operate on
files, directories, or other types of files. In most cases, the
methods defined here will delegate to the associated file system
provider to perform the file operations.
Just as references:
Copy files from one folder to another with Groovy
http://groovyconsole.appspot.com/view.groovy?id=8001
My second option would be the ant task with AntBuilder.
If you are doing this in code, just use something like:
new File('copy.bin').bytes = new File('orig.bin').bytes
If this is for build-related code, this would also work, or use the Ant builder.
Note, if you are sure the files are textual you can use .text rather than .bytes.
If it is a text file, I would go with:
def src = new File('src.txt')
def dst = new File('dst.txt')
dst << src.text
I prefer this way:
def file = new File("old.file")
def newFile = new File("new.file")
Files.copy(file.toPath(), newFile.toPath())
To append to existing file :
def src = new File('src.txt')
def dest = new File('dest.txt')
dest << src.text
To overwrite if file exists :
def src = new File('src.txt')
def dest = new File('dest.txt')
dest.write(src.text)
I'm using AntBuilder for such tasks. It's simple, consistent, 'battle-proven' and fun.
2nd approach is too OS-specific (Linux-only in your case)
3rd it too low-level and it eats up more resources. It's useful if you need to transform the file on the way: change encoding for example
4th looks overcomplicated to me... NIO package is relatively new in JDK.
In the end of the day, I'd go for 1st option. There you can switch from copy to scp task, without re-developing the script almost from scratch
This is the way using platform independent groovy script. If anyone has questions please ask in the comments.
def file = new File("java/jcifs-1.3.18.jar")
this.class.classLoader.rootLoader.addURL(file.toURI().toURL())
def auth_server = Class.forName("jcifs.smb.NtlmPasswordAuthentication").newInstance("domain", "username", "password")
def auth_local = Class.forName("jcifs.smb.NtlmPasswordAuthentication").newInstance(null, "local_username", "local_password")
def source_url = args[0]
def dest_url = args[1]
def auth = auth_server
//prepare source file
if(!source_url.startsWith("\\\\"))
{
source_url = "\\\\localhost\\"+ source_url.substring(0, 1) + "\$" + source_url.substring(1, source_url.length());
auth = auth_local
}
source_url = "smb:"+source_url.replace("\\","/");
println("Copying from Source -> " + source_url);
println("Connecting to Source..");
def source = Class.forName("jcifs.smb.SmbFile").newInstance(source_url,auth)
println(source.canRead());
// Reset the authentication to default
auth = auth_server
//prepare destination file
if(!dest_url.startsWith("\\\\"))
{
dest_url = "\\\\localhost\\"+ dest_url.substring(0, 1) + "\$" +dest_url.substring(2, dest_url.length());
auth = auth_local
}
def dest = null
dest_url = "smb:"+dest_url.replace("\\","/");
println("Copying To Destination-> " + dest_url);
println("Connecting to Destination..");
dest = Class.forName("jcifs.smb.SmbFile").newInstance(dest_url,auth)
println(dest.canWrite());
if (dest.exists()){
println("Destination folder already exists");
}
source.copyTo(dest);
For copying files in Jenkins Groovy
For Linux:
try {
echo 'Copying the files to the required location'
sh '''cd /install/opt/
cp /install/opt/ssl.ks /var/local/system/'''
echo 'File is copied successfully'
}
catch(Exception e) {
error 'Copying file was unsuccessful'
}
**For Windows:**
try {
echo 'Copying the files to the required location'
bat '''#echo off
copy C:\\Program Files\\install\\opt\\ssl.ks C:\\ProgramData\\install\\opt'''
echo 'File is copied successfully'
}
catch(Exception e) {
error 'Copying file was unsuccessful'
}

Resources