Reading whole directory with Spark except one file - apache-spark

I have the following directory containing these CSV files:
/data/one.csv
/data/two.csv
/data/three.csv
/data/four.csv
If I want to read everything, I can simply do:
/data/*.csv
but I can not seem to read everything, except four.csv.
This:
/data/*[^four]*.csv
seemed to work but I think that if the list of files would be bigger, than this way of reading would probably be wrong (because of double wildcards).
Is there a good way to do this? I also am aware that:
/data/{one,two,three,^four}.csv
would solve this specific case, but I need the except method for future needs.
Thank you very much!

I am not 100% sure that this method will work, but you can try. You can use Bash/Python or whatever script to scan all the csv files in the folder, but not with the names four.csv.
The input for spark will be (assuming you have files: one.csv, two.csv, three.csv, four.csv, five.csv ,...up to n.csv files)
PathToFiles=[/data/one.csv, /data/two.csv, /data/three.csv, /data/five.csv, ..., /data/n.csv]
Then you can use (the code is in python)
filesRDD = spark.sparkContext.wholeTextFiles(",".join(PathToFiles))
I have wrote similar code in java, in my impression, it works and you can try.

Related

Need Help in a lot of things

In my company
I need to organized all files and folder from ALL servers into a Text, basally Inside the Platform Smartsheet.
I was writing all the names of the files and folder by hand, then, I realized that exist so many files, like crazy, will never end write them all!
So I tried to find an easy solution, and I find here a code in Windows that help a lot:
Code: Tree /f /a >file.txt (I try to convert txt to CSV online, but if I changed the code to this in the end: >file.csv also works xD don't need to convert)
Anyway, I need a solution to have in that code the SIZE of all files and folders, the boss wanna know the size of the all files too.
Then, other problem as, If I import to Smartsheet / Excel the tree effect (this is important the tree) don't works very well, and I'm crazy to figure out a way to do something like this: The tree effect while importing to smartsheet / excel need automatic have a hide / show lines (with symbol + and -). Like this photos:
with + : https://ibb.co/cLjQLpF
with - : https://ibb.co/g3Y2pps
To explain better: + means Folder and I want all folder (+) in a tree.
If exist a program that do this automatic much better, but I don't find no one.
Thanks
regards

How to create same path to directory on different platforms in Python?

In the piece of code I have, there are many instances where I have the following line
'/home/myname/directory'
For example, I have the following lines of code
filepath = os.listdir('/home/myname/directory')
for content in filepath
# do something
In the next part of the project, I have to share the code with some one else. I know this person runs openSUSE. If I want code to create that specific directory with the same path to the directory as mine, what do I need to include?
I know its going to involve the OS module but i am not sure which functions and methods to use.
Your code can check the existence of the directory and create it if not found:
if not os.path.exists("/home/myname/directory"):
os.makedirs("/home/myname/directory")
# do something

cv2.VideoCapture directory of images

I though this would be simple, but i have been caught by the simplest of puzzles which i can't find the answer to anywhere,
I have some code which reads images and then OpenCV looks for differences.
I read files with the following command
vs = cv2.VideoCapture("/home/andrew/images/image_%6d.jpg")
and this work perfectly with images called image_000000.jpg image_000001.jpg
However i don't want to rename my images so i would like to read files called
MDAlarm_20180921-031140.jpg whcih contain the date then time.
What is the printf format for this ? as what ever I try it does not work i.e no files found or do the files need to start from 0 , so i need to append an index
starting at 000000?
Lastly once i have this working how can i tell which file is being processed ?
Many Thanks
Andrew

Add a Timestamp to the End of Filenames with Grunt

During my Grunt tasks, add a unique string to the end of my filenames. I have tried grunt-contrib-copy and grunt-filerev. Neither have been able to do what I need them to...
Currently my LESS files are automatically compiled on 'save' in Sublime Text 3 (so this does not yet occur in my grunt tasks). Then, I open my terminal and run 'grunt', which concatenates (combines) my JS files. After this is done, then grunt should rename 'dist/css/main.css' and 'dist/js/main.js' with a "version" at the end of the filename.
I have tried:
grunt-contrib-copy ('clean:expired' deletes the concatenated JS before grunt-contrib-copy' can rename the file)
grunt-filerev ('This only worked on the CSS files for some reason, and it inserted the version number BEFORE the '.css'. Not sure why it didn't work on the JS files.')
Here's my Gruntfile.js
So, to be clear, I am not asking for "code review" I simply need to know how I can incorporate a "rename" process so that when the tasks are complete, I will have 'dist/css/main.css12345 & dist/js/main.js12345' with no 'dist/css/main.css' or 'dist/js/main.js' left in their respective directories.
Thanks in advance for any help!
UPDATE: After experimenting with this, I ended up using grunt-contrib-rename and it works great! I beleieve the same results can be achieved via grunt-contrib-copy, in fact I know it does the same thing. So either will work. As far as support for regex, not sure if both support it, so may be something else worth looking into before choosing one of these plugins :)
Your rename:dist looks like it should do what you want, you just need to move clean:dist to be the first task that runs (so it deletes things from the prior build rather than the current build). The order of tasks is defined by the array on this last line:
grunt.registerTask('default', ['jshint:dev', 'concat:dist', 'less:dist', 'csslint:dist', 'uglify:dist', 'cssmin:dist', 'clean:dist', 'rename:dist']);
That said, I'm not sure why you want this behavior. The more common thing to do is to insert a hash of the file into the filename before the file extension.
The difference between a hash and a timestamp is that the hash value will always be the same so long as the file contents don't change - so if you only change one file, the compiled output for just that file will be different and thus browsers only need to re-downloaded that one file while using cached versions of every other file.
The difference between putting this number before the file extension and after the extension is that a lot of tools (like your IDE) have behavior that changes based on the extension.
For this more standard goal, there are tons of ways to accomplish it but one of the more common is to combine grunt-filerev with grunt-usemin which will create properly named files and also update your HTML file(s) to reference these new file names
I'm not sure to understand completely what end you want, but if you add a var timestamp = new Date().getTime(); at the beginning of your gruntfile and concatenate to your dest param that should do the job.
dest: 'dist/js/main.min.js' + timestamp
Is it what your looking for?

Cannot zip files with the same name?

I could not believe this: it seems that the zip specification does not allow two different files with the same file name going into one zip file.
In my case I use an external file to specify all the files I wanna zip.
This could look like this:
../Website1/favicon.ico
../Website2/favicon.ico
and there we are, that's not possible, despite keeping the directory structure. You would expect the name to be <../Website1/favicon.ico> rather than but that does not seem to be the case, I get:
"Invalid ZIP request (cannot repeat names in Zip file)"
with WinZip. I tried the same with 7Zip - same result.
Strangely googling did not show many hits that really fit but those I found seem to confirm my findings. That's hard to believe since this limitation is very severe. I actually struggle to understand why this did not hit me a couple of decades earlier.
Am I overlooking something very basic here?
To be precise:
Adding these two files:
C:\Temp\Website1\FavIcon
C:\Temp\Website2\FavIcon
results in a single file; the last Add wins...
This however:
Website1\FavIcon
Website2\FavIcon
results in a zip file that contains both files.

Resources