Unable to concat many numpy files?

Unable to concat many numpy files? - cygwin

I am trying to concatenate many numpy files, For that I use this command, I use cygwin for that:
ls | sort --field-separator = --key 2 -h | xargs -rn 4 cat >All_Numpy_Files.npy
Let's suppose that I have 100 files, The creation of the final file takes a lot of time, but at the end all what I find in this resulted file is just the first file in the list.
The shape the resulted file is equal to 1- 800 instead of 100-8000

cat >> will append to already existing file
cat > will everytime start from scratch and overwrite previous content
http://www.tldp.org/LDP/abs/html/io-redirection.html

Related

xargs + curl "Failed writing body" (cannot save more than one result to file)

I have problem with xargs and curl.
I have url list in "urls" file, and i need to download these contents limited to first 9 lines and save all to one output file (or one file per result - doesnt matter).
xargs -P 4 -n 1 curl < urls | head -n 9 > outputfile
The problem is that, there only first result are saving to file, all others give a error "(23) Failed writing body". Even when i dont save results to file, there is a "(23) Failed writing body" error in console.
In sum:
I need to download first 9 lines of XXXX URLs from file, and save this to one output file, or one file per URL.
The problem exists on Cygwin (Windows 10) and MacOs.

Your pipeline limits the output to the first 9 lines of xargs output. Try this instead.
xargs -P 4 -i sh -c 'curl {} | head -n 9' <urls >outputfile
This will probably mix up the output lines of parallel fetches uncontrollably. If you want to avoid that, maybe look at GNU parallel. If that's unacceptable, maybe write each to a separate temporary file and concatenate and delete the temporary files when the fetching is done.

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.

Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.

When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.

If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the

If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

Move first "n" files sorted by name with Bash Script or Ubuntu command line

I need a command line or a bash script to move the first 80 files (name sorted) in a folder (which contains 30000 files) into new folders which will store this chunk of 80 files for individual processing with Imagemagick.
I have tried with ls pathtofolder/Pictures/* | head -80 | xargs -I{} cp {} pathtofolder/OutputFolder and other similar codes but the files (named by Pictures%d.jpg) are copied in weird orders (like 1 to 5, then 10 to 16, then 100 to 160, and so on, completing 80 files in total).
The easiest way I found was with the use of convert image-%d.jpg[1-5], as says this page, but It seems that it doesn't work (I tried with convert -delay 3.33 -loop 0 pathtofolder/Pictures%d.jpg[100-180] pathtofolder/Test.gif), throws this error:
zsh: no matches found:
/home/naldrek/Videos/Pictures/Pictures%d.jpg[100-180]
I tried other things too, and I read a lot over the internet. Can't make it work.

How about straightforward solution like
for F in $(ls -U | sort | head -80); do
cp $F /path/to/target
convert /path/to/target/$F
done

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.

Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.

Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).

This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

How to view last created file?

I have uploaded a file to a Linux computer. I do not know its name. So how to view files through their last created date attribute ?

ls -lat
will show a list of all files sorted by date. When listing with the -l flag using the -t flag sorts by date. If you only need the filename (for a script maybe) then try something like:
ls -lat | head -2 | tail -1 | awk '{print $9}'
This will list all files as before, get the first 2 rows (the first one will be something like 'total 260'), the get the last one (the one which shows the details of the file) and then get the 9th column which contains the filename.

find / -ctime -5
Will print the files created in the last five minutes. Increase the period one minute at a time to find your file.

Assuming you know the folder where you'll be searching it, the most easy solution is:
ls -t | head -1
# use -A in case the file can start with a dot
ls -tA | head -1
ls -t will sort by time, newest first (from ls --help itself)
head -1 will only keep 1 line at the top of anything

Use ls -lUt or ls -lUtr, as you wish. You can take a look at the ls command documentation typing man ls on a terminal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unable to concat many numpy files? - cygwin

cat >> will append to already existing file cat > will everytime start from scratch and overwrite previous content http://www.tldp.org/LDP/abs/html/io-redirection.html

Related

xargs + curl "Failed writing body" (cannot save more than one result to file)

Split large gz files while preserving rows

Move first "n" files sorted by name with Bash Script or Ubuntu command line

Clearing archive files with linux bash script

How to view last created file?

Categories

Resources