Using linux sort on multiple files - linux

Is there a way I can run the following command with Linux for many files at once?
$ sort -nr -k 2 file1 > file2

I assume you have many input files, and you want to create a sorted version of each of them. I would do this using something like
for f in file*
do
sort $f > $f.sort
done
Now, this has the small problem that if you run it again, if will not only sort all the files again, it will also create file1.sort.sort to go with file1.sort. There are various ways to fix that. We can fix the second problem by creating sorted files thate don't have names beginning with "file":
for f in file*
do
sort $f > sorted.$f
done
But that's kind of weird, and I wouldn't want files named like that. Alternatively, we could use a slightly more clever script that checks whether the file needs sorting, and avoids both problems:
for f in file*
do
if expr $f : '.*\.sort' > /dev/null
then
: no need to sort
elif test -e $f.sort
then
: already sorted
else
sort -nr -k 2 $f > $f.sort
fi
done

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Unable to cat ~9000 files using command line

I am trying to cat ~9000 fasta like files into one larger file. All of the files are in a single subfolder. I keep getting the argument list is to long error.
This is a sample name from one of the files
efetch.fcgi?db=nuccore&id=CL640905.1&rettype=fasta&retmode=text
They are considered a document type file by the computer.
You can't use cat * > concatfile as you have limits on command line size. So take them one at a time and append:
ls | while read; do cat "$REPLY" >> concatfile; done
(Make sure concatfile doesn't exist beforehand.)
EDIT: As user6292850 rightfully points out, I might be overthinking it. This suffices, if your files don't have too weird names:
ls | xargs cat > concatfile
(but files with spaces in them, for example, would blow it up)
There is a limit on how many arguments you can place on the commandline.
You could use a for loop to handle this:
while read file;do
cat "${file}" >> path/to/output_folder;
done < <(find path/to/output_folder -maxdepth 1 -type f -print)
This will bypass the problem of an expanded glob with too many arguments.

Alternative to ls in shell-script compatible with nohup

I have a shell-script which lists all the file names in a directory and store them in a new file.
The problem is that when I execute this script with the nohup command, it lists the first name four times instead of listing the correct names.
Commenting the problem with other programmers they think that the problem may be the ls command.
Part of my code is the following:
for i in $( ls -1 ./Datasets/); do
awk '{print $1}' ./genes.txt | head -$num_lineas | tail -1 >> ./aux
let num_lineas=$num_lineas-1
done
Do you know an alternative to ls that works well with nohup?
Thanks.
Don't use ls to feed the loop, use:
for i in ./Datasets/*; do
or if subdirectories are of interest
for i in ./Datasets/*/*; do
Lastly, and more correctly, use find if you need the entire tree below Datasets:
find ./Datasets -type f | while IFS= read -r file; do
(do stuff with $file)
done
Others frown, but there is nothing wrong with also using find as:
for file in $(find ./Datasets -type f); do
(do stuff with $file)
done
Just choose the syntax that most closely meets your needs.
First of all, don't parse ls! A simple glob will suffice. Secondly, your awk | head | tail chain can be simplified by only printing the first column of the line that you're interested in using awk. Thirdly, you can redirect the output of your loop to a file, rather than using >>.
Incorporating all of those changes into your script:
for i in Datasets/*; do
awk -v n="$(( num_lineas-- ))" 'NR==n{print $1}' genes.txt
done > aux
Every time the loop goes round, the value of $num_lineas will decrease by 1.
In terms of your problem with nohup, I would recommend looking into using something like screen, which is known to be a better solution for maintaining a session between logins.

cat | sort csv file by name in bash

i have a bunch of csv files that i want to save them in one file ordered by name
i use
cat *.csv | sort -t\ -k2 -n *.csv > output.csv
works good for a naming like a001, a002, a010. a100
but in my files the names are fup a bit so they are like a1. a2. a10. a100
and the command i wrote will arrange my things like this:
cn201
cn202
cn202
cn203
cn204
cn99
cn98
cn97
cn96
..
cn9
can anyone please help me ?
Thanks
If I understand correctly, you want to use the -V (version-sort) flag instead of -n. This is only available on GNU sort, but that's probably the one you are using.
However, it depends how you want the prefixes to be sorted.
If you don't have the -V option, sort allows you to be more precise about what characters constitute a sort key.
sort -t\ -k2.3n *.csv > output.csv
The .3 tells sort that the key to sort on starts with the 3rd character of the second field, effectively skipping the cn prefix. You can put the n directly in the field specifier, which saves you two whole characters, but more importantly for more complex sorts, allows you to treat just that key as a number, rather than applying -n globally (which is only an issue if you specify multiple keys with several uses of -k).
The sort version on the live server is 5.97 from 2006
so few things did not work correctly.
However the code bellow is my solution
#!/bin/bash
echo "This script reads all CSVs into a single file (clusters.csv) in this directory"
for filers in *.csv
do
echo "" >> clusters.csv
echo "--------------------------------" >> clusters.csv
echo $filers >> largefile.txt
echo "--------------------------------" >> clusters.csv
cat $filers >> clusters.csv
done
or if you want to keep it simple inside one command
awk 'FNR > 1' *.csv > clusers.csv

Count/Enumerate files in folder filtered by content

I have a folder with lots of files with some data. Not every file has a complete data set.
The complete data sets all have a common string of the form 'yyyy-mm-dd' on the last line so i thought i might filter with something like tail -n 1, but have no idea how to do that.
Any idea how to do something like that in a simple script or bash command?
for f in *
do
tail -n 1 "$f" |
grep -qE '^[0-9]{4}-[01][0-9]-[0-3][0-9]$' &&
echo "$f"
done

Resources