Unable to cat ~9000 files using command line - cat

I am trying to cat ~9000 fasta like files into one larger file. All of the files are in a single subfolder. I keep getting the argument list is to long error.
This is a sample name from one of the files
efetch.fcgi?db=nuccore&id=CL640905.1&rettype=fasta&retmode=text
They are considered a document type file by the computer.

You can't use cat * > concatfile as you have limits on command line size. So take them one at a time and append:
ls | while read; do cat "$REPLY" >> concatfile; done
(Make sure concatfile doesn't exist beforehand.)
EDIT: As user6292850 rightfully points out, I might be overthinking it. This suffices, if your files don't have too weird names:
ls | xargs cat > concatfile
(but files with spaces in them, for example, would blow it up)

There is a limit on how many arguments you can place on the commandline.
You could use a for loop to handle this:
while read file;do
cat "${file}" >> path/to/output_folder;
done < <(find path/to/output_folder -maxdepth 1 -type f -print)
This will bypass the problem of an expanded glob with too many arguments.

Related

Batch renaming files using variable extracted from file text

Apologies if this has been answered, but I've spent several hours experimenting and searching online for the solution.
I have a folder with several thousand text files named e.g. '1.dat', '2.dat', 3.dat' etc. I would like to rename all of the files by extracting an 8-digit numerical ID from within the file text (the ID is always on the last line in columns 65-73), so that '1.dat' becomes '60741308.dat' etc.
I have made it as far as extracting the ID (using tail and cut) from the text file and assigning it to a variable, which I can then use to rename the file, on a single file,but I am unable to make it work as a batch process in a 'for' loop.
Here is what I have tried:
for i in *.dat
tmpname=$(tail -1 $i| cut -c 65-73)
mv $i $tmpname.dat
done
I get the following error: bash: syntax error near unexpected token `tmpname=$(tail -1 $i| cut -c 65-73)'
Any help much appreciated.
The syntax of a for loop in Bash is:
for i in {1..10}
do
echo $i
done
I can see that, you are missing the do keyword in your example. So, the correct version would be:
for i in *.dat
do
tmpname=$(tail -1 "$i" | cut -c 65-73)
mv "$i" "$tmpname.dat"
done

Batch copy and rename multiple files in the same directory

I have 20 files like:
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
...
Files have a similar format in their names. They begin with 01, and they have 01*AAA*.sh format.
I wish to copy and rename files in the same directory, changing the number 01 to 02, 03, 04, and 05:
02a_AAA_qwe.sh
02b_AAA_asd.sh
02c_AAA_zxc.sh
02d_AAA_rty.sh
...
03a_AAA_qwe.sh
03b_AAA_asd.sh
03c_AAA_zxc.sh
03d_AAA_rty.sh
...
04a_AAA_qwe.sh
04b_AAA_asd.sh
04c_AAA_zxc.sh
04d_AAA_rty.sh
...
05a_AAA_qwe.sh
05b_AAA_asd.sh
05c_AAA_zxc.sh
05d_AAA_rty.sh
...
I wish to copy 20 of 01*.sh files to 02*.sh, 03*.sh, and 04*.sh. This will make the total number of files to 100 in the folder.
I'm really not sure how can I achieve this. I was trying to use for loop in the bash script. But not even sure what should I need to select as a for loop index.
for i in {1..4}; do
cp 0${i}*.sh 0${i+1}*.sh
done
does not work.
There are going to be a lot of ways to slice-n-dice this one ...
One idea using a for loop, printf + brace expansion, and xargs:
for f in 01*.sh
do
printf "%s\n" {02..05} | xargs -r -I PFX cp ${f} PFX${f:2}
done
The same thing but saving the printf in a variable up front:
printf -v prefixes "%s\n" {02..05}
for f in 01*.sh
do
<<< "${prefixes}" xargs -r -I PFX cp ${f} PFX${f:2}
done
Another idea using a pair of for loops:
for f in 01*.sh
do
for i in {02..05}
do
cp "${f}" "${i}${f:2}"
done
done
Starting with:
$ ls -1 0*.sh
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
All of the proposed code snippets leave us with:
$ ls -1 0*.sh
01a_AAA_qwe.sh
01b_AAA_asd.sh
01c_AAA_zxc.sh
01d_AAA_rty.sh
02a_AAA_qwe.sh
02b_AAA_asd.sh
02c_AAA_zxc.sh
02d_AAA_rty.sh
03a_AAA_qwe.sh
03b_AAA_asd.sh
03c_AAA_zxc.sh
03d_AAA_rty.sh
04a_AAA_qwe.sh
04b_AAA_asd.sh
04c_AAA_zxc.sh
04d_AAA_rty.sh
05a_AAA_qwe.sh
05b_AAA_asd.sh
05c_AAA_zxc.sh
05d_AAA_rty.sh
NOTE: blank lines added for readability
You can't do multiple copies in a single cp command, except when copying a bunch of files to a single target directory. cp will not do the name mapping automatically. Wildcards are expanded by the shell, they're not seen by the commands themselves, so it's not possible for them to do pattern matching like this.
To add 1 to a variable, use $((i+1)).
You can use the shell substring expansion operator to get the part of the filename after the first two characters.
for i in {1..4}; do
for file in 0${i}*.sh; do
fileend=${file:2}
cp "$file" "0$((i+1))$fileend"
done
done

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

How do I copy the beginning of multiple files in Linux?

I want to copy a bunch of files (*.txt) from one directory to another in Ubuntu. I want to reduce them in size, so I am using head to get the first 100 lines of each.
I want the new files to keep their original names but be in the subdirectory small/.
I have tried:
head -n 100 *.txt > small/*.txt
but this creates one file called *.txt.
I have also tried:
head -n 100 *.txt > small/
but this gives Is a directory error.
It's got to be easy right, but I am pretty bad at Linux.
Any help is much appreciated.
You'll have to create a loop instead:
for file in *.txt; do
head -n 100 "$file" > small/"$file"
done
This loops through all the .txt files performing a head -n 100 in all of them and outputting into a new file in the small/ directory.
Try
for f in *.txt; do
head -n 100 $f > small/$f
done

recursively "normalize" filenames

i mean getting rid of special chars in filenames, etc.
i have made a script, that can recursively rename files [http://pastebin.com/raw.php?i=kXeHbDQw]:
e.g.: before:
THIS i.s my file (1).txt
after running the script:
This-i-s-my-file-1.txt
Ok. here it is:
But: when i wanted to test it "fully", with filenames like this:
¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÂÃÄÅÆÇÈÊËÌÎÏÐÑÒÔÕ×ØÙUÛUÝÞßàâãäåæçèêëìîïðñòôõ÷øùûýþÿ.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?#[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
it fails [http://pastebin.com/raw.php?i=iu8Pwrnr]:
$ sh renamer.sh directorythathasthefiles
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£': No such file or directory
mv: cannot stat `./áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†....and so on
$
so "mv" can't handle special chars.. :\
i worked on it for many hours..
does anyone has a working one? [that can handle chars [filenames] in that 2 lines too?]
mv handles special characters just fine. Your script doesn't.
In no particular order:
You are using find to find all directories, and ls each directory separately.
Why use for DEPTH in... if you can do exactly the same with one command?
find -maxdepth 100 -type d
Which makes the arbitrary depth limit unnecessary
find -type d
Don't ever parse the output of ls, especially if you can let find handle that, too
find -not -type d
Make sure it works in the worst possible case:
find -not -type d -print0 | while read -r -d '' FILENAME; do
This stops read from eating certain escapes and choking on filenames with new-line characters.
You are repeating the entire ls | replace cycle for every single character. Don't - it kills performance. Loop over each directory all files once, and just use multiple sed's, or multiple replacements in one sed command.
sed 's/á/a/g; s/í/i/g; ...'
(I was going to suggest sed 'y/áí/ai/', but unfortunately that doesn't seem to work with Unicode. Perhaps perl -CS -Mutf8 -pe 'y/áí/ai/' would.)
You're still thinking in ASCII: "other special chars - ASCII Codes 33.. ..255". Don't.
These days, most systems use Unicode in UTF-8 encoding, which has a much wider range of "special" characters - so big that listing them out one by one becomes pointless. (It is even multibyte - "e" is one byte, "ė" is three bytes.)
True ASCII has 128 characters. What you currently have in mind are the ISO 8859 character sets (sometimes called "ANSI") - in particular, ISO 8859-1. But they go all the way up to 8859-16, and only the "ASCII" part stays the same.
echo -n $(command) is rather useless.
There are much easier ways to find the directory and basename given a path. For example, you can do
directory=$(dirname "$path")
oldnname=$(basename "$path")
# filter $oldname
mv "$path" "$directory/$newname"
Do not use egrep to check for errors. Check the program's return code. (Like you already do with cd.)
And instead of filtering out other errors, do...
if [[ -e $directory/$newname ]]; then
echo "target already exists, skipping: $oldname -> $newname"
continue
else
mv "$path" "$directory/$newname"
fi
The ton of sed 's/------------/-/g' calls can be changed to a single regexp:
sed -r 's/-{2,}/-/g'
The [ ]s in tr [foo] [bar] are unnecessary. They just cause tr to replace [ to [, and ] to ].
Seriously?
echo "$FOLDERNAME" | sed "s/$/\//g"
How about this instead?
echo "$FOLDERNAME/"
And finally, use detox.
Try something like:
find . -print0 -type f | awk 'BEGIN {RS="\x00"} { printf "%s\x00", $0; gsub("[^[:alnum:]]", "-"); printf "%s\0", $0 }' | xargs -0 -L 2 mv
Use of xargs(1) will ensure that each filename passed exactly as one parameter. awk(1) is used to add new filename right after old one.
One more trick: sed -e 's/-+/-/g' will replace groups of more than one "-" with exactly one.
Assuming the rest of your script is right, your problem is that you are using read but you should use read -r. Notice how the backslash disappeared:
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&'()*+,:;<=>?#[\]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£.txt
áíüűúöőóéÁÍÜŰÚÖŐÓÉ!"#$%&\'()*+,:;<=>?#[]^_`{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£
Ugh...
Some tips to clean up your script:
** Use sed to do translation on multiple characters at once, that'll clean things up and make it easier to manage:
dev:~$ echo 'áàaieeé!.txt' | sed -e 's/[áàã]/a/g; s/[éè]/e/g'
aaaieee!.txt
** rather than renaming the file for each change, run all your filters then do one move
$ NEWNAME='áàaieeé!.txt'
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/[áàã]/a/g; s/[éè]/e/g')"
$ NEWNAME="$(echo "$NEWNAME" | sed -e 's/aa*/a/g')"
$ echo $NEWNAME
aieee!.txt
** rather than doing a ls | read ... loop, use:
for OLDNAME in $DIR/*; do
blah
blah
blah
done
** separate out your path traversal and renaming logic into two scripts. One script finds the files which need to be renamed, one script handles the normalization of a single file. Once you learn the 'find' command, you'll realize you can toss the first script :)

Resources