How do I loop over multiple files to extract specific columns and save as separate files? - linux

I have numerous *.txt files. I want to extract column 3 and 5 from each of these files and save them as new files keeping their oiginal names with new_ extension. I have this bash loop below in trying to do this, but doesn't do what I want. Can someone please help me with this?
for i in *.txt; do
cut -f 3,5 $i > /media/owner/new_$i_assembly.txt
done

Simple approach:
for f in *.txt; do
cut -d$'\t' -f3,5 "$f" > "/media/owner/new_${f}_assembly.txt"
done
In case if there could be possible whitespace(s) except tabs - you may use the following awk approach:
for f in *.txt; do
awk '{ print $3,$5 }' OFS='\t' "$f" > "/media/owner/new_${f}_assembly.txt"
done

You have to make sure and tell Bash explicitly to expand variable $i, otherwise it picks up characters you don't want and expands variable $i_assembly instead:
for i in *.txt; do
cut -f 3,5 "$i" > "/media/owner/new_${i}_assembly.txt"
done
If you don't want the extension included in your new name, use parameter expansion ${i%.*} that removes everything up to the first . included, from the end.
for i in *.txt; do
cut -f 3,5 "$i" > "/media/owner/new_${i%.*}_assembly.txt"
done
If you decide for a different approach that might result in paths, not just filenames (for example: **/*.txt), you can use parameter expansion once again to get only the name of your file:
for i in **/*.txt; do
base=${i##*/}
base=${base%.*}
cut -f 3,5 "$i" > "/media/owner/new_${base}_assembly.txt"
done
Also note that TAB is the default delimiter for cut, you don't need to specify it with the -d option:
-d, --delimiter=DELIM
use DELIM instead of TAB for field delimiter

Related

How to sort and print array listing of specific file type in shell

I am trying to write a loop with which I want to extract text file names in all sub-directories and append certain strings to it. Additionally, I want the text file name sorted for numbers after ^.
For example, I have three sub directories mydir1, mydir2, mydir3. I have,
in mydir1,
file223^1.txt
file221^2.txt
file666^3.txt
in mydir2,
file111^1.txt
file4^2.txt
In mydir3,
file1^4.txt
file5^5.txt
The expected result final.csv:
STRINGmydir1file223^1
STRINGmydir1file221^2
STRINGmydir1file666^3
STRINGmydir2file111^1
STRINGmydir2file4^2
STRINGmydir3file1^4
STRINGmydir3file5^5
This is the code I tried:
for dir in my*/; do
array=(${dir}/*.txt)
IFS=$'\n' RGBASE=($(sort <<<"${array[#]}"));
for RG in ${RGBASE[#]}; do
RGTAG=$(basename ${RG/.txt//})
echo "STRING${dir}${RGTAG}" >> final.csv
done
done
Can someone please explain what is wrong with my code? Also, there could be other better ways to do this, but I want to use the for-loop.
The output with this code:
$ cat final.csv
STRINGdir1file666^3.txt
STRINGdir2file4^2.txt
STRINGdir3file5^5.txt
As a starting point which works for your special case, I got a two liner for this.
mapfile -t array < <( find my* -name "*.txt" -printf "STRING^^%H^^%f\n" | cut -d"." -f1 | LANG=C sort -t"^" -k3,3 -k6 )
printf "%s\n" "${array[#]//^^/}"
To restrict the directory depth, you can add -maxdepth with the number of subdirs to search. The find command can also use regex in the search, which is applied to the whole path, which can be used to work on a more complex directory-tree.
The difficulty was the sort on two positions and the delimiter.
My idea was to add a delimiter, which easily can be removed afterwards.
The sort command can only handle one delimiter, therefore I had to use the double hat as delimiter which can be removed without removing the single hat in the filename.
A solution using decorate-sort-undecorate idiom could be:
printf "%s\n" my*/*.txt |
sed -E 's_(.*)/(.*)\^([0-9]+).*_\1\t\3\tSTRING\1\2^\3_' |
sort -t$'\t' -k1,1 -k2,2n |
cut -f3
assuming filenames don't contain tab or newline characters.
A basic explanation: The printf prints each pathname on a separate line. The sed converts the pathname dir/file^number.txt into dir\tnumber\tSTRINGdirfile^number (\t represents a tab character). The aim is to use the tab character as a field separator in the sort command. The sort sorts the lines by the first (lexicographically) and second fields (numerically). The cut discards the first and second fields; the remaining field is what we want.

Renaming files in bash script using mv

I recovered many MOV files from a faulty hard drive however software named these files as per block location:
263505816.mov etc...
I wrote a small script that uses mediainfo application so i can read date and time created and rename files accordingly:
for f in *.mov
do
MODIFIED=$(mediainfo -f $f |grep -m 1 "Encoded date" |sort -u |awk -F "UTC " '{print $2}')
DATECREATED=$(echo $MODIFIED |cut -d' ' -f 1)
TIMECREATED=$(echo $MODIFIED |cut -d' ' -f 2 |tr -s ':' '-')
mv $f "$DATECREATED $TIMECREATED.mov"
done
Which works fine but when i modify mv statement by adding 2 words at the end:
mv $f "$DATECREATED $TIMECREATED Holidays 2011.mov"
i get the following:
mv: target â Holidays 2011.movâ is not a directory
I know i have to mark white spaces is some way because mv is mislead that it's a directory. Other articles do not mention using multiple variable in conjunction with mv that is why im asking for guidance.
Many thanks,
I think it worked the first time because *.mov expanded only to numeric files like the one in your example, but after the first execution of your script, it renamed the files introducing spaces in the form $DATECREATED $TIMECREATED.mov. Basically, the second time around, mv is attempting to move $DATECREATED and $TIMECREATED.mov to your target file, and since it's not a directory, it fails.
You can solve this by quoting $f. Try it like this:
mv "$f" "$DATECREATED $TIMECREATED Holidays 2011.mov"
In fact, it's recommended that you always quote variables unless you're sure they won't contain special characters.
Before the $DATECREATED and after the .mov you have some characters that look like double quotes but are not so, my guess. Some typographic quotes. Try replacing with normal double quotes ". - in addition to the suggestion to also quote "$f" (with the right double quotes ;-)

Changing the file names and copying into different directory

I have some files say about 1000 numbers.. Wanted to rename those files in such a way that, wanted to cut out only few chars from file name and copy it to some other directory.
Ex: Original file name.
vfcon062562~19.xml
vfcon058794~29.xml
vfcon072009~3.xml
vfcon071992~10.xml
vfcon071986~2.xml
vfcon071339~4.xml
vfcon069979~43.xml
Required O/P is cutting the ~and following chars.
O/P Ex:
vfcon058794.xml
vfcon062562.xml
vfcon069979.xml
vfcon071339.xml
vfcon071986.xml
vfcon071992.xml
vfcon072009.xml
But want to place n different directory.
If you are using bash or similar you can use the following simple loop:
for input in vfcon*xml
do
mv $input targetDir/$(echo $input | awk -F~ '{print $1".xml"}')
done
Or in a single line:
for input in vfcon*xml; do mv $input targetDir/$(echo $input | awk -F~ '{print $1".xml"}'); done
This uses awk to separate everything before ~ using it as a field separator and printing the first column and appending ".xml" to create the output file name. All this is prepended with the targetDir which can be a full path.
If you are using csh / tcsh then the syntax of the loop will be slightly different but the commands will be the same.
I like to make sure that my data set is correct prior to changing anything so I would put that into a variable first and then check over it.
files=$(ls vfcon*xml)
echo $files | less
Then, like #Stefan said, use a loop:
for i in $files
do
mv "$i" "$( echo "$file" | sed 's/~[0-9].//g')"
done
If you need help with bash you can use http://www.shellcheck.net/

greping and replacing 2 file paths

I am currently working on a project where I would like to change a picture in multiple places.
old file dir: /images/icons/helpPop.png
new file dir: /public/website_pngs/icons-buttons/button_question_mark.png
I want to try
grep -rl 'images/icons/helpPop' . | xargs sed -i 's/images/icons/helpPop/public/website_pngs/icons-buttons/button_question_mark/g'
But I know that will not work. I am looking into delimiters and would like some extra advice please.
I am looking into delimiters and would like some extra advice please.
some suggestions:
if the old/new dirs are fixed, try to save them into variables
use different delimiter in sed e.g. "s#$old#$new#g" or "s#$old#$new#g" (double quotes!!)
to be safer, you better escape the . (period/dot) in your old dir text. since it indicates any char in regex. it could be a dangerous operation with plain text like that. (fortunately you don't have .* in your path ^_^ )
or you want the working codes?
EDIT for comment:
you don't need to "setup" the delimiter, just use it, like you are using /, take a look this example:
kent$ echo "/////"|sed 's#/#?#g'
?????
kent$ echo "/////"|sed 's^/^?^g'
?????
kent$ echo "/////"|sed 's%/%?%g'
?????
kent$ echo "/////"|sed 's;/;?;g'
?????
for i in grep -rl 'images/icons/helpPop' . ; do cp $i $i.bak; sed -i 's/images\/icons\/helpPop/public\/website_pngs\/icons-buttons\/button_question_mark/g' < $i.bak > $i; done

Linux shell script to add leading zeros to file names

I have a folder with about 1,700 files. They are all named like 1.txt or 1497.txt, etc. I would like to rename all the files so that all the filenames are four digits long.
I.e., 23.txt becomes 0023.txt.
What is a shell script that will do this? Or a related question: How do I use grep to only match lines that contain \d.txt (i.e., one digit, then a period, then the letters txt)?
Here's what I have so far:
for a in [command i need help with]
do
mv $a 000$a
done
Basically, run that three times, with commands there to find one digit, two digits, and three digit filenames (with the number of initial zeros changed).
Try:
for a in [0-9]*.txt; do
mv $a `printf %04d.%s ${a%.*} ${a##*.}`
done
Change the filename pattern ([0-9]*.txt) as necessary.
A general-purpose enumerated rename that makes no assumptions about the initial set of filenames:
X=1;
for i in *.txt; do
mv $i $(printf %04d.%s ${X%.*} ${i##*.})
let X="$X+1"
done
On the same topic:
Bash script to pad file names
Extract filename and extension in bash
Using the rename (prename in some cases) script that is sometimes installed with Perl, you can use Perl expressions to do the renaming. The script skips renaming if there's a name collision.
The command below renames only files that have four or fewer digits followed by a ".txt" extension. It does not rename files that do not strictly conform to that pattern. It does not truncate names that consist of more than four digits.
rename 'unless (/0+[0-9]{4}.txt/) {s/^([0-9]{1,3}\.txt)$/000$1/g;s/0*([0-9]{4}\..*)/$1/}' *
A few examples:
Original Becomes
1.txt 0001.txt
02.txt 0002.txt
123.txt 0123.txt
00000.txt 00000.txt
1.23.txt 1.23.txt
Other answers given so far will attempt to rename files that don't conform to the pattern, produce errors for filenames that contain non-digit characters, perform renames that produce name collisions, try and fail to rename files that have spaces in their names and possibly other problems.
for a in *.txt; do
b=$(printf %04d.txt ${a%.txt})
if [ $a != $b ]; then
mv $a $b
fi
done
One-liner:
ls | awk '/^([0-9]+)\.txt$/ { printf("%s %04d.txt\n", $0, $1) }' | xargs -n2 mv
How do I use grep to only match lines that contain \d.txt (IE 1 digit, then a period, then the letters txt)?
grep -E '^[0-9]\.txt$'
Let's assume you have files with datatype .dat in your folder. Just copy this code to a file named run.sh, make it executable by running chmode +x run.sh and then execute using ./run.sh:
#!/bin/bash
num=0
for i in *.dat
do
a=`printf "%05d" $num`
mv "$i" "filename_$a.dat"
let "num = $(($num + 1))"
done
This will convert all files in your folder to filename_00000.dat, filename_00001.dat, etc.
This version also supports handling strings before(after) the number. But basically you can do any regex matching+printf as long as your awk supports it. And it supports whitespace characters (except newlines) in filenames too.
for f in *.txt ;do
mv "$f" "$(
awk -v f="$f" '{
if ( match(f, /^([a-zA-Z_-]*)([0-9]+)(\..+)/, a)) {
printf("%s%04d%s", a[1], a[2], a[3])
} else {
print(f)
}
}' <<<''
)"
done
To only match single digit text files, you can do...
$ ls | grep '[0-9]\.txt'
One-liner hint:
while [ -f ./result/result`printf "%03d" $a`.txt ]; do a=$((a+1));done
RESULT=result/result`printf "%03d" $a`.txt
To provide a solution that's cautiously written to be correct even in the presence of filenames with spaces:
#!/usr/bin/env bash
pattern='%04d' # pad with four digits: change this to taste
# enable extglob syntax: +([[:digit:]]) means "one or more digits"
# enable the nullglob flag: If no matches exist, a glob returns nothing (not itself).
shopt -s extglob nullglob
for f in [[:digit:]]*; do # iterate over filenames that start with digits
suffix=${f##+([[:digit:]])} # find the suffix (everything after the last digit)
number=${f%"$suffix"} # find the number (everything before the suffix)
printf -v new "$pattern" "$number" "$suffix" # pad the number, then append the suffix
if [[ $f != "$new" ]]; then # if the result differs from the old name
mv -- "$f" "$new" # ...then rename the file.
fi
done
There is a rename.ul command installed from util-linux package (at least in Ubuntu) by default installed.
It's use is (do a man rename.ul):
rename [options] expression replacement file...
The command will replace the first occurrence of expression with the given replacement for the provided files.
While forming the command you can use:
rename.ul -nv replace-me with-this in-all?-these-files*
for not doing any changes but reading what changes that command would make. When sure just reexecute the command without the -v (verbose) and -n (no-act) options
for your case the commands are:
rename.ul "" 000 ?.txt
rename.ul "" 00 ??.txt
rename.ul "" 0 ???.txt

Resources