Linux : Move files that have more than 100 commas in one line - linux

I have 100 files in a specific directory that contains several records with fields delimited with commas.
I need to use a Linux command that check the lines in each file
and if the line contains more than 100 comma move it to another directory.
Is it possible ?

Updated Answer
Although my original answer below is functional, Glenn's (#glennjackman) suggestion in the comments is far more concise, idiomatic, eloquent and preferable - as follows:
#!/bin/bash
mkdir subdir
for f in file*; do
awk -F, 'NF>100{exit 1}' "$f" || mv "$f" subdir
done
It basically relies on awk's exit status generally being 0, and then only setting it to 1 when encountering files that need moving.
Original Answer
This will tell you if a file has more than 100 commas on any line:
awk -F, 'NF>100{print 1;exit} END{print 0}' someFile
It will print 1 and exit without parsing the remainder of the file if the file has any line with more than 100, and print 0 at the end if it doesn't.
If you want to move them as well, use
#!/bin/bash
mkdir subdir
for f in file*; do
if [[ $(awk -F, 'NF>100{print 1;exit}END{print 0}' "$f") != "0" ]]; then
echo mv "$f" subdir
fi
done
Try this and see if it selects the correct files, and, if you like it, remove the word echo and run it again so it actually moves them. Back up first!

Related

bash/awk/unix detect changes in lines of csv files

I have a timestamp in this format:
(normal_file.csv)
timestamp
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
19/02/2002
The dates are usually uniform, however, there are files with irregular dates pattern such as this example:
(abnormal_file.csv)
timestamp
19/02/2002
19/02/2003
19/02/2005
19/02/2006
In my directory, there are hundreds of files that consist of normal.csv and abnormal.csv.
I want to write a bash or awk script that detect the dates pattern in all files of a directory. Files with abnormal.csv should be moved automatically to a new, separate directory (let's say dir_different/).
Currently, I have tried the following:
#!/bin/bash
mkdir dir_different
for FILE in *.csv;
do
# pipe 1: detect the changes in the line
# pipe 2: print the timestamp column (first column, columns are comma-separated)
awk '$1 != prev {print ; prev = $1}' < $FILE | awk -F , '{print $1}'
done
If the timestamp in a given file is normal, then only one single timestamp will be printed; but for abnormal files, multiple dates will be printed.
I am not sure how to separate the abnormal files from the normal files, and I have tried the following:
do
output=$(awk 'FNR==3{print $0}' $FILE)
echo ${output}
if [[ ${output} =~ ([[:space:]]) ]]
then
mv $FILE dir_different/
fi
done
Or is there an easier method to detect changes in lines and separate files that have different lines? Thank you for any suggestions :)
Assuming that none of your "normal" CSV files have trailing newlines this should do the separation just fine:
#!/bin/bash
mkdir -p dir_different
for FILE in *.csv;
do
if awk '{a[$1]++}END{if(length(a)<=2){exit 1}}' "$FILE" ; then
echo mv "$FILE" dir_different
fi
done
After a dry-run just get rid of the echo :)
Edit:
{a[$1]++} This bit creates an array a that gets the first field of each line as an index, and that gets incremented every time the same value is seen.
END{if(length(a)<=2){exit 1}} This checks how many elements are in the array. If there there are less than 3 (which should be the case if there's always the same date and we only get 1 header, 1 date) exit the processing with 1.
"$FILE" is part of the bash script, not awk, and I quoted your variable out of habit, should you ever have files w/ spaces in their names you'll see why :)
So, a "normal" file contains only two different lines:
timestamp
dd/mm/yyyy
Testing if a file is normal is thus as simple as:
[ $(sort -u file.csv | wc -l) -eq 2 ]
This leads to the following possible solution:
#!/usr/bin/env bash
mkdir -p dir_different
for FILE in *.csv;
do
if [ $(sort -u "$FILE" | wc -l) -ne 2 ] ; then
echo mv "$FILE" dir_different
fi
done

List directories and their files grouping them on one line for tokenization

I want to group the directory name with their files in bash script.
For example if I type ls /home/maindir/*
I get home/maindir/dir1: file1 file2\n file3
home/maindir/dir2: file1 file2
The directories with files are not separated by a specified delimiter because there are cases that file1 and file2, in the same directory, have a newline beetween them, so I want to tokenize with a delimiter the directory name and its file list all on one line.
Example output with newline delimiter:
home/maindir/dir1: file1 file2 file3\n
home/maindir/dir2: file1 file2\n
home/maindir/dir3: file1 file2 file4\n
I originally used an unquoted interpolation trick.
For example, if you have strings in a file, one per line, and you want them horizontalized, you don't have to use paste -
file named foo:
a
b
c
then you can say:
echo $(<foo)
and you get
a b c
But that could cause issues with filenames, especially if they have embedded special chars or whitespace.
Thanks to Gordon Davisson for a simple upgrade!
for d in /home/maindir/* # includes full path each time
do [[ -d "$d" ]] || continue # ignore nondirectories
cd "$d" # go there to make filenames path-bare
echo "$d:" *
done
Note that this still includes subdirectories. Do you need to skip those?
If you want to be more careful -
for d in /charter/apps/*
do [[ -d "$d" ]] || continue
cd "$d"
dir="$d: "
hit=0
for f in *
do if [[ -f "$f" ]]
then hit=1
dir="$dir $f "
fi
done
(( $hit )) && printf "$dir\n"
done
This one should also work on files with embedded spaces &c.

How to create dynamic headers on a text file using BASH script

I have 5 big text files in a directory with millions of records delimited by pipe. All I want to do is, when I run the BASH script it should create a header on the first line like this:
TCR1|A|B|C|D|E|F|# of records
and the first word(TCR) is the new name of the file and last one is the number of records. Both of them should change with respect to each text file. So, when I run the script once, it should find the 5 text files in the directory and script as mentioned above. The output should look like this in each text file.
a.txt
TCR1|A|B|C|D|E|F|# of records in first text file
b.txt
TCR2|A|B|C|D|E|F|# of records in second text file
c.txt
TCR3|A|B|C|D|E|F|# of records in third text file
d.txt
TCR4|A|B|C|D|E|F|# of records in fourth text file
e.txt
TCR5|A|B|C|D|E|F|# of records in fifth text file
I think this is probably what you mean, though your question is very poorly posed:
#!/bin/bash
# Don't crash if no text files present and allow upper/lowercase "txt/TXT"
shopt -s nullglob nocaseglob
# Declare "lines" to be numeric, rather than string
declare -i lines
for f in *.txt; do
lines=$(wc -l < "$f")
echo "$f|A|B|C|D|E|F|$lines"
cat "$f"
done
I don't understand the TCR thing, but maybe this is what you want:
#!/bin/bash
# Declare "lines" to be numeric, rather than string
declare -i lines
for f in *.txt; do
lines=$(wc -l < "$f")
TCRthing="unknown"
[ "$f" == "a.txt" ] && TCRthing="TCR1"
[ "$f" == "b.txt" ] && TCRthing="TCR2"
[ "$f" == "c.txt" ] && TCRthing="TCR3"
[ "$f" == "d.txt" ] && TCRthing="TCR4"
[ "$f" == "e.txt" ] && TCRthing="TCR5"
echo "$TCRthing|A|B|C|D|E|F|$lines"
cat "$f"
done
Note that there are simpler, more idiomatic ways of doing this, for example, you could just run:
more *.txt
and then press CtrlG to get status as to which file you are viewing and where you have reached and how many lines each file is. You can also press :n to move to the next file and :p to move to the previous file. And 1G to go back to top of current file and G to go to bottom of current file.

Simple sed substitution

I have a text file with a list of files with the structure ABC123456A or ABC123456AA. What I would like to do is check whether the files ABC123456ZZP also exists. i.e I want to substitute the letter(s) after ABC123456 with ZZP
Can I do this using sed?
Like this?
X=ABC123456 ; echo ABC123456AA | sed -e "s,\(${X}\).*,\1ZZP,"
You could use sed as wilx suggests but I think a better option would be bash.
while read file; do
base=${file:0:9}
[[ -f ${base}ZZP ]] && echo "${base}ZZP exists!"
done < file
This will loop over each line in file
then base is set to the first 9 characters of the line (excluding whitespace)
then check to see if a file exists with ZZP on the end of base and print a message if it does.
Look:
$ str="ABC123456AA"
$ echo "${str%[[:alpha:]][[:alpha:]]*}"
ABC123456
so do this:
while IFS= read -r tgt; do
tgt="${tgt%[[:alpha:]][[:alpha:]]*}ZZP"
[[ -f "$tgt" ]] && printf "%s exists!\n" "$tgt"
done < file
It will still fail for file names that contain newlines so let us know if you have that situation but unlike the other posted solutions it will work for file names with other than 9 key characters, file names containing spaces, commas, backslashes, globbing characters, etc., etc. and it is efficient.
Since you said now that you only need the first 9 characters of each line and you were happy with piping every line to sed, here's another solution you might like:
cut -c1-9 file |
while IFS= read -r tgt; do
[[ -f "${tgt}ZZP" ]] && printf "%sZZP exists!\n" "$tgt"
done
It'd be MUCH more efficient and more robust than the sed solution, and similar in both contexts to the other shell solutions.

Linux shell script to add leading zeros to file names

I have a folder with about 1,700 files. They are all named like 1.txt or 1497.txt, etc. I would like to rename all the files so that all the filenames are four digits long.
I.e., 23.txt becomes 0023.txt.
What is a shell script that will do this? Or a related question: How do I use grep to only match lines that contain \d.txt (i.e., one digit, then a period, then the letters txt)?
Here's what I have so far:
for a in [command i need help with]
do
mv $a 000$a
done
Basically, run that three times, with commands there to find one digit, two digits, and three digit filenames (with the number of initial zeros changed).
Try:
for a in [0-9]*.txt; do
mv $a `printf %04d.%s ${a%.*} ${a##*.}`
done
Change the filename pattern ([0-9]*.txt) as necessary.
A general-purpose enumerated rename that makes no assumptions about the initial set of filenames:
X=1;
for i in *.txt; do
mv $i $(printf %04d.%s ${X%.*} ${i##*.})
let X="$X+1"
done
On the same topic:
Bash script to pad file names
Extract filename and extension in bash
Using the rename (prename in some cases) script that is sometimes installed with Perl, you can use Perl expressions to do the renaming. The script skips renaming if there's a name collision.
The command below renames only files that have four or fewer digits followed by a ".txt" extension. It does not rename files that do not strictly conform to that pattern. It does not truncate names that consist of more than four digits.
rename 'unless (/0+[0-9]{4}.txt/) {s/^([0-9]{1,3}\.txt)$/000$1/g;s/0*([0-9]{4}\..*)/$1/}' *
A few examples:
Original Becomes
1.txt 0001.txt
02.txt 0002.txt
123.txt 0123.txt
00000.txt 00000.txt
1.23.txt 1.23.txt
Other answers given so far will attempt to rename files that don't conform to the pattern, produce errors for filenames that contain non-digit characters, perform renames that produce name collisions, try and fail to rename files that have spaces in their names and possibly other problems.
for a in *.txt; do
b=$(printf %04d.txt ${a%.txt})
if [ $a != $b ]; then
mv $a $b
fi
done
One-liner:
ls | awk '/^([0-9]+)\.txt$/ { printf("%s %04d.txt\n", $0, $1) }' | xargs -n2 mv
How do I use grep to only match lines that contain \d.txt (IE 1 digit, then a period, then the letters txt)?
grep -E '^[0-9]\.txt$'
Let's assume you have files with datatype .dat in your folder. Just copy this code to a file named run.sh, make it executable by running chmode +x run.sh and then execute using ./run.sh:
#!/bin/bash
num=0
for i in *.dat
do
a=`printf "%05d" $num`
mv "$i" "filename_$a.dat"
let "num = $(($num + 1))"
done
This will convert all files in your folder to filename_00000.dat, filename_00001.dat, etc.
This version also supports handling strings before(after) the number. But basically you can do any regex matching+printf as long as your awk supports it. And it supports whitespace characters (except newlines) in filenames too.
for f in *.txt ;do
mv "$f" "$(
awk -v f="$f" '{
if ( match(f, /^([a-zA-Z_-]*)([0-9]+)(\..+)/, a)) {
printf("%s%04d%s", a[1], a[2], a[3])
} else {
print(f)
}
}' <<<''
)"
done
To only match single digit text files, you can do...
$ ls | grep '[0-9]\.txt'
One-liner hint:
while [ -f ./result/result`printf "%03d" $a`.txt ]; do a=$((a+1));done
RESULT=result/result`printf "%03d" $a`.txt
To provide a solution that's cautiously written to be correct even in the presence of filenames with spaces:
#!/usr/bin/env bash
pattern='%04d' # pad with four digits: change this to taste
# enable extglob syntax: +([[:digit:]]) means "one or more digits"
# enable the nullglob flag: If no matches exist, a glob returns nothing (not itself).
shopt -s extglob nullglob
for f in [[:digit:]]*; do # iterate over filenames that start with digits
suffix=${f##+([[:digit:]])} # find the suffix (everything after the last digit)
number=${f%"$suffix"} # find the number (everything before the suffix)
printf -v new "$pattern" "$number" "$suffix" # pad the number, then append the suffix
if [[ $f != "$new" ]]; then # if the result differs from the old name
mv -- "$f" "$new" # ...then rename the file.
fi
done
There is a rename.ul command installed from util-linux package (at least in Ubuntu) by default installed.
It's use is (do a man rename.ul):
rename [options] expression replacement file...
The command will replace the first occurrence of expression with the given replacement for the provided files.
While forming the command you can use:
rename.ul -nv replace-me with-this in-all?-these-files*
for not doing any changes but reading what changes that command would make. When sure just reexecute the command without the -v (verbose) and -n (no-act) options
for your case the commands are:
rename.ul "" 000 ?.txt
rename.ul "" 00 ??.txt
rename.ul "" 0 ???.txt

Resources