how to add column in file txt based on filename

how to add column in file txt based on filename - linux

this is my script
SourceFile='/root/Document/Source/'
FND=$(find $SourceFile. -regextype posix-regex -iregex "^.*/ABCDEF_555_[0-9]{5}\.txt$")
echo $FND
#*I've tried using "awk" but haven't gotten perfect results*
File Name:
ABCDEF_555_12345.txt
ABCDEF_555_54321.txt
ABCDEF_555_11223.txt
BEFORE
File Content from ABCDEF_555_12345.txt:
no|name|address|pos_code
1|rick|ABC|12342
2|rock|ABC|12342
3|Robert|DEF|54321
File Content from ABCDEF_555_54321.txt:
no|id|name|city
1|0101|RIZKI|JKT
2|0102|LALA|SMG
3|0302|ROY|YGY
i want to append a column that shows the file name in every row starting from the 2nd, and append a column with name_file to the first and i want to change the contents of the original files.
AFTER
file: ABCDEF_555_12345.txt
no|name|address|pos_code|name_file
1|rick|ABC|12342|ABCDEF_555_12345.txt
2|rock|ABC|12342|ABCDEF_555_12345.txt
3|Robert|DEF|54321|ABCDEF_555_12345.txt
file: ABCDEF_555_54321.txt
no|id|name|city|name_file
1|0101|RIZKI|JKT|ABCDEF_555_54321.txt
2|0102|LALA|SMG|ABCDEF_555_54321.txt
3|0302|ROY|YGY|ABCDEF_555_54321.txt
please give me light to find a solution :))
Thanks :))

The best solution is to use awk.
If it's the first line (NR == 1), print the line and append |name_file.
For all other lines print the line and append the filename using the FILENAME variable:
awk 'NR == 1 {print $0 "|name_file"; next;}{print $0 "|" FILENAME;}' foo.txt
You can either use it with multiple files:
find . -iname "*.txt" -print0 | xargs -0 awk '
NR == 1 {print $0 "|name_file"; next;}
FRN == 1 {next;} # Skip header of next files
{print $0 "|" FILENAME;}'
My first solution used to use the paste command.
Paste allows you to concatenate files horizontally (compared to cat which concatenates vertically).
To achieve the following with paste, do:
first concatenate the first line of your file (head -n1 foo.txt) with the column header (echo "name_file"). The command paste accept the -d flag to define the separator between columns.
second, extract all lines except the first (tail -n+2 foo.txt) and concatenate them with as many foo.txt required (use a for loop, computing the number of lines to fill.
The solution looks like this:
paste -d'|' <(head -n1 foo.txt) <(echo "name_file")
paste -d'|' <(tail -n+2 foo.txt) <(for i in $(seq $(tail -n+2 foo.txt | wc -l)); do echo "foo.txt"; done)
no|name|address|pos_code|name_file
1|rick|ABC|12342|foo.txt
2|rock|ABC|12342|foo.txt
3|Robert|DEF|54321|foo.txt
However, the awk solution must be prefered because it is clearer (only one call, less process substitutions and co.), and faster.
$ wc -l foo.txt
100004 foo.txt
$ time ./awk.sh >/dev/null
./awk.sh > /dev/null 0,03s user 0,01s system 98% cpu 0,041 total
$ time ./paste.sh >/dev/null
./paste.sh > /dev/null 0,38s user 0,33s system 154% cpu 0,459 total

Using find and GNU awk:
My find implementation doesn't have regextype posix-regex and I used posix-extended instead, but since you got the correct results it should be fine.
srcdir='/root/Document/Source/'
find "$srcdir" -regextype posix-regex -iregex ".*/ABCDEF_555_[0-9]{5}\.txt$"\
-exec awk -i inplace -v fname="{}" '
BEGIN{ OFS=FS="|"; sub(/.*\//, "", fname) } # set field separators / extract filename
{ $(NF+1)=NR==1 ? "name_file" : fname; print } # add header field / filename, print line
' {} \;
The pathname found by find is passed to awk in variable fname. In the BEGIN block the filename is extracted from the path.
The files are modified "inplace", make sure you make a backup of your files before running this.

Related

Searching specific lines of files using GREP

I have a directory with many text files. I want to search a given string in specific lines in the files(like searching for 'abc' in only 2nd and 3rd line of each file). Then When I find A match I want to print line 1 of the matching file.
My Approach - I'm doing a grep search with -n option and storing the output in a different file and then searching that file for the line number. Then I'm trying to get the file name and then print out it's first line.
Using the approach I mentioned above I'm not able to get the file name of the right file and even if I get that this approach is very lengthy.
Is there a better and fast solution to this?
Eg.
1.txt
file 1
one
two
2.txt
file 2
two
three
I want to search for "two" in line 2 of each file using grep and then print the first line of the file with match. In this example that would be 2.txt and the output should be "file 2"
I know it is easier using sed/awk but is there any way to do this using grep?

Use sed instead (GNU sed):
parse.sed
1h # Save the first line to hold space
2,3 { # On lines 2 and 3
/my pattern/ { # Match `my pattern`
x # If there is a match bring back the first line
p # and print it
:a; n; ba # Loop to the end of the file
}
}
Run it like this:
sed -snf parse.sed file1 file2 ...
Or as a one-liner:
sed -sn '1h; 2,3 { /my pattern/ { x; p; :a; n; ba; } }' file1 file2 ...
You might want to emit the filename as well, e.g. with your example data:
parse2.sed
1h # Save the first line to hold space
2,3 { # On lines 2 and 3
/two/ { # Match `my pattern`
F # Output the filename of the file currently being processed
x # If there is a match bring back the first line
p # and print it
:a; n; ba # Loop to the end of the file
}
}
Run it like this:
sed -snf parse2.sed file1 file2 | paste -d: - -
Output:
file1:file 1
file2:file 2

$ awk 'FNR==2{if(/one/) print line; nextfile} FNR==1{line=$0}' 1.txt 2.txt
file 1
$ awk 'FNR==2{if(/two/) print line; nextfile} FNR==1{line=$0}' 1.txt 2.txt
file 2
FNR will have line number for the current file being read
use FNR>=2 && FNR<=3 if you need a range of lines
FNR==1{line=$0} will save the contents of first line for future use
nextfile should be supported by most implementations, but the solution will still work (slower though) if you need to remove it

With grep and bash:
# Grep for a pattern and print filename and line number
grep -Hn one file[12] |
# Loop over matches where f=filename, n=match-line-number and s=matched-line
while IFS=: read f n s; do
# If match was on line 2 or line 3
# print the first line of the file
(( n == 2 || n == 3 )) && head -n1 $f
done
Output:
file 1

Only using grep, cut and | (pipe):
grep -rnw pattern dir | grep ":line_num:" | cut -d':' -f 1
Explanation
grep -rnw pattern dir
It return name of the file(s) where the pattern was found along with the line number.
It's output will be somthing like this
path/to/file/file1(.txt):8:some pattern 1
path/to/file/file2(.txt):4:some pattern 2
path/to/file/file3(.txt):2:some pattern 3
Now I'm using another grep to get the file with the right line number (for e.g. file that contains the pattern in line 2)
grep -rnw pattern dir | grep ":2:"
It's output will be
path/to/file/file3(.txt):2:line
Now I'm using cut to get the filename
grep -rnw pattern dir | grep ":2:" | cut -d':' -f 1
It will output the file name like this
path/to/file/file3(.txt)
P.S. - If you want to remove the "path/to/file/" from the filename you can use rev then cut and again rev, you can try this yourself or see the code below.
grep -rnw pattern dir | grep ":2:" | cut -d':' -f 1 | rev | cut -d'/' -f 1 | rev

Suppressing summary information in `wc -l` output

I use the command wc -l count number of lines in my text files (also i want to sort everything through a pipe), like this:
wc -l $directory-path/*.txt | sort -rn
The output includes "total" line, which is the sum of lines of all files:
10 total
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
Is there any way to suppress this summary line? Or even better, to change the way the summary line is worded? For example, instead of "10", the word "lines" and instead of "total" the word "file".

Yet a sed solution!
1. short and quick
As total are comming on last line, $d is the sed command for deleting last line.
wc -l $directory-path/*.txt | sed '$d'
2. with header line addition:
wc -l $directory-path/*.txt | sed '$d;1ilines total'
Unfortunely, there is no alignment.
3. With alignment: formatting left column at 11 char width.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
/^ *[0-9]\+ total$/d;
1i\ lines filename'
Will do the job
lines file
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
4. But if really your wc version could put total on 1st line:
This one is for fun, because I don't belive there is a wc version that put total on 1st line, but...
This version drop total line everywhere and add header line at top of output.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
1{
/^ *[0-9]\+ total$/ba;
bb;
:a;
s/^.*$/ lines file/
};
bc;
:b;
1i\ lines file' -e '
:c;
/^ *[0-9]\+ total$/d
'
This is more complicated because we won't drop 1st line, even if it's total line.

This is actually fairly tricky.
I'm basing this on the GNU coreutils version of the wc command. Note that the total line is normally printed last, not first (see my comment on the question).
wc -l prints one line for each input file, consisting of the number of lines in the file followed by the name of the file. (The file name is omitted if there are no file name arguments; in that case it counts lines in stdin.)
If and only if there's more than one file name argument, it prints a final line containing the total number of lines and the word total. The documentation indicates no way to inhibit that summary line.
Other than the fact that it's preceded by other output, that line is indistinguishable from output for a file whose name happens to be total.
So to reliably filter out the total line, you'd have to read all the output of wc -l, and remove the final line only if the total length of the output is greater than 1. (Even that can fail if you have files with newlines in their names, but you can probably ignore that possibility.)
A more reliable method is to invoke wc -l on each file individually, avoiding the total line:
for file in $directory-path/*.txt ; do wc -l "$file" ; done
And if you want to sort the output (something you mentioned in a comment but not in your question):
for file in $directory-path/*.txt ; do wc -l "$file" ; done | sort -rn
If you happen to know that there are no files named total, a quick-and-dirty method is:
wc -l $directory-path/*.txt | grep -v ' total$'
If you want to run wc -l on all the files and then filter out the total line, here's a bash script that should do the job. Adjust the *.txt as needed.
#!/bin/bash
wc -l *.txt > .wc.out
lines=$(wc -l < .wc.out)
if [[ lines -eq 1 ]] ; then
cat .wc.out
else
(( lines-- ))
head -n $lines .wc.out
fi
rm .wc.out
Another option is this Perl one-liner:
wc -l *.txt | perl -e '#lines = <>; pop #lines if scalar #lines > 1; print #lines'
#lines = <> slurps all the input into an array of strings. pop #lines discards the last line if there are more than one, i.e., if the last line is the total line.

The program wc, always displays the total when they are two or more than two files ( fragment of wc.c):
if (argc > 2)
report ("total", total_ccount, total_wcount, total_lcount);
return 0;
also the easiest is to use wc with only one file and find present - one after the other - the file to wc:
find $dir -name '*.txt' -exec wc -l {} \;
Or as specified by liborm.
dir="."
find $dir -name '*.txt' -exec wc -l {} \; | sort -rn | sed 's/\.txt$//'

This is a job tailor-made for head:
wc -l | head --lines=-1
This way, you can still run in one process.

Can you use another wc ?
The POSIX wc(man -s1p wc) shows
If more than one input file operand is specified, an additional line shall be written, of the same format as the other lines, except that the word total (in the POSIX locale) shall be written instead of a pathname and the total of each column shall be written as appropriate. Such an additional line, if any, is written at the end of the output.
You said the Total line was the first line, the manual states its the last and other wc's don't show it at all. Removing the first or last line is dangerous, so I would grep -v the line with the total (in the POSIX locale...), or just grep the slash that's part of all other lines:
wc -l $directory-path/*.txt | grep "/"

Not the most optimized way since you can use combinations of cat, echo, coreutils, awk, sed, tac, etc., but this will get you want you want:
wc -l ./*.txt | awk 'BEGIN{print "Line\tFile"}1' | sed '$d'
wc -l ./*.txt will extract the line count. awk 'BEGIN{print "Line\tFile"}1' will add the header titles. The 1 corresponds to the first line of the stdin. sed '$d' will print all lines except the last one.
Example Result
Line File
6 ./test1.txt
1 ./test2.txt

You can solve it (and many other problems that appear to need a for loop) quite succinctly using GNU Parallel like this:
parallel wc -l ::: tmp/*txt
Sample Output
3 tmp/lines.txt
5 tmp/unfiltered.txt
42 tmp/file.txt
6 tmp/used.txt

The simplicity of using just grep -c
I rarely use wc -l in my scripts because of these issues. I use grep -c instead. Though it is not as efficient as wc -l, we don't need to worry about other issues like the summary line, white space, or forking extra processes.
For example:
/var/log# grep -c '^' *
alternatives.log:0
alternatives.log.1:3
apache2:0
apport.log:160
apport.log.1:196
apt:0
auth.log:8741
auth.log.1:21534
boot.log:94
btmp:0
btmp.1:0
<snip>
Very straight forward for a single file:
line_count=$(grep -c '^' my_file.txt)
Performance comparison: grep -c vs wc -l
/tmp# ls -l *txt
-rw-r--r-- 1 root root 721009809 Dec 29 22:09 x.txt
-rw-r----- 1 root root 809338646 Dec 29 22:10 xyz.txt
/tmp# time grep -c '^' *txt
x.txt:7558434
xyz.txt:8484396
real 0m12.742s
user 0m1.960s
sys 0m3.480s
/tmp/# time wc -l *txt
7558434 x.txt
8484396 xyz.txt
16042830 total
real 0m9.790s
user 0m0.776s
sys 0m2.576s

Similar to Mark Setchell's answer you can also use xargs with an explicit separator:
ls | xargs -I% wc -l %
Then xargs explicitly doesn't send all the inputs to wc, but one operand line at a time.

Shortest answer:
ls | xargs -l wc

What about using sed with the pattern removal option as below which would only remove the total line if it is present (but also any files with total in them).
wc -l $directory-path/*.txt | sort -rn | sed '/total/d'

Merge multiple files preserving the original sequence in unix

I have multiple (more than 100) text files in the directory such as
files_1_100.txt
files_101_200.txt
The contents of the file are name of some variables like files_1_100.txt contains some variables names between 1 to 100
"var.2"
"var.5"
"var.15"
Similarly files_201_300.txt contains some variables between 101 to 200
"var.203"
"var.227"
"var.285"
and files_1001_1100.txt as
"var.1010"
"var.1006"
"var.1025"
I can merge them using the command
cat files_*00.txt > ../all_files.txt
However, the contents of files does not follow that in the parent files. For example all_files.txt shows
"var.1010"
"var.1006"
"var.1025"
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
So, how can I ensure that contents of files_1_100.txt comes first, followed by files_201_300.txt and then files_1001_1100.txt such that the contents of the all_files.txt is
"var.1"
"var.5"
"var.15"
"var.203"
"var.227"
"var.285"
"var.1010"
"var.1006"
"var.1025"

Let me try it out, but I think that this will work:
ls file*.txt | sort -n -t _ -k2 -k3 | xargs cat
The idea is to take your list of files and sort them and then pass them to the cat command.
The sort uses several options:
-n - use a numeric sort rather than alphabetic
-t _ - divide the input (the filename) into fields using the underscore character
-k2 -k3 - sort first by the 2nd field and then by the 3rd field (the 2 numbers)
You have said that your files are named file_1_100.txt, file_101_201.txt, etc. If that means (as it seems to indicate) that the first numeric "chunk" is always unique then you can leave off the -k3 flag. That flag is needed only if you will end up, for instance, with file_100_2.txt and file_100_10.txt where you have to look at the 2nd numeric "chunk" to determine the preferred order.
Depending on the number of files you are working with you may find that specifying the glob (file*.txt) may overwhelm the shell and cause errors about the line being too long. If that's the case you could do it like this:
ls | grep '^file.*\.txt$' | sort -n -t _ -k2 -k3 | xargs cat

You can use printf sort and pipe that to xargs cat:
printf "%s\0" f*txt | sort -z -t_ -nk2 | xargs -0 cat > ../all_files.txt
Note that whole pipeline is working on NULL terminated filenames thus making sure this command even works foe filenames with space/newlines etc.

If your filenames are free from any special characters or white-spaces, then other answers should be easy solutions.
Else, try this rename based approach:
$ ls files_*.txt
files_101_200.txt files_1_100.txt
$ rename 's/files_([0-9]*)_([0-9]*)/files_000$1_000$2/;s/files_0*([0-9]{3})_0*([0-9]{3})/files_$1_$2/' files_*.txt
$ ls files_*.txt
files_100_100.txt files_101_200.txt
$ cat files_*.txt > outputfile.txt
$ rename 's/files_0*([0-9]*)_0*([0-9]*)/files_$1_$2/' files_*.txt

The default sorting behavior of cat file_* is alphabetical, rather than numeric.
List them in numerical order and then cat each one, appending the output to some file.
ls -1| sort -n |xargs -i cat {} >> file.out

You could try using a for-loop and adding the files one by one (the -v sorts the files correctly when the numbers are not zero-padded)
for i in $(ls -v files_*.txt)
do
cat $i >> ../all_files.txt
done
or more convenient in a single line:
for i in $(ls -v files_*.txt) ; do cat $i >> ../all_files.txt ; done

You could also do this with Awk by splitting and sorting ARGV:
awk 'BEGIN {
for(i=1; i<=ARGC-1; i++) {
if(i > 1) {
j=i-1
split(ARGV[i], curr, "_")
split(ARGV[j], last, "_")
if (curr[2] < last[2]) {
tmp=ARGV[i]
ARGV[i]=ARGV[j]
ARGV[j]=tmp
}
}
}
}1' files_*00.txt

GREP command in loop

I have about 3000 files in a folder. My files have data as given below:
VISITERM_0 VISITERM_20 VISITERM_35 ..... and so on
Each files do not have the same values like as above. They vary from 0 till 99.
I want to find out how many files in the folder have each of the VISITERMS. For example, if VISITERM_0 is present in 300 files in the folder, then I need it to print
VISITERM_0 300
Similary if there are 1000 files that contain VISITERM_1, I need it to print
VISITERM_1 1000
So, I want to print the VISITERMs and the number of files that have them starting from VISITERM_0 till VISITERM_99.
I made use of grep command which is
grep VISITERM_0 * -l | wc -l
However, this is for a single term and I want to loop this from VISITERM_0 till VISITERM_99. Please helP!

#!/bin/bash
# ^^- the above is important; #!/bin/sh would allow only POSIX syntax
# use a C-style for loop, which is a bash extension
for ((i=0; i<100; i++)); do
# Calculate number of matches...
num_matches=$(find . -type f -exec grep -l -e "VISITERM_$i" '{}' + | wc -l)
# ...and print the result.
printf 'VISITERM_%d\t%d\n' "$i" "$num_matches"
done

Here is an gnu awk (gnu due to multiple characters in RS) that should do:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' *
Example:
cat file1
VISITERM_0 VISITERM_320 VISITERM_35
cat file2
VISITERM_0 VISITERM_20 VISITERM_32
VISITERM_20 VISITERM_42 VISITERM_11
Gives:
awk -v RS=" |\n" '{n=split($1,a,"VISITERM_");if (n==2 && a[2]<100) b[a[2]]++} END {for (i in b) print "VISITERM_"i,b[i]}' file*
VISITERM_0 2
VISITERM_11 1
VISITERM_20 2
VISITERM_32 1
VISITERM_35 1
VISITERM_42 1
How it works:
awk -v RS=" |\n" ' # Set record selector to space or new line
{n=split($1,a,"VISITERM_") # Split record using "VISITERM_" as separator and store hits of split in "n"
if (n==2 && a[2]<100) # If "n" is "2" (does contain "ISITERM_") and has number less "100"
b[a[2]]++} # Count the hit of each number and stor it in array "b"
END {for (i in b) # Walk trough array "b"
print "VISITERM_"i,b[i]} # Print the hits
' file* # Read the files
PS
If everything is only on one line, change to RS=" ". Then it should work on most awk

How to find file in each directory with the highest number as filename?

I have a file structure that looks like this
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
and I would like to find the file path to the .bin file in each directory which have the highest number as filename. So the output I am looking for would be
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
The highest number a file can have is 9.
Question
How do I do that in BASH?
I have come as far as find .|grep bin|sort

Globs are guaranteed to be expanded in lexical order.
for dir in ./*/
do
files=($dir/*) # create an array
echo "${files[#]: -1}" # access its last member
done

What about using awk? You can get the FIRST occurrence really simply:
[ghoti#pc ~]$ cat data1
./501.res/1.bin
./503.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$ awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' data1
./501.res/1.bin
./503.res/1.bin
./504.res/1.bin
[ghoti#pc ~]$
To get the last occurrence you could pipe through a couple of sorts:
[ghoti#pc ~]$ sort -r data1 | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
./501.res/1.bin
./503.res/2.bin
./504.res/1.bin
[ghoti#pc ~]$
Given that you're using "find" and "grep", you could probably do this:
find . -name \*.bin -type f -print | sort -r | awk 'BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1' | sort
How does this work?
The find command has many useful options, including the ability to select your files by glob, select the type of file, etc. Its output you already know, and that becomes the input to sort -r.
First, we sort our input data in reverse (sort -r). This insures that within any directory, the highest numbered file will show up first. That result gets fed into awk. FS is the field separator, which makes $2 into things like "/501", "/502", etc. Awk scripts have sections in the form of condition {action} which get evaluated for each line of input. If a condition is missing, the action runs on every line. If "1" is the condition and there is no action, it prints the line. So this script is broken out as follows:
a[$2] {next} - If the array a with the subscript $2 (i.e. "/501") exists, just jump to the next line. Otherwise...
{a[$2]=1} - set the array a subscript $2 to 1, so that in future the first condition will evaluate as true, then...
1 - print the line.
The output of this awk script will be the data you want, but in reverse order. The final sort puts things back in the order you'd expect.
Now ... that's a lot of pipes, and sort can be a bit resource hungry when you ask it to deal with millions of lines of input at the same time. This solution will be perfectly sufficient for small numbers of files, but if you're dealing with large quantities of input, let us know, and I can come up with an all-in-one awk solution (that will take longer than 60 seconds to write).
UPDATE
Per Dennis' sage advice, the awk script I included above could be improved by changing it from
BEGIN{FS="."} a[$2] {next} {a[$2]=1} 1
to
BEGIN{FS="."} $2 in a {next} {a[$2]} 1
While this is functionally identical, the advantage is that you simply define array members rather than assigning values to them, which may save memory or cpu depending on your implementation of awk. At any rate, it's cleaner.

Tested:
find . -type d -name '*.res' | while read dir; do
find "$dir" -maxdepth 1 | sort -n | tail -n 1
done

I came up with someting like this:
for dir in $(find . -mindepth 1 -type d | sort); do
file=$(ls "$dir" | sort | tail -n 1);
[ -n "$file" ] && (echo "$dir/$file");
done
Maybe it can be simpler

If invoking a shell from within find is an option try this
find * -type d -exec sh -c "echo -n './'; ls -1 {}/*.bin | sort -n -r | head -n 1" \;

And here is one liner
find . -mindepth 1 -type d | sort | sed -e "s/.*/ls & | sort | tail -n 1 | xargs -I{} echo &\/{}/" | bash

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to add column in file txt based on filename - linux

Related

Searching specific lines of files using GREP

Suppressing summary information in `wc -l` output

Merge multiple files preserving the original sequence in unix

GREP command in loop

How to find file in each directory with the highest number as filename?

Categories

Resources