Linux: Append Word Count to Each Line of a File - linux

Quite new to Linux at the moment,
I've seen some straightforward answers for appending a constant/non-changing word/component to the end of a file e.g. shell script add suffix each line
However, I'd like to know how to append the word count for each line of a .csv file to the end of each line, so that:
word1, word2, word3
foo1, foo2
bar1, bar2, bar3, bar4
Becomes:
word1, word2, word3, 3
foo1, foo2, 2
bar1, bar2, bar3, bar4, 4
I am working with comma separated values, so if there is a quicker/simpler way to do it by making use of the commas rather than the items, then that would work as well.
Cheers!

Simple awk solution:
awk -F ',' '{print $0", "NF}' file.csv
-F argument can be used to specify the field separator, , in your case.
$0 will contain the entire line
NF is the variable that contains the number of fields in the line

you can use this:
while read line; do
N=`echo $line | wc -w`;
echo $line", "$N;
done < inputfile.txt

a simple (yet most likely slow) bash script could do the trick:
#!/bin/bash
newfile=$1.tmp
cat $1 | while read l ; do
echo -n $l \ >> $newfile
echo $l | wc -w >> $newfile
done
then move files according to your liking (be save by using tempfile ...)
for file:
one,
one, two,
one, two, three,
I get:
one, 1
one, two, 2
one, two, three, 3

Related

echo without trimming the space in awk command

I have a file consisting of multiple rows like this
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN|0000000010000.00|6761857316|508998|6011|GL
I have to split and replace the column 11 into 4 different columns using the count of character.
This is the 11th column containing extra spaces also.
SHOP NO.5,6,7 RUNWAL GRCHEMBUR MHIN
This is I have done
ls *.txt *.TXT| while read line
do
subName="$(cut -d'.' -f1 <<<"$line")"
awk -F"|" '{ "echo -n "$11" | cut -c1-23" | getline ton;
"echo -n "$11" | cut -c24-36" | getline city;
"echo -n "$11" | cut -c37-38" | getline state;
"echo -n "$11" | cut -c39-40" | getline country;
$11=ton"|"city"|"state"|"country; print $0
}' OFS="|" $line > $subName$output
done
But while doing echo of 11th column, its trimming the extra spaces which leads to mismatch in count of character. Is there any way to echo without trimming spaces ?
Actual output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR MHIN|||0000000010000.00|6761857316|508998|6011|GL
Expected Output
10|EQU000000001|12345678|3456||EOMCO042|EOMCO042|31DEC2018|16:51:17|31DEC2018|SHOP NO.5,6,7 RUNWAL GR|CHEMBUR|MH|IN|0000000010000.00|6761857316|508998|6011|GL
The least annoying way to code this that I've found so far is:
perl -F'\|' -lane '$F[10] = join "|", unpack "a23 A13 a2 a2", $F[10]; print join "|", #F'
It's fairly straightforward:
Iterate over lines of input; split each line on | and put the fields in #F.
For the 11th field ($F[10]), split it into fixed-width subfields using unpack (and trim trailing spaces from the second field (A instead of a)).
Reassemble subfields by joining with |.
Reassemble the whole line by joining with | and printing it.
I haven't benchmarked it in any way, but it's likely much faster than the original code that spawns multiple shell and cut processes per input line because it's all done in one process.
A complete solution would wrap it in a shell loop:
for file in *.txt *.TXT; do
outfile="${file%.*}$output"
perl -F'\|' -lane '...' "$file" > "$outfile"
done
Or if you don't need to trim the .txt part (and you don't have too many files to fit on the command line):
perl -i.out -F'\|' -lane '...' *.txt *.TXT
This simply places the output for each input file foo.txt in foo.txt.out.
A pure-bash implementation of all this logic
#!/usr/bin/env bash
shopt -s nocaseglob extglob
for f in *.txt; do
subName=${f%.*}
while IFS='|' read -r -a fields; do
location=${fields[10]}
ton=${location:0:23}; ton=${ton%%+([[:space:]])}
city=${location:23:12}; city=${city%%+([[:space:]])}
state=${location:36:2}
country=${location:38:2}
fields[10]="$ton|$city|$state|$country"
printf -v out '%s|' "${fields[#]}"
printf '%s\n' "${out:0:$(( ${#out} - 1 ))}"
done <"$f" >"$subName.out"
done
It's slower (if I did this well, by about a factor of 10) than pure awk would be, but much faster than the awk/shell combination proposed in the question.
Going into the constructs used:
All the ${varname%...} and related constructs are parameter expansion. The specific ${varname%pattern} construct removes the shortest possible match for pattern from the value in varname, or the longest match if % is replaced with %%.
Using extglob enables extended globbing syntax, such as +([[:space:]]), which is equivalent to the regex syntax [[:space:]]+.

Set an external variable in awk

I have written a script in which I want to count the number of columns in data.txt . My problem is I am unable to set the x in awk script.
Any help would be highly appreciated.
while read p; do
x=1;
echo $p | awk -F' ' '{x=NF}'
echo $x;
file="$x"".txt";
echo $file;
done <$1
data.txt file:
4495125 94307025 giovy115p#live.it 94307025.094307025 12443
stazla deva1a23#gmail.com 1992/.:\1
1447585 gioao_87#hotmail.it h1st#1
saknit tomboro#seznam.cz 1233 1990
Expected output:
5.txt
3.txt
3.txt
4.txt
My output:
1.txt
1.txt
1.txt
1.txt
You just cannot import variable set in Awk to a shell context. In your example the value set inside x containing NF will be not reflected outside.
Either you need to use command substitution($(..)) syntax to get the value of NF and use it later
x=$(echo "$p" | awk '{print NF}')
Now x will contain the column count in each of the line. Note that you don't need to use -F' ' which is the default de-limiter in awk.
Besides your requirement can be fully done in Awk itself.
awk 'NF{print NF".txt"}' file
Here the NF{..} is to ensure that the actions inside {..} are applied only to non-empty rows. The for each row we print the length and append the extension .txt along with it.
Awk processes a line at a time -- processing each line in a separate Awk script inside a shell while read loop is horrendously inefficient. See also https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
Maybe something like this:
awk '{ print >(NF ".txt") }' data.txt
to create a file with the five-column rows in 5.txt, the four-column ones in 4.txt, the three-column rows in 2.txt, etc for each unique column count.
The Awk variable NF contains the number of fields (by default, Awk splits fields on runs of whitespace -- use -F to change to some other separator) and the expression (NF ".txt") simply produces a string catenation of the number of fields with the suffix .txt which we pass as a file name to the print redirection.
With bash:
while read p; do p=($p); echo "${#p[#]}.txt"; done < file
or shorter:
while read -a p; do echo "${#p[#]}.txt"; done < file
Output:
5.txt
3.txt
3.txt
4.txt

How to append and print Numbers from a text file?

I need to start from file.txt, which contains entries like this:
1
2
3
4
5
I need to print the following:
1,100
2,100
3,100
4,100
5,100
I have attempted this, but am receiving an invalid number error:
printf '%d,100\n' "$(< file.txt)"
You can use awk:
$ awk '{printf "%s,100\n", $0}' file
1,100
2,100
3,100
4,100
5,100
You could use
while read in; do echo "$in,100"; done < file.txt
Your error is caused by printf getting the whole file at once and not line by line.
This is one of the rare occurrences of "too many quotes". Observe:
$ cat file.txt
1
2
3
4
5
$ var=$(< file.txt)
$ echo "$var" # Quotes preserve original whitespace
1
2
3
4
5
$ echo $var # No quotes reduce all whitespace to single spaces
1 2 3 4 5
The quotes make echo "see" just a single argument, namely the formatted file contents. Without the quotes, every line becomes an argument to echo, and they're printed separated by just spaces.
So, you can solve your problem with
$ printf '%d,100\n' $(< file.txt)
1,100
2,100
3,100
4,100
5,100
Another solution would be to use sed:
$ sed 's/$/,100/' file.txt
1,100
2,100
3,100
4,100
5,100
This substitutes the end of the line, $, with ,100, for each line.
How about:
( set -f; set -- $(< file.txt)
printf '%d,100\n' "$#" )
seq 6 | awk '$0=$0",100"'
1,100
2,100
3,100
4,100
5,100
6,100

Split string at special character in bash

I'm reading filenames from a textfile line by line in a bash script. However the the lines look like this:
/path/to/myfile1.txt 1
/path/to/myfile2.txt 2
/path/to/myfile3.txt 3
...
/path/to/myfile20.txt 20
So there is a second column containing an integer number speparated by space. I only need the part of the string before the space.
I found only solutions using a "for-loop". But I need a function that explicitly looks for the " "-character (space) in my string and splits it at that point.
In principle I need the equivalent to Matlabs "strsplit(str,delimiter)"
If you are already reading the file with something like
while read -r line; do
(and you should be), then pass two arguments to read instead:
while read -r filename somenumber; do
read will split the line on whitespace and assign the first field to filename and any remaining field(s) to somenumber.
Three (of many) solutions:
# Using awk
echo "$string" | awk '{ print $1 }'
# Using cut
echo "$string" | cut -d' ' -f1
# Using sed
echo "$string" | sed 's/\s.*$//g'
If you need to iterate trough each line of the file anyways, you can cut off everything behind the space with bash:
while read -r line ; do
# bash string manipulation removes the space at the end
# and everything which follows it
echo ${line// *}
done < file
This should work too:
line="${line% *}"
This cuts the string at it's last occurrence (from left) of a space. So it will work even if the path contains spaces (as long as it follows by a space at end).
while read -r line
do
{ rev | cut -d' ' -f2- | rev >> result.txt; } <<< $line
done < input.txt
This solution will work even if you have spaces in your filenames.

grep two string as variable to use in a script

Could you pls help me how can i grep and use the strings mentioned below as variable at the 3rd line in the file.txt involved.
file.txt
line1: some words with 123#domain.com
line2: some words
line3: path = /aaa/bbb/domain.com/user#domain.com/ccc/123#test.com/
So need to grep "user#domain.com" and "123#test" at line3 to use as variables in a script like ;
#!/bin/bash
var1 = some_code result as "user#domain.com"
var2 = some_code result as "123#test"
run_a_command $var1 $var2
Thanks in advance,
If the format of the file is same as you have shown then you could do:
arr=($(awk -F'/' '/path/{print $5,$7}' file)) # Extract the desired 2 fields
arr[1]=${arr[1]%\.com} # Remove the suffix ".com"
run_a_command ${a[0]} ${a[1]}
Depending on the file content, you may also want to adjust the awk part to extract. You can also check if the one or both array elements are empty if that could be a possibility. If it's always third line, then you can do using NR==3 check in the awk pattern matching part: arr=($(awk -F'/' 'NR==3 && /path/{print $5,$7}' file)).
If the input file has more complex format (E.g. what if multiple such lines are there in the input file etc), then you should update the question as any solution depends on that.
What about:
grep -o -E '\/[^\/]+#[^\/\.]+' INFILE | sed "s/\///g"
Maybe the following is what you are looking for?
grep -o -E '\/[^\/]+#[^\/]+(\/|$)' INFILE | sed "s/\///g"

Resources