Why does the wc command count one more character than expected?

Why does the wc command count one more character than expected? - linux

The following is the content stored in my file
This is my Input
So, using wc -c command we can get the number of characters stored in the file.
My expected output for above file that edited by using Vim in Ubuntu is 16. But, wc -c command returns 17.
Why is the output like this? There isn't even a carriage return at end of line. So, what is the 17th character?

Of course you had enter. Maybe you can't see it. Consider these two examples:
echo -n "This is my Input" | wc -c
16
Because -n is for avoiding enter, but
echo "This is my Input" | wc -c
17
Look at this example too see the new line:
How to see newline?
echo "This is my Input" | od -c
od dumps files in octal and other formats. -c selects ASCII characters or backslash escapes.
And here is an example for file and usage of od:

In Linux, when Vim saves buffers, it will terminate every line by appending line terminator of new line.
You can open your file and input :!xxd to view hex-dump or directly use hexdump yourfile command.
0000000: 5468 6973 2069 7320 6d79 2049 6e70 7574 This is my Input
0000010: 0a .
~
~
~
In there you can see, the file have appended 0a in the end of file.
So when you use wc -c to get the number of this file, it will return 17 that includes the new line symbol.

The input string you are giving as input has no enter/new line, but echo is assigning enter/newline to it. And wc -c reads enter or newline from given by the echo command.
For example
echo k | wc -c
returns 2 because 1 for k and 1 for new line appended by echo.
While
echo -n k | wc -c
returns 1 because -n suppresses the newline.
But wc -c always reads newline.
You can try
printf k | wc -c
It returns 1.
See what it does in file:
bash-4.1$ echo 1234 > newfile
bash-4.1$ cat newfile
1234
bash-4.1$ cat -e newfile
1234$
bash-4.1$ printf 1234 > newfile
bash-4.1$ cat newfile
1234bash-4.1$ cat -e newfile
1234bash-4.1$

You have 17 because of the /0 chaeracter.

Related

why does echo -n "100" | wc -c output 3?

I just happened to be playing around with a few linux commands and i found that echo -n "100" | wc -c outputs 3. i knew that 100 could be stored in a single byte as 1100100 so i could not understand why this happened. I guess that it is because of some teminal encoding, is it ? i also found out that if i touch test.txt and echo -n "100" | test.txt and then execute wc ./test.txt -ci get the same output here also my guess is to blame file encoding, am i right ?

100 is three characters long, hence wc giving you 3. If you left out the -n to echo it'd show 4, because echo would be printing out a newline too in that case.

When you echo -n 100, you are showing a string with 3 characters.
When you want to show a character with ascii value 100, use
echo -n "d"
# Check
echo -n "d" | xdd -b
I found value "d" with man ascii. When you don't want to use the man page, use
printf "\\$(printf "%o" 100)"
# Check
printf "\\$(printf "%o" 100)" | xxd -b
# wc returns 1 here
printf "\\$(printf "%o" 100)" | wc -c

It's fine)
$ wc --help
...
-c, --bytes print the byte counts
-m, --chars print the character counts
...
$ man echo
...
-n do not output the trailing newline
...
$ echo -n 'abc' | wc -c
3
$ echo -n 'абс' | wc -c # russian symbols
6

how to get last line first 2 character of a file in linux

I have files containing below format, generated by another system
12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;
i have to check the last line contains 33 , then have to continue to copy the file/files to other location. else discard the file.
currently I am doing as below
tail -1 abc.txt >> c.txt
awk '{print substr($0,0,2)}' c.txt
then if the o/p is saved to another variable and copying.
Can anyone suggest any other simple way.
Thank you!
R/

Imagine you have the following input file:
$ cat file
a
b
c
d
e
agc
Then you can run the following commands (grep, awk, sed, cut) to get the first 2 char of last line:
AWK
$ awk 'END{print substr($0,0,2)}' file
ag
SED
$ sed -n '$s/^\(..\).*/\1/p' file
ag
GREP
$ tail -1 file | grep -oE '^..'
ag
CUT
$ tail -1 file | cut -c '1-2'
ag
BASH SUBSTRING
line=$(tail -1 file); echo ${line:0:2}
All of those commands do what you are looking for, the awk command will just do the operation on the last line of the file so you do not need tail anymore, the said command will extract the last line of the file and store it in its pattern buffer, then replace everything that is not the first 2 chars by nothing and then print the pattern buffer (the 2 fist char of the last line), another solution is just to tail the last line of the file and to extract the first 2 chars using grep, by piping those 2 commands you can also do it in one step without using intermediate variables, files.
Now if you want to put everything in one script this become:
$ more file check_2chars.sh
::::::::::::::
file
::::::::::::::
a
b
c
d
e
33abc
::::::::::::::
check_2chars.sh
::::::::::::::
#!/bin/bash
s1=$(tail -1 file | cut -c 1-2) #you can use other commands from this post
s2=33
if [ "$s1" == "$s2" ]
then
echo "match" #implement the copy/discard logic
fi
Execution:
$ ./check_2chars.sh
match
I will let you implement the copy/discard logic
PROOF:

Given the task of either copying or deleting files based on their contents, shell variables aren't necessary.
Using the sed File name command and xargs the whole task can be done in just one line:
find | xargs -l sed -n '${/^33/!F}' | xargs -r rm ; cp * dest/dir/
Or preferably, with GNU sed:
sed -sn '${/^33/!F}' * | xargs -r rm ; cp * dest/dir/
Or if all the filenames contain no whitespace:
rm -r $(sed -sn '${/^33/!F}' *) ; cp * dest/dir/
That assumes all the files in the current directory are to be tested.
sed looks at the last line ($) of every file, and runs what's in the curly braces.
If any of those last lines line do not begin with 33 (/^33/!), sed outputs just those unwanted file names (F).
Supposing the unwanted files are named foo and baz -- those are piped to xargs which runs rm foo baz.
At this point the only files left should be copied over to dest/dir/: cp * dest/dir/.
It's efficient, cp and rm need only be run once.
If a shell variable must be used, here are two more methods:
Using tail and bash, store first two chars of the last line to $n:
n="$(tail -1 abc.txt)" n="${n:0:2}"
Here's a more portable POSIX shell version:
n="$(tail -1 abc.txt)" n="${n%${n#??}}"

You may explicitly test with sed for the last line ($) starting with 33 (/^33.*/):
echo " 12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;" | sed -n "$ {/^33.*/p}"
33;786789;
If you store the result in a variable, you may test it for being empty or not:
lastline33=$(echo " 12;453453;TBS;OPPS;
12;453454;TGS;OPPS;
12;453455;TGS;OPPS;
12;453456;TGS;OPPS;
20;787899;THS;CLST;
33;786789;" | sed -n "$ {/^33.*/p}")
echo $(test -n "$lastline33" && echo not null || echo null)
not null
Probably you like the regular expression to contain the semicolon, because else it would match 330, 331, ...339, 33401345 and so on, but maybe that can be excluded from the context - to me it seems a good idea:
lastline33=$(sed -n "$ {/^33;.*/p}" abc.txt)

Suppressing summary information in `wc -l` output

I use the command wc -l count number of lines in my text files (also i want to sort everything through a pipe), like this:
wc -l $directory-path/*.txt | sort -rn
The output includes "total" line, which is the sum of lines of all files:
10 total
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
Is there any way to suppress this summary line? Or even better, to change the way the summary line is worded? For example, instead of "10", the word "lines" and instead of "total" the word "file".

Yet a sed solution!
1. short and quick
As total are comming on last line, $d is the sed command for deleting last line.
wc -l $directory-path/*.txt | sed '$d'
2. with header line addition:
wc -l $directory-path/*.txt | sed '$d;1ilines total'
Unfortunely, there is no alignment.
3. With alignment: formatting left column at 11 char width.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
/^ *[0-9]\+ total$/d;
1i\ lines filename'
Will do the job
lines file
5 ./directory/1.txt
3 ./directory/2.txt
2 ./directory/3.txt
4. But if really your wc version could put total on 1st line:
This one is for fun, because I don't belive there is a wc version that put total on 1st line, but...
This version drop total line everywhere and add header line at top of output.
wc -l $directory-path/*.txt |
sed -e '
s/^ *\([0-9]\+\)/ \1/;
s/^ *\([0-9 ]\{11\}\) /\1 /;
1{
/^ *[0-9]\+ total$/ba;
bb;
:a;
s/^.*$/ lines file/
};
bc;
:b;
1i\ lines file' -e '
:c;
/^ *[0-9]\+ total$/d
'
This is more complicated because we won't drop 1st line, even if it's total line.

This is actually fairly tricky.
I'm basing this on the GNU coreutils version of the wc command. Note that the total line is normally printed last, not first (see my comment on the question).
wc -l prints one line for each input file, consisting of the number of lines in the file followed by the name of the file. (The file name is omitted if there are no file name arguments; in that case it counts lines in stdin.)
If and only if there's more than one file name argument, it prints a final line containing the total number of lines and the word total. The documentation indicates no way to inhibit that summary line.
Other than the fact that it's preceded by other output, that line is indistinguishable from output for a file whose name happens to be total.
So to reliably filter out the total line, you'd have to read all the output of wc -l, and remove the final line only if the total length of the output is greater than 1. (Even that can fail if you have files with newlines in their names, but you can probably ignore that possibility.)
A more reliable method is to invoke wc -l on each file individually, avoiding the total line:
for file in $directory-path/*.txt ; do wc -l "$file" ; done
And if you want to sort the output (something you mentioned in a comment but not in your question):
for file in $directory-path/*.txt ; do wc -l "$file" ; done | sort -rn
If you happen to know that there are no files named total, a quick-and-dirty method is:
wc -l $directory-path/*.txt | grep -v ' total$'
If you want to run wc -l on all the files and then filter out the total line, here's a bash script that should do the job. Adjust the *.txt as needed.
#!/bin/bash
wc -l *.txt > .wc.out
lines=$(wc -l < .wc.out)
if [[ lines -eq 1 ]] ; then
cat .wc.out
else
(( lines-- ))
head -n $lines .wc.out
fi
rm .wc.out
Another option is this Perl one-liner:
wc -l *.txt | perl -e '#lines = <>; pop #lines if scalar #lines > 1; print #lines'
#lines = <> slurps all the input into an array of strings. pop #lines discards the last line if there are more than one, i.e., if the last line is the total line.

The program wc, always displays the total when they are two or more than two files ( fragment of wc.c):
if (argc > 2)
report ("total", total_ccount, total_wcount, total_lcount);
return 0;
also the easiest is to use wc with only one file and find present - one after the other - the file to wc:
find $dir -name '*.txt' -exec wc -l {} \;
Or as specified by liborm.
dir="."
find $dir -name '*.txt' -exec wc -l {} \; | sort -rn | sed 's/\.txt$//'

This is a job tailor-made for head:
wc -l | head --lines=-1
This way, you can still run in one process.

Can you use another wc ?
The POSIX wc(man -s1p wc) shows
If more than one input file operand is specified, an additional line shall be written, of the same format as the other lines, except that the word total (in the POSIX locale) shall be written instead of a pathname and the total of each column shall be written as appropriate. Such an additional line, if any, is written at the end of the output.
You said the Total line was the first line, the manual states its the last and other wc's don't show it at all. Removing the first or last line is dangerous, so I would grep -v the line with the total (in the POSIX locale...), or just grep the slash that's part of all other lines:
wc -l $directory-path/*.txt | grep "/"

Not the most optimized way since you can use combinations of cat, echo, coreutils, awk, sed, tac, etc., but this will get you want you want:
wc -l ./*.txt | awk 'BEGIN{print "Line\tFile"}1' | sed '$d'
wc -l ./*.txt will extract the line count. awk 'BEGIN{print "Line\tFile"}1' will add the header titles. The 1 corresponds to the first line of the stdin. sed '$d' will print all lines except the last one.
Example Result
Line File
6 ./test1.txt
1 ./test2.txt

You can solve it (and many other problems that appear to need a for loop) quite succinctly using GNU Parallel like this:
parallel wc -l ::: tmp/*txt
Sample Output
3 tmp/lines.txt
5 tmp/unfiltered.txt
42 tmp/file.txt
6 tmp/used.txt

The simplicity of using just grep -c
I rarely use wc -l in my scripts because of these issues. I use grep -c instead. Though it is not as efficient as wc -l, we don't need to worry about other issues like the summary line, white space, or forking extra processes.
For example:
/var/log# grep -c '^' *
alternatives.log:0
alternatives.log.1:3
apache2:0
apport.log:160
apport.log.1:196
apt:0
auth.log:8741
auth.log.1:21534
boot.log:94
btmp:0
btmp.1:0
<snip>
Very straight forward for a single file:
line_count=$(grep -c '^' my_file.txt)
Performance comparison: grep -c vs wc -l
/tmp# ls -l *txt
-rw-r--r-- 1 root root 721009809 Dec 29 22:09 x.txt
-rw-r----- 1 root root 809338646 Dec 29 22:10 xyz.txt
/tmp# time grep -c '^' *txt
x.txt:7558434
xyz.txt:8484396
real 0m12.742s
user 0m1.960s
sys 0m3.480s
/tmp/# time wc -l *txt
7558434 x.txt
8484396 xyz.txt
16042830 total
real 0m9.790s
user 0m0.776s
sys 0m2.576s

Similar to Mark Setchell's answer you can also use xargs with an explicit separator:
ls | xargs -I% wc -l %
Then xargs explicitly doesn't send all the inputs to wc, but one operand line at a time.

Shortest answer:
ls | xargs -l wc

What about using sed with the pattern removal option as below which would only remove the total line if it is present (but also any files with total in them).
wc -l $directory-path/*.txt | sort -rn | sed '/total/d'

How to keep blank lines in the end of a file when I user cat command in shell script

the file a.txt has two blank lines at the end
[yaxin#oishi tmp]$ cat -n a.txt
1 jhasdfj
2
3 sdfjalskdf
4
5
and my script is:
[yaxin#oishi tmp]$ cat t.sh
#!/bin/sh
a=`cat a.txt`
a_length=`echo "$a" | awk 'END {print NR}'`
echo "$a"
echo $a_length
[yaxin#oishi tmp]$ sh t.sh
jhasdfj
sdfjalskdf
3
open debug
[yaxin#oishi tmp]$ sh -x t.sh
++ cat a.txt
+ a='jhasdfj
sdfjalskdf'
++ echo 'jhasdfj
sdfjalskdf'
++ awk 'END {print NR}'
+ a_length=3
+ echo 'jhasdfj
sdfjalskdf'
jhasdfj
sdfjalskdf
+ echo 3
3
the cat command steal the blank lines at the end of the file.How to solve this problem.

The cat command does not steal anything. It is the command substitution that does. man bash says:
Bash performs the expansion by executing command and replacing the command substitution with the standard output of the command, with any trailing newlines deleted. Embedded newlines are not deleted
If you want to store an output of a command to a variable, you might add && echo . after the command, store the output and remove the final ..
Also, to count the number of lines in a file, the cannonical way is to run wc -l:
wc -l < a.txt

You don't need cat command here, directly use awk like this:
awk 'END {print NR}' a.txt
Your problem is in storing the cat's output in a shell variable. Even this will give right output (though case of UUOC):
cat a.txt | awk 'END {print NR}'
Update: When you try to do this:
a=`cat a.txt`
OR else:
a=$(cat a.txt)
Pitfall is that the process substitution i.e. command inside reverse quote like you have or in $() strips trailing newlines.
You can do this trick to get trailing newlines stored in a shell variable:
a=`cat a.txt; echo ';'`
a="${a%;}"
Test the variable value:
echo "$a"
printf "%q" "$a"
Then output will show newlines as well:
jhasdfj
sdfjalskdf
$'jhasdfj\n\nsdfjalskdf\n\n\n'

bash line concatenation during variable interpolation

$ cat isbndb.sample | wc -l
13
$ var=$(cat isbndb.sample); echo $var | wc -l
1
Why is the newline character missing when I assign the string to the variable? How can I keep the newline character from being converted into a space?
I am using bash.

You have to quote the variable to preserve the newlines.
$ var=$(cat isbndb.sample); echo "$var" | wc -l
And cat is unnecessary in both cases:
$ wc -l < isbndb.sample
$ var=$(< isbndb.sample); echo "$var" | wc -l
Edit:
Bash normally strips extra trailing newlines from a file when it assigns its contents to a variable. You have to resort to some tricks to preserve them. Try this:
IFS='' read -d '' var < isbndb.sample; echo "$var" | wc -l
Setting IFS to null prevents the file from being split on the newlines and setting the delimiter for read to null makes it accept the file until the end of file.

var=($(< file))
echo ${#var[#]}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why does the wc command count one more character than expected? - linux

You have 17 because of the /0 chaeracter.

Related

why does echo -n "100" | wc -c output 3?

how to get last line first 2 character of a file in linux

Suppressing summary information in `wc -l` output

How to keep blank lines in the end of a file when I user cat command in shell script

bash line concatenation during variable interpolation

Categories

Resources