Add a prefix to beginning of each line depending on its size - linux

I have a file with postal codes and city names like this :
1234 foo
4321 foobar
64324 foofoobar
92001 bar
with a \t between the numbers and the city name. I would like to add the prefix 0 to each line with 4 numbers, using sed or a shell script
01234 foo
04321 foobar
64324 foofoobar
92001 bar
Thanks for the help.

Assuming all the postcodes are numeric, you can use the printf command in awk for the task, as per the following transcript (the v characters are there just to show where the tab stops are):
pax> printf "v\tv\tv\n" ; cat infile
v v v
1234 rio xyz
4321 munich abc
64324 perth def
92001 paris qqq
pax> awk 'BEGIN {OFS = "\t"} {arg1 = $1; $1 = ""; printf "%05d%s\n", arg1, $0}' infile
01234 rio xyz
04321 munich abc
64324 perth def
92001 paris qqq
The awk command first extracts and removes the first argument(a) from each line, then formats it along with the changed line.
You'll notice I've also set the output field separator to a tab character since that appears to be what you're using. That may not be necessary, it just depends on how closely you want the output data to match the input.
(a) Technically it just sets it to an empty string, the argument itself still exists. That's why there's no tab needed between the %05d and %s in the format string, since the tab is still there.

sed 's/^\(....\)$/0\1/' filename
(But paxdiablo's answer is more legible, I think.)

Related

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

Swapping the first word with itself 3 times only if there are 4 words only using sed

Hi I'm trying to solve a problem only using sed commands and without using pipeline. But I am allowed to pass the result of a sed command to a file or te read from a file.
EX:
sed s/dog/cat/ >| tmp
or
sed s/dog/cat/ < tmp
Anyway lets say I had a file F1 and its contents was :
Hello hi 123
if a equals b
you
one abc two three four
dany uri four 123
The output should be:
if if if a equals b
dany dany dany uri four 123
Explanation: the program must only print lines that have exactly 4 words and when it prints them it must print the first word of the line 3 times.
I've tried doing commands like this:
sed '/[^ ]*.[^ ]*.[^ ]*/s/[^ ]\+/& & &/' F1
or
sed 's/[^ ]\+/& & &/' F1
but I can't figure out how i can calculate with sed that there are only 4 words in a line.
any help will be appreciated
$ sed -En 's/^([^[:space:]]+)([[:space:]]+[^[:space:]]+){3}$/\1 \1 &/p' file
if if if a equals b
dany dany dany uri four 123
The above uses a sed that supports EREs with a -E option, e.g. GNU and OSX seds).
If the fields are tab separated
sed 'h;s/[^[:blank:]]//g;s/[[:blank:]]\{3\}//;/^$/!d;x;s/\([^[:blank:]]*[[:blank:]]\)/\1\1\1/' infile

Extract values from a fixed-width column

I have text file named file that contains the following:
Australia AU 10
New Zealand NZ 1
...
If I use the following command to extract the country names from the first column:
awk '{print $1}' file
I get the following:
Australia
New
...
Only the first word of each country name is output.
How can I get the entire country name?
Try this:
$ awk '{print substr($0,1,15)}' file
Australia
New Zealand
To complement Raymond Hettinger's helpful POSIX-compliant answer:
It looks like your country-name column is 23 characters wide.
In the simplest case, if you don't need to trim trailing whitespace, you can just use cut:
# Works, but has trailing whitespace.
$ cut -c 1-23 file
Australia
New Zealand
Caveat: GNU cut is not UTF-8 aware, so if the input is UTF-8-encoded and contains non-ASCII characters, the above will not work correctly.
To trim trailing whitespace, you can take advantage of GNU awk's nonstandard FIELDWIDTHS variable:
# Trailing whitespace is trimmed.
$ awk -v FIELDWIDTHS=23 '{ sub(" +$", "", $1); print $1 }' file
Australia
New Zealand
FIELDWIDTHS=23 declares the first field (reflected in $1) to be 23 characters wide.
sub(" +$", "", $1) then removes trailing whitespace from $1 by replacing any nonempty run of spaces (" +") at the end of the field ($1) with the empty string.
However, your Linux distro may come with Mawk rather than GNU Awk; use awk -W version to determine which one it is.
For a POSIX-compliant solution that trims trailing whitespace, extend Raymond's answer:
# Trailing whitespace is trimmed.
$ awk '{ c=substr($0, 1, 23); sub(" +$", "", c); print c}' file
Australia
New Zealand
to get rid of the last two columns
awk 'NF>2 && NF-=2' file
NF>2 is the guard to filter records with more than 2 fields. If your data is consistent you can drop that to simply,
awk 'NF-=2' file
This isn't relevant in the case where your data has spaces, but often it doesn't:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz etc...
In these cases it's really easy to get, say, the IMAGE column using tr to remove multiple spaces:
$ docker ps | tr --squeeze-repeats ' '
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz
Now you can pipe this (without the pesky header row) to cut:
$ docker ps | tr --squeeze-repeats ' ' | tail -n +2 | cut -d ' ' -f 2
foo

Get string between characters in Linux

I have text file, which has data like this:
asd.www.aaa.com
abc.abc.co
look at me
asd.www.bbb.com
bzc.bzc.co
asd.www.ddd.com
hello world
www.eee.com
xx.yy.z
I want strings which is surrounded by "asd.www.[i want this string].com".
So my output will be like:
aaa
bbb
ddd
try:
grep -Po '^asd\.www\.\K[^.]*(?=\.com)' file
if asd could be in middle of the string, remove the first ^.
there could be other corner cases, like the greedy matching etc. it depends on your source input.
I suggested cut originally but I misread your question. So I'm going to post an alternative with awk instead. You are looking for the third column of your text input where there are a total of four columns.
less file.txt | awk -F '.' '{ if ($4 != "") print $3 }'
It splits your string on . and only prints out column $3 if column $4 is blank. This should yield the following given your example text:
aaa
bbb
ddd

Extract Lines when Column K is empty with AWK/Perl

I have data that looks like this:
foo 78 xxx
bar yyy
qux 99 zzz
xuq xyz
They are tab delimited.
How can I extract lines where column 2 is empty, yielding
bar yyy
xuq xyz
I tried this but doesn't seem to work:
awk '$2==""' myfile.txt
You need to specifically set the field separator to a TAB character:
> cat qq.in
foo 78 xxx
bar yyy
qux 99 zzz
xuq xyz
> cat qq.in | awk 'BEGIN {FS="\t"} $2=="" {print}'
bar yyy
xuq xyz
The default behaviour for awk is to treat an FS of SPACE (the default) as a special case. From the man page:
In the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines. (my italics)
perl -F/\t/ -lane 'print unless $F[1] eq q//' myfile.txt
Command Switches
-F tells Perl what delimiter to autosplit on (tabs in this case)
-a enables autosplit mode, splitting each line on the specified delimiter to populate an array #F
-l automatically appends a newline "\n" at the end of each printed line
-n processes the file line-by-line
-e treats the first quoted argument as code and not a filename
grep -e '^.*\t\t.*$' myfile.txt
Will grep each line consisting of characters-tab-tab-characters (nothing between tabs).

Resources