Printing First Variable in Awk but Only If It's Less than X - linux

I have a file with words and I need to print only the lines that are less than or equal to 4 characters but I'm having trouble with my code. There is other text on the end of the lines but I shortened it for here.
file:
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
What I want to do is print the names that have less than 4 characters.
What I've tried:
awk '$1 <= 4 {print $1}' inputfile
What I'm hoping to get:
John
Jane
Mark
Bill
So far, I've got nothing. Either it prints out everything, with no length restrictions or it doesn't even print anything at all. Could someone take a look at this and see what they think?
Thanks

First, let understand why
awk '$1 <= 4 {print $1}' inputfile
gives you whole inputfile, $1 <= 4 is numeric comparison, so this prompt GNU AWK to try to convert first column value to numeric value, but what is numeric value of say
John
? As GNU AWK manual Strings And Numbers put it
A string is converted to a number by interpreting any numeric prefix
of the string as numerals(...)Strings that can’t be interpreted as
valid numbers convert to zero.
Therefore numeric value for John from GNU AWK point of view is zero.
In order to get desired output you might use length function which returns number of characters as follows
awk 'length($1)<=4{print $1}' inputfile
or alternatively pattern matching from 0 to 4 characters that is
awk '$1~/^.{0,4}$/{print $1}' inputfile
where $1~ means check if 1st field match, . denotes any character, {0,4} from 0 to 4 repetitions, ^ begin of string, $ end of string (these 2 are required as otherwise it would also match longer string, as they do contain substring .{0,4})
Both codes for inputfile
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
give output
John
Jane
Mark
Bill
(tested in gawk 4.2.1)

Related

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Add a prefix to beginning of each line depending on its size

I have a file with postal codes and city names like this :
1234 foo
4321 foobar
64324 foofoobar
92001 bar
with a \t between the numbers and the city name. I would like to add the prefix 0 to each line with 4 numbers, using sed or a shell script
01234 foo
04321 foobar
64324 foofoobar
92001 bar
Thanks for the help.
Assuming all the postcodes are numeric, you can use the printf command in awk for the task, as per the following transcript (the v characters are there just to show where the tab stops are):
pax> printf "v\tv\tv\n" ; cat infile
v v v
1234 rio xyz
4321 munich abc
64324 perth def
92001 paris qqq
pax> awk 'BEGIN {OFS = "\t"} {arg1 = $1; $1 = ""; printf "%05d%s\n", arg1, $0}' infile
01234 rio xyz
04321 munich abc
64324 perth def
92001 paris qqq
The awk command first extracts and removes the first argument(a) from each line, then formats it along with the changed line.
You'll notice I've also set the output field separator to a tab character since that appears to be what you're using. That may not be necessary, it just depends on how closely you want the output data to match the input.
(a) Technically it just sets it to an empty string, the argument itself still exists. That's why there's no tab needed between the %05d and %s in the format string, since the tab is still there.
sed 's/^\(....\)$/0\1/' filename
(But paxdiablo's answer is more legible, I think.)

How to merge two documents in Bash?

Right now I have a bash shell script that takes the input of a text file with the syntax for example, "Smith, Bob". The end goal is to take the first letter of the first name and append the first 7 characters of the last name. I am currently in a pickle.
echo "Extracting first letter"
cut -d "," -f2 $1 > first.txt
cut -b2 first.txt > second.txt
echo "First letter extracted"
echo "Extracting 7 characters"
cut -d "," -f1 $1 > letters.txt
cat second.txt | tr '[:upper:]' '[:lower:]' > lowernames.txt
I have two files, one with the first letter, the other with the first 7 characters, but can't combine the two. Any suggestions?
Here are three solutions, one using sed, one using awk, and one using python:
Using sed
Here is a sed solution. Using the same test file as sehe:
$ cat file
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
$ sed -E 's/([^,]{1,7})[^,]*,\s*(\S).*/\2\1/' file
BSmith
JDoe
JSnow
JPattitu
How it works
The idea is to capture the first 7 letters of the last name to group 1 and the first letter of the last name to group 2. The regex to do that consists of the following parts:
([^,]{1,7})
This captures up to seven characters of the last name.
`[^,]*,
This matches any characters after the first seven of the last name and the comma which follows.
\s*
This matches any spaces which follow the comma
(\S)
This matches the first character of the first name
.*
This matches any remaining characters of the first name.
Using awk
$ awk -F', *' '{print substr($2,1,1) substr($1,1,7)}' file
BSmith
JDoe
JSnow
JPattitu
How it works
-F', *'
This declares the field separator to be a comma followed by zero or more spaces
substr($1,1,7)
This selects the first seven characters of the last name
substr($2,1,1)
This selects the first character of the first name
Using python
$ python3 -c 'for line in open("file"): last, first=line.strip().split(", "); print(first[:1] + last[:7])'
BSmith
JDoe
JSnow
JPattitu
You can do this without any external process:
while read surname firstname
do
surname="${surname%,}"
echo "${firstname:0:1}${surname:0:7}"
done
See it Live On IdeOne
Input
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
Output
BSmith
JDoe
JSnow
JPattitu
using awk :
awk -F ', ' '{printf("%s%s\n",substr($2,1,1),subsstr($1,1,7))}' file
input:
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
output:
BSmith
JDoe
JSnow
JPattitucci
the input text is splited on ', ' and substr will extract the 1st character of 2nd field

Expand one column while preserving another

I am trying to get column one repeated for every value in column two which needs to be on a new line.
cat ToExpand.txt
Pete horse;cat;dog
Claire car
John house;garden
My first attempt:
cat expand.awk
BEGIN {
FS="\t"
RS=";"
}
{
print $1 "\t" $2
}
awk -f expand.awk ToExpand.txt
Pete horse
cat
dog
Claire car
John
garden
The desired output is:
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden
Am I on the right track here or would you use another approach? Thanks in advance.
You could also change the FS value into a regex and do something like this:
awk -F"\t|;" -v OFS="\t" '{for(i=2;i<=NF;i++) print $1, $i}' ToExpand.txt
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden
I'm assuming that:
The first tab is the delimiter for the name
There's only one tab delimiter - If tab delimited data occurs after the ; section use fedorqui's implementation.
It's using an alternate form of setting the OFS value ( using the -v flag ) and loops over the fields after the first to print the expected output.
You can think of RS in your example as making "lines" out of your data ( records really ) and your print block is acting on those "lines"(records) instead of the normal newline. Then each record is further parsed by your FS. That's why you get the output from your first attempt. You can explore that by printing out the value of NF in your example.
Try:
awk '{gsub(/;/,ORS $1 OFS)}1' OFS='\t' file
This replaces every occurrence of a semicolon with a newline, the first field and the output field separator..
Output:
Pete horse
Pete cat
Pete dog
Claire car
John house
John garden

Resources