How to merge two documents in Bash? - linux

Right now I have a bash shell script that takes the input of a text file with the syntax for example, "Smith, Bob". The end goal is to take the first letter of the first name and append the first 7 characters of the last name. I am currently in a pickle.
echo "Extracting first letter"
cut -d "," -f2 $1 > first.txt
cut -b2 first.txt > second.txt
echo "First letter extracted"
echo "Extracting 7 characters"
cut -d "," -f1 $1 > letters.txt
cat second.txt | tr '[:upper:]' '[:lower:]' > lowernames.txt
I have two files, one with the first letter, the other with the first 7 characters, but can't combine the two. Any suggestions?

Here are three solutions, one using sed, one using awk, and one using python:
Using sed
Here is a sed solution. Using the same test file as sehe:
$ cat file
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
$ sed -E 's/([^,]{1,7})[^,]*,\s*(\S).*/\2\1/' file
BSmith
JDoe
JSnow
JPattitu
How it works
The idea is to capture the first 7 letters of the last name to group 1 and the first letter of the last name to group 2. The regex to do that consists of the following parts:
([^,]{1,7})
This captures up to seven characters of the last name.
`[^,]*,
This matches any characters after the first seven of the last name and the comma which follows.
\s*
This matches any spaces which follow the comma
(\S)
This matches the first character of the first name
.*
This matches any remaining characters of the first name.
Using awk
$ awk -F', *' '{print substr($2,1,1) substr($1,1,7)}' file
BSmith
JDoe
JSnow
JPattitu
How it works
-F', *'
This declares the field separator to be a comma followed by zero or more spaces
substr($1,1,7)
This selects the first seven characters of the last name
substr($2,1,1)
This selects the first character of the first name
Using python
$ python3 -c 'for line in open("file"): last, first=line.strip().split(", "); print(first[:1] + last[:7])'
BSmith
JDoe
JSnow
JPattitu

You can do this without any external process:
while read surname firstname
do
surname="${surname%,}"
echo "${firstname:0:1}${surname:0:7}"
done
See it Live On IdeOne
Input
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
Output
BSmith
JDoe
JSnow
JPattitu

using awk :
awk -F ', ' '{printf("%s%s\n",substr($2,1,1),subsstr($1,1,7))}' file
input:
Smith, Bob
Doe, John
Snow, John
Pattitucci, John
output:
BSmith
JDoe
JSnow
JPattitucci
the input text is splited on ', ' and substr will extract the 1st character of 2nd field

Related

Printing First Variable in Awk but Only If It's Less than X

I have a file with words and I need to print only the lines that are less than or equal to 4 characters but I'm having trouble with my code. There is other text on the end of the lines but I shortened it for here.
file:
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
What I want to do is print the names that have less than 4 characters.
What I've tried:
awk '$1 <= 4 {print $1}' inputfile
What I'm hoping to get:
John
Jane
Mark
Bill
So far, I've got nothing. Either it prints out everything, with no length restrictions or it doesn't even print anything at all. Could someone take a look at this and see what they think?
Thanks
First, let understand why
awk '$1 <= 4 {print $1}' inputfile
gives you whole inputfile, $1 <= 4 is numeric comparison, so this prompt GNU AWK to try to convert first column value to numeric value, but what is numeric value of say
John
? As GNU AWK manual Strings And Numbers put it
A string is converted to a number by interpreting any numeric prefix
of the string as numerals(...)Strings that can’t be interpreted as
valid numbers convert to zero.
Therefore numeric value for John from GNU AWK point of view is zero.
In order to get desired output you might use length function which returns number of characters as follows
awk 'length($1)<=4{print $1}' inputfile
or alternatively pattern matching from 0 to 4 characters that is
awk '$1~/^.{0,4}$/{print $1}' inputfile
where $1~ means check if 1st field match, . denotes any character, {0,4} from 0 to 4 repetitions, ^ begin of string, $ end of string (these 2 are required as otherwise it would also match longer string, as they do contain substring .{0,4})
Both codes for inputfile
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
give output
John
Jane
Mark
Bill
(tested in gawk 4.2.1)

How can I use grep and regular expression to display names with just 3 characters

I am new to grep and UNIX. I have a sample of data and want to display all the first names that only contain three characters e.g. Lee_example. but I having some difficulty doing that. I am currently using this code cat file.txt|grep -E "[A-Z][a-z]{2}" but it is displaying all the names that contain at least 3 characters and not only 3 characters
Sample data
name
number
Lee_example
1
Hector_exaple
2
You need to match the _ after the first name.
grep -E "[A-Z][a-z]{2}_"
With awk:
awk -F_ 'length($1)==3{print $1}'
-F_ tells awk to split the input lines by _. length($1) == 3 checks whether the first fields (the name) is 3 characters long and {print $1} prints the name in that case.

Filtering by author and counting all numbers im txt file - Linux terminal, bash

I need help with two hings
1)the file.txt has the format of a list of films
, in which they are authors in different lines, year of publication, title, e.g.
author1
year1
title1
author2
year2
title2
author3
year3
title3
author4
year4
title4
I need to show only book titles whose author is "Joanne Rowling"
2)
one.txt contains numbers and letters for example like:
dada4dawdaw54 232dawdawdaw 53 34dadasd
77dkwkdw
65 23 laka 23
I need to sum all of them and receive score - here it should 561
I tried something like that:
awk '{for(i=1;i<=NF;i++)s+=$i}END{print s}' plik2.txt
but it doesn't make sense
For the 1st question, the solution of okulkarni is great.
For the 2nd question, one solution is
sed 's/[^0-9]/ /g' one.txt | awk '{for(i=1;i<=NF;i++) sum+= $i} END { print sum}'
The sed command converts all non-numeric characters into spaces, while the awk command sums the numbers, line by line.
For the first question, you just need to use grep. Specifically, you can do grep -A 2 "Joanne Rowling" file.txt. This will show all lines with "Joanne Rowling" and the two lines immediately after.
For the second question, you can also use grep by doing grep -Eo '[0-9]+' | paste -sd+ | bc. This will put a + between every number found by grep and then add them up using bc.

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

Extract values from a fixed-width column

I have text file named file that contains the following:
Australia AU 10
New Zealand NZ 1
...
If I use the following command to extract the country names from the first column:
awk '{print $1}' file
I get the following:
Australia
New
...
Only the first word of each country name is output.
How can I get the entire country name?
Try this:
$ awk '{print substr($0,1,15)}' file
Australia
New Zealand
To complement Raymond Hettinger's helpful POSIX-compliant answer:
It looks like your country-name column is 23 characters wide.
In the simplest case, if you don't need to trim trailing whitespace, you can just use cut:
# Works, but has trailing whitespace.
$ cut -c 1-23 file
Australia
New Zealand
Caveat: GNU cut is not UTF-8 aware, so if the input is UTF-8-encoded and contains non-ASCII characters, the above will not work correctly.
To trim trailing whitespace, you can take advantage of GNU awk's nonstandard FIELDWIDTHS variable:
# Trailing whitespace is trimmed.
$ awk -v FIELDWIDTHS=23 '{ sub(" +$", "", $1); print $1 }' file
Australia
New Zealand
FIELDWIDTHS=23 declares the first field (reflected in $1) to be 23 characters wide.
sub(" +$", "", $1) then removes trailing whitespace from $1 by replacing any nonempty run of spaces (" +") at the end of the field ($1) with the empty string.
However, your Linux distro may come with Mawk rather than GNU Awk; use awk -W version to determine which one it is.
For a POSIX-compliant solution that trims trailing whitespace, extend Raymond's answer:
# Trailing whitespace is trimmed.
$ awk '{ c=substr($0, 1, 23); sub(" +$", "", c); print c}' file
Australia
New Zealand
to get rid of the last two columns
awk 'NF>2 && NF-=2' file
NF>2 is the guard to filter records with more than 2 fields. If your data is consistent you can drop that to simply,
awk 'NF-=2' file
This isn't relevant in the case where your data has spaces, but often it doesn't:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz etc...
In these cases it's really easy to get, say, the IMAGE column using tr to remove multiple spaces:
$ docker ps | tr --squeeze-repeats ' '
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz
Now you can pipe this (without the pesky header row) to cut:
$ docker ps | tr --squeeze-repeats ' ' | tail -n +2 | cut -d ' ' -f 2
foo

Resources