Print first N words of a file - linux

Is there any way to print the first N words of a file? I've tried cut but it reads a document line-by-line. The only solution I came up with is:
sed ':a;N;$!ba;s/\n/δ/g' file | cut -d " " -f -20 | sed 's/δ/\n/g'
Essentially, replacing newlines with a character that doesn't not exist in the file, applying "cut" with space as delimiter and then restoring the newlines.
Is there any better solution?

You could use awk to print the first n words:
$ awk 'NR<=8{print;next}{exit}' RS='[[:blank:]]+|\n' file
This would print the first 8 words. Each word is output on a separate line, are you looking to keep the original format of the file?
Edit:
The following will preserve the original format of the file:
awk -v n=8 'n==c{exit}n-c>=NF{print;c+=NF;next}{for(i=1;i<=n-c;i++)printf "%s ",$i;print x;exit}' file
Demo:
$ cat file
one two
thre four five six
seven 8 9
10
$ awk -v n=8 'n==c{exit}n-c>=NF{print;c+=NF;next}{for(i=1;i<=n-c;i++)printf "%s ",$i;print x;exit}' file
one two
thre four five six
seven 8
A small caveat: if the last line printed doesn't use a single space as a separator this line will lose it's formatting.
$ cat file
one two
thre four five six
seven 8 9
10
# the 8th word fell on 3rd line: this line will be formatted with single spaces
$ awk -v n=8 'n==c{exit}n-c>=NF{print;c+=NF;next}{for(i=1;i<=n-c;i++)printf "%s ",$i;print x;exit}' file
one two
thre four five six
seven 8

Assuming words are non-white space separated by white space, you can use tr to convert the document to one-word-per-line format and then count the first N lines:
tr -s ' \011' '\012' < file | head -n $N
where N=20 or whatever value you want for the number of words. Note that tr is a pure filter; it only reads from standard input and only writes to standard output. The -s option 'squeezes' out duplicate replacements, so you get one newline per sequence of blanks or tabs in the input. (If there is leading white space in the file, you get an initial blank line. There are various ways to deal with that, such as grab the first N+1 lines out output after all, or filter out all blank lines.)

Using GNU awk so we can set the RS to a regexp and access the matching string using RT:
$ cat file
the quick
brown fox jumped over
the
lazy
dog's back
$ gawk -v c=3 -v RS='[[:space:]]+' 'NR<=c{ORS=(NR<c?RT:"\n");print}' file
the quick
brown
$ gawk -v c=6 -v RS='[[:space:]]+' 'NR<=c{ORS=(NR<c?RT:"\n");print}' file
the quick
brown fox jumped over
$ gawk -v c=9 -v RS='[[:space:]]+' 'NR<=c{ORS=(NR<c?RT:"\n");print}' file
the quick
brown fox jumped over
the
lazy
dog's

Why not try turning your words into lines, and then just using head -n 20 instead?
For example:
for i in `cat somefile`; do echo $i; done | head -n 20
It's not elegant, but it does have considerably less line-noise regex.

One way with perl:
perl -lane 'push #a,#F;END{print "#a[0..9]"}' file
Note: indexing starts at zero so the example will print the first ten words. The words will be printed on a single line separated by a single space.

Related

How can I find the number of 8 letter words that do not contain the letter "e", using the grep command?

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.
You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words
In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.
Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.
Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158
You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.
There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Use tr to replace single new lines but not multiple new lines

Hi I have a file with data in the following format:
262353824192
Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing
http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112
TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye
http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203
NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl
http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203
I would like replace the single new lines with a pipe, but leave the double new lines as they are. I have tried:
tr '\n' '|' < text.txt
But this replaces all new lines with | so the separate products are no longer on different lines. I basically want a | delimiter between the product number, title and url, but each separate product on a different line. How can I achieve this?
Use tr and a little bit of sed:
tr "\n" "|" < text.txt | sed 's/||\+/\n/g'
You could use awk to do this:
awk ' /^$/ { print; } /./ { printf("%s|", $0); } END {print '\n'}' text.txt
This will find any blank line and just print it as-is. If it fin
ds any value on the line it will use printf and stick a pipe after it. At the end of processing it prints a newline character to finish up.
This has already been partially answered HERE, but not completely.
I would add an additional transform to change double newlines to some character (hash in this case), then replace the hashes with a newline (or two if you want to go back to the original formatting of those) after changing the single newlines to be pipes.
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n\n/#/g' -e 's/\n/|/g' -e 's/#/\n/g'
This gives the output:
262353824192|Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing|http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112|TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye|http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203|NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl|http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203
awk to the rescue!
awk -F'\n' -v RS= -v OFS='|' '{$1=$1;printf "%s", $0 RT}' file
this preserves spacing between paragraphs, 3 lines as in the original file.
I made a very specific solution to your problem with awk (specific because it assumes you always have the same number of new lines between the groups of records).
awk 'BEGIN {RS="\n\n\n"; FS="\n"; OFS="|"} {print $1,$2,$3}' < text.txt
It sets the record separator to 3 newlines, field separator to one newline, and the output field separator to pipe. Then for each record (every block seperated by 3 newlines), it prints the first 3 fields (that are separated by one newline), and on the output it separates them with a pipe
Just use sed:
sergey#x50n:~> cat in.txt | tr '\n' '|' | sed -e 's/||\+/\n\n/g; s/|$/\n/'
262353824192|Motley Crue Too Fast For Love Vinyl LP Leathur Records LR123 rare 3rd pressing|http://www.ebay.co.uk/itm/Motley-Crue-Too-Fast-Love-Vinyl-LP-Leathur-Records-LR123-rare-3rd-pressing-/262353824192
301870324112|TRAFFIC Same UK 1st press vinyl LP in gatefold / booklet sleeve Island pink eye|http://www.ebay.co.uk/itm/TRAFFIC-Same-UK-1st-press-vinyl-LP-gatefold-booklet-sleeve-Island-pink-eye-/301870324112
141948187203|NOW That's What I Call Music LP'S Joblot 2-14 MINT CONDITION Vinyl|http://www.ebay.co.uk/itm/NOW-Thats-Call-Music-LPS-Joblot-2-14-MINT-CONDITION-Vinyl-/141948187203
First we replace all newlines with a pipe using tr as in your example.
Then the first expression in sed command (i.e. s/||\+/\n\n/g;) replaces all occurrences of more than one pipe with two newlines. You also may replace them with one line if you do not want blank lines between the lines of output. And the second expression of sed replaces the trailing pipe with a newline to produce more readable output (or more "conventional" empty line at the end of file).
Also note that \+ in sed regex is a GNU extension. Thus if you are using non-GNU implementation of sed (FreeBSD, AIX or so), use standard syntax: |||* instead of ||\+.

using awk or sed to print all columns from the n-th to the last [duplicate]

This question already has answers here:
awk to print all columns from the nth to the last with spaces
(4 answers)
Using awk to print all columns from the nth to the last
(27 answers)
Closed 6 years ago.
This is NOT a duplicate of another question.
All previous questions/solutions posted on stackoverflow have got the same issue: additional spaces get replaced into a single space.
Example (1.txt)
filename Nospaces
filename One space
filename Two spaces
filename Three spaces
Result:
awk '{$1="";$0=$0;$1=$1}1' 1.txt
One space
Two spaces
Three spaces
awk '{$1=""; print substr($0,2)}' 1.txt
One space
Two spaces
Three spaces
Specify IFS with -F option to avoid omitting multiple space by awk
awk -F "[ ]" '{$1="";$0=$0;$1=$1}1' 1.txt
awk -F "[ ]" '{$1=""; print substr($0,2)}' 1.txt
If you define a field as any number of non-space characters followed by any number of space characters, then you can remove the first N like this:
$ sed -E 's/([^[:space:]]+[[:space:]]*){1}//' file
Nospaces
One space
Two spaces
Three spaces
Change {1} to {N}, where N is the number of fields to remove. If you only want to remove 1 field from the start, then you can remove the {1} entirely (as well as the parentheses which are used to create a group):
sed -E 's/[^[:space:]]+[[:space:]]*//' file
Some versions of sed (e.g. GNU sed) allow you to use the shorthand:
sed -E 's/(\S+\s*){1}//' file
If there may be some white space at the start of the line, you can add a \s* (or [[:space:]]*) to the start of the pattern, outside of the group:
sed -E 's/\s*(\S+\s*){1}//' file
The problem with using awk is that whenever you touch any of the fields on given record, the entire record is reformatted, causing each field to be separated by OFS (the Output Field Separator), which is a single space by default. You could use awk with sub if you wanted but since this is a simple substitution, sed is the right tool for the job.
To preserve whitespace in awk, you'll have to use regular expression substitutions or use substrings. As soon as you start modifying individual fields, awk has to recalculate $0 using the defined (or implicit) OFS.
Referencing Tom's sed answer:
awk '{sub(/^([^[:blank:]]+[[:blank:]]+){1}/, "", $0); print}' 1.txt
Use cut:
cut -d' ' -f2- a.txt
prints all columns from the second to the last and preserves whitespace.
Working code in awk, no leading space, supporting multiple space in the columns and printing from the n-th column:
awk '{ print substr($0, index($0,$column_id)) }' 1.txt

Count the number of occurrences in a string. Linux

Okay so what I am trying to figure out is how do I count the number of periods in a string and then cut everything up to that point but minus 2. Meaning like this:
string="aaa.bbb.ccc.ddd.google.com"
number_of_periods="5"
number_of_periods=`expr $number_of_periods-2`
string=`echo $string | cut -d"." -f$number_of_periods`
echo $string
result: "aaa.bbb.ccc.ddd"
The way that I was thinking of doing it was sending the string to a text file and then just greping for the number of times like this:
grep -c "." infile
The reason I don't want to do that is because I want to avoid creating another text file for I do not have permission to do so. It would also be simpler for the code I am trying to build right now.
EDIT
I don't think I made it clear but I want to make finding the number of periods more dynamic because the address I will be looking at will change as the script moves forward.
If you don't need to count the dots, but just remove the penultimate dot and everything afterwards, you can use Bash's built-in string manuipulation.
${string%substring}
Deletes shortest match of $substring from back of $string.
Example:
$ string="aaa.bbb.ccc.ddd.google.com"
$ echo ${string%.*.*}
aaa.bbb.ccc.ddd
Nice and simple and no need for sed, awk or cut!
What about this:
echo "aaa.bbb.ccc.ddd.google.com"|awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
(further shortened by helpful comment from #steve)
gives:
aaa.bbb.ccc.ddd
The awk command:
awk 'BEGIN{FS=OFS="."}{NF=NF-2}1'
works by separating the input line into fields (FS) by ., then joining them as output (OFS) with ., but the number of fields (NF) has been reduced by 2. The final 1 in the command is responsible for the print.
This will reduce a given input line by eliminating the last two period separated items.
This approach is "shell-agnostic" :)
Perhaps this will help:
#!/bin/sh
input="aaa.bbb.ccc.ddd.google.com"
number_of_fields=$(echo $input | tr "." "\n" | wc -l)
interesting_fields=$(($number_of_fields-2))
echo $input | cut -d. -f-${interesting_fields}
grep -o "\." <<<"aaa.bbb.ccc.ddd.google.com" | wc -l
5

Resources