Ignore first few lines and last few lines in a file Linux - linux

I have a file like this and would like to print $0 except the first two and last three lines in linux. Tried awk command but no luck, is there any options I am using the following command - I suppose I am doing something wrong, but not able to figure out what it is with my minimal experience in computer science.
awk '{if(NR>2){c++}else if(FNR<=c-3){print $0}}' samplefile.out > sampleout.txt
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
Thanks

awk cannot look ahead so you'll have to save the lines.
awk 'NR>2{if(z!="")print z;z=y;y=x;x=$0}' file
Practically zero memory overhead

You can do it with a combination of head and tail:
tail -n +2 samplefile.out | head -n -3 > sampleout.txt

Try this:
awk 'NR>2{a[++j]=$0}END{for (i=1;i<=j-3;i++){print a[i]}}' samplefile.out
There's no way to calculate the lines of the file if you don't read or save previous line first.
If the archive is too big , head + tail mix could be better to avoid a memory overhead.

You may also try this, but it uses array
$ cat file
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
$ awk 'NR>first+last{print A[NR%last]}{A[NR%last]=$0}' first=2 last=3 file
entry1 41
entry2 78
entry3 89
entry4 68

Related

Even after `sort`, `uniq` is still repeating some values

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz
(It is a tape archive that contains a file called Wiki-Vote.txt)
The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt
# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt
# Wikipedia voting on promotion to administratorship (till January 2008).
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId ToNodeId
30 1412
30 3352
30 5254
30 5543
30 7478
3 28
I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l
Explanation:
/^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.
| sort pipes the output to sort them.
| uniq should display all those unique values, but it doesn't.
| wc -l counts the previous lines and it is wrong.
The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,
992
993
993
994
994
995
996
998
999
999
Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.
The file has dos line endings - each line is ending with \r CR character.
You can inspect your tail output for example with hexdump -C, lines starting with # added by me:
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C
00000000 39 39 32 0a 39 39 33 0a 39 39 33 0d 0a 39 39 34 |992.993.993..994|
# ^^ HERE
00000010 0a 39 39 34 0d 0a 39 39 35 0d 0a 39 39 36 0a 39 |.994..995..996.9|
# ^^ ^^
00000020 39 38 0a 39 39 39 0a 39 39 39 0d 0a |98.999.999..|
# ^^
0000002c
Because uniq sees unique lines, one with CR and one not, they are not removed. Remove the CR character before pipeing. Note that sort | uniq is better to sort -u.
$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | tr -d '\r' | sort -u | wc -l
7115

awk split adds whole string to array position 1 (reason unknown)

So I have a .txt file that looks like this:
mona 70 77 85 77
john 85 92 78 80
andreja 89 90 85 94
jasper 84 64 81 66
george 54 77 82 73
ellis 90 93 89 88
I have created a grades.awk script that contains the following code:
{
FS=" "
names=$1
vi1=$2
vi2=$3
vi3=$4
rv=$5
#printf("%s ",names);
split(names,nameArray," ");
printf("%s\t",nameArray[1]); //prints the whole array of names for some reason, instead of just the name at position 1 in array ("john")
}
So my question is, how do I split this correctly? Am I doing something wrong?
How do you read line by line, word by word correctly. I need to add each column into its own array. I've been searching for the answer for quite some time now and can't fix my problem.
here is a template to calculate average grades per student
$ awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i;
printf "%s\t%5.2f\n", $1, sum/(NF-1)}' file
mona 77.25
john 83.75
andreja 89.50
jasper 73.75
george 71.50
ellis 90.00
printf("%s\t",nameArray[1])
is doing exactly what you want it to do but you aren't printing any newline between invocations so it's getting called once per input line and outputting one word at a time but since you aren't outputting any newlines between words you just get 1 line of output. Change it to:
printf("%s\n",nameArray[1])
There are a few other issues with your code of course (e.g. you're setting FS in the wrong place and unnecessarily, names only every contains 1 word so splitting it into an array doesn't make sense, etc.) but I think that's what you were asking about specifically.
If that's not all you want then edit your question to clarify what you're trying to do and add concise, testable sample input and expected output.

How to delimit file with "\t\n" on a Mac

I have a document whose lines are separated by "\t\n". Records are separated either by "\t", OR by "\n".
Normally, this should be a straigtforward awk query:
BEGIN {
RS='\t\n';
}
{
print;
print "Next entry:";
}
However, on a Mac, regular expressions do not seem to be supported (maybe I'm not doing something right?) So I tried, RS="\t\n"; however, this is interpreted as RS='\t | \n'. Similar problems running awk from the command line:
awk 1 RS='\t\n' ORS='abc' input > output
replaces the \t's, but leaves the \n's be.
Next try: using tr. This obviously fails for sequence of more than one character-- since \t and \n are both used individually in the rows.
Next:
sed -e '/\t\n/s//NextEntry:/g' input > output
However, doesn't work. Entering any ASCII character sequence instead of \t\n works.
Read the manual. It says that \t is not supported in sed strings. Fair enough
sed -e '/\x9\xa/s//abc/' input > output
Still doesn't work. Idea: use tr to replace \t and \n by characters unused in the input file, use sed to change them to what I want, and then tr to change the remaining characters back to what they should be.
tr: Illegal byte sequence
Turns out, that f6 character makes tr just totally fail.
Went through the suggestions in Sed not recognizing \t instead it is treating it as 't' why? . That might work for replacing output strings (except the "Pasting tab into command prompt via CTRL+V" suggestion-- the shell just rejected that paste.), but did not seem to help in my case.
Maybe it's because it's a Mac? Maybe it's because that's the text I'm looking for, not replacing with? Maybe it's the combination with \n?
Any other suggestions?
UPDATE:
I found thread How can I replace a newline (\n) using sed? . Apparently, I am unable even to replace a \n by the string "abc" using the suggestions in that thread.
EDIT: Hex head of source file:
5a 20 4e 4f 09 0a 41 53 20 4f 46 20 30 31 2d 30
34 2d 30 35 20 45 4d 50 4c 4f 59 45 45 0a 47 52
4f 55 50 09 48 49 52 45 20 44 41 54 45 09 53 41
4c 41 52 59 09 4a 4f 42 20 54 49 54 4c 45 09 0a
4a 4f 42 20 4c 45 56 45 4c 0a 53 45 52 49 45 53
09 41 50 50 54 20 54 59 50 45 09 0a 50 41 59 20
53 54 41 54 55 53 0a f6
Unfortunately, BSD awk, as also used on macOS, doesn't support multi-character record separators (RS) altogether (in line with POSIX) - only a single, literal character is supported.
BSD sed, as also used on macOS, supports only \n in regexes - any other escapes, including hex ones (e.g., \x09) are not supported.
See this answer of mine for a comprehensive comparison of GNU and BSD sed.
Assuming that your sed command works in principle, you can use an ANSI C-quoted string
($'\t') to splice a literal tab char. into your sed script (assumes bash (the macOS default shell), ksh, or zsh),:
sed -e ':a' -e '$!{N;ba' -e '}' -e '/'$'\t''\n/s//NextEntry:/g'
Note that, in order to replace newlines, you must instruct sed to read the entire file into memory first, which is what -e ':a' -e '$!{N;ba' -e '}' does (the BSD Sed-compatible form of the common GNU sed idiom :a;$!{N;ba}).

AWK convert minutes and seconds to only seconds

I have a file that I'm trying to convert into just seconds. It looks like this:
49.80
1:03.40
1:01.00
1:00.40
1:01.00
I've tried:
awk -F: '{seconds=($1*60);seconds=seconds+$2;print seconds}' file
which outputs:
2988
63.4
61
60.4
61
I've also tried:
sed 's/^/:/g' file | awk -F: '{seconds=($2*60);seconds=seconds+$3;print seconds}'
Which outputs the same as results. I would like to obtain these results:
49.80
63.4
61
60.4
61
Just add a check for if the records contains both seconds and minutes
awk -F: 'NF==1; NF==2{seconds=($1*60);seconds=seconds+$2;print seconds}'
NF number of fields in each record(row).
NF==1 If it contains only one field, we are not gonna do any calculations. See that there is no {action} part associated with this check. Hence awk performs the default action to print entire line
NF==2 True if the string contains 2 fields. Takes the associated action, which performs the calculations.
Test
$ awk -F: 'NF==1; NF==2{seconds=($1*60);seconds=seconds+$2;print seconds}' file
49.80
63.4
61
60.4
61
last column in your colon-separated list are seconds, next-to-last are minutes and the column before them are hours (although not mentioned before). This structure I'd put into code:
awk -F':' '{print (NF>2 ? $(NF-2)*3600 : 0) + (NF>1 ? $(NF-1)*60 : 0) + $(NF)}'
It checks whether there are enough columns (e.g. NF>1) and then takes its value (e.g. $(NF-1)*60 or even 0).
Test:
> awk -F':' '{print $0": "(NF>2?$(NF-2)*3600:0)+(NF>1?$(NF-1)*60:0)+$(NF)}'
49.80: 49.8
1:03.40: 63.4
1:01.00: 61
1:00.40: 60.4
1:01.00: 61
23:14:12.1: 83652.1
$ awk -F: '{print (NF-1)*$1*60 + $NF}' file
49.8
63.4
61
60.4
61

remove portion of a column in a .tab in unix

Can someone please help! I'm trying to delete the last portion (following "_cO") in the second column of the following list in the bash shell. E.G where it says "_seq1" in this particular list. I do not want to change any other info in the remaining columns.
Thanks!
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
XP_003962102 comp1000054_c0_seq1 24.07 54 41 0 164 3
Here you go, a simple substitution using sed:
sed -e s/_seq1//
Using sed:
sed -i.bak 's/^\(.*_c0\)[^ ]*\( .*\)$/\1\2/' file
OR using awk:
awk '{sub(/_c0[^ ]*/, "_c0", $2)} 1' file

Resources