How to extract multiple params from string using sed or awk - linux

I have a log file which looks like this:
2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts
I'd like to plot the date time string vs interesting value using gnuplot. In order to do that i'm trying to parse the above log file into a csv file which looks like (not all lines in the log have a plottable vale):
2010/01/12/ 12:00, 45
2010/01/13/ 14:00, 60
How can i do this with sed or awk?
I can extract the initial characters something like:
cat partial.log | sed -e 's/^\(.\{17\}\).*/\1/'
but how can i extract the end values?
I've been trying to do this to no avail!
Thanks

Although this is a really old question with many answers, but you can do it without the use of external tools like sed or awk (hence platform-independent). You can "simply" do it with gnuplot (even with the version at that time of OP's question: gnuplot 4.4.0, March 2010).
However, from your example data and description it is not clear whether the value of interest
is strictly in the 12th column or
is always in the last column or
could be in any column but always trailed with pts
For all 3 cases there are gnuplot-only (hence platform-independent) solutions.
Assumption is that column separator is space.
ad 1. The simplest solution: with u 1:12, gnuplot will simply ignore non-numerical and column values, e.g. like 45pts will be interpreted as 45.
ad 2. and 3. If you extract the last column as string, gnuplot will fail and stop if you want to convert a non-numerical value via real() into a floating point number. Hence, you have to test yourself via your own function isNumber() if the column value at least starts with a number and hence can be converted by real(). In case the string is not a number you could set the value to 1/0 or NaN. However, in earlier gnuplot versions the line of a lines(points) plot will be interrupted.
Whereas in newer gnuplot versions (>=4.6.0) you could set the value to NaN and avoid interruptions via set datafile missing NaN which, however, is not available in gnuplot 4.4.
Furthermore, in gnuplot 4.4 NaN is simply set to 0.0 (GPVAL_NAN = 0.0).
You can workaround this with this "trick" which is also used below.
Data: SO7353702.dat
2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts
2010/01/15/ 09:00 some un related alapha 345678 62pts and nothing
2010/01/17/ 09:00 some un related alapha 345678 and nothing
2010/01/18/ 09:00 some un related alapha 345678 and the interesting value 70.5pts
2010/01/19/ 09:00 some un related alapha 345678 and the interesting value extra extra 64pts
2010/01/20/ 09:00 some un related alapha 345678 and the interesting value 0.66e2pts
Script: (works for gnuplot>=4.4.0, March 2010)
### extract numbers without external tools
reset
FILE = "SO7353702.dat"
set xdata time
set timefmt "%Y/%m/%d/ %H:%M"
set format x "%b %d"
isNumber(s) = strstrt('+-.',s[1:1])>0 && strstrt('0123456789',s[2:2])>0 \
|| strstrt('0123456789',s[1:1])>0
# Version 1:
plot FILE u 1:12 w lp pt 7 ti "value in the 12th column"
pause -1
# Version 2:
set datafile separator "\t"
getLastValue(col) = (s=word(strcol(col),words(strcol(col))), \
isNumber(s) ? (t0=t1, real(s)) : (y0))
plot t0=NaN FILE u (t1=timecolumn(1), y0=getLastValue(1), t0) : (y0) w lp pt 7 \
ti "value in the last column"
pause -1
# Version 3:
getPts(s) = (c=strstrt(s,"pts"), c>0 ? (r=s[1:c-1], p=word(r,words(r)), isNumber(p) ? \
(t0=t1, real(p)) : y0) : y0)
plot t0=NaN FILE u (t1=timecolumn(1),y0=getPts(strcol(1)),t0):(y0) w lp pt 7 \
ti "value anywhere with trailing 'pts'"
### end of script
Result:
Version 1:
Version 2:
Version 3:

Bash
#!/bin/bash
while read -r a b line
do
[[ $line =~ ([0-9]+)pts$ ]] && echo "$a $b, ${BASH_REMATCH[1]}"
done < file

try:
awk 'NF==12{sub(/pts/,"",$12);printf "%s %s, %s ", $1, $2, $12}' file
Input:
2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts
Output:
2010/01/12/ 12:00, 45 2010/01/13/ 09:00, 60
Updated for your new requirements:
Command:
awk 'NF==12{gsub(/\//,"-",$1)sub(/pts/,"",$12);printf "%s%s %s \n", $1, $2, $12}' file
Output:
2010-01-12-12:00 45
2010-01-13-09:00 60
HTH Chris

It is indeed possible. A regex such as this one, for instance:
sed -n 's!([0-9]{4}/[0-9]{2}/[0-9]{2}/ [0-9]{2}:[0-9]{2}).*([0-9]+)pts!\1, \2!p'

awk '/pts/{ gsub(/pts/,"",$12);print $1,$2", "$12}' yourFile
output:
2010/01/12/ 12:00, 45
2010/01/13/ 09:00, 60
[Update:based on your new requirement]
How can i modify the above to look like:
2010-01-12-12:00 45
2010-01-13-09:00 60
awk '/pts/{ gsub(/pts/,"",$12);a=$1$2OFS$12;gsub(/\//,"-",a);print a}' yourFile
the cmd above will give you:
2010-01-12-12:00 45
2010-01-13-09:00 60

sed can be made more readable:
nn='[0-9]+'
n6='[0-9]{6}'
n4='[0-9]{4}'
n2='[0-9]{2}'
rx="^($n4/$n2/$n2/ $n2:$n2) .+ $n6 .+ ($nn)pts$"
sed -nre "s|$rx|\1 \2|p" file
output
2010/01/12/ 12:00 45
2010/01/13/ 09:00 60

I'd do that in two pipeline stages, first awk then sed:
awk '$NF ~ /[[:digit:]]+pts/ { print $1, $2", "$NF }' |
sed 's/pts$//'
By using $NF instead of a fixed number, you work with the final field, regardless of what the unrelated text looks like and how many fields it occupies.

Related

How to remove 1 instance of each (identical) line in a text file in Linux?

There is a file:
Mary
Mary
Mary
Mary
John
John
John
Lucy
Lucy
Mark
I need to get
Mary
Mary
Mary
John
John
Lucy
I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.
If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:
awk 'prev==$0{print}{prev=$0}'
or if you prefer an approach that looks more familiar if coming from other programming languages:
awk '{if(prev==$0)print;prev=$0}'
Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.
If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.
awk '
{ lines[$0]++ }
END {
for (line in lines) {
for (i = 1; i < lines[line]; ++i) {
print line
}
}
}
'
Since you mentioned that the most frequent line must come first, you have to sort first:
sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-
Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?
don't sort for no reason :
nawk '_[$-__]--'
gawk '__[$_]++'
mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields
mawk2 '__[$_]++' FS='\n'
for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk, piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end
(e.g. instance 4 handle lines beginning with F-Q, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ("Nx") of each unique line ("Lx") has been recorded.
From there one could sort a much smaller file along the column holding the Lx's, THEN pipe it to one more awk that would print out Nx# copies of each line Lx.
probably a lot faster than trying to sort 100 GB
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats :
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB test file :
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk, it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin

how to convert floating number to integer in linux

I have a file that look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I want to convert it to look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0 0 0 0 0 0 0 0 0 0 0
I basically want to convert the 0.0 or 1.0 or 2.0 to 0,1,2
I tried to use this command but it doesn't give me the correct output:
cat dosage.txt | "%d\n" "$2" 2>/dev/null
Does anyone know how to do this using awk or sed command.
Thank you.
how to convert floating number to integer in linux(...)using awk
You might use int function of GNU AWK, consider following simple example, let file.csv content be
name,x,y,z
A,1.0,2.1,3.5
B,4.7,5.9,7.0
then
awk 'BEGIN{FS=OFS=","}NR==1{print;next}{for(i=2;i<=NF;i+=1){$i=int($i)};print}' file.csv
gives output
name,x,y,z
A,1,2,3
B,4,5,7
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I print first row as-is and instruct GNU AWK to go to next line, i.e. do nothing else for that line. For all but first line I use for loop to apply int to fields from 2nd to last, after that is done I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed -E ':a;s/((\s|^)[0-9]+)\.[0-9]+(\s|$)/\1\3/g;ta' file
Presuming you want to remove the period and trailing digits from all floating point numbers (where n.n represents a minimum example of such a number).
Match a space or start-of-line, followed by one or more digits, followed by a period, followed by one or more digits, followed by a space or end-of-line and remove the period and the digits following it. Do this for all such numbers through out the file (globally).
N.B. The substitution must be performed twice (hence the loop) because the trailing space of one floating point number may overlap with the leading space of another. The ta command is enacted when the previous substitution is true and causes sed to branch to the a label at the start of the sed cycle.
Maybe this will help. This regex saves the whole part in a variable, and removes the rest. regex can often be fooled by unexpected input, so make sure that you test this against all forms of input data. as I did (partially) for this example.
echo 1234.5 345 a.2 g43.3 546.0 234. hi | sed 's/\b\([0-9]\+\)\.[0-9]\+/\1/g'
outputs
1234 345 a.2 g43.3 546 234. hi
It is important to note, that this was based on gnu sed (standard on linux), so it should not be assumed to work on systems that use an older sed (like on freebsd).

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

Filtering by author and counting all numbers im txt file - Linux terminal, bash

I need help with two hings
1)the file.txt has the format of a list of films
, in which they are authors in different lines, year of publication, title, e.g.
author1
year1
title1
author2
year2
title2
author3
year3
title3
author4
year4
title4
I need to show only book titles whose author is "Joanne Rowling"
2)
one.txt contains numbers and letters for example like:
dada4dawdaw54 232dawdawdaw 53 34dadasd
77dkwkdw
65 23 laka 23
I need to sum all of them and receive score - here it should 561
I tried something like that:
awk '{for(i=1;i<=NF;i++)s+=$i}END{print s}' plik2.txt
but it doesn't make sense
For the 1st question, the solution of okulkarni is great.
For the 2nd question, one solution is
sed 's/[^0-9]/ /g' one.txt | awk '{for(i=1;i<=NF;i++) sum+= $i} END { print sum}'
The sed command converts all non-numeric characters into spaces, while the awk command sums the numbers, line by line.
For the first question, you just need to use grep. Specifically, you can do grep -A 2 "Joanne Rowling" file.txt. This will show all lines with "Joanne Rowling" and the two lines immediately after.
For the second question, you can also use grep by doing grep -Eo '[0-9]+' | paste -sd+ | bc. This will put a + between every number found by grep and then add them up using bc.

column with empty datapoints

date daily weekly monthly
1 11 88
2 12
3 45 44
4 54
5 45
6 45 66
7 77
8 78
9 71 99 88
For empty data points in weekly column , the plot is ploting values from monthly column.
Monthly column plot and daily column plot are perfect.
suggest something more than set datafile missing ' ' and set datafile separator "\t"
Alas, Gnuplot doesn't support field based data files, the only current solution is is to preprocess the file. awk is well suited for the task (note if the file contains hard tabs you need to adjust FIELDWIDTHS):
awk '$3 ~ /^ *$/ { $3 = "?" } $4 ~ /^ *$/ { $4 = "?" } 1' FIELDWIDTHS='6 7 8 7' infile > outfile
This replaces empty fields (/^ *$/) in column 3 and 4 with question marks, which means undefined to Gnuplot. The 1 at the end of the awk script invokes the default rule: { print $0 }.
If you send awk's output to outfile, you can for example now plot the file like this:
set key autotitle columnhead out
set style data linespoint
plot 'outfile' using 1:2, '' using 1:3, '' using 1:4
If anyone runs into this, I recommend updating to at least the 4.6.5 Gnuplot version.
This is because from Gnuplot 4.6.4 update:
* CHANGE treat empty fields in a csv file as "missing" rather than "bad"
And there seemed to be a (related?) bugfix in 4.6.5:
* FIX empty first field in a tab-separated-values file was incorrectly ignored

Resources