Extract substring from first column - linux

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...

If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3

Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...

Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Related

Removing leading 0 from third column

I'm trying to remove the first 0 from the third column in my CSV file
tel.csv -
test,01test,01234567890
test,01test,09876054321
I have been trying to use the following with no luck -
cat tel.csv | sed 's/^0*//'
Something like:
sed 's/^\([^,]*\),\([^,]*\),0\(.*\)$/\1,\2,\3/' file.csv
Or awk
awk 'BEGIN{FS=OFS=","}{sub(/^0/, "", $3)}1' file.csv
Assumptions:
3rd column consists of only numbers (0-9)
3rd column could have multiple leading 0's
Adding a row with a 3rd column that has multiple leading 0's:
$ cat tel.csv
test,01test,01234567890
test,01test,09876054321
test,02test,00001234567890
One awk idea:
$ awk 'BEGIN{FS=OFS=","}{$3=$3+0}1' tel.csv
test,01test,1234567890
test,01test,9876054321
test,02test,1234567890
Where: adding 0 to a number ($3+0) has the side effect of removing leading 0's.
If the third field is the last field, as it is in the sample lines:
sed 's/,0\([^,]*\)$/,\1/' file

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

Split and compare in awk

I want to split and comparison in awk command.
Input file (tab-delimited)
1 aaa 1|3
2 bbb 3|3
3 ccc 0|2
Filtration
First column value > 1
First value of third column value splitted by "|" > 2
Process
Compare first column value if bigger than 1
Split third column value by "|"
Compare first value of the third column if bigger than 2
Print if the first value bigger than 2 only
Command line (example)
awk -F "\t" '{if($1>1 && ....?) print}' file
Output
2 bbb 3|3
Please let me know command line for above processing.
You can set the field separator to either tab or pipe and check the 1st and 3rd values:
awk -F'\t|\\|' '$1>1 && $3>2' file
or
awk -F"\t|\\\\|" '$1>1 && $3>2' file
You can read about all this character escaping in this comprehensive answer by Ed Morton in awk: fatal: Invalid regular expression when setting multiple field separators.
Otherwise, you can split the 3rd field and check the value of the first slice:
awk -F"\t" '{split($3,a,"|")} $1>1 && a[1]>=2' file

How to format decimal space using awk in linux

original file :
a|||a 2 0.111111
a|||book 1 0.0555556
a|||is 2 0.111111
now i need to control third columns with 6 decimal space
after i tried awk {'print $1,$2; printf "%.6f\t",$3'}
but the output is not what I want
result :
a|||a 2
0.111111 a|||book 1
0.055556 a|||is 2
that's weird , how can I do that will just modify third columns
Your print() is adding a newline character. Include your third field inside it, but formatted. Try with sprintf() function, like:
awk '{print $1,$2, sprintf("%.6f", $3)}' infile
That yields:
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111
Print adds a newline on the end of printed strings, whereas printf by default doesn't. This means a newline is added after every second field and none is added after the third.
You can use printf for the whole string and manually add a newline.
Also I'm not sure why you are adding a tab to the end of the lines, so i removed that
awk '{printf "%s %d %.6f\n",$1,$2,$3}' file
a|||a 2 0.111111
a|||book 1 0.055556
a|||is 2 0.111111

Grep find lines that have 4,5,6,7 and 9 in zip code column

I'm using grep to display all lines that have ONLY 4,5,6,7 and 9 in the zipcode column.
How do i display only the lines of the file that contain the numbers 4,5,6,7 and 9 in the zipcode field?
A sample row is:
15 m jagger mick 41 4th 95115
Thanks
I am going to assume you meant "How do I use grep to..."
If all of the lines in the file have a 5 digit zip at the end of each line, then:
egrep "[45679]{5}$" filename
Should give you what you want.
If there might be whitespace between the zip and the end of the line, then:
egrep "[45679]{5}[[:space:]]*$" filename
would be more robust.
If the problem is more general than that, please describe it more accurately.
Following regex should fetch you desired result:
egrep "[45679]+$" file
If by "grep" you mean, "the correct tool", then the solution you seek is:
awk '$7 ~ /^[45679]*$/' input
This will print all lines of input in which the 7th field consists only of the characters 4,5,6,7, and 9. If you want to specify 'the last column' rather than the 7th, try
awk '$NF ~ /^[45679]*$/' input

Resources