How to keep values from a messy table based on string characteristics - string

I have a really difficult file.asv (values separated with #) that contains lines with un-matching columns.
Example:
name#age#city#lat#long
eric#paris#4.4283333333333331e+01#-1.0550000000000000e+02
dan#43#berlin#3.1366000000000000e+01#-1.0371500000000000e+02
london##2.5250000000000000e+01#1.0538333000000000e+02
Latitude and Longitude values are pretty consistent. They have 22 or 23 characters (depending on the positive (absent) and negative sign), and always with scientific notation. I would like to keep only latitude and longitude from each line.
Expected output:
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02
Headers are not totally necessary, I can add them later. I could also work with separated latitude and longitude outputs, and then paste them together. Any sed or awk command I could use?

Use this awk:
awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
Here,
OFS - Output Field Separator
FS - Input Field Separator
NF - Number of Fields
Assuming that latitude and longitude always be a last fields. $NF and $(NF-1) will print last two fields.
Test:
$ awk 'BEGIN{OFS=FS="#"} {print $(NF-1),$NF}' file
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02

Simple grep would do, assuming -o option is present
$ grep -o '[^#]*#[^#]*$' file.asv
lat#long
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02

Trying to select fields using regular expression
$ cat ll.awk
function rep(c, n, ans) { # repeat `c' `n' times
while (n--) ans = ans c
return ans
}
function build_re( d, s, os) { # build a regexp `r'
d = "[0-9]" # a digit
s = "[+-]" # sign
os = s "?" # optional sign
r = os d "[.]" rep(d, 16) "e" s d d # adjust here
r = "^" r "$" # match entire string
}
function process( sep, line, i) {
for (i = 1; i <= NF; i ++ ) {
if (!($i ~ r)) continue # skip fields
line = line sep $i; sep = FS
}
if (length(line)) print line
}
BEGIN {
build_re()
FS = "#"
}
{ # call on every line of input
process()
}
Usage:
$ awk -f ll.awk file.txt
4.4283333333333331e+01#-1.0550000000000000e+02
3.1366000000000000e+01#-1.0371500000000000e+02
2.5250000000000000e+01#1.0538333000000000e+02

One more in Gnu awk using gensub:
$ awk '{print gensub(/(.+)((#[^#]+){2})$/,"\\2","g",$0)}' file
#lat#long
#4.4283333333333331e+01#-1.0550000000000000e+02
#3.1366000000000000e+01#-1.0371500000000000e+02
#2.5250000000000000e+01#1.0538333000000000e+02

Related

Check if a word from one file exists in another file and print the matching line

I have a file which is having some specific words. I have another file having the URLs which contains that word from file1.
I would like to print url if each word in file1 matches with file2. If word is not found in file2 then return "no matching"
I tried with Awk and grep and used if conditions also. But did not get expected results.
File1:
abc
Def
XYZ
File2:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
Output can be like:
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Etc..
Tried:
file=/bin/file1.txt
for i in `cat $file1`;
do
a=$i
echo "$a:" | awk '$repos.txt ~ $a {printf $?}'
done
Tried some other ways like if condition with grep and all... but no luck.
abc means it should only search for abc, not abcd.
You appear to want case-insensitive matching.
An awk solution:
$ cat <<'EOD' >file1
abc
Def
XYZ
missing
EOD
$ cat <<'EOD' >file2
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
EOD
$ awk '
# create lowercase versions
{
lc = tolower($0)
}
# loop over lines of file1
# store search strings in array
# key is search string, value will be results found
NR==FNR {
h[lc]
next
}
# loop over lines of file2
# if search string found, append line to results
{
for (s in h)
if (lc ~ s)
h[s] = h[s]"\n"$0
}
# loop over seearch strings and print results
# if no result, show error message
END {
for (s in h)
print s":"( h[s] ? h[s] : "\nno matching" )
}
' file1 file2
missing:
no matching
def:
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
$
Your attempt is pretty far from the mark. Probably learn the basics of the shell and Awk before you proceed.
Here is a simple implementation which avoids reading lines with for.
while IFS='' read -r word; do
echo "$word:"
grep -F "$word" File2
done <File1
If you want to match case-insensitively, use grep -iF.
The requirement to avoid substring matches is a complication. The -w option to grep nominally restrics matching to entire words, but the definition of "word" characters includes the underscore character, so you can't use that directly. A manual approximation might look like
grep -iE "(^|[^a-z])$word([^a-z]|$)" File2
but this might not work with all grep implementations.
A better design is perhaps to prefix the match(es) before each output line, and only loop over the input file once.
awk 'NR==FNR { w[a] = "(^|[^a-z])" $0 "([^a-z]|$)"; next }
{ m = ""
for (a in w) if ($0 ~ w[a]) m = m (m ? "," : "") a
if (m) print m ":" $0 }' File1 File2
In brief, we collect the search words in the array w from the first input file. When reading the second input file, we collect matches on all the search words in m; if m is non-empty, we print its value followed by the input line which matched.
Again, if you want case-insensitive matching, use tolower() where appropriate.
Demo, featuring lower-case comparisons: https://ideone.com/iTWpFn

How to change a specific character from upper case to lower case using awk or sed or python?

I have a line with string of characters (BCCDDDCDCCDDDDDDABCDABCABDBACBDCAACCBBCABACBCCABCACBCDCCCBDBACDCBBCBCBCCCACADAACCABABADBCBAABBBCCBB)
I'd like to replace a specific character (for e.g, 4th character) to lower case.
I have tried with this awk command;
awk '{for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }' input > output
input file contains the string
"BCCDDDCDCCDDDDDDABCDABCABDBACBDCAACCBBCABACBCCABCACBCDCCCBDBACDCBBCBCBCCCACADAACCABABADBCBAABBBCCBB"
This awk command gives this output:
"Bccdddcdccddddddabcdabcabdbacbdcaaccbbcabacbccabcacbcdcccbdbacdcbbcbcbcccacadaaccababadbcbaabbbccbb"
How do I change all the other characters to lowercase except the 4th character?
On that line you can use something like this. Through -F '' every letter is now a field which can be accessed with $i.
$ cat line
BCCDDDCDCCDDDDDDABCDABCABDBACBDCAACCBBCABACBCCABCACBCDCCCBDBACDCBBCBCBCCCACADAACCABABADBCBAABBBCCBB
$ awk -F '' '{ for(i=1;i<=NF;++i){
if(i!=4){
printf("%s",tolower($i))}
else{
printf("%s",$i)} } print "" }' line
bccDddcdccddddddabcdabcabdbacbdcaaccbbcabacbccabcacbcdcccbdbacdcbbcbcbcccacadaaccababadbcbaabbbccbb
With GNU sed, would you please try:
sed -E 's/(.{3})(.)/\1\L\2/' YourFile
With python, assuming the variable s is assigned to the line,
print(s[0:3] + s[3:4].lower() + s[4:])
An example in python of a function that makes it, I let you understand the logic and adapt it to your problem.
def f(s):
# your result
res = ""
# iterate though each character in the string
for i in range(0, len(s)):
# on some condition, here 4th character, you add a lowercase character to your result
if(i == 3):
res += s[i].lower()
# on some other conditions an upper case character...
elif condition_2_to_create:
res += s[i].upper()
# or just the character as it is in the string without modification
else:
res += s[i]
# Then return your result
return res
Change 4th letter of every line to lowercase:
gawk '{$0 = substr($0,1,3) tolower(substr($0,4,1)) substr($0,5)} 1' YourFile
Or:
gawk -F '' '{$4=tolower($4)} 1' OFS='' YourFile
Or with Perl:
perl -pne 'substr($_,4,1) = lc(substr($_,4,1))' YourFile

AWK print every other column, starting from the last column (and next to last column) for N interations (print from right to left)

Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.

How to separate number and unit from variable when using awk

In a 10 line awk script I need to split the content of a variable into a number variable and an unit variable. Here is a simplified example
~$ echo 139506MB | awk '{
ex = index("KMGTPEZY", substr($1, length($1)));
val = substr($1, 0, length($1) - 2);
print ex " " val
}'
0 139506
I know the unit part is always 2 chars, but for some reason ex always returns 0 instead of MB as I was hoping.
Question
Any idea why ex doesn't contain the unit?
The logic in your index() function is wrong, the character you've extracted is not part of the string you've defined. Hence the return value 0 you are seeing.
For a regex approach using GNU Awk for storing captured groups to an array. With the match() function you could do as below. The captured groups are stored into the array(ar) from which you can access the elements 1 and 2.
echo 139506MB | gawk 'match($0, /([[:digit:]]+)([[:alpha:]]+)/, ary) {print ary[1] ary[2]}'
Your substr() call is substr($1, length($1)) which will return only the last character of $1 (B). This character is not part of the string KMGTPEZY.
$ echo '139506MB' | awk '{ n=$1+0; sub(n,"",$1); print $1,n }'
MB 139506
This uses the fact that converting a string to a number discards everything from the first non-digit. This allows us to store the number in n using $1+0 (force interpreting the first field as a number). We then remove the number from the original line using sub(). The number and the remaining text is then printed.
Using GNU awk and split's seps to abuse the .B as the separator to separate number and unit from variable when using (GNU) awk:
$ echo 139506MB | awk '{split($1,a,/.B/,seps);print seps[1],a[1]}'
MB 139506
Also, regarding your code: You (try to) set the index of M in string KMGTPEZY so I assume you are looking for ex==2. By fixing the substr like below:
$ echo 139506MB | awk '{
ex = index("KMGTPEZY", substr($1, length($1)-1,1)); # from substr($1, length($1))
# ex = substr($1, length($1)-1,1); # uncomment for the unit
val = substr($1, 0, length($1) - 2);
print ex " " val
}'
2 139506
Maybe you should update the OP with the expected output.
Following awk may help you on same too.
str="139506MB"
echo "$str" | awk '
match($0,/[0-9]+/){
val=substr($0,RSTART+RLENGTH);
if(val ~ /[a-zA-Z]+/){
print substr($0,RSTART,RLENGTH),val}
}'
The first issue is here:
substr($1, length($1))
You are getting the last character of the string, which is "B". There is no "B" in "KMGTPEZY", so index returns 0.
I don't think you need to use index at all. To use substr:
ex = substr($1, length($1) - 1);
val = substr($1, 0, length($1) - 2);
Testing:
$ awk '{ print substr($1, length($1) - 1), substr($1, 0, length($0) - 2) }' <<< '139506MB'
MB 139506

What linux commands can I use to sort columns in a tab-separated text file?

I need to compare two versions of the same file. Both are tab-separated and have this form:
<filename1><tab><Marker11><tab><Marker12>...
<filename2><tab><Marker21><tab><Marker22><tab><Marker22>...
So each row has a different number of markers (the number varies between 1 and 10) and they all come from a small set of possible markers. So a file looks like this:
fileX<tab>Z<tab>M<tab>A
fileB<tab>Y
fileM<tab>M<tab>C<tab>B<tab>Y
What I need is:
Sort the file by rows
Sort the markers in each row so that they are in alphabetical order
So for the example above, the result would be
fileB<tab>Y
fileM<tab>B<tab>C<tab>M<tab>Y
fileX<tab>A<tab>M<tab>Z
It's easy to do #1 using sort but how do I do #2?
UPDATE: It's not a duplicate of this post since my rows are of different length and I need each rows (the entries after the filename) sorted individually, i.e. the only column that gets preserved is the first one.
awk solution:
awk 'BEGIN{ FS=OFS="\t"; PROCINFO["sorted_in"]="#ind_str_asc" }
{ split($0,b,FS); delete b[1]; asort(b); r="";
for(i in b) r=(r!="")? r OFS b[i] : b[i]; a[$1] = r
}
END{ for(i in a) print i,a[i] }' file
The output:
fileB Y
fileM B C M Y
fileX A M Z
PROCINFO["sorted_in"]="#ind_str_asc" - sort mode
split($0,b,FS); - split the line into array b by FS (field separator)
asort(b) - sort marker values
All you need is:
awk '
{ for (i=2;i<=NF;i++) arr[$1][$i] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in arr) {
printf "%s", i
for (j in arr[i]) {
printf "%s%s, OFS, arr[i][j]
}
print ""
}
}
' file
The above uses GNU awk for true multi-dimensional arrays plus sorted_in

Resources