Compare different item in two file and output combined result to new file by using AWK

Compare different item in two file and output combined result to new file by using AWK - linux

Greeting!
I have some file in pair taken from two nodes in network, and file has records about TCP segment send/receive time, IP id number, segment type,seq number and so on.
For same TCP flow, it looks like this on sender side:
1420862364.778332 50369 seq 17400:18848
1420862364.780798 50370 seq 18848:20296
1420862364.780810 50371 seq 20296:21744
....
or on receiver side(1 second delay, segment with IP id 50371 lost)
1420862364.778332 50369 seq 17400:18848
1420862364.780798 50370 seq 18848:20296
....
I want to compare IP identification number in two file and output to new one like this:
1420862364.778332 1420862365.778332 50369 seq 17400:18848 o
1420862364.780798 1420862365.780798 50370 seq 18848:20296 o
1420862364.780810 1420862365.780810 50371 seq 20296:21744 x
which has time of arrive on receiver side, and by comparing id field, when same value is not found in receiver sid(packet loss), an x will be added, otherwise o will be there.
I already have code like this,
awk 'ARGIND==1 {w[$2]=$1}
ARGIND==2 {
flag=0;
for(a in w)
if($2==a) {
flag=1;
print $1,w[a],$2,$3,$4;
break;
}
if(!flag)
print $1,"x",$2,$3,$4;
}' file2 file1 >file3
but it doesn't work in Linux, it stops right after I pressed Enter, and leave only empty file.
Shell script contains these code has been through chomd +x.
Please help. My code is not well organized, any new one liner will be appreciated.
Thank you for your time.

ARGIND is gawk-specific btw so check your awk version. – Ed Morton

Related

Is there any command to do fuzzy matching in Linux based on multiple columns

I have two csv file.
File 1
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2
PID,FNAME,MNAME,LNAME,GENDER,DOB
S2,66M,J,Rock,F,1995
S3,David,HM,Lee,M,1990
S0,Marc,HM,Robert,M,2000
S1,Marc,MS,Robert,M,2000
S6,Paul,,Row,M,2008
S7,Sam,O,Baby,F,2018
What I want to do is to use the crosswalk file, File 2, to back out those observations' PID in File 1 based on Columns FNAME,MNAME,LNAME,GENDER, and DOB. Because the corresponding information in observations of File 1 is not complete, I'm thinking of using fuzzy matching to back out their PID as many as possible (of course the level accuracy should be taken into account). For example, the observations with FNAME "Paul" and LNAME "Row" in File 1 should be assigned the same PID because there is only one similar observation in File 2. But for the observations with FNAME "Marc" and LNAME "Robert", Marc,MS,Robert,M,2000,201211.0 should be assigned PID "S1", Marc,H,Robert,M,2000,201211.0 PID "S0" and Marc,M,Robert,M,,201211.0 either "S0" or "S1".
Since I want to compensate File 1's PID as many as possible while keeping high accuracy, I consider three steps. First, use command to make sure that if and only if those information in FNAME,MNAME,LNAME,GENDER, and DOB are all completely matched, observations in File 1 can be assigned a PID. The output should be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,
3,Marc,H,Robert,M,2000,201211.0,
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,
3,David,M,Lee,,1990,201211.0,
5,Paul,ABC,Row,F,2008,201211.0,
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,
4,66,J,Rock,,1995,201211.0,
Next, write another command to guarantee that while DOB information are completely same, use fuzzy matching for FNAME,MNAME,LNAME,GENDER to back out File 1's observations' PID, which is not identified in the first step. So the output through these two steps is supposed to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,
6,66M,,Rock,F,,201211.0,
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
In the final step, use a new command to do fuzzy matching for all related columns, namely FNAME,MNAME,LNAME,GENDER, and DOB to compensate the remained observations' PID. So the final output is expected to be
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
I need to keep the order of File 1's observations so it must be kind of leftouter join. Because my original data size is about 100Gb, I want to use Linux to deal with my issue.
But I have no idea how to complete the last two steps through awk or any other command in Linux. Is there anyone who can give me a favor? Thank you.

Here is a shot at it with GNU awk (using PROCINFO["sorted_in"] to pick the most suitable candidate). It hashes the file2's field values per field and attaches the PID to the value, like field[2]["66M"]="S2" and for each record in file1 counts the amounts of PID matches and prints the one with the biggest count:
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i=1;i<=6;i++) { # fields 1-6
if($i in field[i]) { # if value matches
split(field[i][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[i]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[i][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}
Output:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,PID
2,66M,J,Rock,F,1995,201211.0,S2
3,David,HM,Lee,M,,201211.0,S3
6,66M,,Rock,F,,201211.0,S2
0,David,H M,Lee,,1990,201211.0,S3
3,Marc,H,Robert,M,2000,201211.0,S0
6,Marc,M,Robert,M,,201211.0,S1
6,Marc,MS,Robert,M,2000,201211.0,S1
3,David,M,Lee,,1990,201211.0,S3
5,Paul,ABC,Row,F,2008,201211.0,S6
3,Paul,ACB,Row,,,201211.0,S6
4,David,,Lee,,1990,201211.0,S3
4,66,J,Rock,,1995,201211.0,S2
The "fuzzy match" here is naivistic if($i~j || j~$i) but feel free to replace it with any approximate matching algorithm, for example there are a few implementations of the Levenshtein distance algorithms floating in the internets. Rosetta seems to have one.
You didn't mention how big file2 is but if it's way beyond your memory capasity, you may want to consider spliting the files somehow.
Update: A version that maps file1 fields to file2 fields (as mentioned in comments):
BEGIN {
FS=OFS=","
PROCINFO["sorted_in"]="#val_num_desc"
map[1]=1 # map file1 fields to file2 fields
map[2]=3
map[3]=4
map[4]=2
map[5]=5
map[7]=6
}
NR==FNR { # file2
for(i=1;i<=6;i++) # fields 1-6
if($i!="") {
field[i][$i]=field[i][$i] (field[i][$i]==""?"":OFS) $1 # attach PID to value
}
next
}
{ # file1
for(i in map) {
if($i in field[map[i]]) { # if value matches
split(field[map[i]][$i],t,FS) # get PIDs
for(j in t) { # and
matches[t[j]]++ # increase PID counts
}
} else { # if no value match
for(j in field[map[i]]) # for all field values
if($i~j || j~$i) # "go fuzzy" :D
matches[field[map[i]][j]]+=0.5 # fuzzy is half a match
}
}
for(i in matches) { # the best match first
print $0,i
delete matches
break # we only want the best match
}
}

Reading console output from mplayer to parse track's position/length

When you run mplayer, it will display the playing track's position and length (among some other information) through, what I'd assume is, stdout.
Here's a sample output from mplayer:
MPlayer2 2.0-728-g2c378c7-4+b1 (C) 2000-2012 MPlayer Team
Cannot open file '/home/pi/.mplayer/input.conf': No such file or directory
Failed to open /home/pi/.mplayer/input.conf.
Cannot open file '/etc/mplayer/input.conf': No such file or directory
Failed to open /etc/mplayer/input.conf.
Playing Bomba Estéreo - La Boquilla [Dixone Remix].mp3.
Detected file format: MP2/3 (MPEG audio layer 2/3) (libavformat)
[mp3 # 0x75bc15b8]max_analyze_duration 5000000 reached
[mp3 # 0x75bc15b8]Estimating duration from bitrate, this may be inaccurate
[lavf] stream 0: audio (mp3), -aid 0
Clip info:
album_artist: Bomba Estéreo
genre: Latin
title: La Boquilla [Dixone Remix]
artist: Bomba Estéreo
TBPM: 109
TKEY: 11A
album: Unknown
date: 2011
Load subtitles in .
Selected audio codec: MPEG 1.0/2.0/2.5 layers I, II, III [mpg123]
AUDIO: 44100 Hz, 2 ch, s16le, 320.0 kbit/22.68% (ratio: 40000->176400)
AO: [pulse] 44100Hz 2ch s16le (2 bytes per sample)
Video: no video
Starting playback...
A: 47.5 (47.4) of 229.3 (03:49.3) 4.1%
The last line (A: 47.5 (47.4) of 229.3 (03:49.3) 4.1%) is what I'm trying to read but, for some reason, it's never received by the Process.OutputDataReceived event handler.
Am I missing something? Is mplayer using some non-standard way of outputting the "A:" line to the console?
Here's the code in case it helps:
Public Overrides Sub Play()
player = New Process()
player.EnableRaisingEvents = True
With player.StartInfo
.FileName = "mplayer"
.Arguments = String.Format("-ss {1} -endpos {2} -volume {3} -nolirc -vc null -vo null ""{0}""",
tmpFileName,
mTrack.StartTime,
mTrack.EndTime,
100)
.CreateNoWindow = False
.UseShellExecute = False
.RedirectStandardOutput = True
.RedirectStandardError = True
.RedirectStandardInput = True
End With
AddHandler player.OutputDataReceived, AddressOf DataReceived
AddHandler player.ErrorDataReceived, AddressOf DataReceived
AddHandler player.Exited, Sub() KillPlayer()
player.Start()
player.BeginOutputReadLine()
player.BeginErrorReadLine()
waitForPlayer.WaitOne()
KillPlayer()
End Sub
Private Sub DataReceived(sender As Object, e As DataReceivedEventArgs)
If e.Data = Nothing Then Exit Sub
If e.Data.Contains("A: ") Then
' Parse the data
End If
End Sub

Apparently, the only solution is to run mplayer in "slave" mode, as explained here: http://www.mplayerhq.hu/DOCS/tech/slave.txt
In this mode we can send commands to mplayer (via stdin) and the response (if any) will be sent via stdout.
Here's a very simple implementation that displays mplayer's current position (in seconds):
using System;
using System.Threading;
using System.Diagnostics;
using System.Collections.Generic;
namespace TestMplayer {
class MainClass {
private static Process player;
public static void Main(string[] args) {
String fileName = "/home/pi/Documents/Projects/Raspberry/RPiPlayer/RPiPlayer/bin/Electronica/Skrillex - Make It Bun Dem (Damian Marley) [Butch Clancy Remix].mp3";
player = new Process();
player.EnableRaisingEvents = true;
player.StartInfo.FileName = "mplayer";
player.StartInfo.Arguments = String.Format("-slave -nolirc -vc null -vo null \"{0}\"", fileName);
player.StartInfo.CreateNoWindow = false;
player.StartInfo.UseShellExecute = false;
player.StartInfo.RedirectStandardOutput = true;
player.StartInfo.RedirectStandardError = true;
player.StartInfo.RedirectStandardInput = true;
player.OutputDataReceived += DataReceived;
player.Start();
player.BeginOutputReadLine();
player.BeginErrorReadLine();
Thread getPosThread = new Thread(GetPosLoop);
getPosThread.Start();
}
private static void DataReceived(object o, DataReceivedEventArgs e) {
Console.Clear();
Console.WriteLine(e.Data);
}
private static void GetPosLoop() {
do {
Thread.Sleep(250);
player.StandardInput.Write("get_time_pos" + Environment.NewLine);
} while(!player.HasExited);
}
}
}

I found the same problem with another application that works more or less in a similar way (dbPowerAmp), in my case, the problem was that the process output uses Unicode encoding to write the stdout buffer, so I have to set the StandardOutputEncoding and StandardError to Unicode to be able start reading.
Your problem seems to be the same, because if "A" cannot be found inside the output that you published which clearlly shows that existing "A", then probably means that the character differs when reading in the current encoding that you are using to read the output.
So, try setting the proper encoding when reading the process output, try setting them to Unicode.
ProcessStartInfo.StandardOutputEncoding
ProcessStartInfo.StandardErrorEncoding

Using "read" instead of "readline", and treating the input as binary, will probably fix your problem.
First off, yes, mplayer slave mode is probably what you want. However, if you're determined to parse the console output, it is possible.
Slave mode exists for a reason, and if you're half serious about using mplayer from within your program, it's worth a little time to figure out how to properly use it. That said, I'm sure there's situations where the wrapper is the appropriate approach. Maybe you want to pretend that mplayer is running normally, and control it from the console, but secretly monitor file position to resume it later? The wrapper might be easier than translating all of mplayers keyboard commands into slave mode commands?
Your problem is likely that you're trying to use "readline" from within python on an endless line. That line of output contains \r instead of \n as the line separator, so readline will treat it as a single endless line. sed also fails this way, but other commands (such as grep) treat \r as \n under some circumstances.
Handling of \r is inconsistent, and can't be relied on. For instance, my version of grep treats \r as \n when matching IF output is a console, and uses \n to seperate the output. But if output is a pipe, it treats it as any other character.
For instance:
mplayer TMBG-Older.mp3 2>/dev/null | tr '\r' '\n' | grep "^A: " | sed 's/^A: *\([0-9.]*\) .*/\1/' | tail -n 1
I'm using "tr" here to force it to '\n', so other commands in the pipe can deal with it in a consistent manner.
This pipeline of commands outputs a single line, containing ONLY the ending position in seconds, with decimal point. But if you remove the "tr" command from this pipe, bad things happen. On my system, it shows only "0.0" as the position, as "sed" doesn't deal well with the '\r' line separators, and ALL the position updates are treated as the same line.
I'm fairly sure python doesn't handle \r well either, and that's likely your problem. If so, using "read" instead of "readline" and treating it like binary is probably the correct solution.
There are other problems with this approach though. Buffering is a big one. ^C causes this command to output nothing, mplayer must quit gracefully to show anything at all, as pipelines buffers things, and buffers get discarded on SIGINT.
If you really wanted to get fancy, you could probably cat several input sources together, tee the output several ways, and REALLY write a wrapper around mplayer. A wrapper that's fragile, complicated, and might break every time mplayer is updated, a user does something unexpected, or the name of the file being played contains something weird, SIGSTOP or SIGINT. And probably other things that I haven't though of.

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

When I type "lshosts" I am given:
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
server1 X86_64 Intel_EM 60.0 12 191.9G 159.7G Yes ()
server2 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
server3 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
I am trying to return maxmem and maxswp as megabytes, not gigabytes when lshosts is called. I am trying to send Xilinx ISE jobs to my LSF, however the software expects integer, megabyte values for maxmem and maxswp. By doing debugging, it appears that the software grabs these parameters using the lshosts command.
I have already checked in my lsf.conf file that:
LSF_UNIT_FOR_LIMTS=MB
I have tried searching the IBM Knowledge Base, but to no avail.
Do you use a specific command to specify maxmem and maxswp units within the lsf.conf, lsf.shared, or other config files?
Or does LSF force return the most practical unit?
Any way to override this?

LSF_UNIT_FOR_LIMITS should work, if you completely drained the cluster of all running, pending, and finished jobs. According to the docs, MB is the default, so I'm surprised.
That said, you can use something like this to transform the results:
$ cat to_mb.awk
function to_mb(s) {
e = index("KMG", substr(s, length(s)))
m = substr(s, 0, length(s) - 1)
return m * 10^((e-2) * 3)
}
{ print $1 " " to_mb($6) " " to_mb($7) }
$ lshosts | tail -n +2 | awk -f to_mb.awk
server1 191900 159700
server2 191900 191200
server3 191900 191200
The to_mb function should also handle 'K' or 'M' units, should those pop up.

If LSF_UNIT_FOR_LIMITS is defined in lsf.conf, lshosts will always print the output as a floating point number, and in some versions of LSF the parameter is defined as 'KB' in lsf.conf upon installation.
Try searching for any definitions of the parameter in lsf.conf and commenting them all out so that the parameter is left undefined, I think in that case it defaults to printing it out as an integer in megabytes.
(Don't ask me why it works this way)

entering text in a file at specific locations by identifying the number being integer or real in linux

I have an input like below
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01
Before anything beings I have to input a text like
*Revolved
*Gripped
*Crippled
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
*Cracked
*Crippled
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too.
Any suggestions with sed or awk would be highly appreaciated!
The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.

An awk with fixed strings:
awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"}
match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile
match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.

Counting TCP retransmissions

I would like to know if there is a way to count the number of TCP retransmissions that occurred in a flow, in LINUX. Either on the client side or the server side.

Looks like netstat -s solves my purpose.

You can see TCP retransmissions for a single TCP flow using Wireshark. The "follow TCP stream" filter will allow you to see a single TCP stream. And the tcp.analysis.retransmission one will show retransmissions.
For more details, this serverfault question may be useful: https://serverfault.com/questions/318909/how-passively-monitor-for-tcp-packet-loss-linux

The Linux kernel provides an interface through the pseudo-filesystem proc for counters to track the TCPSynRetrans
For example:
awk '$1 ~ "Tcp:" { print $13 }' /proc/net/snmp
Per documentation:
* TCPSynRetrans
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explanation below::
--
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
You can also adjust these settings also through the pseudo-filesystem procfs but under the sys directory. There is a handy utility that does this short-hand for you.
sysctl -a | grep retrans
net.ipv4.neigh.default.retrans_time_ms = 1000
net.ipv4.neigh.docker0.retrans_time_ms = 1000
net.ipv4.neigh.enp1s0.retrans_time_ms = 1000
net.ipv4.neigh.lo.retrans_time_ms = 1000
net.ipv4.neigh.wlp6s0.retrans_time_ms = 1000
net.ipv4.tcp_early_retrans = 3
net.ipv4.tcp_retrans_collapse = 1
net.ipv6.neigh.default.retrans_time_ms = 1000
net.ipv6.neigh.docker0.retrans_time_ms = 1000
net.ipv6.neigh.enp1s0.retrans_time_ms = 1000
net.ipv6.neigh.lo.retrans_time_ms = 1000
net.ipv6.neigh.wlp6s0.retrans_time_ms = 1000
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string