Linux - parsing data, what language to use

Linux - parsing data, what language to use - linux

I am looking to parse data out of a 'column' based format. I am running into issues where I feel I am 'hacking' bash/awk commands to pull the strings and numbers. If the numbers/text come in different formats then the script might fail unexpectedly and I will have errors.
Data:
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
My method:
Using bash and awk
#!/bin/bash
DATA_OUTPUT=$(get_data)
RSSI=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSSI" {print $3}')
RSRP=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRP" {print $3}')
RSRQ=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRQ" {print $3}')
SINR=$(echo "${DATA_OUTPUT}" | awk '$1 == "SINR" {print $3}')
TX_POWER=$(echo "${DATA_OUTPUT}" | awk '$4 == "Tx" {print $6}')
echo "$SINR"
echo ">$SINR<"
However the output of the above comes out very strange.
2.2 # thats fine!
<2.2 # what??? expecting >4.6<
Little things like this make me question using awk and bash to parse the data. Should I use C++ or some other language? Or is there a better way of doing this?
Thank you

This should be your starting point (the match() can be simplified or removed if your input data is tab-separated or fixed width fields):
$ cat file
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
.
$ cat tst.awk
{
tail = $0
while ( match(tail,/[^:]+:[[:space:]]+[^[:space:]]+[[:space:]]*([^[:space:]]*$)?/) )
{
nvPair = substr(tail,RSTART,RLENGTH)
sub(/ \([^)]+\):/,":",nvPair) # remove (dB) or (dBm)
sub(/:[[:space:]]+/,":",nvPair) # remove spaces after :
sub(/[[:space:]]+$/,"",nvPair) # remove trailing spaces
split(nvPair,tmp,/:/)
name2value[tmp[1]] = tmp[2] # name2value["RSSI"] = "-86"
tail = substr(tail,RSTART+RLENGTH)
}
}
END {
for (name in name2value) {
value = name2value[name]
printf "%s=\"%s\"\n", name, value
}
}
.
$ awk -f tst.awk file
Tx Power="0"
RSSI="-86"
TAC="4r5t (12341)"
Cell ID="efefwg (4261431)"
RSRP="-114"
RSRQ="-10"
SINR="2.2"
Hopefully it's clear that in the above script after the match() loop you can simply say things like print name2value["Tx Power"] to print the value of that key phrase.
If your data was created in DOS, run dos2unix or tr -d '^M' on it first, where ^M means a literal control-M character.

Your data contains DOS-style \r\n line endings. When you do this
echo ">$SINR<"
The actual output is actually
>4.6\r<
The carriage return sends the cursor back to the start of the line.
You can do this:
DATA_OUTPUT=$(get_data | sed 's/\r$//')
But instead of parsing the output over and over, I'd rewrite like this:
while read -ra fields; do
case ${fields[0]} in
RSSI) rssi=${fields[2]};;
RSRP) rsrp=${fields[2]};;
RSPQ) rspq=${fields[2]};;
SINR) sinr=${fields[2]};;
esac
if [[ ${fields[3]} == "Tx" ]]; then tx_power=${fields[5]}; fi
done < <(get_data | sed 's/\r$//' )

Related

How to hash particular column in csv file | linux |

I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940

You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6

You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22

Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -

You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",#F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",#F - print the modified fields
else { print } - this hanldes the header
Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",#F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy
As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso#localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers

How to export daily disk usage to csv format in shell scripting?

My script is as below. When we run the script, it automatically saves the disk space usage in separate cells.
SIZES_1=`df -h | awk 'FNR == 1 {print $1","$2","$3","$4","$5","$6}'`
SIZES_2=`df -h | awk 'FNR == 2 {print $1","$2","$3","$4","$5","$6}'`
SIZES_3=`df -h | awk 'FNR == 3 {print $1","$2","$3","$4","$5","$6}'`
SIZES_4=`df -h | awk 'FNR == 4 {print $1","$2","$3","$4","$5","$6}'`
SIZES_5=`df -h | awk 'FNR == 5 {print $1","$2","$3","$4","$5","$6}'`
SIZES_6=`df -h | awk 'FNR == 6 {print $1","$2","$3","$4","$5","$6}'`
SIZES_7=`df -h | awk 'FNR == 7 {print $1","$2","$3","$4","$5","$6}'`
SIZES_8=`df -h | awk 'FNR == 8 {print $1","$2","$3","$4","$5","$6}'`
echo `date +%Z-%Y-%m-%d_%H-%M-%S` >>/home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_1" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_2" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_3" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_4" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_5" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_6" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_7" >> /home/jeevagan/test_scripts/sizes/excel.csv
echo "$SIZES_8" >> /home/jeevagan/test_scripts/sizes/excel.csv
This script is okay for my machine.
My doubt is, if somebody else's machine has many file systems, my script won't work to fetch all the file systems usage. How to make it to grab all those automatically?

Assuming you want all filesystems you can simplify that to:
printf '%s\n' "$(date +%Z-%Y-%m-%d_%H-%M-%S)" >> excel.csv
df -h | awk '{print $1","$2","$3","$4","$5","$6}' >> excel.csv

I would simplify this to
{ date +%Z-%F_%H-%M-%S; df -h | tr -s ' ' ','; } >> excel.csv
Group commands so only a single redirect is needed
Squeeze spaces and replace them with a single comma using tr
No need for echo `date` or similar: it's the same as just date
date +%Y-%m-%d is the same as date +%F
Notice that this has a little flaw in that the first line of the output of df -h, which looks something like this originally
Filesystem Size Used Avail Use% Mounted on
has a space in the heading of the last column, so it becomes
Filesystem,Size,Used,Avail,Use%,Mounted,on
with an extra comma. The original awk solution just cut off the last word of the line, though. Similarly, spaces in paths would trip up this solution.
To fix the comma problem, you could for example run
sed -i 's/Mounted,on$/Mounted on/' excel.csv
every now and so often.
As an aside, to replace all field separators in awk, instead of
awk '{print $1","$2","$3","$4","$5","$6}'
you can use
awk 'BEGIN { OFS = "," } { $1 = $1; print }'
or, shorter,
awk -v OFS=',' '{$1=$1}1'

Remove lines containing space in unix

Below is my comma separated input.txt file, i want to read the columns and write the lines in to the output.txt when any 1 column has a space.
Content of input.txt:
1,Hello,world
2,worl d,hell o
3,h e l l o, world
4,Hello_Hello,World#c#
5,Hello,W orld
Content of output.txt:
1,Hello,world
4,Hello_Hello,World#c#
is't possible to achieve using awk? Please help!

A simple way to filter out lines with spaces is using inverted matching with grep:
grep -v ' ' input.txt
If you must use awk:
awk '!/ /' input.txt
Or perl:
perl -ne '/ / || print' input.txt
Or pure bash:
while read line; do [[ $line == *' '* ]] || echo $line; done < input.txt
# or
while read line; do [[ $line =~ ' ' ]] || echo $line; done < input.txt
UPDATE
To check if let's say field 2 contains space, you could use awk like this:
awk -F, '$2 !~ / /' input.txt
To check if let's say field 2 OR field 3 contains space:
awk -F, '!($2 ~ / / || $3 ~ / /)' input.txt
For your follow-up question in comments
To do the same using sed, I only know these awkward solutions:
# remove lines if 2nd field contains space
sed -e '/^[^,]*,[^,]* /d' input.txt
# remove lines if 2nd or 3rd field contains space
sed -e '/^[^,]*,[^,]* /d' -e '/^[^,]*,[^,]*,[^,]* /d' input.txt
For your 2nd follow-up question in comments
To disregard leading spaces in the 2nd or 3rd fields:
awk -F', *' '!($2 ~ / / || $3 ~ / /)' input.txt
# or perhaps what you really want is this:
awk -F', *' -v OFS=, '!($2 ~ / / || $3 ~ / /) { print $1, $2, $3 }' input.txt

This can also be done easily with sed
sed '/ /d' input.txt

try this one-liner
awk 'NF==1' file
as #jwpat7 pointed out, it won't give correct output if the line has only leading space, then this line, with regex should do, but it has been already posted in janos's answer.
awk '!/ /' file
or
awk -F' *' 'NF==1'

Pure bash for the fun of it...
#!/bin/bash
while read line
do
if [[ ! $line =~ " " ]]
then
echo $line
fi
done < input.txt

columnWithSpace=2
ColumnBef=$(( ${columnWithSpace} - 1 ))
sed '/\([^,]*,\)\{${ColumnBef\}[^ ,]* [^,]*,/ d'
if you know the column directly (by example the 3):
sed '/\([^,]*,\)\{2}[^ ,]* [^,]*,/ d'

If you can trust the input to always have no more than three fields, simply finding a space somewhere after a comma is sufficient.
grep ',.* ' input.txt
If there can be (or usually are) more fields, you can pull that off with grep -E and a suitable ERE, but you are fast approaching the point at which the equivalent Awk solution will be more readable and maintainable.

Print columns from specific line of file?

I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?

sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)

You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"

Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')

If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )

Sort file beginning at a certain line

I'd like to be able to sort a file but only at a certain line and below. From the manual sort isn't able to parse content so I'll need a second utility to do this. read? or awk possibly? Here's the file I'd like to be able to sort:
tar --exclude-from=$EXCLUDE_FILE --exclude=$BACKDEST/$PC-* \
-cvpzf $BACKDEST/$BACKUPNAME.tar.gz \
/etc/X11/xorg.conf \
/etc/X11/xorg.conf.1 \
/etc/fonts/conf.avail.1 \
/etc/fonts/conf.avail/60-liberation.conf \
So for this case, I'd like to begin sorting on line three. I'm thinking I'm going to have to do a function to be able to do this something like
cat backup.sh | while read LINE; do echo $LINE | sort; done
Pretty new to this and the script looks like it's missing something. Also, not sure how to begin at a certain line number.
Any ideas?

Something like this?
(head -n 2 backup.sh; tail -n +3 backup.sh | sort) > backup-sorted.sh
You may have to fixup the last line of the input... it probably doesn't have the trailing \ for the line continuation, so you might have a broken 'backup-sorted.sh' if you just do the above.
You might want to consider using tar's --files-from (or -T) option, and having the sorted list of files in a data file instead of the script itself.

clumsy way:
len=$(cat FILE | wc -l)
sortable_len=$((len-3))
head -3 FILE > OUT
tail -$sortable_len FILE | sort >> OUT
I'm sure someone will post an elegant 1-liner shortly.

Sort the lines excluding the (2 lines) header, just for view.
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' | sort
Sort the lines excluding the (2 lines) headers and send the output to another file.
Method #1:
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/stderr"; else print $0}' 2> file_sorted.txt | sort >> file_sorted.txt
Method #2:
cat file.txt | awk '{if (NR < 3) print $0 > "file_sorted.txt"; else print $0}' | sort >> file_sorted.txt

You could try this:
(read line; echo "$line"; sort) < file.txt
It takes one line and echoes it, then sorts the rest. You can also:
file.txt | (read line; echo "$line"; sort)
For two lines, just repeat the read and echo:
(read line; echo "$line"; read line; echo "$line"; sort) < file.txt

Using awk:
awk '{ if ( NR > 2 ) { print $0 } }' file.txt | sort
NR is a built-in awk variable and contains the current record/line number. It starts at 1.

Extending Vigneswaran R's answer using awk:
using tty to get your current terminals' stdin file, print the first three lines directly to you terminal (no it won't run the input) within awk and pipe the rest to sort.
tty
>/dev/pts/3
cat file.txt | awk '{if (NR < 3) print $0 > "/dev/pts/3"; else print $0}' | sort

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux - parsing data, what language to use - linux

Related

How to hash particular column in csv file | linux |

How to export daily disk usage to csv format in shell scripting?

Remove lines containing space in unix

Print columns from specific line of file?

Sort file beginning at a certain line

Categories

Resources