Bash script to isolate words in a file - linux

Here is my initial input data to be extracted:
david ex1=10 ex2=12 quiz1=5 quiz2=9 exam=99
judith ex1=8 ex2=16 quiz1=4 quiz2=10 exam=90
sam ex1=8 quiz1=5 quiz2=11 exam=85
song ex1=8 ex2=20 quiz2=11 exam=87
How do extract each word to be formatted in this way:
david
ex1=10
ex2=12
etc...
As I eventually want to have output like this:
david 12 99
judith 16 90
sam 0 85
song 20 87
when I run my program with the commands:
./marks ex2 exam < file

Supposed your input file is named input.txt, just replace space char by new line char using tr command line tool:
tr ' ' '\n' < input.txt
For your second request, you may have to extract specific field on each line, so the cut and awk commands may be useful (note that my example is certainly improvable):
while read p; do
echo -n "$(echo $p | cut -d ' ' -f1) " # name
echo -n "$(echo $p | cut -d ' ' -f3 | cut -d '=' -f2) " # ex2 val
echo -n $(echo $p | awk -F"exam=" '{ print $2 }') # exam val
echo
done < input.txt

This script does what you want:
#!/bin/bash
a=$#
awk -v a="$a" -F'[[:space:]=]+' '
BEGIN {
split(a, b) # split field names into array b
}
{
printf "%s ", $1 # print first field
for (i in b) { # loop through fields to search for
f = 0 # unset "found" flag
for (j=2; j<=NF; j+=2) # loop though remaining fields, 2 at a time
if ($j == b[i]) { # if field matches value in array
printf "%s ",$(j+1)
f = 1 # set "found" flag
}
if (!f) printf "0 " # add 0 if field not found
}
print "" # add newline
}' file
Testing it out
$ ./script.sh ex2 exam
david 12 99
judith 16 90
sam 0 85
song 20 87

Related

Linux, bash: Determines the row number of a cell that is in a specific column and has a specific content

Determines the row number of a cell that is in a specific column and has a specific content.
Remark:
The heading of a column counts as a line.
An empty field in a column counts as a row.
The fields of csv are separated by comma.
Given:
The follow csv file are given:
file.csv
col_o2g,col_dgjdhu,col_of_interest,,
1234567890,tg75fjksfh,$kj56hahb,,
dsewsf,1234567890,,,
khhhdg,5gfj578fj,1234567890,,
,57ijf6ehg,46h%sgf,,
ubthfgfv,zts576fufj,256hf%(",,
Given variables:
# col variable
col=col_of_interest
# variable with the value of the field of interest
value_of_interest=1234567890
# output variable
# thats he part I am looking for
wanted_line_number=
What I have:
LINE_CNT=$(awk '-F[\t ]*,[\t ]*' -vcol=${col} '
FNR==1 {
for(i=1; i<=NF; ++i) {
if($i == col) {
col = i;
break;
}
}
if(i>NF) {
exit 1;
}
}
FNR>1 {
if($col) maxc=FNR;
}
END{
print maxc;
}' file.csv)
echo line count of lines from column $col
echo "$LINE_CNT"
Wanted output:
echo "The wanted line number are:"
echo $wanted_line_number
output:4
I have been trying to decipher your question, so let me know whether I did it right or not. I guess in your case you don't know how many columns are present in the csv file, and also you don't know whether the first line is the header or not.
For the second remark, I have no automatic solution, so you need to provide whether the line 1 is a header or not based on an input parameter.
Let me show you with a test case
]$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
Then you want to know the position of the column of interest in your csv and also the line where the value of interest is located. Here my example script ( that can be improved ). Keep in mind that I harcoded my example of test.csv file into the script.
$ cat check_csv.sh
column_of_interest=$1
value_of_interest=$2
with_header=$3
# check which column is the one
if [[ $with_header = "Y" ]];
then
num_cols=$(cat test.csv | awk --field-separator="," "{ print NF }" | head -n 1)
echo "csv contains $num_cols columns"
to_rows=$(cat test.csv | head -n 1 | tr ',' '\n')
iteration=0
for i in $(cat test.csv | head -n 1 | tr ',' '\n')
do
iteration=$(expr $iteration + 1)
counter=$(echo $i | egrep -i "$column_of_interest" | wc -l)
#echo $i
#echo $counter
if [ $counter -eq 1 ]
then
echo "Column of interest $i is located on number $iteration"
export my_col_is=$iteration;
fi
done
# fine line that ccontains the value of interest
iteration=0
while IFS= read -r line
do
iteration=$(expr $iteration + 1 )
if [[ $iteration -gt 1 ]];
then
#echo $line
is_there=$(echo $line | awk -v temp=$my_col_is -F ',' '{print $temp}' | egrep -i "$value_of_interest"| wc -l)
#echo $is_there
if [ $is_there -gt 0 ];
then
echo "Value of interest $value_of_interest is present on line $iteration"
fi
fi
done < test.csv
fi
Running the example when I want to know which column is col_2 ( position ) and lines where it appears the value 1234567890 for that column. I use an option to identify that the file has header
$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
$ ./check_csv.sh col_2 1234567890 Y
csv contains 4 columns
Column of interest col_2 is located on number 2
Value of interest 1234567890 is present on line 3
With lines duplicated
$ more test.csv
col_1,col_2,col_3,col_4
1234567890,tg75fjksfh,kj56hahb,dkdkdkd
dsewsf,1234567890,,dkdkdk
khhhdg,5gfj578fj,1234567890,akdkdkd
ubthfgfv,zts576fufj,256hf,,
dsewsf,1234567890,,dkdkdk
dsewsf,1234567890,,dkdkdk
$ ./check_csv.sh col_2 1234567890 Y
csv contains 4 columns
Column of interest col_2 is located on number 2
Value of interest 1234567890 is present on line 3
Value of interest 1234567890 is present on line 6
Value of interest 1234567890 is present on line 7
$
If you want to treat the files without header, you only need to copy the code to the treat those without head -1, but in those cases, you cannot get names of the columns and you won't know where to find them respect of the columns.
col="col_of_interest"
value_of_interest="1234567890"
awk -v FS="," -v coi="$col" -v voi="$value_of_interest" \
'NR==1{
for(i=1; i<=NF; i++){
if(coi==$i){
y=i
}
}
next
}
{if($y==voi){print NR}}' file
Output:
4
See: GNU awk: String-Manipulation Functions (split), Arrays in awk, 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR and man awk
file=./input.csv
d=,
# get column number for col_of_interest
c=$(head -n1 "$file" | grep -oE "[^$d]+" | grep -niw "$col" | cut -d: -f1)
# print column with cut and get line numbers for 1234567890
[ "$c" -gt 0 ] && wanted_line_number=$(cut -d$d -f$c "$file" | grep -niw "$value_of_interest" | cut -d: -f1)
printf "The wanted line number are: %b\n" $wanted_line_number

Find field length from one file and extract the same length of data from another fixed length file and store the field and data in new file

I have one file file1.dml and another fixed length data file file2.dat.
data in file1.dml be like
start
integer(16) field1 ;
string(1) filed2 ;
string(80) filed3 ;
decimal(16.2) field4;
string(1) newline = "\n";
end;
data in file2.dat be like
12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890
I need the output file like below
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"
I have written below function which accepts file1.dml and file2.dat and generate the exact result, but i want to simplify this using AWK, thanks in advance for any help
function myfunc1
{
if [ $1 == "" -a $2 == "" -a ! -f $1 -a ! -f $2 ]; then
print "Input files not present"
else
dml_file=$1 #input parameter, dml file
cntl_file=$2 #input parametr, dat file
start_pos=1
end_pos=0
cat "$dml_file" | sed '1d' | sed '$d' | while IFS= read line
do
counter=`echo $line | cut -d'(' -f2 | cut -d')' -f1`
fld_name=`echo $line | cut -d'(' -f2 | cut -d')' -f2 | sed 's/;//g'`
#check decimal or not
if [[ $counter == +([0-9]) ]]; then
end_pos=$((counter+start_pos))
else
counter1=`echo $counter | cut -d'.' -f1`
counter=$counter1
end_pos=$((counter1+end_pos))
fi
newline_check=`echo $fld_name | grep -i 'newline' | wc -l`
if [ $newline_check -gt 0 ]; then
fld_name="newline"
fld_val="\n"
#write below line in one file
echo "$fld_name : \"$fld_val\""
else
fld_val=`cat $cntl_file | cut -c$start_pos-$end_pos`
#write below line in one file
echo "$fld_name : \"$fld_val\""
fi
start_pos=$((start_pos+counter))
done
fi
}
Since file2.dat only has a single line I'd start by reading this into a variable so that we don't have to continually scan the file; eg:
$ IFS= read -r rawdata < file2.dat # 'IFS=' needed in order to retain trailing white space
$ echo ".${rawdata}." # periods included to show that trailing white space retained
.12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890 .
At this point we can pass the rawdata variable to an awk solution:
awk -v rd="${rawdata}" -F'[();=]' '
BEGIN { s = 1 ; nl = "\\n" }
/start|end/ { next }
/newline/ { gsub(/ /,"",$4)
nl = $4
next
}
{ split($2,a,".")
len = a[1]
gsub(/ /,"",$3)
fname = $3
printf "%s=\"%s\"\n", fname, substr(rd,s,len)
s += len
}
END { printf "newline=%s\n", nl }
' file1.dat
Where:
-v rd="${rawdata}" - define awk variable rd as containing the current value of ${rawdata}
-F '[();=]' - define 4 different input/field delimiters ((, ), ;, =); $2=field length, $3=field name, $4=newline character(s)
BEGIN ( s = 1 ; nl= "\\n" } - initialize our start position in rd and a default newline character (\n)
/start|end/ { next } - ignore lines that contain start or end
/newline/ .... - if we see the pattern newline then remove spaces and set nl to this new value ($4)
next - stop processing for current line and go to next line of input
NOTE: for the rest of the lines in our input file:
split($2,a,".") / len = a[1] - split the 'length' field ($2) based on a period delimiter into array a, then set len to the first element of array a
gsub(/ /,"",$3) / fname = $3 - remove white space from the field name ($3) and assign resulting value to local variable fname
printf ... - output our line of data; use s and len to pull the desired substring from rd
s += len - add current len to s to get a new start position for next pass through the logic
END ... - when all input processing is done, print our newline record to stdout
Running the above generates the following:
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

I have hundreds of tsv file with following structure (example):
GH1 123 family1
GH2 23 family2
.
.
.
GH4 45 family4
GH6 34 family6
And i have a text file with list of words (thousands):
GH1
GH2
GH3
.
.
.
GH1000
I want to get output which contain number of each words occurred in each file like this
GH1 GH2 GH3 ... GH1000
filename1 1 1 0... 4
.
.
.
filename2 2 3 1... 0
I try this code but it gives me zero only
for file in *.tsv; do
echo $file >> output.tsv
cat fore.txt | while read line; do
awk -F "\\t" '{print $1}' $file | grep -wc $line >>output.tsv
echo "\\t">>output.tsv;
done ;
done
Use the following script.
Just put sdtout to output.txt file.
#!/bin/bash
while read p; do
echo -n "$p "
done <words.txt
echo ""
for file in *.tsv; do
echo -n "$file = "
while read p; do
COUNT=$(sed 's/$p/$p\n/g' $file | grep -c "$p")
echo -n "$COUNT "
done <words.txt
echo ""
done
Here is a simple Awk script which collects a list like the one you describe.
awk 'BEGIN { printf "\t" }
NR==FNR { a[$1] = n = FNR;
printf "\t%s", $1; next }
FNR==1 {
if(f) { printf "%s", f;
for (i=1; i<=n; i++)
printf "\t%s", 0+b[i] }
printf "\n"
delete b
f = FILENAME }
$1 in a { b[$1]++ }' fore.txt *.tsv /etc/motd
To avoid repeating the big block in END, we add a short sentinel file at the end whose only purpose is to supply a file after the last whose counts will not be reported.
The shell's while read loop is slow and inefficient and somewhat error-prone (you basically always want read -r and handling incomplete text files is hairy); in addition, the brute-force method will require reading the word file once per iteration, which incurs a heavy I/O penalty.

Easy way of selecting certain lines from a file in a certain order

I have a text file, with many lines. I also have a selected number of lines I want to print out, in certain order. Let's say, for example, "5, 3, 10, 6". In this order.
Is there some easy and "canonical" way of doing this? (with "standard" Linux tools, and bash)
When I tried the answers from this question
Bash tool to get nth line from a file
it always prints the lines in order they are in the file.
A one liner using sed:
for i in 5 3 10 6 ; do sed -n "${i}p" < ff; done
A rather efficient method if your file is not too large is to read it all in memory, in an array, one line per field using mapfile (this is a Bash ≥4 builtin):
mapfile -t array < file.txt
Then you can echo all the lines you want in any order, e.g.,
printf '%s\n' "${array[4]}" "${array[2]}" "${array[9]}" "${array[5]}"
to print the lines 5, 3, 10, 6. Now you'll feel it's a bit awkward that the array fields start with a 0 so that you have to offset your numbers. This can be easily cured with the -O option of mapfile:
mapfile -t -O 1 array < file.txt
this will start assigning to array at index 1, so that you can print your lines 5, 3, 10 and 6 as:
printf '%s\n' "${array[5]}" "${array[3]}" "${array[10]}" "${array[6]}"
Finally, you want to make a wrapper function for this:
printlines() {
local i
for i; do printf '%s\n' "${array[i]}"; done
}
so that you can just state:
printlines 5 3 10 6
And it's all pure Bash, no external tools!
As #glennjackmann suggests in the comments you can make the helper function also take care of reading the file (passed as argument):
printlinesof() {
# $1 is filename
# $2,... are the lines to print
local i array
mapfile -t -O 1 array < "$1" || return 1
shift
for i; do printf '%s\n' "${array[i]}"; done
}
Then you can use it as:
printlinesof file.txt 5 3 10 6
And if you also want to handle stdin:
printlinesof() {
# $1 is filename or - for stdin
# $2,... are the lines to print
local i array file=$1
[[ $file = - ]] && file=/dev/stdin
mapfile -t -O 1 array < "$file" || return 1
shift
for i; do printf '%s\n' "${array[i]}"; done
}
so that
printf '%s\n' {a..z} | printlinesof - 5 3 10 6
will also work.
Here is one way using awk:
awk -v s='5,3,10,6' 'BEGIN{split(s, a, ","); for (i=1; i<=length(a); i++) b[a[i]]=i}
b[NR]{data[NR]=$0} END{for (i=1; i<=length(a); i++) print data[a[i]]}' file
Testing:
cat file
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Line 12
awk -v s='5,3,10,6' 'BEGIN{split(s, a, ","); for (i=1; i<=length(a); i++) b[a[i]]=i}
b[NR]{data[NR]=$0} END{for (i=1; i<=length(a); i++) print data[a[i]]}' file
Line 5
Line 3
Line 10
Line 6
First, generate a sed expression that would print the lines with a number at the beginning that you can later use to sort the output:
#!/bin/bash
lines=(5 3 10 6)
sed=''
i=0
for line in "${lines[#]}" ; do
sed+="${line}s/^/$((i++)) /p;"
done
for i in {a..z} ; do echo $i ; done \
| sed -n "$sed" \
| sort -n \
| cut -d' ' -f2-
I's probably use Perl, though:
for c in {a..z} ; do echo $c ; done \
| perl -e 'undef #lines{#ARGV};
while (<STDIN>) {
$lines{$.} = $_ if exists $lines{$.};
}
print #lines{#ARGV};
' 5 3 10 6
You can also use Perl instead of hacking with sed in the first solution:
for c in {a..z} ; do echo $c ; done \
| perl -e ' %lines = map { $ARGV[$_], ++$i } 0 .. $#ARGV;
while (<STDIN>) {
print "$lines{$.} $_" if exists $lines{$.};
}
' 5 3 10 6 | sort -n | cut -d' ' -f2-
l=(5 3 10 6)
printf "%s\n" {a..z} |
sed -n "$(printf "%d{=;p};" "${l[#]}")" |
paste - - | {
while IFS=$'\t' read -r nr text; do
line[nr]=$text
done
for n in "${l[#]}"; do
echo "${line[n]}"
done
}
You can use the nl trick: number the lines in the input and join the output with the list of actual line numbers. Additional sorts are needed to make the join possible as it needs sorted input (so the nl trick is used once more the number the expected lines):
#! /bin/bash
LINES=(5 3 10 6)
lines=$( IFS=$'\n' ; echo "${LINES[*]}" | nl )
for c in {a..z} ; do
echo $c
done | nl \
| grep -E '^\s*('"$( IFS='|' ; echo "${LINES[*]}")"')\s' \
| join -12 -21 <(echo "$lines" | sort -k2n) - \
| sort -k2n \
| cut -d' ' -f3-

how to merge similar lines in linux

I have a file test.txt on my linux system which has data in following format :
first second third fourth 10
first second third fourth 20
fifth sixth seventh eighth 10
mmm nnn ooo ppp 10
mmm nnn ooo ppp 20
I need to modify the format as below -
first second third fourth 10 20
fifth sixth seventh eighth 10 0
mmm nnn ooo ppp 10 20
I have tried following code
cat test.txt | sed 'N;s/\n/ /' | awk -F" " '{if ($1~$5){print $1" "$2" "$3" "$4" "$8} else { print $0 }}'
But this is not giving required output. When there is a line which doesn't have a similar line below it,this command fails. Can u suggest me any solution for this??
Here is one way to do it:
awk ' {
last=$NF; $NF=""
if($0==previous) {
tail=tail " " last
}
else {
if(previous!="") {
if(split(tail,foo)==1) tail=tail " 0"
print previous tail
}
previous=$0
tail=last
}
}
END {
if(previous!="") print previous tail
}
'
Perl solution:
perl -ne '/^(.*) (\S+)/ and push #{ $h{$1} },$2 }{ print "$_ #{$h{$_}}\n" for keys %h' < test.txt
Reuse of my solution (J4F)
cat file.txt | sort | while read L;
do
y=`echo $L | rev | cut -f2- -d' ' | rev`;
{
test "$x" = "$y" && echo -n " `echo $L | awk '{print $NF}'`";
} ||
{
x="$y";echo -en "\n$L";
};
done

Resources