Compare fields in two files and print to another in shell - linux

Please help me with the below. I want to print values in 3rd file in shell scripting
Formatting of files should as per below one
a_file - col1 col2 col3
b_file - col1 col2
. a_file - col1 col2 col3
1 P I
1 1Q JI
.b_file
1 I
How to compare 1st field and 3rd field of a_file to 1st and 2nd fields of b_file? Can anyone please help me on this.

Well, if i understand your question correctly, the simplest way is to read both files line by line, compare specified columns and write customized output to the 3rd file.
# read 3 columns from file descriptor 11 and two columns from file descriptor 12
while read -u 11 acol1 acol2 acol3 && read -u 12 bcol1 bcol2; do
# do comparisions betweencolumns
if [ $acol1 = $bcol1 -a $acol3 = $bcol2 ]; then
echo Comparision true
# print values in 3rd file in shell scripting
echo $acol1 $bcol1 $acol3 $bcol2 >>3rd_file
else
echo Comparision false
fi
# bind a_file to 11 file descriptor, and b_file to 12 file descriptor
done 11<a_file 12<b_file

Related

randomly choose the files and add it to the data present in another folder

i have two folders(DATA1 and DATA2) and inside it there are 3 subfolders(folder1,folder2 and folder3) as shown below :
DATA1
folder1/*.txt contain 5 files
folder2/*.txt contain 4 files
folder3/*.txt contain 10 files
DATA2
folder1/*.txt contain 8 files
folder2/*.txt contain 9 files
folder3/*.txt contain 10 files
as depicted above, there are various number of files in each subfolders with different names and each file contain two columns data as shown below:
1 -2.4654174805e+01
2 -2.3655626297e+01
3 -2.2654634476e+01
4 -2.1654865265e+01
5 -2.0653873444e+01
6 -1.9654104233e+01
7 -1.8654333115e+01
8 -1.7653341293e+01
9 -1.6654792786e+01
10 -1.5655022621e+01
I just want add data folder wise by choosing the second columns of files randomly
I mean any random data(only second column) from DATA2/folder1/*.txt will be added to DATA1/folder1/*.txt(only second column), similarly DATA2/folder2/*.txt will be added to DATA1/folder2/*.txt and so on.
most importantly, i don't need to disturb the first column value of any folders only manipulations with second column.And finally i want to save the data.
can anybody suggest solution for the same
My directory and data structure is attached here
https://i.fluffy.cc/2RPrcMxVQ0RXsSW1lzf6vfQ30jgJD8qp.html
i want to add folder wise data(from DATA2 to DATA1). First of all enter the DATA2/folder1 and randomly chose any file and select its(file) second column(as it consists of two column). Then add the selected second column to the second column of any file present inside DATA1/folder1 and save it to the OUTPUT folder
Since there's no code to start from this won't be a ready-to-use answer but rather a few building blocks that may come in handy.
I'll show how to find all files, select a random file, select a random column and extract the value from that column. Duplicating and adapting this for selecting a random file and column to add the value to is left as an exercise to the reader.
#!/bin/bash
IFS=
# a function to generate a random number
prng() {
# You could use $RANDOM instead but it gives a narrower range.
echo $(( $(od -An -N4 -t u4 < /dev/urandom) % $1 ))
}
# Find files
readarray -t files < <(find DATA2/folder* -mindepth 1 -maxdepth 1 -name '*.txt')
# Debug print-out of the files array
declare -p files
echo Found ${#files[#]} files
# List files one-by-one
for file in "${files[#]}"
do
echo "$file"
done
# Select a random file
fileno=$(prng ${#files[#]})
echo "Selecting file number $fileno"
filename=${files[$fileno]}
echo "which is $filename"
lines=$(wc -l < "$filename")
echo "and it has $lines lines"
# Add 1 since awk numbers its lines from 1 and up
rndline=$(( $(prng $lines) + 1 ))
echo "selecting value in column 2 on line $rndline"
value=$(awk -v rndline=$rndline '{ if(NR==rndline) print $2 }' "$filename")
echo "which is $value"
# now pick a random file and line in the other folder using the same technique

Iterate over columns of a csv file bash script

I have been trying to iterate over a file with 20 columns, using two ways, the first one is creating an array with the name of the columns and then passing it using for, but it does not work.
#!/bin/bash
a="${#}"
columns=('$col2' col3 col4 col5 col6 coll7 col8 col9 col10 col11 col12 col13 col14 col15 col16
col17 col18 col19 col20)
for elem in ${columns[*]}
do
while IFS=, read -r col1 col2 col3 col4 col5 col6 coll7 col8 col9 col10 col11 col12;do
b+=($elem)
done < $a
printf '%s\n' "${b[*]}"
done
The other method looks like to take the entire row, and that is not the idea, I would like to take the entire column individually, not by row. However this one did not work, looks like to be a problem with the way the for is written.
#!/bin/bash
a="${#}"
while IFS= read -r line; do
IFS=, read -ra fields <<<"$line"
for ((i=${fields[#]} ; i >= 1 ; i-- ))
do
printf '%s' "${fields[i]}"
done
done < $a
I have the next table, that represent some sales per year. I would like to take the information of each product per year and sum it, in order to verify the total, because in some cases this total values is incorrect. So for example in the year 2004, if you sum each product (45.000 + 70.000 + 100.000) the output is not 323.000 as the table is mentioned.
https://easyupload.io/3lpz6p
While Bash may not be the most efficient way to verify the total of each column, there is a way to do it and I had fun finding the solution.
Background
I have found a way to obtain the total of the values in one column using Bash. My idea was to use the read command and make sure to only read the values in the column specified by the user in the variable col_num. This is because the read command goes through a file row by row. In the example below, I specified col_num to be 0, which means the read command will go through the .csv input line by line, while grabbing only the values in the first column. I then did addition of the values the Bash way according to this.
I believe that only a small adjustment to my code is needed to enable it to go through all 20 columns before it terminates. But since you only have 20 columns, I thought it would not be too bad to increment col_num for every column.
My solution
#!/bin/bash
{
# this reads the first row which has the column names so we will not go
# through that row in the loop below
read
# this is where you specify the column number
col_num=0
# var_2 specifies the number of values (length) grabbed per line. we only
# want one value from each line so let it be 1
var2=1
while IFS=, read -a arr
do
final_arr+=`echo "${arr[#]:$col_num:$var2} + "`
column_total="$(($final_arr 0)) "
done
echo $column_total
} < input.csv
Important note: The script works for every column except for the last column, probably because it does not end with a comma.

column, width parameter not working

I'm running REHL7 at work with column -V at column from util-linux 2.23.2
I have csv files that contain some columns with long strings.
I want to view the csv as a table, and limit the column width since I'm
typically not interested in spot checking the long strings.
cat foo_bar.csv | column -s"," -t -c5
It appears that the column width is not being limited to 10 chars.
I wonder if this is a bug, or I'm doing it wrong and can't see it ?
Test input, test.csv
co1,col2,col3,col4,col5
1,2,3,longLineOfTextThatIdoNotWantToInspectAndWouldLikeToLimit,5
Running the command I think is correct:
cat test.csv | column -s"," -t -c5
co1 col2 col3 col4 col5
1 2 3 longLineOfTextThatIdoNotWantToInspectAndWouldLikeToLimit 5
The −c or −−columns option does not do what you think it does. By default,
column looks at all lines to find the longest one. If column can fit 2 of
those lines in 80 width, then every 2 lines are fit on one:
$ cat file
1 this is a short line
2 this is a short line
3 this line needs to be 39 or less char
4 this line needs to be 39 or less char
$ column file
1 this is a short line 3 this line needs to be 39 or less char
2 this is a short line 4 this line needs to be 39 or less char
$ column -x file
1 this is a short line 2 this is a short line
3 this line needs to be 39 or less char 4 this line needs to be 39 or less char
If you put -c lower than 80, it’s going to make it less likely that you get
more than 1 column:
$ column -c70 file
1 this is a short line
2 this is a short line
3 this line needs to be 39 or less char
4 this line needs to be 39 or less char
So, simply said, column cannot do what you want it to do. Awk can do this:
BEGIN {
FS = ","
}
{
for (x = 1; x <= NF; x++) {
printf "%s%s", substr($x, 1, 5), x == NF ? "\n" : "\t"
}
}
Result:
co1 col2 col3 col4 col5
1 2 3 longL 5
I was looking for a solution to a similar problem (truncating columns in docker ps output). I did not want to use awk and instead solved it using sed.
In this example I limit the output of docker ps to 30 characters per column.
docker ps -a --format "table {{.ID}},{{.Names}},{{.Image}},{{.State}},{{.Networks}}" | \
sed "s/\([^,]\{30\}\)[^,]*/\1/g" | \
column -s "," -t
The pattern matches 30 non-delimter ([^,]) characters in a group followed by the rest of the non-delimeter characters (if the column is less than 30 characters then it doesn't match and is left alone). The replacement is just the group of 30 characters and the rest of the column is discarded.
Just for fun, you could also do a mid-column truncation in case there is useful information at both ends of the column.
docker ps -a --format "table {{.ID}},{{.Names}},{{.Image}},{{.State}},{{.Networks}}" | \
sed "s/\([^,]\{14\}\)[^,]*\([^,]\{14\}\)/\1..\2/g" | \
column -s "," -t

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Conditional Awk hashmap match lookup

I have 2 tabular files. One file contains a mapping of 50 key values only called lookup_file.txt.
The other file has the actual tabular data with 30 columns and millions of rows. data.txt
I would like to replace the id column of the second file with the values from the lookup_file.txt..
How can I do this? I would prefer using awk in bash script..
Also, Is there a hashmap data-structure i can use in bash for storing the 50 key/values rather than another file?
Assuming your files have comma-separated fields and the "id column" is field 3:
awk '
BEGIN{ FS=OFS="," }
NR==FNR { map[$1] = $2; next }
{ $3 = map[$3]; print }
' lookup_file.txt data.txt
If any of those assumptions are wrong, clue us in if the fix isn't obvious...
EDIT: and if you want to avoid the (IMHO negligible) NR==FNR test performance impact, this would be one of those every rare cases when use of getline is appropriate:
awk '
BEGIN{
FS=OFS=","
while ( (getline line < "lookup_file.txt") > 0 ) {
split(line,f)
map[f[1]] = f[2]
}
}
{ $3 = map[$3]; print }
' data.txt
You could use a mix of "sort" and "join" via bash instead of having to write it in awk/sed and it is likely to be even faster:
key.cvs (id, name)
1,homer
2,marge
3,bart
4,lisa
5,maggie
data.cvs (name,animal,owner,age)
snowball,dog,3,1
frosty,yeti,1,245
cujo,dog,5,4
Now, you need to sort both files first on the user id columns:
cat key.cvs | sort -t, -k1,1 > sorted_keys.cvs
cat data.cvs | sort -t, -k3,3 > sorted_data.cvs
Now join the 2 files:
join -1 1 -2 3 -o "2.1 2.2 1.2 2.4" -t , sorted_keys.cvs sorted_data.cvs > replaced_data.cvs
This should produce:
snowball,dog,bart,1
frosty,yeti,homer,245
cujo,dog,maggie,4
This:
-o "2.1 2.2 1.2 2.4"
Is saying what columns from the 2 files you want in your final output.
It is pretty fast for finding and replacing multiple gigs of data compared to other scripting languages. I haven't done a direct comparison to SED/AWK, but it is much easier to write a bash script wrapping this than writing in SED/AWK (for me at least).
Also, you can speed up the sort by using an upgraded version of gnu coreutils so that you can do the sort in parallel
cat data.cvs | sort --parallel=4 -t, -k3,3 > sorted_data.cvs
4 being how many threads you want to run it in. I was recommended 2 threads per machine core will usually max out the machine, but if it is dedicated just for this, that is fine.
There are several ways to do this. But if you want an easy one liner, without much in the way of validation I would go with an awk/sed solution.
Assume the following:
the files are tab delimited
you are using bash shell
the id in the data file is in the first column
your files look like this:
lookup
1 one
2 two
3 three
4 four
5 five
data
1 col2 col3 col4 col5
2 col2 col3 col4 col5
3 col2 col3 col4 col5
4 col2 col3 col4 col5
5 col2 col3 col4 col5
I would use awk and sed to accomplish this task like this:
awk '{print "sed -i s/^"$1"/"$2"/ data"}' lookup | bash
what this is doing is going through each line of lookup and writing the following to stdout
sed -i s/^1/one/ data
sed -i s/^2/two/ data
and so on.
it next pipes each line to the shell (| bash), which will execute the sed expression. -i for inplace, you may want -i.bak to create a backup file. note you can change the extension to whatever you would like.
the sed is looking for the id at the start of the line, as indicated by the ^. You don't want to be replacing an 'id' in a column that might not contain an id.
your output would look like the following:
one col2 col3 col4 col5
two col2 col3 col4 col5
three col2 col3 col4 col5
four col2 col3 col4 col5
five col2 col3 col4 col5
of course, your ids are probably not simply 1 to one, 2 to two, etc, but this might get you started in the right direction. And I use the term right very loosely.
The way I'd do this is to use awk to write an awk program to process the larger file:
awk -f <(awk '
BEGIN{print " BEGIN{"}
{printf " a[\"%s\"]=\"%s\";",$1,$2}
END {print " }";
print " {$1=a[$1];print $0}"}
' lookup_file.txt
) data.txt
That assumes that the id column is column 1; if not, you need to change both instances of $1 in $1=a[$1]

Resources