Join 2 files by common column header (without awk/sed) - linux

Basically I want to get all records from file2, but filter out columns whose header doesn't appear in file1
Example:
file1
Name Location
file2
Name Phone_Number Location Email
Jim 032131 xyz xyz#qqq.com
Tim 037903 zzz zzz#qqq.com
Pimp 039141 xxz xxz#qqq.com
Output
Name Location
Jim xyz
Tim zzz
Pimp xxz
Is there a way to do this without awk or sed, but still using coreutils tools? I've tried doing it with join, but couldn't get it working.

ALL_COLUMNS=$(head -n1 file2)
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Explanation:
ALL_COLUMNS=$(head -n1 file2)
It saves all the column names to filter next
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
For every column in file1, we look for the position of the one with the same name in file2 and append it to JOIN_FORMAT in the way of "2.<number_of_column>,"
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Once we have the option string complete (2.1,2.3,), we pass it to join removing the last ,.
join prints the unpairable lines from the second file provided (-a2 -> file2), but only the columns specified in the -o option.

Not very efficient, but works for your example:
#!/bin/bash
read -r -a cols < file1
echo "${cols[#]}"
read -r -a header < <(head -n1 file2)
keep=()
for (( i=0; i<${#header}; i++ )) ; do
for c in "${cols[#]}" ; do
if [[ ${header[i]} == "$c" ]] ; then
keep+=($i)
fi
done
done
while read -r -a data ; do
for idx in ${keep[#]} ; do
printf '%s ' "${data[idx]}"
done
printf '\n'
done < <(tail -n+2 file2)
Tools used: head and tail. They aren't essential, though. And bash, of course.

Related

Fastest way to compare hundreds of thousands of files, and create output results file in bash

I have the following:
-Values File, values.txt
-Directory Structure: ./dataset/label/author/files.txt
-Tens of thousands of files.txt's
-A file called targets.txt, which contains the location of every files.txt
Example targets.txt
./dataset/tallperson/Jabba/awesome.txt
./dataset/fatperson/Detox/toxic.txt
I have a file called values.txt, which contains hundreds of thousands of lines of values. These values are things like "aef", "; i", "jfk", etc. Random 3-Character lines.
I also have tens of thousands of files, each which also contain hundreds to thousands of lines. Each line also contains Random 3-Character lines.
The values.txt was created using the values of each files.txt. Therefore, there is no value in any file.txt file which isn't contained in values.txt. values.txt contains NO repeating values.
Example:
./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
The above is just example data. Each file will contain hundreds of lines. And values.txt will contain hundreds of thousands of lines.
My goal here is to make one file, where each line is a file. Each line will contain N values where each value is correspondant to the line in values.txt. And each value will be seperated by a comma. Each value is calculated simply by how many times each file contains the value of each line in values.txt.
The result should look something like this. With line 1 being file1.txt and line 2 being file2.txt.
Result.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,
Now. The last thing is, after getting this result I would like to add a label. The label is equivalent to the Nth parent directory from the file. For this example, lets say the 2nd parent directory. Therefore the label would be "tallperson" or "shortperson". As a result, the new Results.txt file would look like this.
Results.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
I would like a way to accomplish all of this, but I need it to be fast as I am working with a very large scale dataset.
This is my current code, but it's too slow. The bottleneck is line 2.
Script. Each file located at "./dataset/label/author/file.java"
1 while IFS= read file_name; do
2 cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt;
3 label=$(echo "$file_name" | cut -d '/' -f 3);
4 printf "$label\n" >> Results.txt;
5 done < targets.txt
------------
To REPLICATE this problem. Do the following:
mkdir -p dataset/{label1,label2}
touch file1.txt; chmod 777 file1.txt
touch file2.txt; chmod 777 file2.txt
echo "Enter anything here" > file1.txt
echo "Enter something here too" > file2.txt
mv file1.txt ./dataset/label1
mv file2.txt ./dataset/label2
find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done
find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt
xargs cat < targets.txt | sort -u > values.txt
With the above UNCHANGED, you should get a values.txt with something similar to below. If there's any lines with less or more than 3 characters for some reason, please delete the line.
any
e
Ent
er
eth
he
her
ing
ng
re
som
thi
too
You should get a targets.txt file
./dataset/label2/modified-file2.txt
./dataset/label1/modified-file1.txt
From here. The goal is to check every file in targets.txt, and count how many values the file has contained in values.txt. And to output the results with the label to Results.txt
The following script will work for this example, but I need it to be way faster for large scale operations.
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt;
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n" >> Results.txt;
done < targets.txt
Here's another example
Example 2:
./dataset/weirdperson/Crooked/file1.txt
LOL
LOL
HAHA
./dataset/awesomeperson/Mild/file2.txt
LOL
LOL
LOL
values.txt
LOL
HAHA
Result.txt
2,1,weirdperson
3,0,awesomeperson
Here's a solution in Python, using its ordered dictionary datatype.
import os
from collections import OrderedDict
# read samples from values.txt into an Ordered Dict.
# each dict key is a line from the file
# (including the trailing newline, but that doesn't matter)
# each dict value is 0
with open('values.txt', 'r') as f:
samplecount0=OrderedDict((sample, 0) for sample in f.readlines())
# get list of filenames from targets.txt
with open('targets.txt', 'r') as f:
targets=[t.rstrip('\n') for t in f.readlines()]
# for each target,
# read its lines of samples
# increment the corresponding count in samplecount
# print out samplecount in a single line separated by commas
# each line also has the 2nd-to-last directory component of the target's pathname
for target in targets:
with open(target, 'r') as f:
# copy samplecount0 to samplecount so we don't have to read the values.txt file again
samplecount=samplecount0.copy()
# for each sample in the target file, increment the samplecount dict entry
for tsample in f.readlines():
samplecount[tsample] += 1
output = ','.join(str(v) for v in samplecount.values())
output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target)))
print(output)
Output:
$ python3 doit.py
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Try this:
<targets.txt xargs -n1 -P4 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
The -P4 let's you parallelize the jobs in targets.txt. The short awk script marges lines and prints 0 and 1 followed by a comma. Then sed is used to append the 3rd part of the folder path to the end of the line. The sed line looks strange, because I used unprintable character $'\x01' as the separator for s command.
Tested with:
mkdir -p ./dataset/weirdperson/Crooked
cat <<EOF >./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
EOF
mkdir -p ./dataset/awesomeperson/Mild/
cat <<EOF >./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
EOF
cat <<EOF >values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
EOF
cat <<EOF >targets.txt
./dataset/weirdperson/Crooked/file1.txt
./dataset/awesomeperson/Mild/file2.txt
EOF
measure_start() {
declare -g ttic_start
echo "==> Test $* <=="
ttic_start=$(date +%s.%N)
}
measure_end() {
local end
end=$(date +%s.%N)
local start
start="$ttic_start"
ttic_runtime=$(python -c "print(${end} - ${start})")
echo "Runtime: $ttic_runtime"
echo
}
measure_start original
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d,"
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n"
done < targets.txt
measure_end
measure_start first try with bash
nl -w1 values.txt | sort -k2.2 > values_sorted.txt
< targets.txt xargs -n1 -P0 bash -c "
sort -t$'\t' \"\$1\" |
join -t$'\t' -12 -21 -eEMPTY -a1 -o1.1,2.1 values_sorted.txt - |
sort -s -n -k1.1 |
sed 's/.*\tEMPTY/0/;t;s/.*/1/' |
tr '\n' ',' |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
measure_start second try with awk
<targets.txt xargs -n1 -P0 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
Outputs:
==> Test original <==
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Runtime: 0.133769512177
==> Test first try with bash <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0322473049164
==> Test second try with awk <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0180222988129

How to use uniq after printf

I have lot of file which I need to concatenate together with same prefix. I have an idea, but I do not know how to solve this problem:
files:
NAME1_C001_xxx.tsv
NAME1_C001_yyy.tsv
NAME2_C001_xxx.tsv
NAME2_C001_yyy.tsv
I want to print just uniq prefix - NAME1 and NAME2. Length of string in prefix and suffix is vary, but always before prefix is _C001
my solution is:
fo i in *.tsv
do prexix=$(printf "%s\n" "${i%_C001*}")
cat $prefix_C001_xxx.tsv $prefix_C001_yyy.tsv > ${i%_C001*}.merged.tsv
done;
But this solution is not very good. I have each prefix twice.
Thank you for any help.
EDITED:
One solution thanks to anubhava:
fo i in $(printf "%s\n" *.tsv | awk -F '_C001' '!seen[$1]++{print $1}')
do
cat $prefix_C001_xxx.tsv $prefix_C001_yyy.tsv > ${i%_C001*}.merged.tsv
done;
You don't need printf at all here; it's just an unnecessary wrapper around the parameter substitution you are already using.
for i in *.tsv
do prefix=${i%_C001*}
[[ -f $prefix.merged.tsv ]] && continue # Avoid doing the same prefix twice
cat "${prefix}"_* > "$prefix.merged.tsv"
done
As your filenames don't contain any newline you can pipe your list to a awk command to print unique prefixes using field separator as _C001:
printf "%s\n" *.tsv | awk -F '_C001' '!seen[$1]++{print $1}'
NAME1
NAME2
You can also use _ as FS in awk:
printf "%s\n" *.tsv | awk -F _ '!seen[$1]++{print $1}'

Shell nested for loop and string comparison

I have two files
file1
104.128.225.208:8000
103.27.24.114:80
104.128.225.208:8000
and file2
103.27.24.114:99999999
103.27.24.114:88888888888
104.128.225.208:8000
103.27.24.114:80
104.128.225.208:8000
and in file2 there are two new lines
103.27.24.114:99999999
103.27.24.114:88888888888
So I want to check if there are new lines in file
for i in $(cat $2)
do
for j in $(cat $1)
do
if [ $i = $j ]; then
echo $i
fi
done
done
/.program file1 file2
but I don't get expected output. I think that my if statement is not working fine. What I'm doing wrong?
Your problem is probably that you are looping over every line in file1 for each line in file2.
The comm utility does what you want, but it assumes both files are sorted.
$ sort file1 -o file1
$ sort file2 -o file2
$ comm -13 file1 file2
103.27.24.114:99999999
103.27.24.114:88888888888
This is what diff is for. Example:
$ diff dat/newdat1.txt dat/newdat2.txt
0a1,2
> 103.27.24.114:99999999
> 103.27.24.114:88888888888
Where newdat1.txt and newdat2.txt are:
104.128.225.208:8000
103.27.24.114:80
104.128.225.208:8000
and
103.27.24.114:99999999
103.27.24.114:88888888888
104.128.225.208:8000
103.27.24.114:80
104.128.225.208:8000
You can simply test the return of diff with or without output depending on the options and your needs. (e.g. if diff -q $file1 $file2 >/dev/null; then echo same; else echo differ; fi)
#!/bin/bash
for n in $(diff file1 file2); do
if [ -z "$firstLineDiscarded" ]; then
firstLineDiscarded=TRUE
elif [ $n != ">" ]; then
echo $n
fi
done
If you're not attached to that particular approach this seems to work.
Of course it breaks down if the input syntax changes (includes spaces in the data), but for this strict application... maybe good enough.

Calculate Word occurrences from file in bash

I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file
I know I can do this:
sort | uniq -c | sort
the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on.
What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow.
I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out.
Could you please lead me to the right way?
I'm also pasting the code I wrote:
#!/bin/bash
# count the number of word occurrences from a file and writes to another file #
# the words are listed from the most frequent to the less one #
touch .check # used to check the occurrances. Temporary file
touch distribution.txt # final file with all the occurrences calculated
page=$1 # contains the file I'm calculating
occurrences=$2 # temporary file for the occurrences
# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check
# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
word=${words} # word I'm calculating
strlen=${#word} # word's length
# I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word was never found before it writes it with 1 occurrence
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
rm .check
# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt
Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:
while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list
Now you have statistics for all your file, and now you simple aggregate it:
while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'
Example of usage:
$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF
$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list
$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head
3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

checking equality of a part of two files

Is it possible to check if first line of two files is equal using diff(or another easy bash command)?
[Generally checking equality of first/last k lines, or even lines i to j]
To diff the first k lines of two files:
$ diff <(head -k file1) <(head -k file2)
Similary, to diff the last k lines:
$ diff <(tail -k file1) <(tail -k file2)
To diff lines i to j:
diff <(sed -n 'i,jp' file1) <(sed -n 'i,jp' file2)
My solution seems rather basic and beginner when compared to dogbane's above, but here it is all the same!
echo "Comparing the first line from file $1 and $2 to see if they are the same."
FILE1=`head -n 1 $1`
FILE2=`head -n 1 $2`
echo $FILE1 > tempfile1.txt
echo $FILE2 > tempfile2.txt
if diff "tempfile1.txt" "tempfile2.txt"; then
echo Success
else
echo Fail
fi
My solution uses the filterdiff program of the patchutils program collection. The following command shows the difference between file1 and file2 from line number j to k:
diff -U 0 file1 file2 | filterdiff --lines j-k
below command displays the first line of both the files.
krithika.450> head -1 temp1.txt temp4.txt
==> temp1.txt <==
Starting CXC <...> R5x BCMBIN (c) AB 2012
==> temp4.txt <==
Starting CXC <...> R5x BCMBIN (c) AB 2012
Below command displays yes if the first line in both the filesare equal.
krithika.451> head -1 temp4.txt temp1.txt | awk '{if(NR==2)p=$0;if(NR==5){q=$0;if(p==q)print "yes"}}'
yes
krithika.452>

Resources