Join 2 files by common column header (without awk/sed) - linux
Basically I want to get all records from file2, but filter out columns whose header doesn't appear in file1
Example:
file1
Name Location
file2
Name Phone_Number Location Email
Jim 032131 xyz xyz#qqq.com
Tim 037903 zzz zzz#qqq.com
Pimp 039141 xxz xxz#qqq.com
Output
Name Location
Jim xyz
Tim zzz
Pimp xxz
Is there a way to do this without awk or sed, but still using coreutils tools? I've tried doing it with join, but couldn't get it working.
ALL_COLUMNS=$(head -n1 file2)
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Explanation:
ALL_COLUMNS=$(head -n1 file2)
It saves all the column names to filter next
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
For every column in file1, we look for the position of the one with the same name in file2 and append it to JOIN_FORMAT in the way of "2.<number_of_column>,"
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Once we have the option string complete (2.1,2.3,), we pass it to join removing the last ,.
join prints the unpairable lines from the second file provided (-a2 -> file2), but only the columns specified in the -o option.
Not very efficient, but works for your example:
#!/bin/bash
read -r -a cols < file1
echo "${cols[#]}"
read -r -a header < <(head -n1 file2)
keep=()
for (( i=0; i<${#header}; i++ )) ; do
for c in "${cols[#]}" ; do
if [[ ${header[i]} == "$c" ]] ; then
keep+=($i)
fi
done
done
while read -r -a data ; do
for idx in ${keep[#]} ; do
printf '%s ' "${data[idx]}"
done
printf '\n'
done < <(tail -n+2 file2)
Tools used: head and tail. They aren't essential, though. And bash, of course.
Related
Fastest way to compare hundreds of thousands of files, and create output results file in bash
I have the following: -Values File, values.txt -Directory Structure: ./dataset/label/author/files.txt -Tens of thousands of files.txt's -A file called targets.txt, which contains the location of every files.txt Example targets.txt ./dataset/tallperson/Jabba/awesome.txt ./dataset/fatperson/Detox/toxic.txt I have a file called values.txt, which contains hundreds of thousands of lines of values. These values are things like "aef", "; i", "jfk", etc. Random 3-Character lines. I also have tens of thousands of files, each which also contain hundreds to thousands of lines. Each line also contains Random 3-Character lines. The values.txt was created using the values of each files.txt. Therefore, there is no value in any file.txt file which isn't contained in values.txt. values.txt contains NO repeating values. Example: ./dataset/weirdperson/Crooked/file1.txt LOL hel lo how are you on thi s f ine day ./dataset/awesomeperson/Mild/file2.txt I a m v ery goo d. Tha nks LOL values.txt are you on thi s f ine day goo d. Tha hel lo how I a m v ery nks LOL The above is just example data. Each file will contain hundreds of lines. And values.txt will contain hundreds of thousands of lines. My goal here is to make one file, where each line is a file. Each line will contain N values where each value is correspondant to the line in values.txt. And each value will be seperated by a comma. Each value is calculated simply by how many times each file contains the value of each line in values.txt. The result should look something like this. With line 1 being file1.txt and line 2 being file2.txt. Result.txt 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1, 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1, Now. The last thing is, after getting this result I would like to add a label. The label is equivalent to the Nth parent directory from the file. For this example, lets say the 2nd parent directory. Therefore the label would be "tallperson" or "shortperson". As a result, the new Results.txt file would look like this. Results.txt 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson I would like a way to accomplish all of this, but I need it to be fast as I am working with a very large scale dataset. This is my current code, but it's too slow. The bottleneck is line 2. Script. Each file located at "./dataset/label/author/file.java" 1 while IFS= read file_name; do 2 cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt; 3 label=$(echo "$file_name" | cut -d '/' -f 3); 4 printf "$label\n" >> Results.txt; 5 done < targets.txt ------------ To REPLICATE this problem. Do the following: mkdir -p dataset/{label1,label2} touch file1.txt; chmod 777 file1.txt touch file2.txt; chmod 777 file2.txt echo "Enter anything here" > file1.txt echo "Enter something here too" > file2.txt mv file1.txt ./dataset/label1 mv file2.txt ./dataset/label2 find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt xargs cat < targets.txt | sort -u > values.txt With the above UNCHANGED, you should get a values.txt with something similar to below. If there's any lines with less or more than 3 characters for some reason, please delete the line. any e Ent er eth he her ing ng re som thi too You should get a targets.txt file ./dataset/label2/modified-file2.txt ./dataset/label1/modified-file1.txt From here. The goal is to check every file in targets.txt, and count how many values the file has contained in values.txt. And to output the results with the label to Results.txt The following script will work for this example, but I need it to be way faster for large scale operations. while IFS= read file_name; do cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt; label=$(echo "$file_name" | cut -d '/' -f 3); printf "$label\n" >> Results.txt; done < targets.txt Here's another example Example 2: ./dataset/weirdperson/Crooked/file1.txt LOL LOL HAHA ./dataset/awesomeperson/Mild/file2.txt LOL LOL LOL values.txt LOL HAHA Result.txt 2,1,weirdperson 3,0,awesomeperson
Here's a solution in Python, using its ordered dictionary datatype. import os from collections import OrderedDict # read samples from values.txt into an Ordered Dict. # each dict key is a line from the file # (including the trailing newline, but that doesn't matter) # each dict value is 0 with open('values.txt', 'r') as f: samplecount0=OrderedDict((sample, 0) for sample in f.readlines()) # get list of filenames from targets.txt with open('targets.txt', 'r') as f: targets=[t.rstrip('\n') for t in f.readlines()] # for each target, # read its lines of samples # increment the corresponding count in samplecount # print out samplecount in a single line separated by commas # each line also has the 2nd-to-last directory component of the target's pathname for target in targets: with open(target, 'r') as f: # copy samplecount0 to samplecount so we don't have to read the values.txt file again samplecount=samplecount0.copy() # for each sample in the target file, increment the samplecount dict entry for tsample in f.readlines(): samplecount[tsample] += 1 output = ','.join(str(v) for v in samplecount.values()) output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target))) print(output) Output: $ python3 doit.py 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Try this: <targets.txt xargs -n1 -P4 bash -c " awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt | sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01' " -- The -P4 let's you parallelize the jobs in targets.txt. The short awk script marges lines and prints 0 and 1 followed by a comma. Then sed is used to append the 3rd part of the folder path to the end of the line. The sed line looks strange, because I used unprintable character $'\x01' as the separator for s command. Tested with: mkdir -p ./dataset/weirdperson/Crooked cat <<EOF >./dataset/weirdperson/Crooked/file1.txt LOL hel lo how are you on thi s f ine day EOF mkdir -p ./dataset/awesomeperson/Mild/ cat <<EOF >./dataset/awesomeperson/Mild/file2.txt I a m v ery goo d. Tha nks LOL EOF cat <<EOF >values.txt are you on thi s f ine day goo d. Tha hel lo how I a m v ery nks LOL EOF cat <<EOF >targets.txt ./dataset/weirdperson/Crooked/file1.txt ./dataset/awesomeperson/Mild/file2.txt EOF measure_start() { declare -g ttic_start echo "==> Test $* <==" ttic_start=$(date +%s.%N) } measure_end() { local end end=$(date +%s.%N) local start start="$ttic_start" ttic_runtime=$(python -c "print(${end} - ${start})") echo "Runtime: $ttic_runtime" echo } measure_start original while IFS= read file_name; do cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," label=$(echo "$file_name" | cut -d '/' -f 3); printf "$label\n" done < targets.txt measure_end measure_start first try with bash nl -w1 values.txt | sort -k2.2 > values_sorted.txt < targets.txt xargs -n1 -P0 bash -c " sort -t$'\t' \"\$1\" | join -t$'\t' -12 -21 -eEMPTY -a1 -o1.1,2.1 values_sorted.txt - | sort -s -n -k1.1 | sed 's/.*\tEMPTY/0/;t;s/.*/1/' | tr '\n' ',' | sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01' " -- measure_end measure_start second try with awk <targets.txt xargs -n1 -P0 bash -c " awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt | sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01' " -- measure_end Outputs: ==> Test original <== 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson Runtime: 0.133769512177 ==> Test first try with bash <== 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson Runtime: 0.0322473049164 ==> Test second try with awk <== 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson 1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson Runtime: 0.0180222988129
How to use uniq after printf
I have lot of file which I need to concatenate together with same prefix. I have an idea, but I do not know how to solve this problem: files: NAME1_C001_xxx.tsv NAME1_C001_yyy.tsv NAME2_C001_xxx.tsv NAME2_C001_yyy.tsv I want to print just uniq prefix - NAME1 and NAME2. Length of string in prefix and suffix is vary, but always before prefix is _C001 my solution is: fo i in *.tsv do prexix=$(printf "%s\n" "${i%_C001*}") cat $prefix_C001_xxx.tsv $prefix_C001_yyy.tsv > ${i%_C001*}.merged.tsv done; But this solution is not very good. I have each prefix twice. Thank you for any help. EDITED: One solution thanks to anubhava: fo i in $(printf "%s\n" *.tsv | awk -F '_C001' '!seen[$1]++{print $1}') do cat $prefix_C001_xxx.tsv $prefix_C001_yyy.tsv > ${i%_C001*}.merged.tsv done;
You don't need printf at all here; it's just an unnecessary wrapper around the parameter substitution you are already using. for i in *.tsv do prefix=${i%_C001*} [[ -f $prefix.merged.tsv ]] && continue # Avoid doing the same prefix twice cat "${prefix}"_* > "$prefix.merged.tsv" done
As your filenames don't contain any newline you can pipe your list to a awk command to print unique prefixes using field separator as _C001: printf "%s\n" *.tsv | awk -F '_C001' '!seen[$1]++{print $1}' NAME1 NAME2 You can also use _ as FS in awk: printf "%s\n" *.tsv | awk -F _ '!seen[$1]++{print $1}'
Shell nested for loop and string comparison
I have two files file1 104.128.225.208:8000 103.27.24.114:80 104.128.225.208:8000 and file2 103.27.24.114:99999999 103.27.24.114:88888888888 104.128.225.208:8000 103.27.24.114:80 104.128.225.208:8000 and in file2 there are two new lines 103.27.24.114:99999999 103.27.24.114:88888888888 So I want to check if there are new lines in file for i in $(cat $2) do for j in $(cat $1) do if [ $i = $j ]; then echo $i fi done done /.program file1 file2 but I don't get expected output. I think that my if statement is not working fine. What I'm doing wrong?
Your problem is probably that you are looping over every line in file1 for each line in file2. The comm utility does what you want, but it assumes both files are sorted. $ sort file1 -o file1 $ sort file2 -o file2 $ comm -13 file1 file2 103.27.24.114:99999999 103.27.24.114:88888888888
This is what diff is for. Example: $ diff dat/newdat1.txt dat/newdat2.txt 0a1,2 > 103.27.24.114:99999999 > 103.27.24.114:88888888888 Where newdat1.txt and newdat2.txt are: 104.128.225.208:8000 103.27.24.114:80 104.128.225.208:8000 and 103.27.24.114:99999999 103.27.24.114:88888888888 104.128.225.208:8000 103.27.24.114:80 104.128.225.208:8000 You can simply test the return of diff with or without output depending on the options and your needs. (e.g. if diff -q $file1 $file2 >/dev/null; then echo same; else echo differ; fi)
#!/bin/bash for n in $(diff file1 file2); do if [ -z "$firstLineDiscarded" ]; then firstLineDiscarded=TRUE elif [ $n != ">" ]; then echo $n fi done If you're not attached to that particular approach this seems to work. Of course it breaks down if the input syntax changes (includes spaces in the data), but for this strict application... maybe good enough.
Calculate Word occurrences from file in bash
I'm sorry for the very noob question, but I'm kind of new to bash programming (started a few days ago). Basically what I want to do is keep one file with all the word occurrences of another file I know I can do this: sort | uniq -c | sort the thing is that after that I want to take a second file, calculate the occurrences again and update the first one. After I take a third file and so on. What I'm doing at the moment works without any problem (I'm using grep, sed and awk), but it looks pretty slow. I'm pretty sure there is a very efficient way just with a command or so, using uniq, but I can't figure out. Could you please lead me to the right way? I'm also pasting the code I wrote: #!/bin/bash # count the number of word occurrences from a file and writes to another file # # the words are listed from the most frequent to the less one # touch .check # used to check the occurrances. Temporary file touch distribution.txt # final file with all the occurrences calculated page=$1 # contains the file I'm calculating occurrences=$2 # temporary file for the occurrences # takes all the words from the file $page and orders them by occurrences cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check # loop to update the old file with the new information # basically what I do is check word by word and add them to the old file as an update cat .check | while read words do word=${words} # word I'm calculating strlen=${#word} # word's length # I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ] then # if the word was never found before it writes it with 1 occurrence if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ] then echo "$word: 1" | cat >> $occurrences # else it calculates the occurrences else old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences` let "new=old+1" sed -i "s/^$word: $old$/$word: $new/g" $occurrences fi fi done rm .check # finally it orders the words awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt
Well, I'm not sure that I've got the point of the thing you are trying to do, but I would do it this way: while read file do cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file done < file-list Now you have statistics for all your file, and now you simple aggregate it: while read file do cat stat.$file done < file-list \ | sort -k2 \ | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' Example of usage: $ for i in ls bash cp; do man $i > $i.txt ; done $ cat <<EOF > file-list > ls.txt > bash.txt > cp.txt > EOF $ while read file; do > cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file > done < file-list $ while read file > do > cat stat.$file > done < file-list \ > | sort -k2 \ > | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head 3875 the 1671 is 1137 to 1118 a 1072 of 793 if 744 and 533 command 514 in 507 shell
checking equality of a part of two files
Is it possible to check if first line of two files is equal using diff(or another easy bash command)? [Generally checking equality of first/last k lines, or even lines i to j]
To diff the first k lines of two files: $ diff <(head -k file1) <(head -k file2) Similary, to diff the last k lines: $ diff <(tail -k file1) <(tail -k file2) To diff lines i to j: diff <(sed -n 'i,jp' file1) <(sed -n 'i,jp' file2)
My solution seems rather basic and beginner when compared to dogbane's above, but here it is all the same! echo "Comparing the first line from file $1 and $2 to see if they are the same." FILE1=`head -n 1 $1` FILE2=`head -n 1 $2` echo $FILE1 > tempfile1.txt echo $FILE2 > tempfile2.txt if diff "tempfile1.txt" "tempfile2.txt"; then echo Success else echo Fail fi
My solution uses the filterdiff program of the patchutils program collection. The following command shows the difference between file1 and file2 from line number j to k: diff -U 0 file1 file2 | filterdiff --lines j-k
below command displays the first line of both the files. krithika.450> head -1 temp1.txt temp4.txt ==> temp1.txt <== Starting CXC <...> R5x BCMBIN (c) AB 2012 ==> temp4.txt <== Starting CXC <...> R5x BCMBIN (c) AB 2012 Below command displays yes if the first line in both the filesare equal. krithika.451> head -1 temp4.txt temp1.txt | awk '{if(NR==2)p=$0;if(NR==5){q=$0;if(p==q)print "yes"}}' yes krithika.452>