I have the following:
-Values File, values.txt
-Directory Structure: ./dataset/label/author/files.txt
-Tens of thousands of files.txt's
-A file called targets.txt, which contains the location of every files.txt
Example targets.txt
./dataset/tallperson/Jabba/awesome.txt
./dataset/fatperson/Detox/toxic.txt
I have a file called values.txt, which contains hundreds of thousands of lines of values. These values are things like "aef", "; i", "jfk", etc. Random 3-Character lines.
I also have tens of thousands of files, each which also contain hundreds to thousands of lines. Each line also contains Random 3-Character lines.
The values.txt was created using the values of each files.txt. Therefore, there is no value in any file.txt file which isn't contained in values.txt. values.txt contains NO repeating values.
Example:
./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
The above is just example data. Each file will contain hundreds of lines. And values.txt will contain hundreds of thousands of lines.
My goal here is to make one file, where each line is a file. Each line will contain N values where each value is correspondant to the line in values.txt. And each value will be seperated by a comma. Each value is calculated simply by how many times each file contains the value of each line in values.txt.
The result should look something like this. With line 1 being file1.txt and line 2 being file2.txt.
Result.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,
Now. The last thing is, after getting this result I would like to add a label. The label is equivalent to the Nth parent directory from the file. For this example, lets say the 2nd parent directory. Therefore the label would be "tallperson" or "shortperson". As a result, the new Results.txt file would look like this.
Results.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
I would like a way to accomplish all of this, but I need it to be fast as I am working with a very large scale dataset.
This is my current code, but it's too slow. The bottleneck is line 2.
Script. Each file located at "./dataset/label/author/file.java"
1 while IFS= read file_name; do
2 cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt;
3 label=$(echo "$file_name" | cut -d '/' -f 3);
4 printf "$label\n" >> Results.txt;
5 done < targets.txt
------------
To REPLICATE this problem. Do the following:
mkdir -p dataset/{label1,label2}
touch file1.txt; chmod 777 file1.txt
touch file2.txt; chmod 777 file2.txt
echo "Enter anything here" > file1.txt
echo "Enter something here too" > file2.txt
mv file1.txt ./dataset/label1
mv file2.txt ./dataset/label2
find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done
find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt
xargs cat < targets.txt | sort -u > values.txt
With the above UNCHANGED, you should get a values.txt with something similar to below. If there's any lines with less or more than 3 characters for some reason, please delete the line.
any
e
Ent
er
eth
he
her
ing
ng
re
som
thi
too
You should get a targets.txt file
./dataset/label2/modified-file2.txt
./dataset/label1/modified-file1.txt
From here. The goal is to check every file in targets.txt, and count how many values the file has contained in values.txt. And to output the results with the label to Results.txt
The following script will work for this example, but I need it to be way faster for large scale operations.
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt;
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n" >> Results.txt;
done < targets.txt
Here's another example
Example 2:
./dataset/weirdperson/Crooked/file1.txt
LOL
LOL
HAHA
./dataset/awesomeperson/Mild/file2.txt
LOL
LOL
LOL
values.txt
LOL
HAHA
Result.txt
2,1,weirdperson
3,0,awesomeperson
Here's a solution in Python, using its ordered dictionary datatype.
import os
from collections import OrderedDict
# read samples from values.txt into an Ordered Dict.
# each dict key is a line from the file
# (including the trailing newline, but that doesn't matter)
# each dict value is 0
with open('values.txt', 'r') as f:
samplecount0=OrderedDict((sample, 0) for sample in f.readlines())
# get list of filenames from targets.txt
with open('targets.txt', 'r') as f:
targets=[t.rstrip('\n') for t in f.readlines()]
# for each target,
# read its lines of samples
# increment the corresponding count in samplecount
# print out samplecount in a single line separated by commas
# each line also has the 2nd-to-last directory component of the target's pathname
for target in targets:
with open(target, 'r') as f:
# copy samplecount0 to samplecount so we don't have to read the values.txt file again
samplecount=samplecount0.copy()
# for each sample in the target file, increment the samplecount dict entry
for tsample in f.readlines():
samplecount[tsample] += 1
output = ','.join(str(v) for v in samplecount.values())
output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target)))
print(output)
Output:
$ python3 doit.py
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Try this:
<targets.txt xargs -n1 -P4 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
The -P4 let's you parallelize the jobs in targets.txt. The short awk script marges lines and prints 0 and 1 followed by a comma. Then sed is used to append the 3rd part of the folder path to the end of the line. The sed line looks strange, because I used unprintable character $'\x01' as the separator for s command.
Tested with:
mkdir -p ./dataset/weirdperson/Crooked
cat <<EOF >./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
EOF
mkdir -p ./dataset/awesomeperson/Mild/
cat <<EOF >./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
EOF
cat <<EOF >values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
EOF
cat <<EOF >targets.txt
./dataset/weirdperson/Crooked/file1.txt
./dataset/awesomeperson/Mild/file2.txt
EOF
measure_start() {
declare -g ttic_start
echo "==> Test $* <=="
ttic_start=$(date +%s.%N)
}
measure_end() {
local end
end=$(date +%s.%N)
local start
start="$ttic_start"
ttic_runtime=$(python -c "print(${end} - ${start})")
echo "Runtime: $ttic_runtime"
echo
}
measure_start original
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d,"
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n"
done < targets.txt
measure_end
measure_start first try with bash
nl -w1 values.txt | sort -k2.2 > values_sorted.txt
< targets.txt xargs -n1 -P0 bash -c "
sort -t$'\t' \"\$1\" |
join -t$'\t' -12 -21 -eEMPTY -a1 -o1.1,2.1 values_sorted.txt - |
sort -s -n -k1.1 |
sed 's/.*\tEMPTY/0/;t;s/.*/1/' |
tr '\n' ',' |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
measure_start second try with awk
<targets.txt xargs -n1 -P0 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
Outputs:
==> Test original <==
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Runtime: 0.133769512177
==> Test first try with bash <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0322473049164
==> Test second try with awk <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0180222988129
Say I have 8b1f 0008 0231 49f6 0300 f1f3 75f4 0c72 f775 0850 7676 720c 560d 75f0 02e5 ce00 0861 1302 0000 0000, how can I easily get a binary file from that without copying+pasting into a hex editor?
Use:
% xxd -r -p in.txt out.bin
See xxd.
This version will work with binary format too:
cat /bin/sh \
| od -A n -v -t x1 \
| tr -d '\r' \
| xxd -r -g 1 -p1 \
| md5sum && md5sum /bin/sh
The extra '\r' is just if you're dealing with DOS text files...
And process byte by byte to prevent endianness difference if running parts of a pipe on different systems.
All the present answers refer to the convenient xxd -r approach, but for situations where xxd is not available or convenient here is a more portable (and more flexible, but more verbose and less efficient) solution, using only POSIX shell syntax (it also compensates for odd-number of digits in input):
un_od() {
printf -- "$(
tr -d '\t\r\n ' | sed -e 's/^(.(.{2})*)$/0\1/' -e 's/\(.\{2\}\)/\\x\1/g'
)"
}
By the way: you don't specify whether your input is big-endian or little-endian, or whether you want big/little-endian output. Usually input such as in your question would be big-endian/network-order (e.g., as created by od -t x1 -An -v), and would be expected to transform to big-endian output. I presume xxd just assumes that default if not told otherwise, and this solution does that too. If byte-swapping is needed, how you do the byte-swapping also depends on the word-size of the system (e.g., 32 bit, 64 bit) and very rarely the byte-size (you can almost always assume 8-bit bytes - octets - though).
The below functions use a more complex version of the binary -> od -> binary trick to portably byteswap binary data, conditional on system endianness, and accounting for system word-size. The algorithm works for anything up to 72-bit word size (because seq -s '' 10 -> 12345678910 doesn't work):
if { sed --version 2>/dev/null || :; } | head -n 1 | grep -q 'GNU sed'; then
_sed() { sed -r "${#}"; }
else
_sed() { sed -E "${#}"; }
fi
sys_bigendian() {
return $(
printf 'I' | od -t o2 | head -n 1 | \
_sed -e 's/^[^ \t]+[ \t]+([^ \t]+)[ \t]*$/\1/' | cut -c 6
)
}
sys_word_size() { expr $(getconf LONG_BIT) / 8; }
byte_swap() {
_wordsize=$1
od -An -v -t o1 | _sed -e 's/^[ \t]+//' | tr -s ' ' '\n' | \
paste -d '\\' $(for _cnt in $(seq $_wordsize); do printf -- '- '; done) | \
_sed -e 's/^/\\/' -e '$ s/\\+$//' | \
while read -r _word; do
_thissize=$(expr $(printf '%s' "$_word" | wc -c) / 4)
printf '%s' "$(seq -s '' $_thissize)" | tr -d '\n' | \
tr "$(seq -s '' $_thissize -1 1)" "$_word"
done
unset _wordsize _prefix _word _thissize
}
You can use the above to output file contents in big-endian format regardless of system endianness:
if sys_bigendian; then
cat /bin/sh
else
cat /bin/sh | byte_swap $(sys_word_size)
fi
Here is the way to reverse "od" output:
echo "test" | od -A x -t x1 | sed -e 's|^[0-f]* ?||g' | xxd -r
test