randomly choose the files and add it to the data present in another folder - linux

i have two folders(DATA1 and DATA2) and inside it there are 3 subfolders(folder1,folder2 and folder3) as shown below :
DATA1
folder1/*.txt contain 5 files
folder2/*.txt contain 4 files
folder3/*.txt contain 10 files
DATA2
folder1/*.txt contain 8 files
folder2/*.txt contain 9 files
folder3/*.txt contain 10 files
as depicted above, there are various number of files in each subfolders with different names and each file contain two columns data as shown below:
1 -2.4654174805e+01
2 -2.3655626297e+01
3 -2.2654634476e+01
4 -2.1654865265e+01
5 -2.0653873444e+01
6 -1.9654104233e+01
7 -1.8654333115e+01
8 -1.7653341293e+01
9 -1.6654792786e+01
10 -1.5655022621e+01
I just want add data folder wise by choosing the second columns of files randomly
I mean any random data(only second column) from DATA2/folder1/*.txt will be added to DATA1/folder1/*.txt(only second column), similarly DATA2/folder2/*.txt will be added to DATA1/folder2/*.txt and so on.
most importantly, i don't need to disturb the first column value of any folders only manipulations with second column.And finally i want to save the data.
can anybody suggest solution for the same
My directory and data structure is attached here
https://i.fluffy.cc/2RPrcMxVQ0RXsSW1lzf6vfQ30jgJD8qp.html
i want to add folder wise data(from DATA2 to DATA1). First of all enter the DATA2/folder1 and randomly chose any file and select its(file) second column(as it consists of two column). Then add the selected second column to the second column of any file present inside DATA1/folder1 and save it to the OUTPUT folder

Since there's no code to start from this won't be a ready-to-use answer but rather a few building blocks that may come in handy.
I'll show how to find all files, select a random file, select a random column and extract the value from that column. Duplicating and adapting this for selecting a random file and column to add the value to is left as an exercise to the reader.
#!/bin/bash
IFS=
# a function to generate a random number
prng() {
# You could use $RANDOM instead but it gives a narrower range.
echo $(( $(od -An -N4 -t u4 < /dev/urandom) % $1 ))
}
# Find files
readarray -t files < <(find DATA2/folder* -mindepth 1 -maxdepth 1 -name '*.txt')
# Debug print-out of the files array
declare -p files
echo Found ${#files[#]} files
# List files one-by-one
for file in "${files[#]}"
do
echo "$file"
done
# Select a random file
fileno=$(prng ${#files[#]})
echo "Selecting file number $fileno"
filename=${files[$fileno]}
echo "which is $filename"
lines=$(wc -l < "$filename")
echo "and it has $lines lines"
# Add 1 since awk numbers its lines from 1 and up
rndline=$(( $(prng $lines) + 1 ))
echo "selecting value in column 2 on line $rndline"
value=$(awk -v rndline=$rndline '{ if(NR==rndline) print $2 }' "$filename")
echo "which is $value"
# now pick a random file and line in the other folder using the same technique

Related

Get most appear phrase (not word) in a file in bash

My file is
cat a.txt
a
b
aa
a
a a
I am trying to get most appear phrase (not word).
my code is
tr -c '[:alnum:]' '[\n*]' < a.txt | sort | uniq -c | sort -nr
4 a
1 b
1 aa
1
I need
2 a
1 b
1 aa
1 a a
sort a.txt | uniq -c | sort -rn
When you say “in Bash”, I’m going to assume that no external programs are allowed in this exercise. (Also, what is a phrase? I’m going to assume that there is one phrase per line and that no extra preprocessing (such as whitespace trimming) is needed.)
frequent_phrases() {
local -Ai phrases
local -ai {dense_,}counts
local phrase
local -i count i
while IFS= read -r phrase; # Step 0
do ((++phrases["${phrase}"]))
done
for phrase in "${!phrases[#]}"; do # Step 1
((count = phrases["${phrase}"]))
((++counts[count]))
local -a "phrases_$((count))"
local -n phrases_ref="phrases_$((count))"
phrases_ref+=("${phrase}")
done
dense_counts=("${!counts[#]}") # Step 2
for ((i = ${#counts[#]} - 1; i >= 0; --i)); do # Step 3
((count = dense_counts[i]))
local -n phrases_ref="phrases_$((count))"
for phrase in "${phrases_ref[#]}"; do
printf '%d %s\n' "$((count))" "${phrase}"
done
done
}
frequent_phrases < a.txt
Steps taken by the frequent_phrases function (marked in code comments):
Read lines (phrases) into an associative array while counting their occurrences. This yields a mapping from phrases to their counts (the phrases array).
Create a reverse mapping from counts back to phrases. Obviously, this will be a “multimap”, because multiple different phrases can occur the same number of times. To avoid assumptions around separator characters disallowed in a phrase, we store lists of phrases for each count using dynamically named arrays (instead of a single array). For example, all phrases that occur 11 times will be stored in an array called phrases_11.
Besides the map inversion (from (phrase → count) to (count → phrases)), we also gather all known counts in an array called counts. Values of this array (representing how may different phrases occur a particular number of times) are somewhat useless for this task, but its keys (the counts themselves) are a useful representation of a sparse set of counts that can be (later) iterated in a sorted order.
We compact our sparse array of counts into a dense array of dense_counts for easy backward iteration. (This would be unnecessary if we were to just iterate through the counts in increasing order. A reverse order of iteration is not that easy in Bash, as long as we want to implement it efficiently, without trying all possible counts between the maximum and 1.)
We iterate through all known counts backwards (from highest to lowest) and for each count we print out all phrases that occur that number of times. Again, for example, phrases that occur 11 times will be stored in an array called phrases_11.
Just for completeness, to print out (also) the extra bits of statistics we gathered, one could extend the printf command like this:
printf 'count: %d, phrases with this count: %d, phrase: "%s"\n' \
"$((count))" "$((counts[count]))" "${phrase}"

rank grep result by entries' timestamp

I would like to rank log entries by the timestamp of each entry.
let's say my grep result is like this, with each entry having different number of fields and time on different number of columns:
a, 3, time:123
b, time:124, 4
c, time:122, 5
how should I pipe the result such that it looks like this?
c, time:122, 5
a, 3, time:123
b, time:124, 4
Would you try the following:
while IFS= read -r line; do
[[ $line =~ time:([0-9]+) ]] && printf "%s\t%s\n" "${BASH_REMATCH[1]}" "$line"
done < file | sort -n | cut -f 2-
It first extracts the time after the time: substring.
Then it prepends the time before the line using a tab as a delimiter.
It numerically sorts the lines.
Finally it cuts off the 1st field.
A general solution is:
for each line:
detect log format
extract timestamp column based on detected format
convert timestamp into sortable-form
print sortable-form + column delimiter + original line
pipe output of previous stage into something that sorts on the new first column
pipe output of previous stage into something that strips off the new first column

How to extract names of compound present in sub files?

I have a list of 15000 compound names (file name: uniq-compounds) which contains names of 15000 folder. the folder have sub files i.e. out.pdbqt which contains names of compound in 3rd Row. (Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide). I want to extract all those 15000 names by providing uniq-compound file (it contain folder names e.g ligand_*) out of 50,000 folder.
directory and subfiles
sidra---50,000folder (ligand_00001 - ligand50,000)--each contains subfiles (out.pdbqt)--that conatins names.(mention below)
another file (uniq-compound) contains 15000 folder names (that compound names i want).
out.pdbqt
MODEL 1
REMARK VINA RESULT: -6.0 0.000 0.000
REMARK Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: N_1 and C_7
Assuming, uniq-compound.txt contains the folder names and each folder contains an out.pdbqt. Also, the compound name appears in the 3rd row of the file out.pdbqt. If that is the case below script will work:
#!/bin/bash
while IFS= read -r line; do
awk 'FNR == 3 {print $4}' $line/out.pdbqt
done < uniq-compound.txt
Loop will iterate through the uniq-compound.txt one by one, for each line in the file (i.e folder), it uses awk to display the 4th column in the 3rd line of the file out.pdbqt inside that folder.

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Fast removing duplicate rows between multiple files

I have 10k files with 80k rows each and need to compare, and - either delete the duplicate lines or replace them by "0". ultrafast since I have to do it +1000 times.
the following script is fast enough for files with less than 100 rows. now tcsh
import csv
foreach file ( `ls -1 *` )
split -l 1 ${file} ${file}.
end
find *.* -type f -print0 | xargs -0 sha512sum | awk '($1 in aa){print $2}(!($1 in aa)){aa[$1]=$2}' | xargs -I {} cp rowzero {}
cat ${file}.* > ${file}.filtered
where "rowzero" is just a file with a... zero. I have tried python but haven't found a fast way. I have tried pasting them and doing all nice fast things (awk, sed, above commands, etc.) but the i/o slows to incredible levels when the file has over more than e.g. 1000 columns. I need help, thanks a million hours!.
ok this is so far the fastest code that I could make, which works on a transposed and "cat" input. As explained before, "cat"-ed input ">>" works fine however "paste" or "pr" code gives nightmares pasting another column in, say, +1GB files, and that is why we need to transpose. e.g.
each original file looks like
1
2
3
4
...
if we transpose and cat the first file with others the input for the code will look like:
1 2 3 4 ..
1 1 2 4 ..
1 1 1 4 ..
The code will return the original "aka retransposed pasted" format with the minor detail of shuffled rows
1
1 2
1 2 3
2 3 4
..
The repeated rows were effectively removed. below the code,
HOWEVER THE CODE IS NOT GENERAL! it only works with 1-digit integers since the awk array indexes are not sorted. Could someone help to generalize it? thanks!
{for(ii=1;ii<=NF;ii++){aa[ii,$ii]=$ii}}END{mm=1; for (n in aa) {split(n, bb, SUBSEP);
if (bb[1]==mm){cc=bb[2]; printf ( "%2s", cc)}else{if (mm!=bb[1]){printf "\n%2s", bb[2] }; mm=bb[1]}}}

Resources