grouping input files with incremental names to divide a computation into smaller ones - linux

I created a script to run a function that takes a specific file and compares it to several others. The files that I have to scan for the matches are more than 900 and incrementally named like this:
A001-abc.txt A001-efg.txt [..] A002-efg.txt A002-hjk.txt [..] A120-xwz.txt (no whitespaces)
The script looks like this:
#!/bin/bash
for f in *.txt; do
bedtools function /mydir/myfile.txt *.txt> output.txt
done
Now, the computation is extremely intensive and can't be completed. How can I group the files according to their incremental name and perform the computation on files A001-abc A001-def then go to A002-abc, A002-def.., to A123-xyz and so on?
I was thinking that by finding a way to specify the name with an incremental variable and then grouping them, I could divide the computation in many smaller ones. Is it reasonable and doable?
Many thanks!

The problem might be that you're not using $f. Try using this line in the for-loop instead. Also, use >> so the output appends instead of overwriting the output file.
bedtools function /mydir/myfile.txt "$f" >> output.txt
But if you still want to break it up, you might try a double loop like this:
for n in $(seq --format="%03.f" 1 120); do
for f in "A$n"*; do
bedtools function /mydir/myfile.txt "$f" >> "${n}output.txt"
done
done
The values of $n from the output of the seq command is 001 002 .. 120.
So the inner loop loops through A001* A002* .. A120*.

Related

how can i make the lines variable in a file?

I am using a unix based program. I want to automate the code so as not to copy and paste data one by one. For this, I need to define line-by-line data in a file as a variable for the code
The program converts xyz coordinates to local coordinates. How can I run the coordinates in the xyz_coordinates file I created, one by one, in the code below? In the program I use, the conversion code works like this:
echo 4208830.039709186 2334850.551667509 4171267.377406844 -6.753E-01 4.493E-01 2.849E-01 | xyz2env.py
and this is the file i am trying to run:
2679689.926729193 -727950.9964290063 5722789.538975053 7.873E-02 3.466E-01 6.410E-01
2679689.927123377 -727950.9971557076 5722789.540522 7.912E-02 3.458E-01 6.425E-01
2679689.930567728 -727950.9979971027 5722789.550832021 8.257E-02 3.450E-01 6.528E-01
2679689.931029495 -727950.9992263148 5722789.549927638 8.303E-02 3.438E-01 6.519E-01
2679689.929031829 -727950.9981009626 5722789.546359798 8.103E-02 3.449E-01 6.484E-01
........
and it goes on like this. Also, there are spaces between the lines. Will this be a problem?
You can use xargs to invoke the command for a specific number of arguments (6 in your case) and have the advantage of skipping empty lines automatically
< file.txt xargs -n 6 xyz2env.py

Best way to identify similar text inside strings?

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?
One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.
The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.
Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

better way to avoid use of ls inside of variable

not sure about how to correctly title this, please change it if you prefer
given that my code actually works, I'd like to have a peer review to increase the quality of it.
I have a folder full of .zip files. Theese files are streams of data (identifiable by their stream name) daily offloaded. There could be more than one daily file per stream, so I need to grab the last one in order of time. I can't rely on posix timestamp for this, so files expose timestamp on their name.
Filename example:
XX_XXYYZZ_XYZ_05_AB00C901_T001_20170808210052_20170808210631.zip
Last two fields are timestamps, and I'm interested in the second-last.
other fields are useless (now)
I've previously stored the stream name (in this case XYZ_05_AB00C901_T001 in the variable $stream
I have this line of code:
match=$(ls "$streamPath"/*.zip|grep "$stream"|rev|cut -d'_' -f2|rev|sort|tail -1)
And what it does is to search the given path for files matching the stream, cutting out the timestamp and sorting them. So now that I know what is the last timestamp for this stream, I can ls again, this time grepping for $streamand $match togegher, and I'm done:
streamFile=$(ls "$streamPath"/.zip|grep "$stream.*$match\|$match.*$stream")
Question time:
Is there a better way to achieve my goal ? Probably more than one, I'll prefer one-liner solution, tough.
ShellChecks advices me that it would be better to use a for loop or a while cycle instead of ls, to be able to handle particular filenames (which I'm not facing ATM, but who knows), but I'm not so sure about it (seems more complicated to me).
Thanks.
O.
Thanks to the page suggested by Cyrus I chose to go with this solution:
echo "$file"|grep "$stream"|rev|cut -d'_' -f2|rev|sort|tail -1
done < <(find "$streamPath" -maxdepth 1 -type f -name '*.zip' -print0)

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

how to create a txt file with columns being the descending sub-directories in Linux?

My data follow the structure:
../data/study_ID/FF_Number/Exam_Number/date,
Where the data dir contains 176 participants` sub-directories. The ID number represents the participants ID, and each of the following sub-directories represents some experimental number.
I want to create a txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date.
However it gets a bit more complicated as I want to divide the participants into chunks of ~ 15-20 ppt per chunk for the following analysis.
Any suggestions?
Cheers.
Hmm, nobody?
You should redirect output of "find" command, consider switches -type d, and -maxdepth, and probably parse it with sed, replacing "/" with "spaces". Maybe piping through "cut" and "column -t" commands, and "sort" and "uniq" will be useful. Do names, except FF and ID, contain spaces, or special characters e.g. related to names of participants?
It should be possible to get a TXT with "one liner" and a few pipes.
You should try, and post first results of your work on this :)
EDIT: Alright, I created for me a structure with several thousands of directories and subdirectories numbered by participant, by exam number etc., which look like this ( maybe it's not identical with what you have, but don't worry ). Studies are numbered from 5 to 150, FF from 45 to 75, and dates from 2012_01_00 to 2012_01_30 - which makes really huge quantity of directories in total.
/Users/pwadas/bzz/data
/Users/pwadas/bzz/data/study_005
/Users/pwadas/bzz/data/study_005/05_Num
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_00
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_01
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_02
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_03
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_04
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_05
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_06
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_07
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_08
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_09
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_10
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_11
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_12
Now, I want ( quote ) "txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date."
So I use the following one-liner:
find /Users/pwadas/bzz/data -type d | head -n 5000 |cut -d'/' -f5-7 | uniq |while read line; do echo -n "$line: " && ls -d /Users/pwadas/bzz/$line/*Exam/* | perl -0pe 's/.*2012/2012/g;s/\n/ /g' && echo ; done > out.txt
and here is the output ( a few first lines from out.txt ). Lines are very long, I cutted it on output for first 80-90 characters:
dtpwmbp:data pwadas$ cat out.txt |cut -c1-90
data:
data/study_005:
data/study_005/05_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/06_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/07_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/08_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
dtpwmbp:data pwadas$
I hope this will help you a little, and you'll be able to modify it according to your needs and patterns, and that seems to be all I can do :) You should analyze the one liner, especially "cut" command, and perl-regex part, which removes newlines and full directory name from "ls" output. This is probably fair from optimal, but beautifying is not the point here, I guess :)
So, good luck :)
PS. "head" command limits output for N first lines, you'll probably want to skip out
| head .. |
part.

Resources