better way to avoid use of ls inside of variable - linux

not sure about how to correctly title this, please change it if you prefer
given that my code actually works, I'd like to have a peer review to increase the quality of it.
I have a folder full of .zip files. Theese files are streams of data (identifiable by their stream name) daily offloaded. There could be more than one daily file per stream, so I need to grab the last one in order of time. I can't rely on posix timestamp for this, so files expose timestamp on their name.
Filename example:
XX_XXYYZZ_XYZ_05_AB00C901_T001_20170808210052_20170808210631.zip
Last two fields are timestamps, and I'm interested in the second-last.
other fields are useless (now)
I've previously stored the stream name (in this case XYZ_05_AB00C901_T001 in the variable $stream
I have this line of code:
match=$(ls "$streamPath"/*.zip|grep "$stream"|rev|cut -d'_' -f2|rev|sort|tail -1)
And what it does is to search the given path for files matching the stream, cutting out the timestamp and sorting them. So now that I know what is the last timestamp for this stream, I can ls again, this time grepping for $streamand $match togegher, and I'm done:
streamFile=$(ls "$streamPath"/.zip|grep "$stream.*$match\|$match.*$stream")
Question time:
Is there a better way to achieve my goal ? Probably more than one, I'll prefer one-liner solution, tough.
ShellChecks advices me that it would be better to use a for loop or a while cycle instead of ls, to be able to handle particular filenames (which I'm not facing ATM, but who knows), but I'm not so sure about it (seems more complicated to me).
Thanks.
O.

Thanks to the page suggested by Cyrus I chose to go with this solution:
echo "$file"|grep "$stream"|rev|cut -d'_' -f2|rev|sort|tail -1
done < <(find "$streamPath" -maxdepth 1 -type f -name '*.zip' -print0)

Related

grouping input files with incremental names to divide a computation into smaller ones

I created a script to run a function that takes a specific file and compares it to several others. The files that I have to scan for the matches are more than 900 and incrementally named like this:
A001-abc.txt A001-efg.txt [..] A002-efg.txt A002-hjk.txt [..] A120-xwz.txt (no whitespaces)
The script looks like this:
#!/bin/bash
for f in *.txt; do
bedtools function /mydir/myfile.txt *.txt> output.txt
done
Now, the computation is extremely intensive and can't be completed. How can I group the files according to their incremental name and perform the computation on files A001-abc A001-def then go to A002-abc, A002-def.., to A123-xyz and so on?
I was thinking that by finding a way to specify the name with an incremental variable and then grouping them, I could divide the computation in many smaller ones. Is it reasonable and doable?
Many thanks!
The problem might be that you're not using $f. Try using this line in the for-loop instead. Also, use >> so the output appends instead of overwriting the output file.
bedtools function /mydir/myfile.txt "$f" >> output.txt
But if you still want to break it up, you might try a double loop like this:
for n in $(seq --format="%03.f" 1 120); do
for f in "A$n"*; do
bedtools function /mydir/myfile.txt "$f" >> "${n}output.txt"
done
done
The values of $n from the output of the seq command is 001 002 .. 120.
So the inner loop loops through A001* A002* .. A120*.

Best way to identify similar text inside strings?

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?
One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.
The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.
Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

Move non-sequential files to new directory

I have no previous programming experience. I know this question has been asked before or the answer is out there but I, for the life of me, cannot find it. I have searched google for hours trying to figure this out. I am working on a Red Hat Linux computer and it is in bash.
I have a directory of files 0-500 in /directory/.
They are named as such,
/directory/filename_001, /directory/filename_002, and so forth.
After running my analysis for my research, I have a listofnumbers.txt (txt file, with each row being a new number) of the numbers that I am interested in. For example,
015
124
187
345
412
A) Run a command from the list of files the files from the list of numbers? Our code looks like this:
g09slurm filename_001.com filename_001.log
Is there a way to write something like:
find value (row1 of listofnumbers.txt) then g09slurm filename_row1value.com filename_row1value.log
find value (row2 of listofnumbers.txt) then g09slurm filename_row2value.com filename_row2value.log
find value (row3 of listofnumbers.txt) then g09slurm filename_row3value.com filename_row2value.log
etc etc
B) Move the selected files from the list to a new directory, so I can rename them sequentially, then run a sequential number command?
Thanks.
First, read the list of files into an array:
readarray myarray < /path/to/filename.txt
Next, we'll get all the filenames based on those numbers, and move them
cd /path/to/directory
mv -t /path/to/new_directory "${myarray[#]/#/filename_}"
After this... honestly, I got bored. Stack Overflow is about helping people who make a good start at a problem, and you've done zero work toward figuring this out (other than writing "I promise I tried google").
I don't even understand what
Run a command from the list of files the files from the list of numbers
means.
To rename them sequentially (once you've moved them), you'll want to do something based on this code:
for i in $(ls); do
*your stuff here*
done
You should be able to research and figure stuff out. You might have to do some bash tutorials, here's a reasonable starting place

Grep Recursively for files in specific group of sub folders

I've done a search through both Google and StackOverflow, but haven't seemed to be able to find an answer to this question. It might be because I'm searching the wrong terms, but hopefully someone can help!
I have the following file structure similar to this:
/logs/ServiceA/Prod/2017/10/01/11/logFileForHour
/logs/ServiceA/Prod/2017/10/01/12/logFileForHour
/logs/ServiceA/Prod/2017/10/01/13/logFileForHour
/logs/ServiceB/SubService1/Prod/2017/10/01/12/logFileForHour
/logs/ServiceC/SubService1/Prod/Mirror/2017/10/01/12/logFileForHour
/logs/ServiceC/SubService1/Beta/2017/10/01/12/logFileForHour
Each hour's folder contains the aggregate of the logs from all hosts running that service. Those hourly folders are aggregated into daily folders, which are aggregated into monthly folders and so on. Then the logs are aggregated by Stage (Prod/Demo/Dev) and then further by Service/SubService.
I need a way to grep for a common identifier across all PROD services, and sub services and would like to try and do this with a single grep if at all possible. I know the hour that the request was placed in.
Ideally I'd be able to use a filepath of: /logs/*/Prod/*/2017/10/01/12/* if I wanted all prod logs from all services and sub services during the 12:00 hour of October 10th, 2017, but this only works if there is a single folder where each asterisk is, when in reality that could be 1 or more folders for the first asterisk and 0 or more folders for the second asterisk.
Any help you all could provide would be greatly appreciated!
You could try something like the following:
find logs -type d -name Prod | xargs grep -r SEARCHSTRING
but beware I couldn't test it...
EDIT after clarification:
find logs -type f -path '*/Prod/*' -a -path '*/2017/10/01/12/*'
which is, find all files descendants of a path that matches Prod and matches the date/hour path you are looking for (add grep)

What command to search for ID in .bz2 file?

I am new to Linux and I'm trying to look for an ID number within a .bz2 file. Seems like a fairly straight forward requirement, however I cannot find the correct command anywhere online. I believe I need to use bzgrep.
I want to look for '123456' in the file Bulk9876.bz2
How would I construct this command?
You probably just need to tell grep that it's okay to parse that data as text:
bzgrep -a 123456 Bulk9876.bz2
If you're trying to view the compressed data (rather than decompressing it and searching the decompressed data), just use grep -a ….
Otherwise, it might make sense to verify that the desired string is even present in the file; bunzip2 it and grep -a the decompressed file. If that works, the problem is in your bzgrep instance (which is odd because it should be using the same decompression library as bunzip2).

Resources