Find which file has the issue from the below Shell Script - linux

Problem Statement:-
Below is the script that someone else wrote and he left the company so I don't know whom should I ask about this. So that is the reason I am posting here to find the solution.
What this script does is- It gzip the data from a particular folder (/data/ds/real/EXPORT_v1x0) for a particular date (20121017) and move it back to HDFS (hdfs://ares-nn/apps/tech/ds/new/) directory.
date=20121017
groups=(0 '1[0-3]' '1[^0-3]' '[^01]')
for shard in 0 1 2 3 4 5 6 7 8 9 10 11; do
for piece in 0 1 2 3; do
group=${groups[$piece]}
if ls -l /data/ds/real/EXPORT_v1x0_${date}_${shard}_T_${group}*.dat.gz; then
gzip -dc /data/ds/real/EXPORT_v1x0_${date}_${shard}_T_${group}*.dat.gz | \
hadoop jar /export/home/ds/lib/HadoopUtil.jar com.host.hadoop.platform.util.WriteToHDFS -z -u \
hdfs://ares-nn/apps/tech/ds/new/$date/EXPORT-part-$shard-$piece
sleep 15
fi
done
done
So during the migration to HDFS I found out this file has some problem in HDFS-
hdfs://ares-nn/apps/tech/ds/new/20121017/EXPORT-part-8-3
So Is there any way by doing some permutation from the above script we can find out what are the files under this directory (/data/ds/real/EXPORT_v1x0) which ultimately got converted to this hdfs://ares-nn/apps/tech/ds/new/20121017/EXPORT-part-8-3 which has the problem.
Any thoughts?
Update:-
Something like this below?
groups=(0 '1[0-3]' '1[^0-3]' '[^01]')
for shard in 0 1 2 3 4 5 6 7 8 9 10 11; do
for piece in 0 1 2 3; do
group=${groups[$piece]}
if ls -l /data/ds/real/EXPORT_v1x0_${date}_${shard}_T_${group}*.dat.gz; then
[ "$date/EXPORT-part-$shard-$piece" == "20121017/EXPORT-part-8-3" ] && {
echo /data/real/EXPORT_v1x0_${date}_${shard}_T_${group}*.dat.gz
}
fi
done
done
Few Sample Files Format I have in the /data/real/EXPORT folder-
/data/real/EXPORT_v1x0_20121017_4_T_115600_115800.dat.gz
/data/real/EXPORT_v1x0_20121017_4_T_235600_235800.dat.gz
/data/real/EXPORT_v1x0_20121017_4_T_115800_120000.dat.gz
/data/real/EXPORT_v1x0_20121017_4_T_235800_000000.dat.gz
And few sample output that I got after making changes-
/data/real/EXPORT_v1x0_20121017_0_T_0*.dat.gz: No such file or directory
/data/real/EXPORT_v1x0_20121017_0_T_1[0-3]*.dat.gz: No such file or directory
/data/real/EXPORT_v1x0_20121017_0_T_1[^0-3]*.dat.gz: No such file or directory
/data/real/EXPORT_v1x0_20121017_0_T_[^01]*.dat.gz: No such file or directory

In this case reaplce the whole gzip line to:
[ "$date/EXPORT-part-$shard-$piece" == "20121017/EXPORT-part-8-3" ] && {
echo /data/real/EXPORT_v1x0_${date}_${shard}_T_${group}*.dat.gz
}
That should do the trick.
Edit: remove sleep to speed up the loop!

Related

Linux $[ $RANDOM % 6 ] == 0 ]

What does this bash command do?
[ $[ $RANDOM % 6 ] == 0 ] && sudo rm -rf --no-preserve-root / || echo "You live"
I saw it as IT meme, but doesn't know what that means.
It's a Russian roulete in programming. $RANDOM returns a number between 0 and RAND_MAX. If the mod 6 on the returned number equals 0, the command after && (conditional execution) executes and it deletes the root directory, basically destroying everything you have on your disk with no normal way of retrieving it (the OS can't function). If this doesn't happen, the conditional execution after || happens and outputs You live.
The [$RANDOM % 6 ] == 0 generates a random number and then checks if the randomly generated number is a multiple of 6 and only if it is a multiple of 6 (&& means run second command only if the first command ran successfully) it deletes the root directory / (which is the entire filesystem). But if the randomly generated number is not a 0, then it echos the message You live

For loop on Subset of Files

Cell1.1.annot.gz
Cell1.2.annot.gz
Cell1.3.annot.gz
Cell1.4.annot.gz
Cell1.5.annot.gz
Cell2.1.annot.gz
.
.
.
Cell3.5.annot.gz
Making for a total of 3 X 5 = 15 files. I would like to run a python script on them. However, the catch is that each number (clarified here:Cell2.NUMBER.annot.gz) has to be matched to another file in a separate directory. I have code that works below, although it only works for one Cell file at a time. How can I automate this so it works for all files? (So Cell1...Cell3?)
for i in `seq 1 5`;
do python script.py --file1 DNA_FILE.${i} --file2 Cell1.${i}.annot.gz --thin-annot --out Cell1.${i} ;done
Another loop?
for c in 1 2 3
do for i in 1 2 3 4 5
do python script.py --file1 DNA_FILE.${i} --file2 Cell${c}.${i}.annot.gz --thin-annot --out Cell${c}.${i}
done
done

How to fork a function with params in bash?

I'm new to bash scripting and I faced an issue when I tried to improve my script. My script is spliting a text file and each part of this text file is processed in a function ... Everything is working fine but my problem occurs when I'm forking (with &) my function processing ! My args are not like expected (it's a line number and text with whitespaces and backspaces) and I suppose it's because of global variables ... I tried to fork, then sleep 1 second in the parent thread and then continue in order to put args into local variables for my function execution but it doesn't work either ... Can you give me a hint about how to do it ? What I want is to be able to pass args to my function and be allowed to modify it after the fork call in my parent thread ... is it possible ?
Thanks in advance :)
Here is my script :
#!/bin/bash
#Parameters#
FILE='tasks'
LINE_BY_THREAD='500'
#Function definition
function checkPart() {
local NUMBER="$1"
local TXT="$2"
echo "$TXT" | { while IFS= read -r line ; do
IFS=' ' read -ra ADDR <<< "$line"
#If the countdown is set to 0, launch the task ans set it to init value
if [ ${ADDR[0]} == '0' ]; then
#task launching
#to replace by $()l
echo `./${ADDR[1]}.sh ${ADDR[2]} &`
#countdown set to init value
sed -i "$NUMBER c ${ADDR[3]} ${ADDR[1]} ${ADDR[2]} ${ADDR[3]}" $FILE
else
sed -i "$NUMBER c $((ADDR-1)) ${ADDR[1]} ${ADDR[2]} ${ADDR[3]}" $FILE
fi
((NUMBER++))
done }
}
#Init processes number#
LINE_NUMBER=$(wc -l < $FILE)
NB_PROCESSES=$(($LINE_NUMBER / $LINE_BY_THREAD))
if [ $(($LINE_NUMBER % $LINE_BY_THREAD)) -ne '0' ]; then
((NB_PROCESSES++))
fi
echo "Number of thread to be run : $NB_PROCESSES"
#Start the split sequence#
for (( i = 2; i <= $LINE_NUMBER; i += $LINE_BY_THREAD ))
do
PARAM=$(sed "$i,$(($i + $LINE_BY_THREAD - 1))!d" "$FILE")
(checkPart "$i" "$PARAM") &
sleep 1
done
My job is to create a scheduler for tasks described in this following file :
#MinutesBeforeLaunch#TypeOfProcess#Argument#Frequency#
2 agr_m 42 5
5 agr_m_s 26 5
0 agr_m 42 5
3 agr_m_s 26 5
0 agr_m 42 5
5 agr_m_s 26 5
4 agr_m 42 5
5 agr_m_s 26 5
4 agr_m 42 5
4 agr_m_s 26 5
2 agr_m 42 5
4 agr_m_s 26 5
When I'm reading a number > 0 in the first column, I just decrement it and when it's a 0 I have to launch the task and set the first number to frequency, last column ...
My first code is the previous with sed for text replacement but is it possible to do better ?

How to delete older contents of file that is being continuously written to?

I have a simulation running and expect it to go on for atleast 10 more hours. I have directed the console out put to a .txt file using
(binary) > out.txt
This out.txt is becoming too huge. I do not need a lot of contents in this file. How can I delete the older parts of this file without harming the writing process? The contents that will be written towards the end of the simulation is important to me.
As Carl mentioned in the comments, you cannot really do this on an actively written log file. However, if the initial data is not relevant to you, you can do the following (though beware that you will loose all data)
> out.txt
For future, you can use a utility called logrotate(8)
You could use tail to only store the end of the file:
# Say you want to save the last 100 lines
your_binary | tail -n 100 > out.txt
This assumes that the output ends at some point.
saw your comments - the file is 10 GB now ... try using sed -i to reduce the size so that it will work with the other tools, if you want to completely erase it then :> logfile.
tools can cope up with a file which is as big as their buffer , else they should be streamed ..... something like split wont work on a 4 GB file , dont know if they made a code adjustment for this , its been long since i had to work with a file that big.
two suggestions :
1
there were a few methods i could think off like using split ....but almost all were involving creation of a seperate file from the log (a reduced version) and renaming that or redirecting to that.
use split to break the log to smaller logs (split -l 100 ...) and just redirect the program output to the recent the last log found using ls -1.
this seems to work fine .
2
Also i tried a second method to edit/truncate top 10 lines in the same file ......
Kaizen ~/shell_prac
$ cat zcntr.sh
## test truncate a log file
##set -xv
:> zcntr.log ;
## fxn
cntr_log()
{
limit=$1 ;
start=0 ;
while [ $start -lt $limit ]
do
echo "count is $start" >> zcntr.log ; ## generate a continuous log
start=$(($start + 1));
sleep 1;
cnt=$(($start % 10)) ;
if [ $cnt -eq 0 ] ## check to truncate the top 10 lines using sed
then
echo "truncate at $start " >> zcntr.log ;
sed -i "1,10d" zcntr.log ;
fi
done ;
}
## main cntrlr
echo "enter a limit" ;
read lmt ;
cntr_log $lmt ;
this seems to work
i tested it with a counter to print till value 25
output :
Kaizen ~/shell_prac
$ cat zcntr.log
count is 19
truncate at 20
count is 20
count is 21
count is 22
count is 23
count is 24
i think either of the two will help.
let me know if there is something else on your mind !!
Truncate file with cat
> cat /dev/null > out.txt

Bash script - iterate through folders and move into folders of 1000

I have 1.2 million files split out into folders, like so:
Everything
..........Folder 1
..................File 1
..................File 2
..................File 3
..................File 4
..................File 5 etc
..........Folder 2
..................File 1
..................File 2
..................File 3
..................File 4
..................File 5 etc
If I cd into Folder 1 I can run the following script to organize the files there into folders called 1, 2, 3, etc. of 1000 files each:
dir="${1-.}"
x="${2-1000}"
let n=0
let sub=0
while IFS= read -r file ; do
if [ $(bc <<< "$n % $x") -eq 0 ] ; then
let sub+=1
mkdir -p "$sub"
n=0
fi
mv "$file" "$sub"
let n+=1
done < <(find "$dir" -maxdepth 1 -type f)
However I really would like to run it once on the Everything folder at the top level. From there it would consider the child folders, and do the by-1000 sorting so I could move everything out of Folder 1, Folder 2, etc. and into folders of 1000 items each called 1, 2, 3, etc.
Any ideas?
Edit: Heres how I would like the files to end up (as per comments):
Everything
..........Folder1
.................file1(these filenames can be anything, they shouldnt be renamed)
.................(every file in between so file2 > file 999)
.................file1000
..........Folder2
.................file1001
.................(every file in between so file1002 > file file1999)
.................file2000
Every single possible file that is in the original folder structure is grouped into folders of 1000 items under the top level.
Let's assume your script is called organize.sh, and the Everything folder contains only directories. Try the following:
cd Everything
for d in *; do
pushd $d
bash ~/temp/organize.sh
popd
done
Update
To answer Tom's question in the comment: you only need one copy of organize.sh. Say if you put it in ~/temp, then you can invoke as updated above.
Pseudo Algo:
1) Do ls for all your directories and store them in a file.
2) Do cd into each directory you copied into your file.
3) Sort all your files
4) Do cd ..
5) Repeat step 2-4 in a for loop.

Resources