I have a list of files listed 1 per line on some filein.txt like so:
mikesfile.php
ericsfile.php
subdir1/johnsfile.php
subdir1/davidsfile.php
subdir1/subdir2/ashleysfile.php
subdir1/subdir2/zoesfile.php
I need my bash script to read from that line by line, md5sum the corresponding files, and then write the file as well as its md5 on a new file named fileout.txt
For example:
e14086108b4d5d191c22b0a085694e4a - mikesfile.php
ebadb70de710217a7d4d4c9d114b8145 - ericsfile.php
b40bb5dfb23bf89b3011ff82d9cb0b0b - subdir1/johnsfile.php
d03e9b7306cb1f6c019b574437f54db0 - subdir1/davidsfile.php
f840a8d2ea7342303c807b6cb6339fd1 - subdir1/subdir2/ashleysfile.php
3560e05d5ccdad6900a5dfed1a4a8154 - subdir1/subdir2/zoesfile.php
I've been messing around with this:
while read line; do echo -n "$line" | md5sum; done; < filein > fileout
But it just dumps the md5 hash and completely omits the corresponding filenames. Searched all over trying to remedy this to no avail.
I'd very much appreciate your help in combining the two and properly writing them to the output file as shown. Many thanks in advance.
bash + awk solution:
while read -r fn; do
echo "$fn" | md5sum | awk -v fn="$fn" '{ print $0,fn }'
done < filein > fileout
$ cat fileout
1db757c4f098cebf93639f00e55bc88d - mikesfile.php
f063a35599d453721ada5d0c8fcc0185 - ericsfile.php
a8c3a721d12b432c94c23d463fb5a93f - subdir1/johnsfile.php
a4aa114d977c75153aac382574229d3a - subdir1/davidsfile.php
abd77236c393266115acda48ddb4f9a0 - subdir1/subdir2/ashleysfile.php
18ba7e37f42a837d33a7fb3e56a618b5 - subdir1/subdir2/zoesfile.php
Just a straight application of md5sum and printf will do it for you, e.g.
$ while read -r name; do
printf "%s %s %s\n" $(md5sum <<<"$name") "$name";
done <filein.txt
Output
1db757c4f098cebf93639f00e55bc88d - mikesfile.php
f063a35599d453721ada5d0c8fcc0185 - ericsfile.php
a8c3a721d12b432c94c23d463fb5a93f - subdir1/johnsfile.php
a4aa114d977c75153aac382574229d3a - subdir1/davidsfile.php
abd77236c393266115acda48ddb4f9a0 - subdir1/subdir2/ashleysfile.php
18ba7e37f42a837d33a7fb3e56a618b5 - subdir1/subdir2/zoesfile.php
Related
I'm a beginner to bash scripting and been writing a script to check different log files and I'm bit stuck here.
clientlist=/path/to/logfile/which/consists/of/client/names
#i will grep only the client name from the file which has multiple log lines
clients=$(grep --color -i 'list of client assets:' $clientlist | cut -d":" -f1 )
echo "Clients : $clients"
#For example "Clients: Apple
# Samsung
# Nokia"
#number of clients may vary from time to time
assets=("$clients".log)
echo assets: "$assets"
The code above greps the client name from the log file and i'm trying to use the grepped client name (each) to construct a logfile with each client name.
The number of clients is indefinite and may vary from time to time.
The code I have returns the client name as a whole
assets: Apple
Samsung
Nokia.log
and I'm bit unsure of how to cut the string and pass it on one by one to return the assets which has .log for each client name. How can i do this ?
Apple.log
Samsung.log
Nokia.log
(Apologies if I have misunderstood the task)
Using awk
if your input file (I'll call it clients.txt) is:
Clients: Apple
Samsung
Nokia
The following awk step:
awk '{print $NF".log"}' clients.txt
outputs:
Apple.log
Samsung.log
Nokia.log
(You can pipe straight into awk and omit the file name if the pipe stream is as the file contents in the above example).
It is highly likely that a simple awk procedure can perform the entire task beginning with the 'clientlist' you process with grep (awk has all the functionality of grep built-in) but I'd need to know the structure of the origial file to extract the client names.
One awk idea:
assets=( $(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}") )
# or
mapfile -t assets < <(awk -F: '/list of client assets:/ {print $2".log"}' "${clientlist}")
Where:
-F: - define input field delimiter as :
/list of client assets:/ - for lines that contain the string list of clients assets: print the 2nd :-delimited field and append the string .log on the end
One sed idea:
assets=( $(sed 's/.*://; s/$/.log/' "${clientlist}") )
# or
mapfile -t assets < <(sed 's/.*://; s/$/.log/' "${clientlist}")
Where:
s/.*:// - strip off everything up to the :
s/$/.log/ - replace end of line with .log
Both generate:
$ typeset -p assets
declare -a assets=([0]="Apple.log" [1]="Samsung.log" [2]="Nokia.log")
$ echo "${assets[#]}"
Apple.log Samsung.log Nokia.log
$ printf "%s\n" "${assets[#]}"
Apple.log
Samsung.log
Nokia.log
$ for i in "${!assets[#]}"; do echo "assets[$i] = ${assets[$indx]}"; done
assets[0] = Apple.log
assets[1] = Samsung.log
assets[2] = Nokia.log
NOTE: the alternative answers using mapfile address the issue referenced in Charles Duffy comment (see bash pitfall #50); readarray is a synonym for mapfile
I would like to read a config file that should look similar to what is shown below:
source/path:blocksize,offset,seek,count
source/path2:blocksize,offset,seek
source/path3:blocksize,offset
Where source/path,source/path2 and source/path3 are paths to some binary file and offset, seek, count and blocksize are respective values for dd command.
Note that the variables may vary, like some binary file may not have seek or both seek and count values for dd command.
How should I split the above lines to compose dd command like this
dd if=${source/path} bs=${blocksize} seek=${seek} count=${count}
dd if=${source/path} bs=${blocksize} seek=${seek}
dd if=${source/path} bs=${blocksize}?
It is ok if modification is required in the above format to make it easy for parsing cause I ran out of all possibilities that my naive mind can think of.
Hope this helps:
$ cat <<EOF | while read line; do arr=($(sed 's/[,:]/ /g' <<< $line)); echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"; done
source/path:blocksize,offset,seek,count
source/path2:blocksize,offset,seek
source/path3:blocksize,offset
EOF
source:source/path block:blocksize offset:offset seek:seek count:count
source:source/path2 block:blocksize offset:offset seek:seek count:
source:source/path3 block:blocksize offset:offset seek: count:
General Idea:
#!/usr/bin/env bash
your_command | while read line; do
arr=($(sed 's/[,:]/ /g' <<< $line));
echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"
# Do whatever processing & validation you want here
# access from array : ${arr[0]}....${arr[n]}
#
done
If you're having file then:
#!/usr/bin/env bash
while read line; do
arr=($(sed 's/[,:]/ /g' <<< $line));
echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"
# Do whatever processing & validation you want here
# access from array : ${arr[0]}....${arr[n]}
#
done < "path/to/your-file"
I have a file containing a certain number of occurrences of a determined string called "Thermochemistry". Im trying to write a script to grep 500 lines above each occurrence of this string and create a new file numbered accordingly. What I've tried this
occurrences="$(grep -c 'Thermochemistry' $FILE)"
for (( i=1; i<=$occurrences; i++)); do
grep -B500 -m "$i" 'Thermochemistry' $FILE > newfile_"$i".tmp
done
If there are 20 occurrences of 'Thermochemistry' in the file, I wanted it to create 20 new files, called newfile_1.tmp to newfile_20.tmp, but it doesn't work.
Can anyone help?
Next to the magic command from oguz ismail, you could use the following awk line:
awk '/Thermochemistry/{close(f);f="newfile_"++i".tmp"
for(i=FNR;i<=FNR+(FNR>500?500:FNR);++i) print b[i%500] > f
}{b[FNR%500]=$0}' file
I've got to process a large amount of txt files in a folder using bash scripting.
Each file contains million of row and they are formatted like this:
File #1:
en ample_1 200
it example_3 24
ar example_5 500
fr.b example_4 570
fr.c example_2 39
en.n bample_6 10
File #2:
de example_3 4
uk.n example_5 50
de.n example_4 70
uk example_2 9
en ample_1 79
en.n bample_6 1
...
I've got to filter by "en" or "en.n", finding duplicate occurrences in the second column, sum third colum and get a sorted file like this:
en ample_1 279
en.n bample_6 11
Here my script:
#! /bin/bash
clear
BASEPATH=<base_path>
FILES=<folder_with_files>
TEMP_UNZIPPED="tmp"
FINAL_RES="pg-1"
#iterate each file in folder and apply grep
INDEX=0
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" > log
for i in ${BASEPATH}${FILES}
do
FILENAME="${i%.*}"
if [ $INDEX = 0 ]; then
VAR=$(gunzip $i)
#-e -> multiple condition; -w exact word; -r grep recursively; -h remove file path
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $FINAL_RES)
INDEX=1
#remove file to free space
rm $FILENAME
else
VAR=$(gunzip $i)
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $TEMP_UNZIPPED)
cat $TEMP_UNZIPPED >> $FINAL_RES
#AWK BLOCK
#create array a indexed with page title and adding frequency parameter as value.
#eg. a['ciao']=2 -> the second time I find "ciao", I sum previous value 2 with the new. This is why i use "+=" operator
#for each element in array I print i=page_title and array content such as frequency
PARSING=$(awk '{ page_title=$1" "$2;
frequency=$3;
array[page_title]+=frequency
}END{
for (i in array){
print i,array[i] | "sort -k2,2"
}
}' $FINAL_RES)
echo "$PARSING" > $FINAL_RES
#END AWK BLOCK
rm $FILENAME
rm $TEMP_UNZIPPED
fi
done
mv $FINAL_RES $BASEPATH/06/01/
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" >> log
Everything works, but it take a long long time to execute. Does anyone know how to get same result, with less time and less lines of code?
The UNIX shell is an environment from which to manipulate files and processes and sequence calls to tools. The UNIX tool which shell calls to manipulate text is awk so just use it:
$ awk '$1~/^en(\.n)?$/{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}' file | sort
en ample_1 279
en.n bample_6 11
Your script has too many issues to comment on which indicates you are a beginner at shell programming - get the books Bash Shell Scripting Recipes by Chris Johnson and Effective Awk Programming, 4th Edition, by Arnold Robins.
I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!
You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file