How to read file from another file - linux

This script lists the unit-*-slides.txt files in from directory to a filelist.txt file and from that file list it goes to the file and reads the file and gives the count of st^ lines to a file.but it is not counting in order for ex 1,2,3,4,.... it is counting like 10,1,2,3,4......
How to read it in order.
#!/bin/sh
#
outputdir=filelist
mk=$(mkdir $outputdir)
$mk
dest=$outputdir
cfile=filelist.txt
ofile="combine-slide.txt"
output=file-list.txt
path=/home/user/Desktop/script
ls $path/unit-*-slides.txt | sort -n -t '-' -k 2 > $dest/$cfile
echo "Generating files list..."
echo "Done"
#Combining
while IFS= read file
do
if [ -f "$file" ]; then
tabs=$(cat unit-*-slides.txt | grep "st^" | split -l 200)
fi
done < "$dest/$cfile"
echo "Combining Done........!"

Try with sort -n
tabs=$(cat $( ls unit-*-slides.txt | sort -n ) | grep "st^" | split -l 200)
sort -n means numeric sort, so output of ls is ordered by number.

Related

Bash script to automatically email variable filename and corresponding modified datetime stamp

In short, i am trying to find the duplicate files ( which have been downloaded from FTP site, it is another script file) and email the business about all the duplicate files and its corresponding modified dates.
The file /path/test_mail.txt contains one file name per line (two filenames in this case), ex.
abc.xlsx
def.xlsx
In the below code, i am trying to find the modified datetimestamp for the first filename and pipe it with the respective filename and send an email , similarly the loop runs for the second one.
This is using stat
for val in '/path/test_mail.txt'; do
{ stat path/$val | grep 'Modify: ' | cut -d' ' -f2,3,4 | awk -F"." '{print $1}' ; } |
$val
done |
mail -s "Duplicate file found ${DATE}" abc#xyz.com
I also tried in another way using ls -ltr
for val in '/path/tj_mail.txt'; do
{ ls -ltr /path/$val | cut -d' ' -f6,7,8 | find $val / -path
$val
done |
mail -s "Duplicate file found ${DATE}" abc#xyz.com
i was expecting the email body should be approximately like
Duplicate Filename - xyz.xlsx Uploaded time - 2020-02-17 11:18:10
Duplicate Filename - abc.xlsx Uploaded time - 2020-02-17 11:18:10
The below question is optional, but that would be great if you can help me!
Also i am using another script to find the duplicate filenames in the directory. it works perfectly fine. But i am wondering if i could fit in the same code above in 1 single script file, so that it would be crisp and easy!
{
DATE=`date +"%Y-%m-%d"`
dirname=/path
tempfile=myTempfileName
find $dirname -type f > $tempfile
cat $tempfile | sed 's_.*/__' | sort | uniq -d|
while read fileName
do
grep "$fileName" $tempfile
done
} | tee '/path/tj_var.txt' | awk -F"/" '{print $NF}' | tee '/path/tj_var.txt' | sort -u | tee '/path/tj_mail.txt' | mail -s "Duplicate file found ${DATE}" abc#xyz.com
This is my actual code
path = /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation
printf "%s" "$(</marketsource/scripts/tj_mail.txt)" | while IFS= read -r filename; do
mtime=$(stat -c %y "/path/$filename")
printf 'Duplicate Filename - %s Uploaded time - %s\n' "$filename" "$mtime"
done | mail -s "Duplicate file found ${DATE}" tipalli#allegisgroup.com
mtime=$(stat -c %y "/path/$filename" 2>/dev/null || echo "unknown (stat failed)")
this is the error!
./tj_mail1.ksh: line 1: path: command not found stat: cannot stat
`/path/AirTimeActs_2020-02-08.xlsx': No such file or directory
Little more!!
My aim to find for any duplicates files, if therent arent any duplicate files and find command is empty, then perform the if condition and perform 'mv' command and exit the script entirely, if they are duplicate files, then exit the if condition and pipe the duplicate files and perform the mail and date stamp operation.
{
DATE=`date +"%Y-%m-%d"`
dirname=/marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation
tempfile=myTempfileName
find $dirname -type f > $tempfile
cat $tempfile | sed 's_.*/__' | sort | uniq -d|
while read fileName
do
grep "$fileName" $tempfile
done
}
if ["$fileName" == ""]; then
mv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/*.xlsx /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/Archive
mv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/*.csv /marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/Archive
exit 1
fi | tee '/marketsource/scripts/tj_var.txt' | awk -F"/" '{print $NF}' | tee '/marketsource/scripts/tj_var.txt' | sort -u | tee '/marketsource/scripts/tj_mail.txt'
DATE=`date +"%Y-%m-%d"`
printf "%s\n" "$(</marketsource/scripts/tj_mail.txt)" | while IFS= read -r filename; do
mtime=$(stat -c %y "/marketsource/SrcFiles/Target_Shellscript_Autodownload/Airtime_Activation/$filename")
printf 'Duplicate Filename - %s Uploaded time - %s\n\n' "$filename" "$mtime"
done | mail -s "Duplicate file found ${DATE}" ti#allegisgroup.com
I assume the file /path/test_mail.txt has been prepared by some other script (as a list of duplicate files) and the task is to add the modification time of the files listed in /path/test_mail.txt and format the output as shown in the question.
while IFS= read -r filename; do
mtime=$(stat -c %y "/path/$filename")
printf 'Duplicate Filename - %s Uploaded time - %s\n' "$filename" "$mtime"
done < "/path/test_mail.txt"
Instead of parsing the file /path/test_mail.txt you could also add this to a pipe like this
somehow_print_duplicate_file_names | while IFS= read -r filename; do
mtime=$(stat -c %y "/path/$filename")
printf 'Duplicate Filename - %s Uploaded time - %s\n' "$filename" "$mtime"
done | somehow_send_mail
You could add some error handling in case stat fails.
mtime=$(stat -c %y "/path/$filename" 2>/dev/null || echo "unknown (stat failed)")
or use stat's error message
mtime=$(stat -c %y "/path/$filename" 2>&1)

Fastest way to compare hundreds of thousands of files, and create output results file in bash

I have the following:
-Values File, values.txt
-Directory Structure: ./dataset/label/author/files.txt
-Tens of thousands of files.txt's
-A file called targets.txt, which contains the location of every files.txt
Example targets.txt
./dataset/tallperson/Jabba/awesome.txt
./dataset/fatperson/Detox/toxic.txt
I have a file called values.txt, which contains hundreds of thousands of lines of values. These values are things like "aef", "; i", "jfk", etc. Random 3-Character lines.
I also have tens of thousands of files, each which also contain hundreds to thousands of lines. Each line also contains Random 3-Character lines.
The values.txt was created using the values of each files.txt. Therefore, there is no value in any file.txt file which isn't contained in values.txt. values.txt contains NO repeating values.
Example:
./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
The above is just example data. Each file will contain hundreds of lines. And values.txt will contain hundreds of thousands of lines.
My goal here is to make one file, where each line is a file. Each line will contain N values where each value is correspondant to the line in values.txt. And each value will be seperated by a comma. Each value is calculated simply by how many times each file contains the value of each line in values.txt.
The result should look something like this. With line 1 being file1.txt and line 2 being file2.txt.
Result.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,
Now. The last thing is, after getting this result I would like to add a label. The label is equivalent to the Nth parent directory from the file. For this example, lets say the 2nd parent directory. Therefore the label would be "tallperson" or "shortperson". As a result, the new Results.txt file would look like this.
Results.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
I would like a way to accomplish all of this, but I need it to be fast as I am working with a very large scale dataset.
This is my current code, but it's too slow. The bottleneck is line 2.
Script. Each file located at "./dataset/label/author/file.java"
1 while IFS= read file_name; do
2 cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt;
3 label=$(echo "$file_name" | cut -d '/' -f 3);
4 printf "$label\n" >> Results.txt;
5 done < targets.txt
------------
To REPLICATE this problem. Do the following:
mkdir -p dataset/{label1,label2}
touch file1.txt; chmod 777 file1.txt
touch file2.txt; chmod 777 file2.txt
echo "Enter anything here" > file1.txt
echo "Enter something here too" > file2.txt
mv file1.txt ./dataset/label1
mv file2.txt ./dataset/label2
find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done
find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt
xargs cat < targets.txt | sort -u > values.txt
With the above UNCHANGED, you should get a values.txt with something similar to below. If there's any lines with less or more than 3 characters for some reason, please delete the line.
any
e
Ent
er
eth
he
her
ing
ng
re
som
thi
too
You should get a targets.txt file
./dataset/label2/modified-file2.txt
./dataset/label1/modified-file1.txt
From here. The goal is to check every file in targets.txt, and count how many values the file has contained in values.txt. And to output the results with the label to Results.txt
The following script will work for this example, but I need it to be way faster for large scale operations.
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt;
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n" >> Results.txt;
done < targets.txt
Here's another example
Example 2:
./dataset/weirdperson/Crooked/file1.txt
LOL
LOL
HAHA
./dataset/awesomeperson/Mild/file2.txt
LOL
LOL
LOL
values.txt
LOL
HAHA
Result.txt
2,1,weirdperson
3,0,awesomeperson
Here's a solution in Python, using its ordered dictionary datatype.
import os
from collections import OrderedDict
# read samples from values.txt into an Ordered Dict.
# each dict key is a line from the file
# (including the trailing newline, but that doesn't matter)
# each dict value is 0
with open('values.txt', 'r') as f:
samplecount0=OrderedDict((sample, 0) for sample in f.readlines())
# get list of filenames from targets.txt
with open('targets.txt', 'r') as f:
targets=[t.rstrip('\n') for t in f.readlines()]
# for each target,
# read its lines of samples
# increment the corresponding count in samplecount
# print out samplecount in a single line separated by commas
# each line also has the 2nd-to-last directory component of the target's pathname
for target in targets:
with open(target, 'r') as f:
# copy samplecount0 to samplecount so we don't have to read the values.txt file again
samplecount=samplecount0.copy()
# for each sample in the target file, increment the samplecount dict entry
for tsample in f.readlines():
samplecount[tsample] += 1
output = ','.join(str(v) for v in samplecount.values())
output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target)))
print(output)
Output:
$ python3 doit.py
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Try this:
<targets.txt xargs -n1 -P4 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
The -P4 let's you parallelize the jobs in targets.txt. The short awk script marges lines and prints 0 and 1 followed by a comma. Then sed is used to append the 3rd part of the folder path to the end of the line. The sed line looks strange, because I used unprintable character $'\x01' as the separator for s command.
Tested with:
mkdir -p ./dataset/weirdperson/Crooked
cat <<EOF >./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
EOF
mkdir -p ./dataset/awesomeperson/Mild/
cat <<EOF >./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
EOF
cat <<EOF >values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
EOF
cat <<EOF >targets.txt
./dataset/weirdperson/Crooked/file1.txt
./dataset/awesomeperson/Mild/file2.txt
EOF
measure_start() {
declare -g ttic_start
echo "==> Test $* <=="
ttic_start=$(date +%s.%N)
}
measure_end() {
local end
end=$(date +%s.%N)
local start
start="$ttic_start"
ttic_runtime=$(python -c "print(${end} - ${start})")
echo "Runtime: $ttic_runtime"
echo
}
measure_start original
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d,"
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n"
done < targets.txt
measure_end
measure_start first try with bash
nl -w1 values.txt | sort -k2.2 > values_sorted.txt
< targets.txt xargs -n1 -P0 bash -c "
sort -t$'\t' \"\$1\" |
join -t$'\t' -12 -21 -eEMPTY -a1 -o1.1,2.1 values_sorted.txt - |
sort -s -n -k1.1 |
sed 's/.*\tEMPTY/0/;t;s/.*/1/' |
tr '\n' ',' |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
measure_start second try with awk
<targets.txt xargs -n1 -P0 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
Outputs:
==> Test original <==
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Runtime: 0.133769512177
==> Test first try with bash <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0322473049164
==> Test second try with awk <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0180222988129

Bash substring from position not printing

I am using the following format #{string:start:length} to extract the file name from wget's .listing file, line by line.
The format for the file is something I think we are all familiar with:
04-30-13 01:41AM 7033614 some_archive.zip
04-29-13 08:13PM <DIR> DIRECTORY NAME 1
04-29-13 05:41PM <DIR> DIRECTORY NAME 2
All file names start at pos:40, so setting :start to 39, with no :length should (and does) return the file name for each line:
#!/bin/bash
cat .listing | while read line; do
file="${line:40}"
echo $file
done
Correctly returns:
some_archive.zip
DIRECTORY NAME 1
DIRECTORY NAME 2
However, if I get any more creative, it breaks:
#!/bin/bash
cat .listing | while read line; do
file="${line:40}"
dir=$(echo $line | egrep -o '<DIR>' | head -n1)
if [ $dir ]; then
echo "the file $file is a $dir"
fi
done
Returns:
$ ./test.sh
is a <DIR>ECTORY NAME 1
is a <DIR>ECTORY NAME 2
What gives? I lose "the file " and the rest of the test looks like it prints on top of "the file DIRECTORY NAME 1" from pos:0.
It's weird, what's it on account of?
The answer, as I am learning more and more with linux as I progress, is non-printing control characters.
Adding a pipe to egrep for only printing characters solved the problem:
#!/bin/bash
cat .listing | while read line; do
file=$(echo ${line:39} | egrep -o '[[:print:]]+' | head -n1)
dir=$(echo $line | egrep -o '<DIR>' | head -n1)
if [ $dir ]; then
echo "the file $file is a $dir"
fi
done
Correctly returns:
$ ./test.sh
the file DIRECTORY NAME 1 is a <DIR>
the file DIRECTORY NAME 2 is a <DIR>
Wish there were a better way to visualize these control characters, but what the above does is basically take the string segment, pull out the first string of printable characters, and assign it to the variable.
I assume there is a control character at the end of the line that returns the cursor to the beginning of the line. Causing the rest of the echo to be printed there, overwriting the previous characters.'
Odd.
You can remove the \r control characters from the whole file by using the tr command on the first line of your script:
#!/bin/bash
cat .listing | tr -d '\015' | while read line; do
file="${line:39}"
dir=$(echo $line | egrep -o '<DIR>' | head -n1)
if [ $dir ]; then
echo "the file $file is a $dir"
fi
done

In Linux, how can I merge the output of two console commands into a text file?

I'm obtaining the first line and last 10,000 lines of a csv as follows:
head workrace.csv -n 1
tail workrace.csv -n 10000
How can I merge the output into a single text file? I can pipe the above commands into two separate text files and then concatenate the files. Is there a way to do this without needing to use intermediary text files?
You can either run both commands in a subshell:
( head workrace.csv -n 1 ; tail workrace.csv -n 10000 ) > result.txt
or, you can use the >> redirection operator to add contents to a file:
head workrace.csv -n 1 > result.txt
tail workrace.csv -n 10000 >> result.txt
Some other options not mentioned by choroba:
F=workrace.c
{ head -n 1 $F; tail -n 10000 $F; } > result.txt # no subshell
awk 'NR==1 || NR>k-1000' k="$( wc -l < $F )" $F > result.txt
exec > result.txt # truncate result.txt and direct output of remaining commands to it
head -n 1 $F
tail -n 10000 $F

File name printed twice when using wc

For printing number of lines in all ".txt" files of current folder, I am using following script:
for f in *.txt;
do l="$(wc -l "$f")";
echo "$f" has "$l" lines;
done
But in output I am getting:
lol.txt has 2 lol.txt lines
Why is lol.txt printed twice (especially after 2)? I guess there is some sort of stream flush required, but I dont know how to achieve that in this case.So what changes should i make in the script to get the output as :
lol.txt has 2 lines
You can remove the filename with 'cut':
for f in *.txt;
do l="$(wc -l "$f" | cut -f1 -d' ')";
echo "$f" has "$l" lines;
done
The filename is printed twice, because wc -l "$f" also prints the filename after the number of lines. Try changing it to cat "$f" | wc -l.
wc prints the filename, so you could just write the script as:
ls *.txt | while read f; do wc -l "$f"; done
or, if you really want the verbose output, try
ls *.txt | while read f; do wc -l "$f" | awk '{print $2, "has", $1, "lines"}'; done
There is a trick here. Get wc to read stdin and it won't print a file name:
for f in *.txt; do
l=$(wc -l < "$f")
echo "$f" has "$l" lines
done

Resources