How to remove some number of files using a wild card before adding stuff back to them? - python-3.x

I have a set of .txt files named my_file_1.txt, my_file_2.txt, ..., my_file_n.txt where n is finite integer. As my python code is running (in a directory with path ~/simulation/some_code), it is adding data into these files using the following for loop:
for realization in np.arange(1, n+1):
# Identifying the file_path
some_name = 'output/{}/Info/{}/{}/parameter_{:.3f}/my_file_{}.txt'.format(size, name, status, value, realization)
# do some stuff
with open(some_name, "a") as filename:
print('{}'.format(some_list), file=filename)
filename.close()
However, to begin with, these files are not empty and need to be emptied. To do so, I am running the following line ahead of time (in the home directory ~/ which is two levels up with respect to the directory of the above code) to make sure files are empty before being modified:
os.system('> output/{}/Info/{}/{}/parameter_{:.3f}/my_file_*.txt'.format(size, name, status, value))
While I expected to see * symbol should act as a wild card to empty all similar text files above, it seems that files are accumulating from previous data instead of removing initial data and adding the stuff above. Am I using * wild card incorrectly? Is this problem fixable without changing the path of my codes?

Your understanding of the wildcard is correct. The mistake is with redirection. By default you only redirect (>) to one output. You can use the program tee to redirect std out to multiple other files like this:
(echo -n, echoes nothing)
echo -n | tee *
The pipe | passes the stdout of echo -n to the stdin of tee.
Then the wildcard will expand to all files in the directory.
echo -n | tee my_file_1.txt, my_file_2.txt, ..., my_file_n.txt

Related

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

grep empty output file

I made a shell script the purpose of which is to find files that don't contain a particular string, then display the first line that isn't empty or otherwise useless. My script works well in the console, but for some reason when I try to direct the output to a .txt file, it comes out empty.
Here's my script:
#!/bin/bash
# takes user input.
echo "Input substance:"
read substance
echo "Listing media without $substance:"
cd media
# finds names of files that don't feature the substance given, then puts them inside an array.
searchresult=($(grep -L "$substance" *))
# iterates the array and prints the first line of each - contains both the number and the medium name.
# however, some files start with "Microorganisms" and the actual number and name feature after several empty lines
# the script checks for that occurence - and prints the first line that doesnt match these criteria.
for i in "${searchresult[#]}"
do
grep -m 1 -v "Microorganisms\|^$" $i
done >> output.txt
I've tried moving the >>output.txt to right after the grep line inside the loop, tried switching >> to > and 2>&1, tried using tee. No go.
I'm honestly feeling utterly stuck as to what the issue could be. I'm sure there's something I'm missing, but I'm nowhere near good enough with this to notice. I would very much appreciate any help.
EDIT: Added files to better illustrate what I'm working with. Sample inputs I tried: Glucose, Yeast extract, Agar. Link to files [140kB] - the folder was unzipped beforehand.
The script was given full permissions to execute. I don't think the output is being rewritten because even if I don't iterate and just run a single line of the loop, the file is empty.

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

How do you format output string in bash script for input by another script?

I need to unzip a bunch of student assignment (jar) files so that I can use a script to submit the contents to the Moss (Stanford) plagiarism detection server. I did the same thing in Java which was trivial but I'm trying to re-implement to as a bash script.
I am trying to do the following:
Get a list of student names (each student has a directory).
In each student directory, sub-directories exist numbered from 1 to the
latest submission. I need to get the directory with the highest
number.
Inside of each of those submission directories contains a
jar file that I need. I copy each jar into a temp directory with the
same name as the student and unzip it.
I need that temp directory listing formatted as a string in the form
/tempDir/studentName1/.languageExt /tempDir/studentName2/.languageExt
The student directory has the basic structure:
Student_Root_Directory:
Student1
Student2
Student1
Sub-Directories: 1 2 3 4 5
1: student1.jar
2: student1.jar
...
Student2
Sub-Directories: 1 2 3
1. student2.jar
...
To do the first 3 steps above I did:
#!/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: .c, .cpp, .java, .py
students=`ls $1`
student_dir=$1
languageExt=$2
mossDir="/home/moss"
tempDir="/home/moss/tempJarStorage"
for student in $students
do
latestSubmissionDir=`ls -t $student_dir/$student | head -1`
for jarDir in $latestSubmissionDir
do
mkdir $tempDir/$student
cp $student_dir/$student/$jarDir/*.jar $tempDir/$student
unzip -d $tempDir/$student/ -o -j $tempDir/$student/$student.jar *.$languageExt
rm $tempDir/$student/$student.jar
done
done
...which results in a number of student directories being created in a temp directory that contains only the unzipped contents for the student submissions.
I need the ls output of the new temp directories formatted as a string that contains:
/tempDir/studentName1/\*.languageExt /tempDir/studentName2/\*.languageExt
I have tried variations on
find "$tempDir" -iname "*.$languageExt" -printf "%p/*.$languageExt"
using iname and not - but I either have output that contains extra directory information such as $tempDir/*.languageExt (when I just need the subdirectories $tempDir/$studentName/*.languageExt) or I have output where the path for every source file is also listed such as:
$tempDir/$studentName/studentNameA.java
$tempDir/$studentName/studentNameB.java
when I only need
$tempDir/$studentName/*.java
I think this should be really easy and I'm just over thinking it. Any hints for improving the script also appreciated.
Here's a revised version of the script hat may work:
#/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: c, cpp, java, py
students_dir=$1
languageExt=$2
studentPathsT=( "$students_dir"/*/ )
mossDir='/home/moss'
tempDir='/home/moss/tempJarStorage'
for studentPathT in "${studentPathsT[#]}"; do
student=$(basename "$studentPathT")
mkdir "$tempDir/$student"
submissionDirsT=( "$studentPathT"*/ )
latestSubmissionDirT=${submissionDirsT[${#submissionDirsT[#]-1]}
cp "$latestSubmissionDirT"*.jar "$tempDir/$student/"
unzip -d "$tempDir/$student/" -o -j "$tempDir/$student/*.jar" "*.$languageExt"
rm "$tempDir/$student"/*.jar
done
# Note that at this point `"$tempDir"/*/*.$languageExt` would expand
# to all extracted submission files, across all students.
# Finally, output each student's extracted files as an unexpanded glob à la
# /{tempDir}/{studentName1}/*.{languageExt}
for pT in "$tempDir"/*/; do
echo "$pT*.$languageExt"
# Note: If there is a chance that your filenames contain
# embedded newlines (rare in practice) using `echo` won't work properly
# as #Charles Duffy points out.
# If that is a concern, use
# printf '%s\0' "$pT*.$languageExt"
# and process the output with a utility that can process NUL characters
# as separators, such as `xargs -0`.
done
It avoids using ls and only uses pathname expansion and array variables so as to properly deal with paths that contain embedded spaces and other shell metacharacters.
suffix ...T in variable names indicates that a particular path or array of paths is *T*erminated, i.e, that it ends in a /.
The assumption is that the numbered subdirectories do not go beyond 9, as the implicit lexical sorting of pathname expansion is relied upon; if the numbers go higher, explicit numerical sorting must be applied.
Note that the globs (pathname patterns) passed to unzip are intentionally double-quoted, as they should be interpreted by unzip, not the shell.
Note that, based on your original code, I've assumed that $languageExt does NOT start with . (e.g., cpp rather than .cpp), despite what your comment says.

Resources