Bash for loop not writing to file - linux

I often work like this:
for skra in `ls *txt` ; do paste foo.csv <(cut -f 5 $skra) > foo.csv; done
for looping through a directory by using 'ls'
Now I don't understand why this command does not add column to foo.csv in every loop
What is happening under the hood? Seems like foo.csv is not saved in every iteration
The output I get is field 5 from the last file. Not even the original foo.csv as I get if I only paste foo.csv bar.txt
EDIT:
All files are tab delimited
foo.csv is just one column in the beginning
example.txt as seen in vim with set list:
(101,6352)(11174,51391)(10000,60000)^INC_044048.1^I35000^I6253^I0.038250$
(668,7819)(23384,69939)(20000,70000)^INC_044048.1^I45000^I7153^I0.034164$
(2279,8111)(32691,73588)(30000,80000)^INC_044048.1^I55000^I5834^I0.031908$
Here is a python script that does what I want:
import pandas
rammi=[]
with open('window.list') as f:
for line in f:
nafn=line.strip()
df=pandas.read_csv(nafn, header=None, names=[nafn], sep='\t', usecols=[4])
rammi.append(df)
frame = pandas.concat(rammi, axis=1)
frame.to_csv('rammi.allra', sep='\t', encoding='utf-8')
Paste column 4 from all files to one (initially I wanted to retain one original column but it was not necessary). The question was about bash not wanting to update stdin in the for loop.

As already noted in the comments, opening foo.csv for output will truncate it in most shells. (Even if that was not the case, opening the file and running cut and paste repeatedly looks quite inefficient.)
If you don’t mind keeping all the data in memory at one point in time, a simple AWK or Bash script can do this type of processing without any further processes such as cut or paste.
awk -F'\t' ' { lines[FNR] = lines[FNR] "\t" $5 }
END { for (l in lines) print substr(lines[l], 2) }' \
*.txt > foo.csv
(The output should not be called .csv, but I’m sticking with the naming from the question nonetheless.)
Actually, one doesn’t really need awk for this, Bash will do:
#!/bin/bash
lines=()
for file in *.txt; do
declare -i i=0
while IFS=$'\t' read -ra line; do
lines[i++]+=$'\t'"${line[4]}"
done < "$file"
done
printf '%s\n' "${lines[#]/#?}" > foo.csv
(As a side note, "${lines[#]:1}" would remove the first line, not the first (\t) character of each line. (This particular expansion syntax works differently for strings (scalars) and arrays in Bash.) Hence "${lines[#]/#?}" (another way to express the removal of the first character), which does get applied to each array element.)

Related

Linux - Delete all lines from a given line number

I am trying to delete a file's contents from a supplied line number using sed. The problem is that sed isn't accepting the variable I supply to it
line_num=$(grep -n "debited" file.csv | sed -n '2 s/:.*//p') && sed -i.bak "$line_num,$d" file.csv
The idea is to delete all lines from a file after & including the second occurence of the pattern.
I'm not stubborn with sed. Awk & perl could do too.
Seems like you want to delete the rest of the file after a second showing of a pattern (debited), including that line.
Then can truncate it, ising tell for the length of what's been read up to that line
perl -e'while (<>) {
if ( ($cnt += /debited/) == 2 ) { truncate $ARGV, $len; exit }
$len = tell;
}' file
Here the $ARGV variable has the "current" file (when reading from <>). Feel free to introduce a variable with the pattern instead of the literal (debited), based on your context.
This can be made to look far nicer in a little script but it seems that a command-line program ("one-liner") is needed in the question.
I always suggest ed for editing files over trying to use sed to do it; a program intended from the beginning to work with a file instead of a stream of lines just works better for most tasks.
The idea is to delete all lines from a file after & including the second occurence[sic] of the pattern
Example:
$ cat demo.txt
a
b
c
debited 12
d
e
debited 14
f
g
h
$ printf "%s\n" '/debited/;//,$d' w | ed -s demo.txt
$ cat demo.txt
a
b
c
debited 12
d
e
The ed command /pattern/;//,$d first sets the current line cursor to the first one that matches the basic regular expression pattern, then moves it to the next match of the pattern and deletes everything from there to the end of the file. Then w writes the changed file back to disk.
you're doing lot's of unnecessary steps, this will do what you want.
$ awk '/debited/{c++} c==2{exit}1' file
delete second occurrence of the pattern and everything after it.
To replace the original file (and create backup)
$ awk ... file > t && mv -b --suffix=.bak t file

Creating 3 column TAB file using name of files in directory

I have over 100 files in a directory with format xxx_1_sequence.fastq.gz and xxx_2_sequence.fastq.gz
The goal is to create a TAB file with 3 columns in this format:
xxx ---> xxx_1_sequence.fastq.gz ---> xxx_2_sequence.fastq.gz
where ---> is a tab.
I was thinking of creating a for loop or maybe using string manipulation in order to achieve this. My knowledge is rudimentary at this stage, so any help would be much appreciated.
Would you please try the following:
shopt -s extglob # enable extended pattern matching
suffix="sequence.fastq.gz"
for f in !(*"$suffix"); do # files which does not match the pattern
if [[ -f ${f}_1_$suffix && -f ${f}_2_$suffix ]]; then
# check the existence of the files just in case
printf "%s\t%s\t%s\n" "$f" "${f}_1_$suffix" "${f}_2_$suffix"
fi
done
If your files are in a directory called files:
paste -d '\t' \
<(printf "%s\n" files/*_1_sequence.fastq.gz | sort) \
<(printf "%s\n" files/*_2_sequence.fastq.gz | sort) \
| sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz/' \
> out.tsv
Explanation:
printf "%s\n" will print every argument in a new line. So:
printf "%s\n" files/*_1_sequence.fastq.gz | sort
prints a sorted list of the first type of files (the second column in your output). And of course it's symmetrical with *_2_sequence.fastq.gz (the third column).
(We probably don't need the sort part, but it helps clarify the intention.)
The syntax <(some shell command) runs some shell command, puts its output into a temporary input file, and passes that file as an argument. You can see the temporary file like so:
$ echo <(echo a) <(echo b)
/dev/fd/63 /dev/fd/62
So we are passing 2 (temporary) files to paste. If each output file has N lines, then paste outputs N lines, where line number K is a concatenation of line K of each of the files, in order.
For example, if line 4 of the first file is hello and line 4 if the second file is world, paste will have hello\tworld as line 4 of the output. But instead of trusting the default, we're setting the delimiter to TAB explicitly with -d '\t'.
That gives us the last 2 columns of our tab-separated-values file, but the first column is the * part of *_1_sequence.fastq.gz, which is where sed comes in.
We tell sed to replace \(.*\)_1_sequence.fastq.gz with \1\t\1_1_sequence.fastq.gz. .* will match anything, and \(some-pattern\) tells sed to remember the text that matched the pattern.
The first parentheses in sed's regex are can be read back into the replacement pattern as \1, which is why we have \1_1_sequence.fastq.gz in the replacement pattern.
But now we can also use \1 to create the first column of our tsv, which is why we have \1\t.
Thankyou for the help guys- I was thrown into a coding position a week ago with no prior experience and have been struggling.
I ended up with this printf "%s\n" *_1_sequence.fastq.gz | sort | sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz\t\1_2_sequence.fastq.gz/ ' > NULLARBORformat.tab
and it does the job perfectly!

How can I copy parts of a text file and paste them in a new one

I have a text file containing HTML code of different websites like this one textfile:
and I want to copy the source code one at a time and put them in a different text file because I want to compare it with another text file containing the same source code in order to find out if the website has been updated. Each time I copy the next source code to the new file the old one will be deleted so basically then new textfile must contain only one source code at a time.
I have been able to copy the source code of the first page only but I don't know how to read the file from where I left off in order to copy the next source code.
input="./Desktop/sourcecode0.txt"
while read -r var
do
if [ "$var" != "</html>" ]
then
echo "$var" >> "./Desktop/htmlcode.txt"
continue
elif [ "$var" == "</html>" ]
then
echo "$var" >> "./Desktop/htmlcode.txt"
break
fi
done < "$input"
I would recommend to use rather sed (stream editor) for this, above You can do with:
sed '/<\/html>/q' sample.html
sed '/<\/html>/q' input.html >> htmlcode.txt
What above does sed by it default print all lines and on regexp <\/html> is does q print that line and quit.
Could You provide example what You exactly need "to copy next sourcecode"
If I got you right, you want to split sourcecode0.txt into some file, and each file will contain one <html></html> block.
for this task you can use
split -p '<html>' ~/Desktop/test.txt htmlcode_
that will create files with names htmlcode_aa, htmlcode_ab, htmlcode_ac... the number of files is depend on the number of <html></html> block.
if you want you can add later .txt to each file by calling
find ~/Desktop/htmlcode_a* | xargs -I '{}' mv {} {}.txt

Bash: Read in file, edit line, output to new file

I am new to linux and new to scripting. I am working in a linux environment using bash. I need to do the following things:
1. read a txt file line by line
2. delete the first line
3. remove the middle part of each line after the first
4. copy the changes to a new txt file
Each line after the first has three sections, the first always ends in .pdf and the third always begins with R0 but the middle section has no consistency.
Example of 2 lines in the file:
R01234567_High Transcript_01234567.pdf High School Transcript R01234567
R01891023_Application_01891023127.pdf Application R01891023
Here is what I have so far. I'm just reading the file, printing it to screen and copying it to another file.
#! /bin/bash
cd /usr/local/bin;
#echo "list of files:";
#ls;
for index in *.txt;
do echo "file: ${index}";
echo "reading..."
exec<${index}
value=0
while read line
do
#value='expr ${value} +1';
echo ${line};
done
echo "read done for ${index}";
cp ${index} /usr/local/bin/test2;
echo "file ${index} moved to test2";
done
So my question is, how can I delete the middle bit of each line, after .pdf but before the R0...?
Using sed:
sed 's/^\(.*\.pdf\).*\(R0.*\)$/\1 \2/g' file.txt
This will remove everything between .pdf and R0 and replace it with single space.
Result for your example:
R01234567_High Transcript_01234567.pdf R01234567
R01891023_Application_01891023127.pdf R01891023
The Hard, Unreliable Way
It's a bit verbose, and much less terse and efficient than what would make sense if we knew that the fields were separated by tab literals, but the following loop does this processing in pure native bash with no external tools:
shopt -s extglob
while IFS= read -r line; do
[[ $line = *".pdf"*R0* ]] || continue # ignore lines that don't fit our format
filename=${line%%.pdf*}.pdf
id=R0${line##*R0}
printf '%s\t%s\n' "$filename" "$id"
done
${line%%.pdf*} returns everything before the first .pdf in the line; ${line%%.pdf*}.pdf then appends .pdf to that content.
Similarly, ${line##*R0} expands to everything after the last R0; R0${line##*R0} thus expands to the final field starting with R0 (presuming that that's the only instance of that string in the file).
The Easy Way (Using Tab Delimiters)
If cat -t file (on MacOS) or cat -A file (on Linux) shows ^I sequences between the fields (but not within the fields), use the following instead:
while IFS=$'\t' read -r filename title id; do
printf '%s\t%s\n' "$filename" "$id"
done
This reads the three tab separated fields into variables named filename, title and id, and emits the filename and id fields.
Updated answer assuming tab delim
Since there is a tab delimiter, then this is a cinch for awk. Borrowing from my originally deleted answer and #geek1011 deleted answer:
awk -F"\t" '{print $1, $NF}' infile.txt
Here awk splits each record in your file by tab, then prints the first field $1 and the last field $NF where NF is the built in awk variable for the record's Number of Fields; by prepending a dollar sign, it says "The value of the last field in the record".
Original answer assuming space delimiter
Leaving this here in case someone has space delimited nonsense like I originally assumed.
You can use awk instead of using bash to read through the file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt
awk reads files line by line and processes each record it comes across. Fields are delimited automatically by white space. The first field is $1, the second is $2 and so on. awk has built in variables; here we use NF which is the Number of Fields contained in the record, and NR which is the record number currently being processed.
This script does the following:
If the record number is greater than 1 (not the header) then
Loop through each field (separated by white space here) until we find a field that has "pdf" in it ($i!~/pdf/). Store everything we find up until that field in a variable called firstRec separated by a space (firstRec=firstRec" "$i).
print out the firstRec, then print out whatever field we stopped iterating on (the one that contains "pdf") which is $i, and finally print out the last field in the record, which is $NF (print firstRec,$i,$NF)
You can direct this to another file:
awk 'NR>1{for(i=1; $i!~/pdf/; ++i) firstRec=firstRec" "$i} NR>1{print firstRec,$i,$NF}' yourfile.txt > outfile.txt
sed may be a cleaner way of going here since, if your pdf file has more than one space separating characters, then you will lose the multiple spaces.
You can use sed on each line like that:
line="R01234567_High Transcript_01234567.pdf High School Transcript R01234567"
echo "$line" | sed 's/\.pdf.*R0/\.pdf R0/'
# output
R01234567_High Transcript_01234567.pdf R01234567
This replace anything between .pdf and R0 with a spacebar.
It doesn't deal with some edge cases but it simple and clear

How can I remove the last character of a file in unix?

Say I have some arbitrary multi-line text file:
sometext
moretext
lastline
How can I remove only the last character (the e, not the newline or null) of the file without making the text file invalid?
A simpler approach (outputs to stdout, doesn't update the input file):
sed '$ s/.$//' somefile
$ is a Sed address that matches the last input line only, thus causing the following function call (s/.$//) to be executed on the last line only.
s/.$// replaces the last character on the (in this case last) line with an empty string; i.e., effectively removes the last char. (before the newline) on the line.
. matches any character on the line, and following it with $ anchors the match to the end of the line; note how the use of $ in this regular expression is conceptually related, but technically distinct from the previous use of $ as a Sed address.
Example with stdin input (assumes Bash, Ksh, or Zsh):
$ sed '$ s/.$//' <<< $'line one\nline two'
line one
line tw
To update the input file too (do not use if the input file is a symlink):
sed -i '$ s/.$//' somefile
Note:
On macOS, you'd have to use -i '' instead of just -i; for an overview of the pitfalls associated with -i, see the bottom half of this answer.
If you need to process very large input files and/or performance / disk usage are a concern and you're using GNU utilities (Linux), see ImHere's helpful answer.
truncate
truncate -s-1 file
Removes one (-1) character from the end of the same file. Exactly as a >> will append to the same file.
The problem with this approach is that it doesn't retain a trailing newline if it existed.
The solution is:
if [ -n "$(tail -c1 file)" ] # if the file has not a trailing new line.
then
truncate -s-1 file # remove one char as the question request.
else
truncate -s-2 file # remove the last two characters
echo "" >> file # add the trailing new line back
fi
This works because tail takes the last byte (not char).
It takes almost no time even with big files.
Why not sed
The problem with a sed solution like sed '$ s/.$//' file is that it reads the whole file first (taking a long time with large files), then you need a temporary file (of the same size as the original):
sed '$ s/.$//' file > tempfile
rm file; mv tempfile file
And then move the tempfile to replace the file.
Here's another using ex, which I find not as cryptic as the sed solution:
printf '%s\n' '$' 's/.$//' wq | ex somefile
The $ goes to the last line, the s deletes the last character, and wq is the well known (to vi users) write+quit.
After a whole bunch of playing around with different strategies (and avoiding sed -i or perl), the best way i found to do this was with:
sed '$! { P; D; }; s/.$//' somefile
If the goal is to remove the last character in the last line, this awk should do:
awk '{a[NR]=$0} END {for (i=1;i<NR;i++) print a[i];sub(/.$/,"",a[NR]);print a[NR]}' file
sometext
moretext
lastlin
It store all data into an array, then print it out and change last line.
Just a remark: sed will temporarily remove the file.
So if you are tailing the file, you'll get a "No such file or directory" warning until you reissue the tail command.
EDITED ANSWER
I created a script and put your text inside on my Desktop. this test file is saved as "old_file.txt"
sometext
moretext
lastline
Afterwards I wrote a small script to take the old file and eliminate the last character in the last line
#!/bin/bash
no_of_new_line_characters=`wc '/root/Desktop/old_file.txt'|cut -d ' ' -f2`
let "no_of_lines=no_of_new_line_characters+1"
sed -n 1,"$no_of_new_line_characters"p '/root/Desktop/old_file.txt' > '/root/Desktop/my_new_file'
sed -n "$no_of_lines","$no_of_lines"p '/root/Desktop/old_file.txt'|sed 's/.$//g' >> '/root/Desktop/my_new_file'
opening the new_file I created, showed the output as follows:
sometext
moretext
lastlin
I apologize for my previous answer (wasn't reading carefully)
sed 's/.$//' filename | tee newFilename
This should do your job.
A couple perl solutions, for comparison/reference:
(echo 1a; echo 2b) | perl -e '$_=join("",<>); s/.$//; print'
(echo 1a; echo 2b) | perl -e 'while(<>){ if(eof) {s/.$//}; print }'
I find the first read-whole-file-into-memory approach can be generally quite useful (less so for this particular problem). You can now do regex's which span multiple lines, for example to combine every 3 lines of a certain format into 1 summary line.
For this problem, truncate would be faster and the sed version is shorter to type. Note that truncate requires a file to operate on, not a stream. Normally I find sed to lack the power of perl and I much prefer the extended-regex / perl-regex syntax. But this problem has a nice sed solution.

Resources