For my project I am handling large data files, when these data files come in they are "uncleaned" and I need to clean them so that I can calculate the required functions from them. In this data the first 9 line is text and information about for example the time, and number of atoms. Whilst the next 10000 lines it is trajectory data this repeats until a certain time.
Now I have written code that cleans the text out of it given by:
homedir=$(pwd) #print working directory
for ex in 0 #5
do
dirname="ex-$ex"
cd $dirname
dirname2="Tq-0.25-N10000"
cd $dirname2
for i in $(seq 1 1 100)
do
dirname3="tr-$i"
cd $dirname3
mv traj-passive-afterquench.atom traj-afterquench
sed -i "1,9d" traj-afterquench
awk '{if((NR-1) % 10009<=9999){print $0}}' traj-afterquench>test
cd .. # tr
done
cd .. # Ti-1
cd .. # ex
done
But now I want to create another file that removes every line except the time, these are located on the lines of 2+10009*i where i is the number of timesteps till the end of the file, how would I create a code that would remove every line except the ones in the given formula?
If you have GNU sed:
sed '2~10009!d' file
should do the job.
Related
I often work like this:
for skra in `ls *txt` ; do paste foo.csv <(cut -f 5 $skra) > foo.csv; done
for looping through a directory by using 'ls'
Now I don't understand why this command does not add column to foo.csv in every loop
What is happening under the hood? Seems like foo.csv is not saved in every iteration
The output I get is field 5 from the last file. Not even the original foo.csv as I get if I only paste foo.csv bar.txt
EDIT:
All files are tab delimited
foo.csv is just one column in the beginning
example.txt as seen in vim with set list:
(101,6352)(11174,51391)(10000,60000)^INC_044048.1^I35000^I6253^I0.038250$
(668,7819)(23384,69939)(20000,70000)^INC_044048.1^I45000^I7153^I0.034164$
(2279,8111)(32691,73588)(30000,80000)^INC_044048.1^I55000^I5834^I0.031908$
Here is a python script that does what I want:
import pandas
rammi=[]
with open('window.list') as f:
for line in f:
nafn=line.strip()
df=pandas.read_csv(nafn, header=None, names=[nafn], sep='\t', usecols=[4])
rammi.append(df)
frame = pandas.concat(rammi, axis=1)
frame.to_csv('rammi.allra', sep='\t', encoding='utf-8')
Paste column 4 from all files to one (initially I wanted to retain one original column but it was not necessary). The question was about bash not wanting to update stdin in the for loop.
As already noted in the comments, opening foo.csv for output will truncate it in most shells. (Even if that was not the case, opening the file and running cut and paste repeatedly looks quite inefficient.)
If you don’t mind keeping all the data in memory at one point in time, a simple AWK or Bash script can do this type of processing without any further processes such as cut or paste.
awk -F'\t' ' { lines[FNR] = lines[FNR] "\t" $5 }
END { for (l in lines) print substr(lines[l], 2) }' \
*.txt > foo.csv
(The output should not be called .csv, but I’m sticking with the naming from the question nonetheless.)
Actually, one doesn’t really need awk for this, Bash will do:
#!/bin/bash
lines=()
for file in *.txt; do
declare -i i=0
while IFS=$'\t' read -ra line; do
lines[i++]+=$'\t'"${line[4]}"
done < "$file"
done
printf '%s\n' "${lines[#]/#?}" > foo.csv
(As a side note, "${lines[#]:1}" would remove the first line, not the first (\t) character of each line. (This particular expansion syntax works differently for strings (scalars) and arrays in Bash.) Hence "${lines[#]/#?}" (another way to express the removal of the first character), which does get applied to each array element.)
i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.
I am trying to create a file called depths that has the name of the sample, the gene, and then the number of times that gene is in the sample. The below code is what I have currently, but the output just has the file names. Ex. file name=ERR034597.MTCYB.sam
I want the file to have ERR034597 MTCYB 327, for example.
for i in genes/${i}.sam
filename=$(basename $i)
n_rows=$(cat $i | wc -l)
echo $filename $n_rows > depths
Here
for i in genes/${i}.sam
you're accessing the variable i before it has been assigned yet. This shouldn't work. What you probably want to do is
for i in genes/*.sam
filename=$(basename "$i")
n_rows=$(wc -l "$i")
echo "$filename" $n_rows > depths
And just another note. It's good practice to avoid unnecessary calls to cat and always quote the variables holding filenames.
If I understand what you are attempting, then you need a few more steps to isolate the first part of the filename, (e.g. ERR034597) and the gene (e.g. MTCYB) before writing the information to depths. You also need to consider whether you are replacing the contents of depths on each iteration (e.g. using >) or Appending to depths with >>.
Since your tag is [Linux], all we can presume is you have a POSIX shell and not an advanced shell like bash. To remove the .sam extension from filename and then separate into the first part and the gene before obtaining the line count, you can do something similar to the following:
#!/bin/sh
:> depths # truncate depths (optional - if required)
for i in genes/*.sam; do # loop over all .sam files
filename="$(basename "$i")" # remove path from name
filename="${filename%.sam}" # trim .sam extension from name
gene="${filename##*.}" # trim to last '.' save as gene
filename="${filename%.$gene}" # remove gene from end of name
n_rows=$(wc -l < "$i") # get number of lines in file
echo "$filename $gene $n_rows" >> depths # append vales to depths
done
Which would result in depths containing lines similar to:
ERR034597 MTCYB 92
(where the test file contained 92 lines)
Look things over and let me know if you have further questions.
I have two files data.txt and results.txt, assuming there are 5 lines in data.txt, I want to copy all these lines and paste them in file results.txt starting from the line number 4.
Here is a sample below:
Data.txt file:
stack
ping
dns
ip
remote
Results.txt file:
# here are some text
# please do not edit these lines
# blah blah..
this is the 4th line that data should go on.
I've tried sed with various combinations but I couldn't make it work, I'm not sure if it fit for that purpose as well.
sed -n '4p' /path/to/file/data.txt > /path/to/file/results.txt
The above code copies line 4 only. That isn't what I'm trying to achieve. As I said above, I need to copy all lines from data.txt and paste them in results.txt but it has to start from line 4 without modifying or overriding the first 3 lines.
Any help is greatly appreciated.
EDIT:
I want to override the copied data starting from line number 4 in
the file results.txt. So, I want to leave the first 3 lines without
modifications and override the rest of the file with the data copied
from data.txt file.
Here's a way that works well from cron. Less chance of losing data or corrupting the file:
# preserve first lines of results
head -3 results.txt > results.TMP
# append new data
cat data.txt >> results.TMP
# rename output file atomically in case of system crash
mv results.TMP results.txt
You can use process substitution to give cat a fifo which it will be able to read from :
cat <(head -3 result.txt) data.txt > result.txt
head -n 3 /path/to/file/results.txt > /path/to/file/results.txt
cat /path/to/file/data.txt >> /path/to/file/results.txt
if you can use awk:
awk 'NR!=FNR || NR<4' Result.txt Data.txt
I have a file called flw.py and would like to write a bash script that will replace some text in the file (take out the last two lines and add two new lines). I apologize if this seems like a stupid question. A thorough explanation would be appreciated since I am still learning to script. Thanks!
head -n -2 flw.py > tmp # (1)
echo "your first new line here..." >> tmp # (2)
echo "your second new line here...." >> tmp #
mv tmp flw.py # (3)
Explanation:
head normally prints out the first ten lines of a file. The -n argument can change the number of lines printed out. So if you wanted to print out the first 15 lines you would use head -n 15. If you give negative numbers to head it means the opposite: print out all lines but the last N lines. Which happens to be what you want: head -n -2
Then we redirect the output of our head command to a temporary file named tmp. > does the redirecting magic here. tmp now contains everything of flw.py but the last two lines.
Next we add the two new lines by using the echo command. We append the output of the echo "your first new line here..." to our tmp file. >> appends to an existing file, whereas > will overwrite an existing file.
We do the same thing for the second line we want to append.
Last, we move the tmp file to flw.py and the job is done.
You can use single sed command to get you expect result
sed -n 'N;$!P;$!D;a\line\n\line2' fly.py
Example:
cat fly.py
1
2
3
4
5
sed -n 'N;$!P;$!D;a\line\n\line2' fly.py
Output :
1
2
3
line1
line2
Note :
Using -i option to update your file