How can I copy parts of a text file and paste them in a new one - linux

I have a text file containing HTML code of different websites like this one textfile:
and I want to copy the source code one at a time and put them in a different text file because I want to compare it with another text file containing the same source code in order to find out if the website has been updated. Each time I copy the next source code to the new file the old one will be deleted so basically then new textfile must contain only one source code at a time.
I have been able to copy the source code of the first page only but I don't know how to read the file from where I left off in order to copy the next source code.
input="./Desktop/sourcecode0.txt"
while read -r var
do
if [ "$var" != "</html>" ]
then
echo "$var" >> "./Desktop/htmlcode.txt"
continue
elif [ "$var" == "</html>" ]
then
echo "$var" >> "./Desktop/htmlcode.txt"
break
fi
done < "$input"

I would recommend to use rather sed (stream editor) for this, above You can do with:
sed '/<\/html>/q' sample.html
sed '/<\/html>/q' input.html >> htmlcode.txt
What above does sed by it default print all lines and on regexp <\/html> is does q print that line and quit.
Could You provide example what You exactly need "to copy next sourcecode"

If I got you right, you want to split sourcecode0.txt into some file, and each file will contain one <html></html> block.
for this task you can use
split -p '<html>' ~/Desktop/test.txt htmlcode_
that will create files with names htmlcode_aa, htmlcode_ab, htmlcode_ac... the number of files is depend on the number of <html></html> block.
if you want you can add later .txt to each file by calling
find ~/Desktop/htmlcode_a* | xargs -I '{}' mv {} {}.txt

Related

Bash for loop not writing to file

I often work like this:
for skra in `ls *txt` ; do paste foo.csv <(cut -f 5 $skra) > foo.csv; done
for looping through a directory by using 'ls'
Now I don't understand why this command does not add column to foo.csv in every loop
What is happening under the hood? Seems like foo.csv is not saved in every iteration
The output I get is field 5 from the last file. Not even the original foo.csv as I get if I only paste foo.csv bar.txt
EDIT:
All files are tab delimited
foo.csv is just one column in the beginning
example.txt as seen in vim with set list:
(101,6352)(11174,51391)(10000,60000)^INC_044048.1^I35000^I6253^I0.038250$
(668,7819)(23384,69939)(20000,70000)^INC_044048.1^I45000^I7153^I0.034164$
(2279,8111)(32691,73588)(30000,80000)^INC_044048.1^I55000^I5834^I0.031908$
Here is a python script that does what I want:
import pandas
rammi=[]
with open('window.list') as f:
for line in f:
nafn=line.strip()
df=pandas.read_csv(nafn, header=None, names=[nafn], sep='\t', usecols=[4])
rammi.append(df)
frame = pandas.concat(rammi, axis=1)
frame.to_csv('rammi.allra', sep='\t', encoding='utf-8')
Paste column 4 from all files to one (initially I wanted to retain one original column but it was not necessary). The question was about bash not wanting to update stdin in the for loop.
As already noted in the comments, opening foo.csv for output will truncate it in most shells. (Even if that was not the case, opening the file and running cut and paste repeatedly looks quite inefficient.)
If you don’t mind keeping all the data in memory at one point in time, a simple AWK or Bash script can do this type of processing without any further processes such as cut or paste.
awk -F'\t' ' { lines[FNR] = lines[FNR] "\t" $5 }
END { for (l in lines) print substr(lines[l], 2) }' \
*.txt > foo.csv
(The output should not be called .csv, but I’m sticking with the naming from the question nonetheless.)
Actually, one doesn’t really need awk for this, Bash will do:
#!/bin/bash
lines=()
for file in *.txt; do
declare -i i=0
while IFS=$'\t' read -ra line; do
lines[i++]+=$'\t'"${line[4]}"
done < "$file"
done
printf '%s\n' "${lines[#]/#?}" > foo.csv
(As a side note, "${lines[#]:1}" would remove the first line, not the first (\t) character of each line. (This particular expansion syntax works differently for strings (scalars) and arrays in Bash.) Hence "${lines[#]/#?}" (another way to express the removal of the first character), which does get applied to each array element.)

Shell Script With sed and Random number

How to make a shell script that receives one or more text files and removes from them whitespaces and blanklines. After that new files will have a random 2-digit number in front of them.
For example File1.txt generates File1_56.txt
Tried this:
#!/bin/bash
for file in "$*"; do
sed -e '/^$/d;s/[[:blank:]]//g' $* >> "$*_$$.txt"
done
But when I give 2 files as input script merges them into one single file, when I want for each file a separate one.
Try:
#!/bin/bash
for file in "$#"; do
sed -e '/^$/d;s/[[:blank:]]//g' "$file" >> "${file%.txt}_$$.txt"
done
Notes
To loop over each argument without word splitting or other hazards, use for file in "$#" not for file in "$*"
To run the sed command on one file instead of all, specify "$file" as the file, not $*.
To save the output to the correct file, use "${file%.txt}_$$.txt" where ${file%.txt} is an example of suffix removal: it removes the final .txt from the file name.
$$ is the process ID. The title says mentions a "random" number. If you want a random number, replace $$ with $RANDOM.

Linux command to grab lines similar between files

I have one file that has one word per line.
I have a second file that has many words per line.
I would like to go through each line in the first file, and all lines for which it is found in the second file, I would like to copy those lines from the second file into a new third file.
Is there a way to do this simply with Linux command?
Edit: Thanks for the input. But, I should specify better:
The first file is just a list of numbers (one number per line).
463463
43454
33634
The second file is very messy, and I am only looking for that number string to be in lines in any way (not necessary an individual word). So, for instance
ewjleji jejeti ciwlt 463463.52%
would return a hit. I think what was suggested to me does not work in this case (please forgive my having to edit for not being detailed enough)
If n is the number of lines in your first file and m is the number of lines in your second file, then you can solve this problem in O(nm) time in the following way:
cat firstfile | while read word; do
grep "$word" secondfile >>thirdfile
done
If you need to solve it more efficiently than that, I don't think there are any builtin utilties for that, however.
As for your edit, this method does work the way you describe.
Here is a short script that will do it. it will take 3 command line arguments 1- file with 1 word per line, 2- file with many lines you want to match for each word in file1 and 3- your output file:
#!/bin/bash
## test input and show usage on error
test -n "$1" && test -n "$2" && test -n "$3" || {
printf "Error: insufficient input, usage: %s file1 file2 file3\n" "${0//*\//}"
exit 1
}
while read line || test -n "$line" ; do
grep "$line" "$2" 1>>"$3" 2>/dev/null
done <"$1"
example:
$ cat words.txt
me
you
them
$ cat lines.txt
This line is for me
another line for me
maybe another for me
one for you
another for you
some for them
another for them
here is one that doesn't match any
$ bash ../lines.sh words.txt lines.txt outfile.txt
$ cat outfile.txt
This line is for me
another line for me
maybe another for me
some for them
one for you
another for you
some for them
another for them
(yes I know that me also matches some in the example file, but that's not really the point.

Copy text from multiple files, same names to different path in bash (linux)

I need help copying content from various files to others (same name and format, different path).
For example, $HOME/initial/baby.desktop has text which I need to write into $HOME/scripts/baby.desktop. This is very simple for a single file, but I have 2500 files in $HOME/initial/ and the same number in $HOME/scripts/ with corresponding names (same names and format). I want append (copy) the content of file in path A to path B (which have the same name and format), to the end of file in path B without erase the content of file in path B.
Example content of $HOME/initial/*.desktop to final $HOME/scripts/*.desktop. I tried the following, but it don't work:
cd $HOME/initial/
for i in $( ls *.desktop ); do egrep "Icon" $i >> $HOME/scripts/$i; done
Firstly, I would backup $HOME/initial and $HOME/scripts, because there is lots of scope for people misunderstanding your question. Like this:
cd $HOME
tar -cvf initial.tar initial
tar -cvf scripts.tar scripts
That will put all the files in $HOME/initial into a single tarfile called initial.tar and all the files in $HOME/scripts into a single tarfile called scripts.tar.
Now for your question... in general, if you want to put the contents of FileB onto the end of FileA, the command is
cat FileB >> FileA
Note the DOUBLE ">>" which means "append" rather than single ">" which means overwrite.
So, I think you want to do this:
cd $HOME/initial/baby.desktop
cat SomeFile >> $HOME/scripts/baby.desktop/SomeFile
where SomeFile is the name of any file you choose to test with. I would test that has worked and then, if you are happy with that, go ahead and run the same command inside a loop:
cd $HOME/initial/baby.desktop
for SOURCE in *
do
DESTINATION="$HOME/scripts/baby.desktop/$SOURCE"
echo Appending "$SOURCE" to "$DESTINATION"
#cat "$SOURCE" >> "$DESTINATION"
done
When the output looks correct, remove the "#" at the start of the penultimate line and run it again.
I solved it, if some people want learn how to resolve is very simple:
using Sed
I need only the match (or pattern) line "Icon=/usr/share/some_picture.png into $HOME/initial/example.desktop to other with same name and format $HOME/scripts/example.desktop, but I had a lot of .desktop files (2500 files)
cd $HOME/initial
STRING_LINE=`grep -l -R "Icon=" *.desktop`
for i in $STRING_LINE; do sed -ne '/Icon=/ p' $i >> $HOME/scripts/$i ; done
_________
If you need only copy all to other file with same name and format
using cat
cd $HOME/initial
STRING_LINE=`grep -l -R "Icon=" *.desktop`
for i in $STRING_LINE; do cat $i >> $HOME/scripts/$i ; done

Split files according to a field and save in subdirectory created using the root name

I am having trouble with several bits of code, I am no expert in Linux Bash programming unfortunately so I have tried unsuccessfully to find something that works for my task all day and was hoping you could help guide me in the right direction.
I have many large files that I would like to split according to the third field within each of them, I would like to keep the header in each of the sub-files, and save the created sub-files in new directories created from the root names of the files.
The initial files stored in the original directory are:
Downloads/directory1/Levels_CHG_Lab_S_sample1.txt
Downloads/directory1/Levels_CHG_Lab_S_sample2.txt
Downloads/directory1/Levels_CHG_Lab_S_sample3.txt
and so on..
Each of these files have 200 columns, and column 3 contains values from 1 through 10.
I would like to split each of the files above based on the value of this column, and store the subfiles in subfolders, so for example sub-folder "Downloads/directory1/sample1" will contain 10 files (with the header line) derived by splitting the file Downloads/directory1/Levels_CHG_Lab_S_sample1.txt.
I have tried now many different steps for these steps, with no success.. I must be making this more complicated than it is since the code I have tried looks aweful…
Here is the code I am trying to work from:
FILES=Downloads/directory1/
for f in $FILES
do
# Create folder with root name by stripping file names
fname=${echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//'}
echo "Creating sub-directory [$fname]"
mkdir "$fname"
# Save the header
awk 'NR==1{print $0}' $f > header
# Split each file by third column
echo "Splitting file $f"
awk 'NR>1 {print $0 > $3".txt" }' $f
# Move newly created files in sub directory
mv {1..10}.txt $fname # I have no idea how to do specify the files just created
# Loop through the sub-files to attach header row:
for subfile in $fname
do
cat header $subfile >> tmp_file
mv -f tmp_file $subfile
done
done
All these steps seem very complicated to me, I would very much appreciate if you could help me solve this in the right way. Thank you very much for your help.
-fra
You have a few problems with your code right now. First of all, at no point do you list the contents of your downloads directory. You are simply setting the FILES variable to a string that is the path to that directory. You would need something like:
FILES=$(ls Downloads/directory1/*.txt)
You also never cd to the Downloads/directory1 folder, so your mkdir would create directories in cwd; probably not what you want.
If you know that the numbers in column 3 always range from 1 to 10, I would just pre-populate those files with the header line before you split the file.
Try this code to do what you want (untested):
BASEDIR=Downloads/directory1/
FILES=$(ls ${BASEDIR}/*.txt)
for f in $FILES; do
# Create folder with root name by stripping file names
dirname=$(echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//')
dirname="${BASENAME}/${dirname}/"
echo "Creating sub-directory [$dirname]"
mkdir "$dirname"
# Save the header to each file
HEADER_LINE=$(head -n1 $f)
for i in {1..10}; do
echo ${HEADER_LINE} > ${dirname}/${i}.txt
done
# Split each file by third column
echo "Splitting file $f"
awk -v dirname=${dirname} 'NR>1 {filename=dirname$3".txt"; print $0 >> filename }' $f
done

Resources