wget: reading from a list with id numbers and urls - text

In a .txt file, I have 500 lines containing an id number and a website homepage URL, in the following way
id_345 http://www.example1.com
id_367 http://www.example2.org
...
id_10452 http://www.example3.net
Using wget and the -i option, I am trying to download recursively part of these websites, but I would like to store the files in a way that is linked with the id number (storing the files in a directory called like the id number, or - the best option, but i think the most difficult to achieve - storing the html content in a single txt file called like the id number) .
Unfortunataly, the option -i cannot read a file like the one that i am using.
How can link the websites content with their connected id?
Thanks
P.s.: I imagine that to do so I have to 'go out' from wget, and call it through a script. If so, please take into account that I am a newbie in this sector (just some python experience), and that in particular I am not yet able to understand the logic and the code in bash scripts: step by step explanations for dummies are therefore very welcome.

Get site recursively with wget -P ... -r -l ... in Python, with parallel processing (gist is here):
import multiprocessing, subprocess, re
def getSiteRecursive(id, url, depth=2):
cmd = "wget -P " + id + " -r -l " + str(depth) + " " + url
subprocess.call(cmd, shell=True)
input_file = "site_list.txt"
jobs = []
max_jobs = multiprocessing.cpu_count() * 2 + 1
with open(input_file) as f:
for line in f:
id_url = re.compile("\s+").split(line)
if len(id_url) >= 2:
try:
print "Grabbing " + id_url[1] + " into " + id_url[0] + " recursively..."
if len(jobs) >= max_jobs:
jobs[0].join()
del jobs[0]
p = multiprocessing.Process(target=getSiteRecursive,args=(id_url[0],id_url[1],2,))
jobs.append(p)
p.start()
except Exception, e:
print "Error for " + id_url[1] + ": " + str(e)
pass
for j in jobs:
j.join()
Get single page into named file with Python:
import urllib2, re
input_file = "site_list.txt"
#open the site list file
with open(input_file) as f:
# loop through lines
for line in f:
# split out the id and url
id_url = re.compile("\s+").split(line)
print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..."
try:
# try to get the web page
u = urllib2.urlopen(id_url[1])
# save the GET response data to the id file (appended with "html")
localFile = open(id_url[0]+".html", 'wb+')
localFile.write(u.read())
localFile.close()
print "got " + id_url[0] + "!"
except:
print "Could not get " + id_url[0] + "!"
pass
Example site_list.txt:
id_345 http://www.stackoverflow.com
id_367 http://stats.stackexchange.com
Output:
Grabbing http://www.stackoverflow.com into id_345.html...
got id_345!
Grabbing http://stats.stackexchange.com into id_367.html...
got id_367!
Directory listing:
get_urls.py
id_345.html
id_367.html
site_list.txt
And if you prefer command line or shell scripting, you can use awk to read each line with the default splitting at spaces, pipe it to a loop and execute with the backtick:
awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done
Breakdown...
awk '{print "wget -O " $1 ".html " $2}' site_list.txt |
Use the awk tool to read each line of the site_list.txt file and
split each line at spaces (default) into variables ($1, $2, $3,
etc.), so that your id is in $1 and your url is in $2.
Add the print AWK command to construct the call for wget.
Add the pipe operator | to send the output to the next command
Next we do the wget call:
while read line ; do `$line` ; done
Loop through the prior command output line by line, storing it into the $line variable, and execute it using the backtick operator to interpret the text and run it as a command

Related

How do I move a line of text in a file to the top if it contains a word (need for multiple files)

I have a file structure as follows:
folder\spkgbuild
folder2\spkgbuild
folder3\spkgbuild
each spkgbuild has the following:
name=packagename
version=1.0
release=1
# depends : packages
# description : blah blah
build()
{
}
I want to move # description to the top of the file and # depends under it, like this:
# description : blah blah
# depends : packages
name=packagename
version=1.0
release=1
build()
{
}
Any idea how?
And a far from robust Linux one-liner:
find -type f -name spkgbuild -printf "%h\n" | xargs -L 1 bash -c 'cd $0 ; cat <(grep -xE "^# des.*" spkgbuild) <(grep -xE "^# dep.*" spkgbuild) <(grep -xvE "^# de.*" spkgbuild) > spkgbuild.1;mv spkgbuild.1 spkgbuild'
Explanation:
Find all files named spkgbuild
Pass the directory each lives in to xargs, cd into each direcotry containing the file respectively, use three greps to rearrange the lines, write the output to a temporary file (spkgbuild.1), mv that file to spkgbuild when done.
there it is, i feel like a should be paied for this though
filesToModify = ["add your path here", "and here", "and so on"]
linesToMove = ["# description :", "# depends :"] # you can add more
for fName in filesToModify:
content = ""
with open(fName, "r") as fl:
lines = []
for line in fl:
moved = False
for start in linesToMove:
if line.startswith(start):
lines.insert(0, line + ("\n" if not line.endswith("\n") else ""))
moved = True
break
if not moved:
lines.append(line)
content = "".join(lines)
with open(fName, "w") as fl:
fl.write(content)
then run i like python script.py.

Trouble converting file types by calling a program from the terminal

I'm trying to automate a process of taking in a 'pdb' file from user input and then using that input file in a program called from the Terminal called 'Antechamber' that outputs a 'mol2'file.
Here is my code:
import sys
inFile = sys.argv[tetrafluoroborate.pdb]
outFile = sys.argv[tetrafluoroborate.mol2]
p = 'antechamber' + ' -i ' + inFile + ' fi pdb o- ' + outFile + ' -fo mol2'
subprocess.call(p)
The Antechamber program takes four commands: '-i' is the input file, '-fi' is the output file type, '-o' is the output file, and '-fo' is the output file format.
When I run the script, I get:
Traceback (most recent call last):
File "test.py", line 4, in <module>
inFile = sys.argv[tetrafluoroborate.pdb]
NameError: name 'tetrafluoroborate' is not defined
I'm new to coding, and I appreciate any help.
Thank you!
If you need to call the antechamber script from python and the input and output names will have the same prefix, I would suggest a python script like this:
import sys
prefix = sys.argv[1]
inFile = prefix + '.pdb'
outFile = prefix + '.mol2'
pcommand = 'antechamber' + ' -i ' + inFile + ' -fi pdb -o ' + outFile + ' -fo mol2'
subprocess.call(pcommand)
Then you can call from the command line by:
python test.py tetrafluoroborate

Bash output limited to echo only

I am writing a bash script to handle by backups. I have created a message function controller that uses functions to handle email, log and output.
So the structure is as:
message_call(i, "This is the output")
Message Function
-> Pass to email function
--> Build email file
-> Pass to log function
--> Build log file
-> Pass to echo function (custom)
--> Format and echo input dependent on $1 as a switch and $2 as the output message
When I echo I want nice clean output that only consists of messages passed to the echo function, I can point all output /dev/null but I am struggling to limit all output except for the echo command.
Current output sample:
craig#ubuntu:~/backup/functions$ sudo ./echo_function.sh i test
+ SWITCH=i
+ INPUT=test
+ echo_function
+ echo_main
+ echo_controller i test
+ '[' i == i ']'
+ echo_info test
+ echo -e '\e[32m\e[1m[INFO]\e[0m test'
[INFO] test
+ echo test
test
+ '[' i == w ']'
+ '[' i == e ']'
Above I ran the echo function alone and the output I want is on line 10, all other output in the sample I don't want.
If you have the line set -x in your script, comment it out. If not, try adding set +x at the top of your script.
If you want to hide all the output from everything except what you're explicitly doing in your echo function you could do something like this:
exec 7>&1 # save a copy of current stdout
exec >/dev/null # redirect everyone else's stdout to /dev/null
ls # output goes to /dev/null
echo My Message >&7 # output goes to "old" stdout

grep lines before and after in aix/ksh shell

I want to extract lines before and after a matched pattern.
eg: if the file contents are as follows
absbasdakjkglksagjgj
sajlkgsgjlskjlasj
hello
lkgjkdsfjlkjsgklks
klgdsgklsdgkldskgdsg
I need find hello and display line before and after 'hello'
the output should be
sajlkgsgjlskjlasj
hello
lkgjkdsfjlkjsgklks
This is possible with GNU but i need a method that works in AIX / KSH SHELL WHERE NO GNU IS INSTALLED.
sed -n '/hello/{x;G;N;p;};h' filename
I've found it is generally less frustrating to build the GNU coreutils once, and benefit from many more features http://www.gnu.org/software/coreutils/
Since you'll have Perl on the machine, you could use the following code, but you'd probably do better to install the GNU utilities. This has options -b n1 for lines before and -f n1 for lines following the match. It works with PCRE matches (so if you want case-insensitive matching, add an i after the regex instead using a -i option. I haven't implemented -v or -l; I didn't need those.
#!/usr/bin/env perl
#
# #(#)$Id: sgrep.pl,v 1.7 2013/01/28 02:07:18 jleffler Exp $
#
# Perl-based SGREP (special grep) command
#
# Print lines around the line that matches (by default, 3 before and 3 after).
# By default, include file names if more than one file to search.
#
# Options:
# -b n1 Print n1 lines before match
# -f n2 Print n2 lines following match
# -n Print line numbers
# -h Do not print file names
# -H Do print file names
use warnings;
use strict;
use constant debug => 0;
use Getopt::Std;
my(%opts);
sub usage
{
print STDERR "Usage: $0 [-hnH] [-b n1] [-f n2] pattern [file ...]\n";
exit 1;
}
usage unless getopts('hnf:b:H', \%opts);
usage unless #ARGV >= 1;
if ($opts{h} && $opts{H})
{
print STDERR "$0: mutually exclusive options -h and -H specified\n";
exit 1;
}
my $op = shift;
print "# regex = $op\n" if debug;
# print file names if -h omitted and more than one argument
$opts{F} = (defined $opts{H} || (!defined $opts{h} and scalar #ARGV > 1)) ? 1 : 0;
$opts{n} = 0 unless defined $opts{n};
my $before = (defined $opts{b}) ? $opts{b} + 0 : 3;
my $after = (defined $opts{f}) ? $opts{f} + 0 : 3;
print "# before = $before; after = $after\n" if debug;
my #lines = (); # Accumulated lines
my $tail = 0; # Line number of last line in list
my $tbp_1 = 0; # First line to be printed
my $tbp_2 = 0; # Last line to be printed
# Print lines from #lines in the range $tbp_1 .. $tbp_2,
# leaving $leave lines in the array for future use.
sub print_leaving
{
my ($leave) = #_;
while (scalar(#lines) > $leave)
{
my $line = shift #lines;
my $curr = $tail - scalar(#lines);
if ($tbp_1 <= $curr && $curr <= $tbp_2)
{
print "$ARGV:" if $opts{F};
print "$curr:" if $opts{n};
print $line;
}
}
}
# General logic:
# Accumulate each line at end of #lines.
# ** If current line matches, record range that needs printing
# ** When the line array contains enough lines, pop line off front and,
# if it needs printing, print it.
# At end of file, empty line array, printing requisite accumulated lines.
while (<>)
{
# Add this line to the accumulated lines
push #lines, $_;
$tail = $.;
printf "# array: N = %d, last = $tail: %s", scalar(#lines), $_ if debug > 1;
if (m/$op/o)
{
# This line matches - set range to be printed
my $lo = $. - $before;
$tbp_1 = $lo if ($lo > $tbp_2);
$tbp_2 = $. + $after;
print "# $. MATCH: print range $tbp_1 .. $tbp_2\n" if debug;
}
# Print out any accumulated lines that need printing
# Leave $before lines in array.
print_leaving($before);
}
continue
{
if (eof)
{
# Print out any accumulated lines that need printing
print_leaving(0);
# Reset for next file
close ARGV;
$tbp_1 = 0;
$tbp_2 = 0;
$tail = 0;
#lines = ();
}
}
I had a situation where I was stuck with a slow telnet session on a tablet, believe it or not, and I couldn't write a Perl script very easily with that keyboard. I came up with this hacky maneuver that worked in a pinch for me with AIX's limited grep. This won't work well if your grep returns hundreds of lines, but if you just need one line and one or two above/below it, this could do it. First I ran this:
cat -n filename |grep criteria
By including the -n flag, I see the line number of the data I'm seeking, like this:
2543 my crucial data
Since cat gives the line number 2 spaces before and 1 space after, I could grep for the line number right before it like this:
cat -n filename |grep " 2542 "
I ran this a couple of times to give me lines 2542 and 2544 that bookended line 2543. Like I said, it's definitely fallable, like if you have reams of data that might have " 2542 " all over the place, but just to grab a couple of quick lines, it worked well.

Split one file into multiple files based on delimiter

I have one file with -| as delimiter after each section...need to create separate files for each section using unix.
example of input file
wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|
Expected result in File 1
wertretr
ewretrtret
1212132323
000232
-|
Expected result in File 2
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
Expected result in File 3
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|
A one liner, no programming. (except the regexp etc.)
csplit --digits=2 --quiet --prefix=outfile infile "/-|/+1" "{*}"
tested on:
csplit (GNU coreutils) 8.30
Notes about usage on Apple Mac
"For OS X users, note that the version of csplit that comes with the OS doesn't work. You'll want the version in coreutils (installable via Homebrew), which is called gcsplit." — #Danial
"Just to add, you can get the version for OS X to work (at least with High Sierra). You just need to tweak the args a bit csplit -k -f=outfile infile "/-\|/+1" "{3}". Features that don't seem to work are the "{*}", I had to be specific on the number of separators, and needed to add -k to avoid it deleting all outfiles if it can't find a final separator. Also if you want --digits, you need to use -n instead." — #Pebbl
awk '{f="file" NR; print $0 " -|"> f}' RS='-\\|' input-file
Explanation (edited):
RS is the record separator, and this solution uses a gnu awk extension which allows it to be more than one character. NR is the record number.
The print statement prints a record followed by " -|" into a file that contains the record number in its name.
Debian has csplit, but I don't know if that's common to all/most/other distributions. If not, though, it shouldn't be too hard to track down the source and compile it...
I solved a slightly different problem, where the file contains a line with the name where the text that follows should go. This perl code does the trick for me:
#!/path/to/perl -w
#comment the line below for UNIX systems
use Win32::Clipboard;
# Get command line flags
#print ($#ARGV, "\n");
if($#ARGV == 0) {
print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename. All of the contents of filename.txt are written to that file until another mff is found.\n";
exit;
}
# this package sets the ARGV count variable to -1;
use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);
# set a default $mff variable
if ($mff eq "") {$mff = "-#-"};
print ("using file switch=", $mff, "\n\n");
while($_ = shift #ARGV) {
if(-f "$_") {
push #filelist, $_;
}
}
# Could be more than one file name on the command line,
# but this version throws away the subsequent ones.
$readfile = $filelist[0];
open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;
while (<SOURCEFILE>) {
/^$mff (.*$)/o;
$outname = $1;
# print $outname;
# print "right is: $1 \n";
if (/^$mff /) {
open OUTFILE, ">$outname" ;
print "opened $outname\n";
}
else {print OUTFILE "$_"};
}
The following command works for me. Hope it helps.
awk 'BEGIN{file = 0; filename = "output_" file ".txt"}
/-|/ {getline; file ++; filename = "output_" file ".txt"}
{print $0 > filename}' input
You can also use awk. I'm not very familiar with awk, but the following did seem to work for me. It generated part1.txt, part2.txt, part3.txt, and part4.txt. Do note, that the last partn.txt file that this generates is empty. I'm not sure how fix that, but I'm sure it could be done with a little tweaking. Any suggestions anyone?
awk_pattern file:
BEGIN{ fn = "part1.txt"; n = 1 }
{
print > fn
if (substr($0,1,2) == "-|") {
close (fn)
n++
fn = "part" n ".txt"
}
}
bash command:
awk -f awk_pattern input.file
Here's a Python 3 script that splits a file into multiple files based on a filename provided by the delimiters. Example input file:
# Ignored
######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END
# Ignored
######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END
Here's the script:
#!/usr/bin/env python3
import os
import argparse
# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'
# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")
args = parser.parse_args()
# read the input file
with open(args.input_file, 'r') as input_file:
input_data = input_file.read()
# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
# discard lines until the next start delimiter
while input_lines and not input_lines[0].startswith(start_delimiter):
input_lines.pop(0)
# corner case: no delimiter found and no more lines left
if not input_lines:
break
# extract the output filename from the start delimiter
output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
output_path = os.path.join(args.output_dir, output_filename)
# open the output file
print("extracting file: {0}".format(output_path))
with open(output_path, 'w') as output_file:
# while we have lines left and they don't match the end delimiter
while input_lines and not input_lines[0].startswith(end_delimiter):
output_file.write("{0}\n".format(input_lines.pop(0)))
# remove end delimiter if present
if not input_lines:
input_lines.pop(0)
Finally here's how you run it:
$ python3 script.py -i input-file.txt -o ./output-folder/
Use csplit if you have it.
If you don't, but you have Python... don't use Perl.
Lazy reading of the file
Your file may be too large to hold in memory all at once - reading line by line may be preferable. Assume the input file is named "samplein":
$ python3 -c "from itertools import count
with open('samplein') as file:
for i in count():
firstline = next(file, None)
if firstline is None:
break
with open(f'out{i}', 'w') as out:
out.write(firstline)
for line in file:
out.write(line)
if line == '-|\n':
break"
cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )
and the formated version:
#!/bin/bash
cat FILE | (
I=0;
echo -n"">file0;
while read line;
do
echo $line >> file$I;
if [ "$line" == '-|' ];
then I=$[I+1];
echo -n "" > file$I;
fi;
done;
)
This is the sort of problem I wrote context-split for:
http://stromberg.dnsalias.org/~strombrg/context-split.html
$ ./context-split -h
usage:
./context-split [-s separator] [-n name] [-z length]
-s specifies what regex should separate output files
-n specifies how output files are named (default: numeric
-z specifies how long numbered filenames (if any) should be
-i include line containing separator in output files
operations are always performed on stdin
Here is a perl code that will do the thing
#!/usr/bin/perl
open(FI,"file.txt") or die "Input file not found";
$cur=0;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur";
while(<FI>)
{
print FO $_;
if(/^-\|/)
{
close(FO);
$cur++;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur"
}
}
close(FO);
Try this python script:
import os
import argparse
delimiter = '-|'
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input txt")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")
args = parser.parse_args()
counter = 1;
output_filename = 'part-'+str(counter)
with open(args.input_file, 'r') as input_file:
for line in input_file.read().split('\n'):
if delimiter in line:
counter = counter+1
output_filename = 'part-'+str(counter)
print('Section '+str(counter)+' Started')
else:
#skips empty lines (change the condition if you want empty lines too)
if line.strip() :
output_path = os.path.join(args.output_dir, output_filename+'.txt')
with open(output_path, 'a') as output_file:
output_file.write("{0}\n".format(line))
ex:
python split.py -i ./to-split.txt -o ./output-dir

Resources