Get wget to download only new items from a list - linux

I've got a file that contains a list of file paths. I’m downloading them like this with wget:
wget -i cram_download_list.txt
However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.
I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.
File contents look like this:
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram
I’m currently trying to do something like this:
ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt

I’d like to look at the directory for which files already exist, and
only download the outstanding ones.
To get such behavior you might use -nc (alias --no-clobber) flag. It does skip downloads that would download to existing files (overwriting them). So in your case
wget -nc -i cram_download_list.txt
Beware that this solution does not handle partially downloaded files.

wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
grep -vf - cram_download_list.txt )
Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.
Added:
-c for finalizing incomplete files (i.e. resume download)
Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.

If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:
wget -c -i cram_download_list.txt
The files which are already completed will only be checked and skipped.

Related

Obtaining a flat list of all blobs in a .git/objects/ folder

In the .git/objects/ folder there are many folders with files within such as ab/cde.... I understand that these are actually blobs abcde...
Is there a way to obtain a flat file listing of all blobs under .git/objects/ with no / being used a delimitor between ab and cde in the example above? For e.g.
abcde....
ab812....
74axs...
I tried
/.git/objects$ du -a .
This does list recursively all folders and files within the /objects/ folder but the blobs are not listed since the command lists the folder followed by the filename (as the OS recognizes them, as opposed to git). Furthermore, the du command does not provide a flat listing in a single column -- it provides the output in two columns with a numeric entry (disk usage) in the first column.
I think you should start round here (git version 2.37.2):
git rev-list --all --objects --filter=object:type=blob
Doing it this way offers the advantage of not only checking the directory where the unpacked objects are but also the objects that are already packed (which are not in that directory anymore).
If you are in the .git/objects/ folder
Try this.
find . -type f | sed -e 's/.git\/objects\///' | sed -e 's/\///'
sed -e requires the sed script, which means a find/replace pattern.
's/.git\/objects\///' finds .git/objects/ and replace it to '' which is nothing. therefore sed command remove the pattern.
\ in the pattern is an escape character.
After first sed command ends,
the results will be (in linux.)
61/87c3f3d6c61c1a6ea475afb64265b83e73ec26
To remove / which refers a directory sign,
sed -e 's/\///'
If you are in the directory which contains .git
find .git/objects/ -type f | sed -e 's/.git\/objects\///' | sed -e 's/\///'
try this.

How to search multiple DOCX files for a string within a Word field?

Is there any Windows app that will search for a string of text within fields in a Word (DOCX) document? Apps like Agent Ransack and its big brother FileLocator Pro can find strings in the Word docs but seem incapable of searching within fields.
For example, I would like to be able to find all occurrences of the string "getProposalTranslations" within a collection of Word documents that have fields with syntax like this:
{ AUTOTEXTLIST \t "<wr:out select='$.shared_quote_info' datasource='getProposalTranslations'/>" }
Note that string doesn't appear within the text of the document itself but rather only within a field. Essentially the DOCX file is just a zip file, I believe, so if there's a tool that can grep within archives, that might work. Note also that I need to be able to search across hundreds or perhaps thousands of files in many directories, so unzipping the files one by one isn't feasible. I haven't found anything on my own and thought I'd ask here. Thanks in advance.
This script should accomplish what you are trying to do. Let me know if that isn't the case. I don't usually write entire scripts because it can hurt the learning process, so I have commented each command so that you might learn from it.
#!/bin/sh
# Create ~/tmp/WORDXML folder if it doesn't exist already
mkdir -p ~/tmp/WORDXML
# Change directory to ~/tmp/WORDXML
cd ~/tmp/WORDXML
# Iterate through each file passed to this script
for FILE in $#; do
{
# unzip it into ~/tmp/WORDXML
# 2>&1 > /dev/null discards all output to the terminal
unzip $FILE 2>&1 > /dev/null
# find all of the xml files
find -type f -name '*.xml' | \
# open them in xmllint to make them pretty. Discard errors.
xargs xmllint --recover --format 2> /dev/null | \
# search for and report if found
grep 'getProposalTranslations' && echo " [^ found in file '$FILE']"
# remove the temporary contents
rm -rf ~/tmp/WORDXML/*
}; done
# remove the temporary folder
rm -rf ~/tmp/WORDXML
Save the script wherever you like. Name it whatever you like. I'll name it docxfind. Make it executable by running chmod +x docxfind. Then you can run the script like this (assuming your terminal is running in the same directory): ./docxfind filenames...

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

Compare files content with similar names on two folders

I have two folders (I'll use database names as example):
MongoFolder/
CassandraFolder/
These two folders have similar files inside like:
MongoFolder/
MongoFile
MongoStatus
MongoConfiguration
MongoPlugin
CassandraFolder/
CassandraFile
CassandraStatus
CassandraConfiguration
Those files have content also very similar, only changing the name of the database for example, so they all have code or configuration only changing the name Mongo for Cassandra.
How can I compare this two folders, so the result is the files missing from one to the other (for example the file CassandraPlugin for the CassandraFolder) and also that the contents of the files alike, have to be similar, only changing the database name.
This will give you the names of the missing files (minus the database name):
find MongoFolder/ CassandraFolder/ | \
sed -e s/Mongo//g -e s/Cassandra//g | sort | uniq -u
Output:
Folder/Plugin
the following provides a full diff, including missing files and changed content:
cp -r CassandraFolder cmpFolder
# rename files
find cmpFolder -name "Cassandra*" -print | while read file; do
mongoName=`echo "$file" | sed 's/Cassandra/Mongo/'`
mv "$file" "$mongoName"
done
# fix content
find cmpFolder -type f -exec perl -pi -e 's/Cassandra/Mongo/g' {} \;
# inspect result
diff -r MongoFolder cmpFolder # or use a gui tool like kdiff3
I haven't tested this though, feel free fix bugs or to ask if something specific is unclear.
Instead of mv you can also use rename but that's different on different flavours of linux.

Question on grep

Out of many results returned by grepping a particular pattern, if I want to use all the results one after the other in my script, how can I go about it?For e.g. I grep for .der in a certificate folder which returns many results. I want to use each and every .der certificate listed from the grep command. How can I use one file after the other out of the grep result?
Are you actually grepping content, or just filenames? If it's file names, you'd be better off using the find command:
find /path/to/folder -name "*.der" -exec some other commands {} ";"
It should be quicker in general.
One way is to use grep -l. This ensures you only get every file once. -l is used to print the name of each file only, not the matches.
Then, you can loop on the results:
for file in `grep ....`
do
# work on $file
done
Also note that if you have spaces in your filenames, there is a ton of possible issues. See Looping through files with spaces in the names on the Unix&Linux stackexchange.
You can use the output as part of a for loop, something like:
for cert in $(grep '\.der' *) ; do
echo ${cert} # or something else
done
Of course, if those der things are actually files (and you're using ls | grep to get them), you can directly use the files:
for cert in *.der ; do
echo ${cert} # or something else
done
In both cases, you may need to watch out for arguments with embedded spaces.

Resources