CPanel/VPS - Search all messages for URL - search

I need a command that will search through the body text of all emails (sent/received/junk) for all email accounts on a VPS server for a particular URL (which may be embedded in text). Possibly a grep command?
Apologies if this is obvious!

You can use a very simple and basic grep command:
grep -ri 'particular url' /home/*/mail/*
-i (ignore case so no case sensitive search)
-r recursive; searches through all the files/folders
Give it a try!

Related

Linux command select specific directory

I have only two folders under a given directory. Is there any method to choose the second directory based on the order and not on the folder name?
Example: (I want to enter under doc2)
#ls
doc1 doc2
If you really want to use ls,
cd "$(ls -d */ | sed -n '2p')"
selects enters the second directory listed by it, independently of the number of directories provided by ls.
Parsing ls output is not a good idea generally, although it will work in most cases and will cause no harm if you are just using it in your interactive shell for fast navigation. You should not use this for serious programming.
You can use the tail command to get the last line
ls |tail -1

Grep on a file that is being written by another application

I am using grep to match a string in a file but at the same time another application might be writing to the same file.
In that case what will grep do.
Will it allow the file to being written by other application or will it not give access to the file?
Also if it does give access will my grep results be based on before the file was written or after?
Basically I want the grep to not lock the access to the file but if it does that is their an alternative to prevent it from doing so..
My Sample command:
egrep -r -i "regex" /directory/*
grep does not lock the file, so it is safe to use it, while the file is being actively used by another application

wget: downloading all files in directories/subdirectories

Basically on a webpage there is a list of directories, and each of these has further subdirectories. The subdirectories contain a number of files, and I want to download to a single location on my linux machine one file from each subdirectory which has the specific sequence letters 'RMD' in it.
E.g., say the main webpage links to directories dir1, dir2, dir3..., and each of these has subdirectories dir1a, dir1b..., dir2a, dir2b... etc. I want to download files of the form:
webpage/dir1/dir1a/file321RMD210
webpage/dir1/dir1b/file951RMD339
...
webpage/dir2/dir2a/file416RMD712
webpage/dir2/dir2b/file712RMD521
The directories/subdirectories are not sequentially numbered like in the above example (that was just me making it simpler to read) so is there a terminal command that will recursively go through each directory and subdirectory and download every file with the letters 'RMD' in the file name?
The website in question is: here
I hope that's enough information.
An answer with a lot of remarks:
In case the website supports ftp, you better use #MichaelBaldry's answer. This answer aims to give a way to do it with wget (but this is less efficient for both server and client).
Only in case the website works with a directory listing, you can use the -r flag for this (the -R flag aims to find links in webpages and then downloads these pages as well).
The following method is inefficient for both server and client and can result in a huge load if the pages are generated dynamically. The website you mention furthermore specifically asks not to fetch data that way.
wget -e robots=off -r -k -nv -nH -l inf -R jpg,jpeg,gif,png,tif --reject-regex '(.*)\?(.*)' --no-parent 'http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/'
with:
wget the program you want to call;
-e robots=off; the fact that you ignore the websites request not to download this automatically;
-r: you download recursively;
-R jpg,jpeg,gif,png,tif: reject the downloading of media (the small images);
--reject-regex '(.*)\?(.*)' do not follow or download query pages (sorting of index pages).
-l inf: that you keep downloading for an infinite level
--no-parent: prevent wget from starting to fetch links in the parent of the website (for instance the .. links to the parent directory).
wget downloads the files breadth-first so you will have to wait a long time before it eventually starts fetching the real data files.
Note that wget has no means to guess the directory structure at server-side. It only aims to find links in the fetched pages and thus with this knowledge aims to generate a dump of "visible" files. It is possible that the webserver does not list all available files, and thus wget will fail to download all files.
I've noticed this site supports FTP protocol, which is a far more convenient way of reading files and folders. (Its for transferring files, not web pages)
Get a FTP client (lots of them about) and open ftp://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ you can probably just highlight all the folders in there and hit download.
One solution using saxon-lint :
saxon-lint --html --xpath 'string-join(//a/#href, "^M")' http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ | awk '/SOL/{print "http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/"$0}' | while read url; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url" | awk -vurl="$url" '/SOL/{print url$0}'; done | while read url2; do saxon-lint --html --xpath 'string-join(//a/#href, "^M")' "$url2" | awk -vurl2="$url2" '/RME/{print url2$0}'; done | xargs wget
Edit the
"^M"
by control+M (Unix) or \r\n for windows

Using wget but ignore url parameters

I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.
You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.
wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.
It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.
Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!
#kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/

Find a string in Perforce file without syncing

Not sure if this is possible or not, but I figured I'd ask to see if anyone knows. Is it possible to find a file containing a string in a Perforce repository? Specifically, is it possible to do so without syncing the entire repository to a local directory first? (It's quite large - I don't think I'd have room even if I deleted lots of stuff - that's what the archive servers are for anyhow.)
There's any number of tools that can search through files in a local directory (I personally use Agent Ransack, but it's just one of many), but these will not search a remote Perforce directory, unless there's some (preferably free) tool I'm not aware of that has this capability, or maybe some hidden feature within Perforce itself?
p4 grep is your friend. From the perforce blog
'p4 grep' allows users to use simple file searches as well as regular
expressions to search through file contents of head as well as earlier
revisions of files stored on the server. While not every single option
of a standard grep is supported, the most important options are
available. Here is the syntax of the command according to 'p4 help
grep':
p4 grep [ -a -i -n -v -A after -B before -C context -l -L -t -s -F -G ] -e pattern file[revRange]...
See also, the manual page.
Update: Note that there is a limitation on the number of files that Perforce will search in a single p4 grep command. Presumably this is to help keep the load on the server down. This manifests as an error:
Grep revision limit exceeded (over 10000).
If you have sufficient perforce permissions, you can use p4 configure to increase the dm.grep.maxrevs setting from this default of 10K to something larger. e.g. to set to 1 million:
p4 configure set dm.grep.maxrevs=1M
If you do not have permission to change this, you can work around it by splitting the p4 grep up into multiple commands over the subdirectories. You may have need to split further into sub-subdirectories etc depending on your depot structure.
For example, this command can be used at a bash shell to search each subdirectory of //depot/trunk one at a time. Makes use of the p4 dirs command to obtain the list of subdirectories from the server.
for dir in $(p4 dirs //depot/trunk/*); do
p4 grep -s -i -e the_search_string $dir/...
done
Actually, solved this one myself. p4 grep indeed does the trick. Doc here. You have to carefully narrow it down before it'll work properly - on our server at least you have to get it down to < 10000 files. I also had to redirect the output to a file instead of printing it out in the console, adding > output.txt, because there's a limit of 4096 chars per line in the console and the file paths are quite long.
It's not something you can do with the standard perforce tools. One helpful command might be p4 print but it's not really faster than syncing I would think.
This is a big if but if you have access to the server you can run agent ransack on the perforce directory. Perforce stores all versioned files on disk, it's only the metadata that's in a database.

Resources