bash - remove improper words - linux

I have a file with bunch of words in which many of them don't make much sense such as 'completemakes' or even #s mixed with letters/words. What I need is to use a tool to spell check them, if it exists on the dictionary leave it, if not delete it.
What would be a good way of doing this in bash?
Thanks

You can script Aspell.

I had some fun with getting a single quote character in here, but hey, it should be as hard to read as it was to write, right? (assuming your words are listed in words.txt)
awk 'system("grep -i -q " "'"'"'^"$0"$'"'"'" " /usr/share/dict/words") == 0 {print $0};' words.txt

Related

How to capitalize and replace characters in shell script in one echo

I am trying to find a way to capitalize and replace dashes of a string in one echo. I do not have the ability to use multiple lines for reassigning the string value.
For example:
string='test-e2e-uber' needs to echo $string as TEST_E2E_UBER
I currently can do one or the other by utilizing
${string^^} for capitalization
${string//-/_} for replacement
However, when I try to combine them it does not appear to work (bad substitution error).
Is there a correct syntax to achieve this?
echo ${string^^//-/_}
This does not answer directly your question, but still following script achieves what you wanted :
declare -u string='test-e2e-uber'
echo ${string//-/_}
You can do that directly with the 'tr' command, in just one 'echo'
echo "$string" | tr "-" "_" | tr "[:lower:]" "[:upper:]"
TEST_E2E_UBER
I don't think 'tr' allows to do the conversion of 2 objects in one command only, so I used pipe for output redirection
or you could do something similar with 'awk'
echo "$string" | awk '{gsub("-","_",$0)} {print toupper($0)}'
TEST_E2E_UBER
in this case, I'm replacing with 'gsub' the hyphen, then i'm printing the whole record to uppercase
Why do you dislike it so much to have two successive assignment statements? If you really hate it, you will have to revert to some external program to do the task for you, such as
string=$(tr a-z- A-Z_ <<<$string)
but I would consider it a waste of resources to create a child process for such a simple operation.

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Embedding quotation marks in command string generated by AWK?

I need to match all instances of strings in one file, with a master list in another. However, if my string is abc I want only that, not abcdef, abc1234 and so on.
So, a word boundary for the regex? Right now, I'm using a simple awk one liner:
cat results_file| sort -k 1| awk -F" " '{ print $1" /home/owner/file_2_search"}'|
xargs -L 1 /bin/grep -i
However, to force a word boundary, I'd need to grep string\b and the quotes (single or double) seem to be required.
In awk, \b is a special character, you need \\b ... And the quoted quotes ... (arg) ... Or am I missing something and overdoing this?
This is a Linux box, so presumably gawk. I have gone over quoting rules for awk, and realize this has got to be simple (and not complex ... but), but am not seeing it.
Had meant to post as an answer, not a comment. Will try to pose a more readable question, but confess to having second thoughts about doing this as a one-liner in the first place -- may be best to follow an alternate method. Appreciate the willingness to help.
--Joe

Delete Repeated Characters without back-referencing with SED

Let's say we have a file that contains
HHEELLOO
HHYYPPOOTTHHEESSIISS
and we want to delete repeated characters. To my knowledge we can do this with
s/\([A-Z]\)\1/\1/g
This is a homework problem and the professor said he wants us to try the exercises without back-referencing or extended regular expressions. Is that possible on this one? I would appreciate it if anyone could point me in the right direction, thanks!
The only reasonable way to do this is to use the right tool for the job, in this case tr:
$ tr -s 'A-Z' < file
HELO
HYPOTHESIS
If you were going to use sed for that specific problem though then it'd just be:
$ sed 's/\(.\)./\1/g' file
HELO
HYPOTHESIS
If that's not what you're looking for then edit your question to show more truly representative sample input and expected output.
Here's one way:
s/AA/A/g
s/BB/B/g
...
s/ZZ/Z/g
As a one-liner:
sed 's/AA/A/g; s/BB/B/g; ...'

How do I grep for entire, possibly wrapped, lines of code?

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

Resources