.txt delete lines containing - text

There are two text files: 1.txt and 2.txt. 1.txt contains URLs separated by line breaks and 2.txt contains words separated by line breaks, one word per line. I want to delete URLs from 1.txt that contain words from 2.txt. What is the most convenient way to do that?
For example:
1.txt:
website1.com
website2word1.com
webword2site3.com
2.txt:
word1
word2
After processing, 1.txt should look like this:
website1.com
The files are quite large. The first file contains a million lines (that's after being split, there are multiple files) and the second one contains 10,000 lines.

You can simply write a java program to read url from 1.txt and match it with 2.txt. And write the url in 3.txt if matches the requirement.

Related

Linux Sort words alphabetically and make a file for each letter

I want to write a shell script which creates automatically 26 dictionary files, where the first file should contain all the words starting with a or A, the second all the words starting with b or B, ... etc. Where each dictionary file is sorted. For example, if I had a file that had the words Lime, Apple, Orange, Avacado, Apricot, Lemon. Then I want a new file that contains in order Apple, Apricot, Avacado, a file that contains just Orange, and a file that contains Lemon, Lime.
I thought about doing this using sort, so it could be:
sort sample.txt
but that would not put each section of words into a new file. So I thought of doing:
sort sample.txt > [a-z].txt
but that just makes one new file titled [a-z].txt
How do I make different alphabetically sorted files from the list of words in the file? I want it to be like a.txt, b.txt, etc with each containing all the words that start with that letter.
You can do this with awk:
awk '{ print $0 >> toupper(substr($0,1,1))"_wordsfile" }' <(sort wordsfilemaster)
Where wordsfilemaster contains the original dictionary file, run sort on the file and redirect the output back into awk. Append the line to a file generated by taking the first character of the line, converting it to upper case and then appending "_wordsfile" e.g.
files get created as A_wordsfile or O_wordsfile.

how to print two strings in a line one with space delimiter and another between two strings in Linux

I have a file with more than 100 lines.
But only some lines have specific pattern like abc.
My question is that I want two things to print
5th word of line which has pattern abc.
words between 2 distinct strings (xxx, yyy).
Say for example my file has the content below:
This is first line.
Second line has abc pattern with xxx as first separator and yyy as second separator.
This is third line.
Again fourth line has same pattern abc with separators xxx and yyy.
And so on.
The required output is like below:
pattern as first separator and
same and
I tried many ways in Linux but if I was able to print 5th word then content between xxx and yyy I was not able to print and vice versa.
Can any one help me please?
Let me answer to your question:
My question is that I want two things to print
5th word of line which has pattern abc.
words between 2 distinct strings (xxx, yyy).
You can use awk for both parts of your question:
awk '/abc/{print $5}' input_file.txt
awk '/xxx.*yyy/{if(match($0,"xxx.*yyy)){print substr($0,RSTART,RLENGTGH)}}' input_file.txt
if you need to combine both requirements in one command:
awk '/abc/{print $5} /xxx.*yyy/{if(match($0,"xxx.*yyy)){print substr($0,RSTART,RLENGTGH)}}'
OUTPUT:
pattern
xxx as first separator and yyy
same
xxx and yyy

Example Grep search pattern to search for all text between two words

I have multiple text files I need to extract a viable amount of characters between two specific words, "".
Can someone give me an example grep pattern that will find all characters of any kind, including spaces, between these two words so I can then replace with a blank space? Thank you.
I don't have any example code I can put in my question, I am using a text editing program and I would like to find all the text between two unique words in the file and delete it, the text editing program allows the use of grep patterns.
You can use below grep line to search pattern like word1 something and other things and then word2
grep -o -E "word1(\b).*word2(\b)" file.txt
As you may notice this command's output also includes word1 and word2.

For every string in file1.txt check if it exists in file2.txt then do something

I got two txt file, file1.txt and file2.txt.
Both of them have one single string for each line. Strings in file1.txt are uniqe (no duplication), as well as strings in file2.txt.
The files have different numbers of strings.
file1.txt file2.txt
FFF AAA
GGG BBB
ZZZ CCC
ZZZ
I'd like to compare those files, so that for every string in file1.txt, if it exists in file2.txt than it's ok. If not, than write that string in another file (file3.txt)
In this example, file3.txt would be:
file3.txt
FFF
GGG
I'd like to use the command shell, doing something like:
cat file1.txt | while read a; do something on file2.txt ...
but that is not compulsory.
See the man page for grep, specifically the -f option.
grep -vf file2.txt file1.txt
Your best bet would be to read in the input from file 2, put it in a sorted list (or even better, a balanced search tree) and then as you read in each line from file1, go through the tree or do a binary search of the list to find if the string exists.
The idea is that you want to do processing once to make the list of allowed values as easy to check as possible. Putting them in a binary search tree would mean that you first compare it against the word in the middle (alphabetically) of list 2, if it is before it, you take the left branch (which contains words that come before the word you just compared to, or if it comes after, you only have to look at the right branch.
Similarly, if using a list, you look at the word in the middle of the list and then can remove half of the remaining list from consideration each iteration. This means you only have to do log n steps to check each of the words in List1 against the n words in list2.

Paste two text lists (one list a file) into one list separated by semicolon

An example of the process/output would be:
File1:
hello
world
File2:
foo
bar
Resulting file after concatenation:
File3:
hello;foo
world;bar
For a large list of non-predictive text (no-wild cards - but lines are aligned as above).
I cannot figure out how to do this with the paste command under Ubuntu.
paste -d';' File1 File2 > File3
cat concatenates by lines (or, more accurately, doesn't care what the contents are).
What you seem to need is something more like paste.
$ paste -d\; file1 file2
hello;foo
world;bar

Resources