filter a file with other file in bash - linux

i Have a file with numbers, for example:
$cat file
31038467
32048169
33058564
34088662
35093964
31018168
31138061
31208369
31538163
31798862
and other for example with
$cat file2
31208369
33058564
34088662
31538163
31038467
Then i need other file with lines that are in the first file but not in the second
cat $output
35093964
31018168
31138061
31798862
32048169
My real file has 12'000.0000 of lines.
Then how can i do it?

Is
grep -f file2 -v -F -x file1
sufficient?
NOTE1: Please specify in question, if the actual question is that, you need it to be time/memory optimized.
NOTE2: Get rid of any blank lines in file2.

Related

Automate and looping through batch script

I'm new to batch. I want iterate through a list and use the output content to replace a string in another file.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
do
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
done
When I run the code I get the error
sed: 1: "s/dataFile/"$torepla ...": bad flag in substitute command: '$'
Example of somefile with which has list of files paths
foo/name/xxx/2020-01-01.txt
foo/name/xxx/2020-01-02.txt
foo/name/xxx/2020-01-03.txt
However, my desired output is to use the list of file paths in somefile directory to replace a string in another file2 content. Something like this:
This is the directory of locations where data from /Team/foo/name/xxx/2020-01-01.txt ............
I'm not sure if I understand your desired outcome, but hopefully this will help you to figure out your problem:
You have three files in a directory:
TEAM/foo/name/xxx/2020-01-02.txt
TEAM/foo/name/xxx/2020-01-03.txt
TEAM/foo/name/xxx/2020-01-01.txt
And you have another file called to_be_changed.txt which contains the text This is the directory of locations where data from TO_BE_REPLACED ............ and you want to grab the filenames of your three files and insert them into your to_be_changed.txt file, you can do it with:
while read file
do
filename="$file"
sed "s/TO_BE_REPLACED/${filename##*/}/g" to_be_changed.txt >> changed.txt
done < <(find ./TEAM/ -name "*.txt")
And you will then have made a file called changed.txt which contains:
This is the directory of locations where data from 2020-01-02.txt ............
This is the directory of locations where data from 2020-01-03.txt ............
This is the directory of locations where data from 2020-01-01.txt ............
Is this what you're trying to achieve? If you need further clarification I'm happy to edit this answer to provide more details/explanation.
ls -l somefile | grep .txt | awk 'print $4}' | while read file
No. No, no, nono.
ls -l somefile is only going to show somefile unless it's a directory.
(Don't name a directory "somefile".)
If you mean somefile.txt, please clarify in your post.
grep .txt is going to look through the lines presented for the three characters txt preceded by any character (the dot is a regex wildcard). Since you asked for a long listing of somefile it shouldn't find any, so nothing should be passed along.
awk 'print $4}' is a typo which won't compile. awk will crash.
Keep it simple. What I suspect you meant was
for file in *.txt
Then in
toreplace="/Team/$file"
sed 's/dataFile/"$toreplace"/$file/ file2 > /tmp/test.txt
it's unlear what you expect $file to be - awk's $4 from an ls -l seems unlikely.
Assuming it's the filenames from the for above, then try
sed "s,dataFile,/Team/$file," file2 > /tmp/test.txt
Does that help? Correct me as needed. Sorry if I seem harsh.
Welcome to SO. ;)

Is there a way to compare N files at once, and only leave lines unique to each file?

Background
I have five files that I am trying to make unique relative to each other. In other words, I want to make it so that the lines of text in each file have no commonality with each other.
Attempted solution
So far, I have been able to run the grep -vf command comparing one file with the other 4 as so:
grep -vf file2.txt file1.txt
grep -vf file3.txt file1.txt
...
This makes it print out the lines in file1 that are not in file2, nor file3, etc.. However, this becomes cumbersome because I would need to do this for the superset of all files. In otherwords, to truly reduce each file to lines of text only in that file, I would have to do every combination of files into the grep -vf command. Given that this sounds cumbersome to me, I wanted to know...
Question
What is the command/series of commands in linux to find the lines of text in each file that is mutually exclusive to all the other files?
You could just do:
awk '!a[$0]++ { out=sprintf("%s.out", FILENAME); print > out}' file*
This will write the lines that are uniq in file to file.out. Each line will be written to the output file of the associated input file in which it first appears, and subsequent duplicates of that same line will be suppressed.

How to use sed to comment and add lines in a config-file

I am looking for a way to achieve the following:
A certain directory contains 4 (config) files:
File1
File2
File3
File4
I want my bash script to read in each of the files, one by one. In each file, look for a certain line starting with "params: ". I want to comment out this line and then in the next line put "params: changed according to my will".
I know there are a lot of handy tools such as sed to aid with these kind of tasks. So I gave it a try:
sed -ri 's/params:/^\\\\*' File1.conf
sed -ri '/params:/params: changed according to my will' File1.conf
Questions: Does the first line really substitute the regex params: with \\ following a copy of the entire line in which params: was found? I am not sure I can use the * here.
Well, and how would I achieve that these commands are executed for all of the 4 files?
So this command will comment every line beggining by params: in you files, and append a text in the next line
sed -E -i 's/^(params:.*)$/\/\/\1\nYOUR NEW LINE HERE/g'
the pattern ^(params:.*)$ will match any whole line beggining by params:, and the parenthesis indicate that this is a capturing group.
Then, it is used in the second part of the sed command via \1, which is the reference of the first capturing group found. So you can see the second part comments the first line, add a line break and finally your text.
You can execute this for all your files simply by going sed -E -i 's/^(params:.*)$/\/\/\1\nYOUR NEW LINE HERE/g' file1 file2 file3 file4
Hope this helps!
You can do this:
for i in **conf
do
cp $i $i.bak
sed -i 's/\(params:\)\(.*\)$/#\1\2\n\1new value/'
done
With: \(params:\)\(.*\)
match params: and store it in `\1
match text following .*\: and store it in \2
Then create two lines:
The initial line commented: #\1\2\n
The new line with your wanted value: \1new value
This might work for you (GNU sed and parallel):
parallel --dry-run -q sed -i 's/^params:/#&/;T;aparams: bla bla' {} ::: file[1-4]
Run this in the desired directory and if the commands are correct remove the --dry-run option and run for real.

How to copy data from file to another file starting from specific line

I have two files data.txt and results.txt, assuming there are 5 lines in data.txt, I want to copy all these lines and paste them in file results.txt starting from the line number 4.
Here is a sample below:
Data.txt file:
stack
ping
dns
ip
remote
Results.txt file:
# here are some text
# please do not edit these lines
# blah blah..
this is the 4th line that data should go on.
I've tried sed with various combinations but I couldn't make it work, I'm not sure if it fit for that purpose as well.
sed -n '4p' /path/to/file/data.txt > /path/to/file/results.txt
The above code copies line 4 only. That isn't what I'm trying to achieve. As I said above, I need to copy all lines from data.txt and paste them in results.txt but it has to start from line 4 without modifying or overriding the first 3 lines.
Any help is greatly appreciated.
EDIT:
I want to override the copied data starting from line number 4 in
the file results.txt. So, I want to leave the first 3 lines without
modifications and override the rest of the file with the data copied
from data.txt file.
Here's a way that works well from cron. Less chance of losing data or corrupting the file:
# preserve first lines of results
head -3 results.txt > results.TMP
# append new data
cat data.txt >> results.TMP
# rename output file atomically in case of system crash
mv results.TMP results.txt
You can use process substitution to give cat a fifo which it will be able to read from :
cat <(head -3 result.txt) data.txt > result.txt
head -n 3 /path/to/file/results.txt > /path/to/file/results.txt
cat /path/to/file/data.txt >> /path/to/file/results.txt
if you can use awk:
awk 'NR!=FNR || NR<4' Result.txt Data.txt

Search for lines in a file that contain de lines of a second file

So I have a first file with a ID in each line, for example:
458-12-345
466-44-3-223
578-4-58-1
599-478
854-52658
955-12-32
Then I have a second file. It has a ID in each file followed by information, for example:
111-2457-1 0.2545 0.5484 0.6914 0.4222
112-4844-487 0.7475 0.4749 0.1114 0.8413
115-44-48-5 0.4464 0.8894 0.1140 0.1044
....
The first file only has 1000 lines, with the IDs of the info I need, while the second file has more than 200,000 lines.
I used the following bash command in a fedora with good results:
cat file1.txt | while read line; do cat file2.txt | egrep "^$line\ "; done > file3.txt
However I'm now trying to replicate the results in Ubuntu, and the output is a blank file. Is there a reason for this not to work in Ubuntu?
Thanks!
You can grep for several strings at once:
grep -f id_file data_file
Assuming that id_file contains all the IDs and data_file contains the IDs and data.
Typical job for awk:
awk 'FNR==NR{i[$1]=1;next} i[$1]{print}' file1 file2
This will print the lines from the second file that have an index in the first one. For even more speed, use mawk.
this line works fine for me in Ubuntu:
cat 1.txt | while read line; do cat 2.txt | grep "$line"; done
However, this may be slow as the second file (200000 lines) will be grepped 1000 times (number of lines in the first file)

Resources