Identifying matching lines across multiple files in Linux

Identifying matching lines across multiple files in Linux - linux

I'm working on a script to help me identify matching lines across multiple files. I've had luck with the comm command:
comm -12 <(sort file1.txt) <(sort file2.txt)
This works and can be daisy-chained to expand to X number of files. My problem is that anything more than two files starts becoming very unreadable. I'm wondering if there is a sleeker option available or if I should dig in and write a script to do the work for me. It'd be a simple loop and maybe a prompt to collect file names.
I have a working solution just trying to see if there is a better way that doesn't involve reinventing the wheel. Thoughts appreciated.

Related

Can we use awk, sed and grep simultaneously for string operation in linux

I got a question from an interview, can we use awk, sed, and grep simultaneously?
I am not sure why and how but is there any possibility that we use all of them simultaneously to manipulate strings from the file?

That makes absolutely no sense: you use awk, sed and grep in order to alter a string or to filter information from it, why would you even want to combine that at the same time? I believe the question is related to the usage of both commands in one single line like grep "INF" file.txt | awk '{print $1}', which is one single commandline, containing both grep and awk, first for filtering only the lines, containing "INF" and then only showing the first column. This however does not mean that you are executing both commands simultaneously: first you perform the grep and afterwards the awk.

The correct answer is yes because awk has co-processes - you can start the other two from awk.
The interviewer wanted to hear you reason about why/why not this is a good idea and to understand your level of competence with these tools.
For example: in some scenarios this could be a really bad idea because you can potentially edit the file while reading from it, leading to confusing or unpredictable results.
Generally you don't need grep if you are using awk, and finally IMHO mixing sed and awk in a "simultaneous" script is poor form because it's potentially hard to understand what's being done

Merge, sort, maintain line order

This probably sounds contradicting. So let me explain. I have a number of log files that use log4j to write to different files and rotate. What I want to do is merge them into fewer files.
How I started to go about doing this:
- use awk to concat multi-line entries into one line into a separate file.
- cat awk output files to 1 file.
- sort the cat file
- awk to separate the concatenated lines.
But I see that the sort is putting entries with the same second/ms in a different order than they appeared in their original output file. It may not be a HUGE deal. But, I don't like it. Any ideas for how I go about doing what I want (maintaining their original line order while sorting)? I would rather not write my own program and would like to use native linux utils if possible. But, I am open to the "best" way of doing this (Perl, Python, etc..).
I thought about cat'ing the output files from highest to lowest (log4j rotate files) so that I wouldn't have to sort. But that only solves the problem for files writing to the same log file (file1.0.log, file1.1.log, etc..). But this doesn't help when needing to merge file2 with file1.
Thank you,
Gregg

What you are talking about is "stable" sorting. There is a -s option on sort that should give you what you want.
Stability in sorting algorithms

Ensure Linux (suse) Programs Level across multiple servers with cksum

We have a GOLD image new servers are imaged from new ones are created.
Over time some of these have become out of sync due to poorly managed rollouts.
I would like to scan all of these servers bin folders and compare to what the GOLD image has into a output file. (IE: if different flag one way. If same say Same, if missing say Missing, if there but not on gold. Addition?)
I was going to accomplish via like below.
on the Gold Image run following example.
for x in `ls /bin/`
do
cksum $x >> /data/OnGold.lst
done
Distribute this file to all of servers along with another script that will execute the same thing with a different log name.
after the script executes another script will Diff the two files and report on the differences based off of the cksum or if files are missing or in addition to the OnGold.lst
(This is what I could use some advice on the best way to achieve this? Or if anyone knows of some open source tools that could accomplish the same thing? assuming. pretty sure diff would do the trick as it will advise if items were misssing or in addition but I dont know how to format this in a report format.)
Any help would be greatly appreciated?

Something like .htaccess in Linux

I have a directory with lot of files (above 4.000.000 files). All filenames has this same pattern:
PREFIX-XXXXXX-YY.ext
where
XXXXXX contains letters and digits
YY contains digits
ext is a extension of file (.txt, .jpg)
File structure have 12MB, so listing/searching of this directory takes long time. I divided all content of this directory to subdirectories, depends of filename, precisiously first letter of XXXXXX from pattern above.
ie.
main_directory/A/PREFIX-AXXXXX-YY.ext
main_directory/B/PREFIX-BXXXXX-YY.ext
main_directory/1/PREFIX-1XXXXX-YY.ext
Is in Linux easy way to make a rule, when I type in linux command for example
test:/home/usr/admin # ls main_directory/PREFIX-AXXXXX-*
I will get a list of filenames from main_directory/A/ directory? This rule MUST work only for main_directory.

You can't have this at file-system layer, not without creating links and circling back to your original problem. I can think of two easy ways out.
Take 1: scripting
You could write a short script to rewrite the names for you.
Suppose you had a rewrite script that took PREFIX-AXXXX-* and outputted main_directory/A/PREFIX-AXXXX-*. You could then change your ls line to:
$ ls `rewrite PREFIX-AXXXXX-*`
This can be easily accomplished with sed, awk or any other on-the-fly text transformation tool.
Shell programs are composable for a reason! :)
Take 2: embed a faster file-system
You could do away with the restructuring and rewriting names by using a faster file-system, mounted in your main directory. XFS sounds good for this. It should remove your performance concerns without further ado.
This requires a deeper understanding of what's going on to be effective for day-to-day usage, however.
Edit: Here's an article on how to create virtual user-space file-systems.
Edit 2: actually no, I don't think XFS would cut it. Maybe another file-system, though.

Merge two files in linux and ignore any repetition

Can anyone provide me with a shell script in linux that merges two files and saves it in a third file. However I want that if there is any common data in both the files then the common lines should only be saved once. Please ask if you need any more details. Thanks in advance!!

Simplest way:
cat one two | sort -u > third
But this is probably not what you want...
You mentioned merging in your question: what do you mean with that? If it's not that simple as I assumed in my code above, provide sample files and tell us what you want to achieve.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Identifying matching lines across multiple files in Linux - linux

Related

Can we use awk, sed and grep simultaneously for string operation in linux

Merge, sort, maintain line order

Ensure Linux (suse) Programs Level across multiple servers with cksum

Something like .htaccess in Linux

Merge two files in linux and ignore any repetition

Categories

Resources