Merge, sort, maintain line order

Merge, sort, maintain line order - linux

This probably sounds contradicting. So let me explain. I have a number of log files that use log4j to write to different files and rotate. What I want to do is merge them into fewer files.
How I started to go about doing this:
- use awk to concat multi-line entries into one line into a separate file.
- cat awk output files to 1 file.
- sort the cat file
- awk to separate the concatenated lines.
But I see that the sort is putting entries with the same second/ms in a different order than they appeared in their original output file. It may not be a HUGE deal. But, I don't like it. Any ideas for how I go about doing what I want (maintaining their original line order while sorting)? I would rather not write my own program and would like to use native linux utils if possible. But, I am open to the "best" way of doing this (Perl, Python, etc..).
I thought about cat'ing the output files from highest to lowest (log4j rotate files) so that I wouldn't have to sort. But that only solves the problem for files writing to the same log file (file1.0.log, file1.1.log, etc..). But this doesn't help when needing to merge file2 with file1.
Thank you,
Gregg

What you are talking about is "stable" sorting. There is a -s option on sort that should give you what you want.
Stability in sorting algorithms

Related

Identifying matching lines across multiple files in Linux

I'm working on a script to help me identify matching lines across multiple files. I've had luck with the comm command:
comm -12 <(sort file1.txt) <(sort file2.txt)
This works and can be daisy-chained to expand to X number of files. My problem is that anything more than two files starts becoming very unreadable. I'm wondering if there is a sleeker option available or if I should dig in and write a script to do the work for me. It'd be a simple loop and maybe a prompt to collect file names.
I have a working solution just trying to see if there is a better way that doesn't involve reinventing the wheel. Thoughts appreciated.

Ensure Linux (suse) Programs Level across multiple servers with cksum

We have a GOLD image new servers are imaged from new ones are created.
Over time some of these have become out of sync due to poorly managed rollouts.
I would like to scan all of these servers bin folders and compare to what the GOLD image has into a output file. (IE: if different flag one way. If same say Same, if missing say Missing, if there but not on gold. Addition?)
I was going to accomplish via like below.
on the Gold Image run following example.
for x in `ls /bin/`
do
cksum $x >> /data/OnGold.lst
done
Distribute this file to all of servers along with another script that will execute the same thing with a different log name.
after the script executes another script will Diff the two files and report on the differences based off of the cksum or if files are missing or in addition to the OnGold.lst
(This is what I could use some advice on the best way to achieve this? Or if anyone knows of some open source tools that could accomplish the same thing? assuming. pretty sure diff would do the trick as it will advise if items were misssing or in addition but I dont know how to format this in a report format.)
Any help would be greatly appreciated?

Something like .htaccess in Linux

I have a directory with lot of files (above 4.000.000 files). All filenames has this same pattern:
PREFIX-XXXXXX-YY.ext
where
XXXXXX contains letters and digits
YY contains digits
ext is a extension of file (.txt, .jpg)
File structure have 12MB, so listing/searching of this directory takes long time. I divided all content of this directory to subdirectories, depends of filename, precisiously first letter of XXXXXX from pattern above.
ie.
main_directory/A/PREFIX-AXXXXX-YY.ext
main_directory/B/PREFIX-BXXXXX-YY.ext
main_directory/1/PREFIX-1XXXXX-YY.ext
Is in Linux easy way to make a rule, when I type in linux command for example
test:/home/usr/admin # ls main_directory/PREFIX-AXXXXX-*
I will get a list of filenames from main_directory/A/ directory? This rule MUST work only for main_directory.

You can't have this at file-system layer, not without creating links and circling back to your original problem. I can think of two easy ways out.
Take 1: scripting
You could write a short script to rewrite the names for you.
Suppose you had a rewrite script that took PREFIX-AXXXX-* and outputted main_directory/A/PREFIX-AXXXX-*. You could then change your ls line to:
$ ls `rewrite PREFIX-AXXXXX-*`
This can be easily accomplished with sed, awk or any other on-the-fly text transformation tool.
Shell programs are composable for a reason! :)
Take 2: embed a faster file-system
You could do away with the restructuring and rewriting names by using a faster file-system, mounted in your main directory. XFS sounds good for this. It should remove your performance concerns without further ado.
This requires a deeper understanding of what's going on to be effective for day-to-day usage, however.
Edit: Here's an article on how to create virtual user-space file-systems.
Edit 2: actually no, I don't think XFS would cut it. Maybe another file-system, though.

how to merge 2 big files [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux
we can use
cat file1 file2 > final_file
But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.

Can you merge two files without writing one file onto the other?
Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.
Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)
Can we make this better than doubling the size of both files?
Yes, we can use the append (>>) operation instead.
cat file2 >> file1
That will still result in using all the space of consumed by file2 twice over until we can delete it.
Can we avoid using extra space?
No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.
What about writing a little bit at a time, then deleting what we wrote?
No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.
What about sparse files?
Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.
Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.
Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

You can do like this
cat file2 >> file1
file1 will become the full content.

No, it is not possible to merge (on Linux) two big files by working on their meta-data.
Maybe you might consider some kind of database for your work.
As Alexandre noticed, you can append one big file to another, but this still requires a lot of data copying.

Merge two files in linux and ignore any repetition

Can anyone provide me with a shell script in linux that merges two files and saves it in a third file. However I want that if there is any common data in both the files then the common lines should only be saved once. Please ask if you need any more details. Thanks in advance!!

Simplest way:
cat one two | sort -u > third
But this is probably not what you want...
You mentioned merging in your question: what do you mean with that? If it's not that simple as I assumed in my code above, provide sample files and tell us what you want to achieve.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Merge, sort, maintain line order - linux

What you are talking about is "stable" sorting. There is a -s option on sort that should give you what you want. Stability in sorting algorithms

Related

Identifying matching lines across multiple files in Linux

Ensure Linux (suse) Programs Level across multiple servers with cksum

Something like .htaccess in Linux

how to merge 2 big files [closed]

Merge two files in linux and ignore any repetition

Categories

Resources