Easy way to merge files in Linux, based on line timestamp?

Easy way to merge files in Linux, based on line timestamp? - linux

We currently have a process which grabs distinct log files off a remote system and places them all in a single consolidated file for analysis.
The lines are all of the form:
2023-02-08 20:39:32 Textual stuff goes here.
so the process is a rather simple:
cat source_log_file_* | sort > consolidated_log_file
Now this works fine for merging the individual files into a coherent, ordered, file but it has the problem that it also sorts lines within each of the source log files where they have the same timestamps). For example, the left side below is modified to the right side:
2023-02-08 20:39:32 First ==> 2023-02-08 20:39:32 First
2023-02-08 20:39:32 Second ==> 2023-02-08 20:39:32 Fourth
2023-02-08 20:39:32 Third ==> 2023-02-08 20:39:32 Second
2023-02-08 20:39:32 Fourth ==> 2023-02-08 20:39:32 Third
This makes analysis rather difficult as sequence within a source log file is changed.
I could temporarily insert a sequence number (per source file) between the timestamp and the text and remove it from the consolidated file but I was wondering if it were possible to do a merge of the files based on timestamp rather than a sort.
By that, I mean open every single source log file (which is already sorted correctly based on sequence) and, until they're all processed, grab the first line that has the earliest timestamp and add it to the consolidated file. This way, the order of lines is preserved for each source log file but the lines from the separate files are sequenced correctly.
I can write a program to do that if need be, but I was wondering if there was an easy way to do it with standard tools.

You need to use just the timestamp as the sort key, not the whole line. Then use the --stable option to keep the lines in their original order if they have the same timestamp.
sort -k 1,2 --stable source_log_file_* > consolidated_log_file

Related

Cannot append to a file: Append replaces the content

The following command does not append but replaces the content
echo 0 >> /sys/block/nvme0n1/queue/nomerges
I don't want to replace but append. But I'm curious Is there something special about this file?
It also doesn't allow more than one character as its input.

Look at https://serverfault.com/questions/865787/what-does-the-nomerge-mean-in-linux-system
It might help you in understanding, that there are only 3 options that the file can take.
Also:
nomerges enables the user to disable the lookup logic involved with IO
merging requests in the block layer. By default (0) all merges are
enabled. When set to 1 only simple one-hit merges will be tried. When
set to 2 no merge algorithms will be tried (including one-hit or more
complex tree/hash lookups).

Bash Script Efficient For Loop (Different Source File)

First of all i'm a beginner in bash scripting. I usually code in Java but this certain task requires me to create some bash scripts in Linux. FYI i've already made a working script but I think its not efficient enough because of the large files I'm dealing with.
The problem is simple I have 2 logs that I need to compare and make some correction on one of the logs... ill call it logA and logB. This 2 logs contains different format here is an example:
01/04/2015 06:48:59|9691427842139609|601113090635|PR|30.00|8.3|1.7| <- log A
17978712 2015-04-01 06:48:44 601113090635 SUCCESS DealerERecharge.<-log B
17978714 2015-04-01 06:48:49 601113090635 SUCCESS DealerERecharge.<-log B
As you can see there is a gap in time stamp. The actual logs that will match with log A is the one with the ID 17978714 because it is the closest time from it. The highest time gap I've seen is 1 minute. I cant use the RANGE logic because if there are more than one line on log B that is within the 1 minute range then all of the line will show in my regenerated log.
The script I made contains a for loop which iterate the timestamp of log A until it hits something in log B (The first one it hits is the closest)
Inside the for loop I have this line of code which makes the loop slow.
LINEOUTPUT=$(less $File2 | grep "Validation 1" | grep "Validation 2" | grep "Timestamp From Log A")
I've read some sample using SED but the problem is I have 2 more validation to consider before matching it with the time stamp.
The validation works as a filter to narrow down the exact match for log A and B.
Additional Info: I tried doing some benchmark test for the script I made by performing some loop. One thing I've noticed is that even though I only use 1 pipe for that script the loop tick is still slow.

Matching text files from a list of system numbers

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,

Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Bash to get timestamp from file list and compare it to filename

Implementing a GIT repository for a project we are including the DB structure by generating a dump on the post-commit hook on deployment.
What I would like to have is a simple versioning system for the file based on the timestamp of the last change to the tables structure.
After finding this post with the suggestion to check for the dates of the the *.frm files in the MySQL data dir I thought the solution would be to implement it based on that last date as part of the generated file. This is:
Find out the latest date-time of the files of the DB (i.e. /var/lib/mysql/databaseX/) via an ls command (of type ls -la *.frm)
compare that value (last changed file) with the one of a certain file (ie /project/dump_2012102620001.sql) where the numbers correspond to the last generated dump.
If files timestamp is after that date generate the mysqldump command, otherwise ignore so the dump does not get generated and
committed as a change to GIT
Unfortunately my Linux console/bash concepts are too far from being capable and have not found any similar script to use.

You can use [[ file1 -ot file2 ]] to test whether file1 is older than file2.
last=$(ls -tr /path/to/db/files/*.frm | tail -n1)
if [[ dump -ot $last ]] ; then
create_new_dump
fi

You can save yourself a lot of grief by just dumping the table structure every time with the appropriate mysqldump command as this is relatively lightweight since it won't include table contents. Strip out the variable timestamp information at the top and compare with the previous file. Store if different.

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!

First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string