How do I randomly merge two input files to one output file using unix tools? - linux

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?

Random sort using -R:
$ sort -R file1 file2 -o file3

My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'

This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

Related

How can I get the second column of a very large csv file using linux command?

I was given this question during an interview. I said I could do it with java or python like xreadlines() function to traverse the whole file and fetch the column, but the interviewer wanted me to just use linux cmd. How can I achieve that?
You can use the command awk. Below is an example of printing out the second column of a file:
awk -F, '{print $2}' file.txt
And to store it, you redirect it into a file:
awk -F, '{print $2}' file.txt > output.txt
You can use cut:
cut -d, -f2 /path/to/csv/file
I'd add to Andreas answer, but can't comment yet.
With csv, you have to give awk a field seperator argument, or it will define fields bound by whitespace instead of commas. (Obviously, csv that uses a different field seperator will need a different character to be declared.)
awk -F, '{print $2}' file.txt

variable assignment is not working in rhel6 linux

file1
ABY37499|ANK37528|DEL37508|SRILANKA|195203230000|445500759
ARJU7499|CHA38008|DEL37508|SRILANKA|195203230000|445500759
IB1704174|ANK37528|DEL37508|SRILANKA|195203230000|445500759
IB1704174|CHA38008|DEL37508|SRILANKA|195203230000|445500759
ABY37500|ANK37529|DEL37509|BRAZIL|195203240000|445500757
ARJU7500|CHA38009|DEL37509|BRAZIL|195203240000|445500757
IB1704175|ANK37529|DEL37509|BRAZIL|195203240000|445500757
i want to convert the fifth column date to another format script below
#!/bin/sh
dt="%Y-%m-%d %H:%M"
awk -F '|' '{print $5}' file1 | sed 's/.\{8\}/& /g'> f1.txt
aa=`(date -f f1.txt +"$dt")`
echo "$aa"
awk -F '|' '$5=$aa' file1
echo "$aa" got desired output but i cannot assign $aa to $5 please help me.
Thanks
I corrected my answer after the commento of Etan Reisner
from AWK man:
The input is read in units called records, and processed by the rules
of your program one record at a time. By default, each record is one
line. Each record is automatically split into chunks called fields.
This makes it more convenient for programs to work on the parts of a
record.
Fields are stored in variables $1, $2, ...
And
The contents of a field, as seen by awk, can be changed within an awk
program; this changes what awk perceives as the current input record.
see the man page
thus, this expression:
awk -F '|' '$5=$aa' file1
does not have the effect of substitute the fifth column of file1.
You have to write the modified output to a second file.
May be this could help you in sed
echo 195203240000 | sed -n -e "s_\(....\)\(..\)\(..\)\(..\)\(..\)_\1-\2-\3 \4:\5_p"
1952-03-24 00:00
This awk script should do what you want.
It isn't exactly pretty but it works assuming the input format is consistent.
awk '{$5=sprintf("%s-%s-%s %s:%s\n",
substr($5,1,4), substr($5,5,2), substr($5,7,2),
substr($5,9,2), substr($5,11,2))} 7' file1 > file1.new
It assigns the new value for the field to $5 and then uses 7 (as a truth-y value) to get the default awk {print} action to print the modified line.

bash script: how to diff two files after stripping prefix from each line?

I have two log files. Each line is formatted as follows:
<timestamp><rest of line>
with this timestamp format:
2015-10-06 04:35:55.909 REST OF LINE
I need to diff the two files modulo the timestamps, i.e. I need to compare lines of the two files without their timestamps. What linux tools should I use?
I am on a RedHat 6 machine running bash if it makes a difference
You don't need to create temp files: use bash process substitution
diff <(cut -d" " -f3- log1) <(cut -d" " -f3- log2)
I would first generate the two files to compare with the header removed using the cut command like this :
cut -f 3- -d " " file_to_compare > cut_file
And then use the diff command.
You can use 'cut'
cat file1 | cut -b23- > file1cut
cat file2 | cut -b23- > file2cut
diff file1 file2
To print all fields but the first two the awk utility (and programming language) can be used:
awk '{$1=$2=""; print $0}' file1 > newfile1
awk '{$1=$2=""; print $0}' file2 > newfile2
diff newfile1 newfile2
Well, as your're looking for a tool why not just use a Kompare. Its very powerful and well known which is used by most developers who uses Linux.
https://www.kde.org/applications/development/kompare/
https://docs.kde.org/trunk5/en/kdesdk/kompare/using.html
Kompare is a GUI front-end program that enables differences between source files to be viewed and merged. It can be used to compare differences on files or the contents of folders, and it supports a variety of diff formats and provide many options to customize the information level displayed.

Sed, Awk for combining the output of two cut statements

I'm trying to combine the below outputs into one command. The issue is that the field I'm trying to grab is in reverse order. I was told that cut doesn't support a "reverse" option and to use AWK for this purpose but it didn't end up working for my purpose. I'm trying to take the output of the ls- l against the /dev/block to return the partitions and automatically build a dd if= / of= for each outputted line based on the output of the command.
I tried piping the output to awk:
cut -d' ' -f23,25 ... | awk '{print $2,$1}'
however, the result was when using sed to input the prefix and suffix, it wasn't in the appropriate order.
I built the two statements below which individually return the expected output, just looking for the "right" way to combine both of these statements in the most efficient manner using sed / awk.
ls -l /dev/block/platform/msm_sdcc.1/by-name/ | cut -d' ' -f 25 | sed "s/^/dd if=/"
ls -l /dev/block/platform/msm_sdcc.1/by-name/ | cut -d' ' -f 23 | sed "s/.*/of=\/external_sd\/&.dsk/"
Any assistance will be appreciated.
Thank you.
If you're already using awk, I don't think you'll need cut or sed. You can probably do something like the following, though I'll have to trust you on the field numbers
ls -l /dev/block/platform/msm_sdcc.1/by-name | awk '{print "dd if=/"$25 " of=/" $23 ".dsk"}'
awk will split on all whitespace, not just the space character, so it's possible the fields will shift some, though it may be more reliable too.

Unix (ksh) script to read file, parse and output certain columns only

I have an input file that looks like this:
"LEVEL1","cn=APP_GROUP_ABC,ou=dept,dc=net","uid=A123456,ou=person,dc=net"
"LEVEL1","cn=APP_GROUP_DEF,ou=dept,dc=net","uid=A123456,ou=person,dc=net"
"LEVEL1","cn=APP_GROUP_ABC,ou=dept,dc=net","uid=A567890,ou=person,dc=net"
I want to read each line, parse and then output like this:
A123456,ABC
A123456,DEF
A567890,ABC
In other words, retrieve the user id from "uid=" and then the identifier from "cn=APP_GROUP_". Repeat for each input record, writing to a new output file.
Note that the column positions aren't fixed, so can't rely on positions, guessing I have to search for the "uid=" string and somehow use the position maybe?
Any help much appreciated.
You can do this easily with sed:
sed 's/.*cn=APP_GROUP_\([^,]*\).*uid=\([^,]*\).*/\2,\1/'
The regex captures the two desired strings, and outputs them in reverse order with a comma between them. You might need to change the context of the captures, depending on the precise nature of your data, because the uid= will match the last uid= in the line, if there are more than one.
You can use awk to split in columns, split by ',' and then split by =, and grab the result. You can do it easily as awk -F, '{ print $5}' | awk -F= '{print $2}'
Take a look at this line looking at the example you provided:
cat file | awk -F, '{ print $5}' | awk -F= '{print $2}'
A123456
A123456
A567890

Resources