Split sorted file without cutting blocks - linux

I have a large file (20 GB) consisting of 30 million data records. The first field of each record is a non-unique key. The file is sorted by this key. I'd like to split this file into chunks, using anything available in a bash shell, such that the chunks are approximately the same size in bytes and all records with the same key go into the same chunk in the same order as in the original file.
I obviously do not want awk -F";" '{print > $1}' theFile because I'd prefer on the order of 10 large chunks, not one file per key. Also, split alone won't cut it, because I need a way to keep identical keys together.

You can pre-process the file such that split knows where it is allowed to split.
Here, we insert the null byte \0 to mark that splitting is allowed. Afterwards, we remove all \0 from the generated files. This assumes that your original data never contains \0.
awk -F\; '$1!=last {last=$1; if(NR>1) printf "\0"} 1' file > file.tmp
split -t\\0 --filter 'tr -d \\0 > "$FILE"' -n l/10 file.tmp file.
rm file.tmp
You can adapt split's options to your liking. Here we split into 10 chunks. Due to our pre-processing and the changed delimiter -t\\0, the chunk option l/… keeps identical keys together.
To verify that everything worked, you can run
for i in file.*; do
echo "--- $i ---"
head -n1 "$i"
echo "[...]"
tail -n1 "$i"
done
I generated a testfile using
n=1""000""000; paste -d\; <(shuf -i1-500 -rn$n) <(shuf -rn$n /usr/share/dict/words) | sort -t\; -k1,1n > file
and got
--- file.aa ---
1;abbesses
[...]
55;Zoroaster
--- file.ab ---
56;abase
[...]
107;zoologists
--- file.ac ---
108;abattoir
[...]
and so on.

You mention split as a tag, but do you know that the UNIX/Linux command split exists, especially for this purpose?
The man-page mentions (amongst others):
-b, --bytes=SIZE
put SIZE bytes per output file
There are plenty of examples all over the internet.
Edit: apparently split is not a good option
I've created a file, try.txt, with following content:
x1 a b
x1 b c
x1 c d
x2 a b
x2 a d
x3 a b
x3 a c
x3 b c
x3 c c
First, I need to know how many lines there are per key:
Linux prompt>cat try.txt | awk '{print $1}' | sort | uniq -c
3 x1
2 x2
4 x3
(Remark: uniq -c shows the count of unique entries)
So, there is 3 times "x1", 2 times "x2", and 4 times "x3". Now let's take those parts:
Linux prompt>head -n 3 try.txt
x1 a b
x1 b c
x1 c d
Linux prompt>head -n $((3+2)) try.txt | tail -n 2
x2 a b
x2 a d
Linux prompt>head -n $((3+2+4)) try.txt | tail -n 4
x3 a b
x3 a c
x3 b c
x3 c c
It's not an entirely scripted solution, but I guess it might be helpful for you.

Related

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Finding the Number of strings in a File

I'm trying to write a very small program that will check the number of sub strings in a large text file. All it will do is count the first 2000 lines of the text file, find any "TTT" sub-strings, count them, and set a variable to that total. I'm a bit new to shell, so any help would be amazingly appreciated!
#!/bin/bash
$counter=(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
For what it's worth you might awk better suited for this task:
awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename
This will split each record in your file by delimiter "ttt". Then it counts the number of fields, subtracts one, and adds that to the total.
A file like:
ttt tttttt something
1 5 ttt
tt
one more ttt record
Would be split (visualizing with pipe delim) like:
| || something
1 5 |
tt
one more | record
Counting the number of fields per record:
4
2
1
2
Subtracting one from that:
3
1
0
1
Which totals to 5, which is how many "ttt" substrings are present.
To incorporate this into your script (and fixing your other issue):
#!/bin/bash
counter=$(awk -F"ttt" '{j=(NF-1)+j}END{print j}' filename)
echo $counter
The change here is that when we set a variable in Bash we don't include the $ sign at the front. Only in referencing the variable do we include the $.
You have some minor syntax errors there, probably you meant this:
counter=$(head -2000 [file name] | grep TTT | grep -o TTT | wc -l)
echo $counter
Notice the tiny changes I made there to make it work.
Btw the grep TTT in the middle is redundant, you can simply drop it, that is:
counter=$(head -2000 [file name] | grep -o TTT | wc -l)
grep can already do what you want: counter=$(grep -c TTT $infile). You can limit the number of hits (not lines) with -m NUM, --max-count=NUM, which makes grep stop at the end of the file OR when NUM occurrences are found.

How to add number of identical line next to the line itself? [duplicate]

This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
Closed 7 years ago.
I have file file.txt which look like this
a
b
b
c
c
c
I want to know the command to which get file.txt as input and produces the output
a 1
b 2
c 3
I think uniq is the command you are looking for. The output of uniq -c is a little different from your format, but this can be fixed easily.
$ uniq -c file.txt
1 a
2 b
3 c
If you want to count the occurrences you can use uniq with -c.
If the file is not sorted you have to use sort first
$ sort file.txt | uniq -c
1 a
2 b
3 c
If you really need the line first followed by the count, swap the columns with awk
$ sort file.txt | uniq -c | awk '{ print $2 " " $1}'
a 1
b 2
c 3
You can use this awk:
awk '!seen[$0]++{ print $0, (++c) }' file
a 1
b 2
c 3
seen is an array that holds only uniq items by incrementing to 1 first time an index is populated. In the action we are printing the record and an incrementing counter.
Update: Based on comment below if intent is to get a repeat count in 2nd column then use this awk command:
awk 'seen[$0]++{} END{ for (i in seen) print i, seen[i] }' file
a 1
b 2
c 3

Joining a pair of lines with specific starting points

I know that with sed I can print
cat current.txt | sed 'N;s/\n/,/' > new.txt
A
B
C
D
E
F
to
A,B
C,D
E,F
What I would like to do is following:
A
B
C
D
E
F
to
A,D
B,E
C,F
I'd like to join 1 with 4, 2 with 5, 3 with 6 and so on.
Is this possible with sed? Any idea how it could be achieved?
Thank you.
Try printing in columns:
pr -s, -t -2 current.txt
This is longer than I was hoping, but:
$ lc=$(( $(wc -l current.txt | sed 's/ .*//') / 2 ))
$ paste <(head -"$lc" current.txt) <(tail -"$lc" current.txt) | column -t -o,
The variable lc stores the number of lines in current.txt divided by two. Then head and tail are used to print lc first and lc last lines, respectively (i.e. the first and second half of the file); then paste is used to put the two together and column changes tabs to commas.
An awk version
awk '{a[NR]=$0} NR>3 {print a[NR-3]","$0}' current.txt
A,D
B,E
C,F
This solution is easy to adjust if you like other interval.
Just change NR>3 and NR-3 to desired number.

"Minus" operation on two files using Linux commands

I have 4 files sorted alphabetically, A, B, C, and D.
These files contain a single string on each line.
Essentially, what needs to happen is that anything in B gets deleted from A.
The result of that will then be stripped of anything in C.
And similarly, the result of that will be stripped of D.
Is there a way to this using Linux commands?
comm is good for this, either:
cat B C D | sort | comm -2 -3 A -
or:
comm -2 -3 A B | comm -2 -3 - C | comm -2 -3 - D
depending on what's easier/clearer for your script.
grep -x -v -f B A | grep -x -v -f C | grep -x -v -f D
The -v switch is an inverse match (i.e. match all except). The -f switch takes a file with a list of patterns to match. The -x switch forces it to match whole lines (so that lines that are substrings of other lines don't cause the longer lines to be removed).
Look at the join command. Read its man page and you should find what you seek.
join A B | join - C | join - D

Resources