How to Randomize Columns of a Text File in Linux - linux

I would like to take a text file and for each line randomize the words/columns. The files could contain millions of rows, so efficiency is important. I've tried the Google route, but everything I find is related to sorting lines randomly and not the columns.
For example taking a simple file like this (I'll use numbers, but they could be words):
111 222 333 444 555
555 666 777 888 999 000
000 333 555 777
The output might look the following:
222 111 555 444 333
777 555 666 000 999 888
777 333 000 555

Perl to the rescue:
perl -MList::Util=shuffle -lne 'print join " ", shuffle split' < input.txt > output.txt
-l appends a newline after print
-n process the input line by line
split splits the input line on whitespace
shuffle (from List::Util) shuffles the list randomly
join " " creates one string from the list, putting a space between members.

Related

shell script to find frequency of array elements

I want to find the frequency of array elements.
array=(111 111 222 111 777 555 666 777)
I found this command:
(IFS=$'\n'; sort <<< "${array[*]}") | uniq -c
That prints:
3 111
1 222
1 555
1 666
2 777
But I want to print first the element of the array and next its frequency as a percentage, like this:
111 %
222 %
555 %
666 %
777 %
Does what you asks for:
#!/usr/bin/env bash
array=(111 111 222 111 777 555 666 777)
(
IFS=$'\n'
sort <<< "${array[*]}"
) |
uniq -c |
xargs -l1 bash -c '
pc=$(bc -l <<<"(100*$1)/$0")
LC_NUMERIC=C
printf "%s %02.02f%%\n" "$2" "$pc"
' "${#array[#]}"
xargs -l1 bash -c: Take each line and execute bash inline script with arguments.
The inline bash script itself:
# Receives the array size as argument 0
# The count as argument 1
# The value as argument 2
# Computes the percent with bc calculator
pc=$(bc -l <<<"(100*$1)/$0")
# Switches numeric format to C, POSIX
# so Bash printf "%f" can understand the output from bc
LC_NUMERIC=C
# Format the value and its percent frequency
printf "%s %02.02f%%\n" "$2" "$pc"
Sample output:
111 37.50%
222 12.50%
555 12.50%
666 12.50%
777 25.00%
Also much simpler with awk:
#!/usr/bin/env bash
array=(111 111 222 111 777 555 666 777)
(IFS=$'\n';awk '{arr[$1]++}END{for(k in arr)printf"%s %02.02f%%\n",k,(100*arr[k])/NR}' <<< "${array[*]}")
The awk script:
{
# Populates an associative array
# with argument 1 as key, and occurrences counter as value
arr[$1]++
}
# Once the lines are parsed
END {
# Print and format the associative array
# with its key and the percent frequency
for(k in arr) {
printf "%s %02.02f%%\n", k, (100*arr[k])/NR
}
}

linux command to delete the last column of csv

How can I write a linux command to delete the last column of tab-delimited csv?
Example input
aaa bbb ccc ddd
111 222 333 444
Expected output
aaa bbb ccc
111 222 333
It is easy to remove the fist field instead of the last. So we reverse the content, remove the first field, and then revers it again.
Here is an example for a "CSV"
rev file1 | cut -d "," -f 2- | rev
Replace the "file1" and the "," with your file name and the delimiter accordingly.
You can use cut for this. You specify a delimiter with option -d and then give the field numbers (option -f) you want to have in the output. Each line of the input gets treated individually:
cut -d$'\t' -f 1-6 < my.csv > new.csv
This is according to your words. Your example looks more like you want to strip a column in the middle:
cut -d$'\t' -f 1-3,5-7 < my.csv > new.csv
The $'\t' is a bash notation for the string containing the single tab character.
You can use below command which will delete the last column of tab-delimited csv irrespective of field numbers,
sed -r 's/(.*)\s+[^\s]+$/\1/'
for example:
echo "aaa bbb ccc ddd 111 222 333 444" | sed -r 's/(.*)\s+[^\s]+$/\1/'

Linux shell command to copy text data from a file to another

file_1 contents:
aaa 111 222 333
bbb 444 555 666
ccc 777 888 999
file_2 contents:
ddd
eee
fff
how do i copy only part of the text from file_1 to file_2
so that file_2 would become:
ddd 111 222 333
eee 444 555 666
fff 777 888 999
Try with awk:
awk 'NR==FNR{a[FNR]=$2FS$3FS$4;next} {print $0, a[FNR]}' file_1 file_2
Explanation:
NR is the current input line, FNR is the number of input line in current file, you can see that by
$ awk '{print NR,FNR}' file_1 file_2
1 1
2 2
3 3
4 1
5 2
6 3
So, the condition NR==FNR is only true when reading the first file, and that's when the columns $2, $3, and $4 get saved in a[FNR]. After reading file_1, the condition NR==FNR becomes false and the block {print $0, a[FNR]} is executed, where $0 is the whole line in file_2.

Linux: check duplicate from a list of txt

What is the best way to remove duplicates from a list of lists?
I have a lot of txt
List1.txt
111
222
333
444
...
List300.txt
555
666
777
888
Now I have a new txt List301.txt but need to check duplicate and remove it
List301.txt
111
666
999
aaa
bbb
I was trying to use set like this:
cat List* |sort |uniq -u |xargs -i grep {} List* > ListFinal.txt
List1.txt:222
List1.txt:333
List1.txt:444
List300.txt:555
List300.txt:777
List300.txt:888
List3.txt:999
List3.txt:aaa
List3.txt:bbb
Have better way to list out 999,aaa,bbb only or remove 111 and 666 in List301.txt?
Thanks~
If you have to put the new file (List301) in same directory, you can do it with gawk:
awk -v fn="f301.txt" 'FILENAME!=fn{a[$0];next}{b[$0]}
END{for(x in b)if(!(x in a))print x}' *.txt
You just change the fn value to apply the one-line on your new files.
If you could first move your new file to other directory, say new/, you could:
awk 'FILENAME!="new/f301.txt"{a[$0];next}!($0 in a)' *.txt new/f301.txt

Slice 3TB log file with sed, awk & xargs?

I need to slice several TB of log data, and would prefer the speed of the command line.
I'll split the file up into chunks before processing, but need to remove some sections.
Here's an example of the format:
uuJ oPz eeOO 109 66 8
uuJ oPz eeOO 48 0 221
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 2 9 771
mxmx lo uUui 577 765 27878456
The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.
I want to keep only the last line in each group.
If there's only 1 line in a group, it should be kept.
Here's the expected output:
uuJ oPz eeOO 9 674 3
kf iiiTti oP 88 909 19
mxmx lo uUui 577 765 27878456
How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?
awk -F '\t' '
NR==1 {key=$1}
$1!=key {print line; key=$1}
{line=$0}
END {print line}
' file_in > file_out
Try this:
awk 'BEGIN{FS="\t"}
{if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
END{print lastLine}'
It saves the last line and prints it only when it notcies that the key has changed.
This might work for you:
sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

Resources