I have some CSV/tabular data in a file, like so:
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
(They're not always numbers, just random comma-separated values. Single-digit numbers are easier for an example, though.)
I want to shuffle a random 40% of any of the columns. As an example, say the 3rd one. So perhaps 3 and 1 get swapped with each other. Now the third column is:
1 << Came from the last position
8
5
7
3 << Came from the first position
I am trying to do this in place in a file from within a bash script that I am working on, and I am not having much luck. I keep wandering down some pretty crazy and fruitless grep rabbit holes that leave me thinking that I'm going the wrong way (the constant failure is what's tipping me off).
I tagged this question with a litany of things because I'm not entirely sure which tool(s) I should even be using for this.
Edit: I'm probably going to end up accepting Rubens' answer, however wacky it is, because it directly contains the swapping concept (which I guess I could have emphasized more in the original question), and it allows me to specify a percentage of the column for swapping. It also happens to work, which is always a plus.
For someone who doesn't need this, and just wants a basic shuffle, Jim Garrison's answer also works (I tested it).
A word of warning, however, on Rubens' solution. I took this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
...
}
printf "\n";
removed the printf "\n"; and moved the newline character up like this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "\n";
...
}
because just having "" on the else case was causing awk to write broken characters at the end of each line (\00). At one point, it even managed to replace my entire file with Chinese characters. Although, honestly, this probably involved me doing something extra stupid on top of this problem.
This will work for a specifically designated column, but should be enough to point you in the right direction. This works on modern bash shells including Cygwin:
paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
The operative feature is "process substitution".
The paste command joins files horizontally, and the three pieces are split from the original file via cut, with the second piece (the column to be randomized) run through the shuf command to reorder the lines. Here's the output from running it a couple of times:
$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9
Algorithm:
create a vector with n pairs, from 1 to number of lines, and the respective value in the line (for the selected column), and then sort it randomly;
find how many lines should be randomized: num_random = percentage * num_lines / 100;
select the first num_random entries from your randomized vector;
you may sort the selected lines randomly, but it should be already randomly sorted;
printing output:
i = 0
for num_line, value in column; do
if num_line not in random_vector:
print value; # printing non-randomized value
else:
print random_vector[i]; # randomized entry
i++;
done
Implementation:
#! /bin/bash
infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")
# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
| sort -R | head -n ${prob} > ${tmp}
# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
i = c = 1;
for (v in id) {value[i] = id[v]; ++i;}
}
{
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
if (i != col) {printf "%s%c", $i, delim; continue;}
if (FNR in id) {printf "%s%c", value[c], delim; c++;}
else {printf "%s%c", $i, delim;}
}
printf "\n";
}
' ${tmp} ${infile}
rm ${tmp}
In case you want a close approach to in-placement, you may pipe the output back to the input file, using sponge.
Execution:
To execute, simply use:
$ ./script.sh <inpath> <column> <percentage>
As in:
$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9
Conclusion:
This allows you to select the column, randomly sort a percentage of entries in that column, and replace the new column in the original file.
This script goes as proof like no other, not only that shell scripting is extremely entertaining, but that there are cases where it should definitely be used not. (:
I'd use a 2-pass approach that starts by getting a count of the number of lines and read the file into an array, then use awk's rand() function to generate random numbers to identify the lines you'll change and then rand() again to determine which pairs of those lines you will swap and then swap the array elements before printing. Something like this PSEUDO-CODE, rough algorithm:
awk -F, -v pct=40 -v col=3 '
NR == FNR {
array[++totNumLines] = $0
next
}
FNR == 1{
pctNumLines = totNumLines * pct / 100
srand()
for (i=1; i<=(pctNumLines / 2); i++) {
oldLineNr = rand() * some factor to produce a line number that's in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array.
newLineNr = ditto plus must not equal oldLineNr
swap field $col between array[oldLineNr] and array[newLineNr]
swapped[oldLineNr]
swapped[newLineNr]
}
next
}
{ print array[FNR] }
' "$file" "$file" > tmp &&
mv tmp "$file"
Related
Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top.
suppose i have a input file like this
2 3 4 5 6
3 4 5 6 7
2 3 4 5 6
1 2 2 2 2
then i need to extract each column in separate text file with two strings on the top
file1.txt file2.txt .... filen.txt
s=5 s=5
r=9 r=9
2 3
3 4
2 3
1 2
i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance.
#!/bin/sh
for i in $(seq 1 1 5)
do
echo $i
awk '{print $i}' inp_file > file_$i
done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i > (outputFile)
}
}
' Input_file
In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i >> (outputFile)
}
close(outputFile)
}
' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection:
awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) }
{ for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
}
}
{
for (i=1; i<=NF; i++) {
print $i > out[i]
}
}
or with any awk just close them as you go:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
close(out[i])
}
}
{
for (i=1; i<=NF; i++) {
print $i >> out[i]
close(out[i])
}
}
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file
This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line.
Decoding:
split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns.
-nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line.
-t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on.
--additional-suffix=".txt" -- Append .txt to output files.
-a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ...
--numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need.
--filter ... -- Pipe each file through a shell command.
Shell command:
cat -- Concatenate the next two arguments.
<(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read.
- -- Read standard input as an argument to cat -- this is the binned data.
| tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example.
>\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command).
Shell command over -- rest of split arguments:
<(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input).
file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.
I need to write a shell script that does the following which I am showing below with an example.
Suppose I have a file cars.txt which depicts a table like this
Person|Car|Country
The '|' is the separator. So the first two lines goes like this
Michael|Ford|USA
Rahul|Maruti|India
I have to write a shell script which will find the lines in the cars.txt file that has the country as USA and will print it like
USA|Ford|Michael
I am not very adept with Unix so I need some help here.
Will this do?
while read -r i; do
NAME="$(cut -d'|' -f1 <<<"$i")"
MAKE="$(cut -d'|' -f2 <<<"$i")"
COUNTRY="$(cut -d'|' -f3 <<<"$i")"
echo "$COUNTRY|$MAKE|$NAME"
done < <(grep "USA$" cars.txt)
Updated To Locate USA Not 1st Line As Provided in Your Question
Using awk you can do what you are attempting in a very simple manner, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS}' cars.txt
USA|Ford|Michael
India|Maruti|Rahul
Explanation
awk -F'|' read the file using '|' as the Field-Separator, specified as -F'|' at the beginning of the call, or as FS within the command itself,
/USA/ locate only lines containing "USA",
for (i = NF; i >= 1; i--) - loop over fields in reverse order,
printf "%s%s", $i, i==1 ? RS : FS - output the field followed by a '|' (FS) if i is not equal 1 or by the Record-Separator (RS) which is a "\n" by default if i is equal 1. The form test ? true_val : false_val is just the ternary operator that tests if i == 1 and if so provides RS for output, otherwise provides FS for output.
It will be orders of magnitude faster than spawning 8-subshells using command substitutions, grep and cut (plus the pipes).
Printing Only The 1st Occurrence of Line Containing "USA"
To print only the first line with "USA", all you need to do is exit after processing, e.g.
$ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS; exit}' cars.txt
USA|Ford|Michael
Explanation
simply adding exit to the end of the command will cause awk to stop processing records after the first one.
While both awk and sed take a little time to make friends with, together they provide the Unix-Swiss-Army-Knife for text processing. Well worth the time to learn both. It only takes a couple of hours to get a good base by going through one of the tutorials. Good luck with your scripting.
I'm trying to find out how to use the "awk" command, in order to display a word that shows up multiple times in a file(txt). In addition, how can you display the name of this/those file/s?
ex: first sentence first file.
Second sentence followed by the second word.
This should display: "first" and "second"
I assume with -i you mean comparison / counting should be ignoring case.
If I understand your requirements correctly an command like this should work:
awk '{ for( i=1; i<=NF; i++){ cnt[ tolower( $i ) ]++; if (cnt[$i] > 1) {print $i} } }' yourfile | sort -u
It prints these words for your example:
first
second
sentence
the
If you need a case sensitive counting, just delete tolower .
For each line in the file, the script iterates through each word (the for( i=1 i <= NF; i++) loop):
increments for each word a counter ( cnt[ tolower( $i) ]++ )
if the count is larger than 1 the word is printer
the pipe to sort -u sorts the output and removes the duplicates from the output.
I'm trying to get specific columns of a csv file (that Header contains "SOF" in case). Is a large file and i need to copy this columns to another csv file using Shell.
I've tried something like this:
#!/bin/bash
awk ' {
i=1
j=1
while ( NR==1 )
if ( "$i" ~ /SOF/ )
then
array[j] = $i
$j += 1
fi
$i += 1
for ( k in array )
print array[k]
}' fil1.csv > result.csv
In this case i've tried to save the column numbers that contains "SOF" in the header in an array. After that copy the columns using this numbers.
Preliminary note: contrary to what one may infer from the code included in the OP, the values in the CSV are delimited with a semicolon.
Here is a solution with two separate commands:
the first parses the first line of your CSV file and identifies which fields must be exported. I use awk for this.
the second only prints the fields. I use cut for this (simpler syntax and quicker than awk, especially if your file is large)
The idea is that the first command yields a list of field numbers, separated with ",", suited to be passed as parameter to cut:
# Command #1: identify fields
fields=$(awk -F";" '
{
for (i = 1; i <= NF; i++)
if ($i ~ /SOF/) {
fields = fields sep i
sep = ","
}
print fields
exit
}' fil1.csv
)
# Command #2: export fields
{ [ -n "$fields" ] && cut -d";" -f "$fields" fil1.csv; } > result.csv
try something like this...
$ awk 'BEGIN {FS=OFS=","}
NR==1 {for(i=1;i<=NF;i++) if($i~/SOF/) {col=i; break}}
{print $col}' file
there is no handling if the sought out header doesn't exist so should print the whole line.
This link might be helpful for you :
One of the useful commands you probably need is "cut"
cut -d , -f 2 input.csv
Here number 2 is the column number you want to cut from your csv file.
try this one out :
awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }' filename | grep SOF | awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }'
I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?
A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"
[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.
Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.
Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE
This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.