Transpose a File in unix - linux

I have file something like this
1111,K1
2222,L2
3333,LT50
4444,K2
1111,LT50
5555,IA
6666,NA
1111,NA
2222,LT10
Output that is need
1111,K1,LT50,NA
2222,L2,LT10
3333,LT50
4444,K2
5555,IA
6666,NA
1 st Column number may repeat anytime but output that i need is sort and uniq

awk -F"," '{a[$1]=a[$1]FS$2}END{for(i in a) print i,a[i]}' file | sort
If you have a big file, you can try printing the items out every few lines eg 50000
BEGIN{FS=","}
{ a[$1]=a[$1]FS$2 }
NR%50000==0 {
for(i in a) { print a[i] }
delete a #delete array so it won't take up memory
}
END{
for(i in a){ print a[i] }
}

Here is an understandable try using a non-standard tool, SQLite shell. Database is in-memory.
echo 'create table tmp (a int, b text);
.separator ,
.import file.txt tmp
.output out.txt
SELECT a, group_concat(b) FROM tmp GROUP BY a ORDER BY a ASC;
.output stdout
.q' | sqlite

This is solution in python. Script reads data from stdin.
#!/usr/bin/env python
import sys
d = {}
for line in sys.stdin.readlines():
pair = line.strip().split(',')
d[pair[0]] = d.get(pair[0], [])
d[pair[0]].append(str(pair[1]))
for key in sorted(d):
print "%s,%s" % (key, ','.join(d[key]))

Here's one in Perl, but it isn't going to be particularly efficient:
#!/usr/bin/perl -w
use strict;
my %lines;
while (<>) {
chomp;
my ($key, $value) = split /,/;
$lines{$key} .= "," if $lines{$key};
$lines{$key} .= $value;
}
my $key;
for $key in (keys(%lines)) {
print "$key,$lines{$key}\n";
}
Use like this:
$ ./command <file >newfile
You will likely have better luck with a multiple-pass solution, though. I don't really have time to write that for you. Here's an outline:
Grab and remove the first line from the file.
Parse through the rest of the file, concatenating any matching line and removing it.
At the end of the file, output your new long line.
If the file still has content, loop back to 1.

Related

how to split the data in the unix file

I've a file in Unix (solaris) system with data like below
[TYPEA]:/home/typeb/file1.dat
[TYPEB]:/home/typeb/file2.dat
[TYPEB]:/home/typeb/file3.dat
[TYPE_C]:/home/type_d/file4.dat
[TYPE_C]:/home/type_d/file5.dat
[TYPE_C]:/home/type_d/file6.dat
I want to separate the headings like below
[TYPEA]
/home/typeb/file1.dat
[TYPEB]
/home/typeb/file2.dat
/home/typeb/file3.dat
[TYPE_C]
/home/type_d/file4.dat
/home/type_d/file5.dat
/home/type_d/file6.dat
Files with similar type have to come under one type.
Please help me with any logic to achieve this without hardcoding.
Assuming the input is sorted by type like in your example,
awk -F : '$1 != prev { print $1 } { print $2; prev=$1 }' file
If there are more than 2 fields you will need to adjust the second clause.
sed 'H;$ !b
x
s/\(\(\n\)\(\[[^]]\{1,\}]\):\)/\1\2\1/g
:cycle
=;l
s/\(\n\[[^]]\{1,\}]\)\(.*\)\1/\1\2/g
t cycle
s/^\n//' YourFile
Posix sed version a bit unreadeable due to presence of [ in pattern
- allow : in label or file/path
- failed if same label have a line with another label between them (sample seems ordered).
If you can use perl you will be able to make use of hashes to create a simple data structure:
#! /usr/bin/perl
use warnings;
use strict;
my %h;
while(<>){
chomp;
my ($key,$value) = split /:/;
$h{$key} = [] unless exists $h{$key};
push ${h{$key}},$value;
}
foreach my $key (sort keys %h) {
print "$key"."\n";
foreach my $value (#{$h{$key}}){
print "$value"."\n";
}
}
In action:
perl script.pl file
[TYPEA]
/home/typeb/file1.dat
[TYPEB]
/home/typeb/file2.dat
/home/typeb/file3.dat
[TYPE_C]
/home/type_d/file4.dat
/home/type_d/file5.dat
/home/type_d/file6.dat
If you like it, there is a wholeTutorial to solve this simple problem. It's worth reading it.

How can I split a large file with tab-separated values into smaller files while keeping lines together in a single file based on the first value?

I have a file (currently about 1 GB, 40M lines), and I need to split it into about smaller files based on a target file size (target is ~1 MB per file).
The file contains multiple lines of tab-separated values. The first column has an integer value. The file is sorted by the first column. There are about 1M values in the first column, so each value has on average 40 lines, but some may have 2 and others may have 100 or more lines.
12\t...
12\t...
13\t...
14\t...
15\t...
15\t...
15\t...
16\t...
...
2584765\t...
2586225\t...
2586225\t...
After splitting the file, any distinct first value must only appear in a single file. E.g. when I read a smaller file and find a line starting with 15, it is guaranteed that no other files contain lines starting with 15.
This does not mean map all lines that start with a specific value to a single file.
Is this possible with the commandline tools available on a Unix/Linux system?
The following will try to split every 40,000 records, but postpone the split if the next record has the same key as the previous.
awk -F '\t' 'BEGIN { i=1; s=0; f=sprintf("file%05i", i) }
NR % 40000 == 0 { s=1 }
s==1 && $1!=k { close(f); f=sprintf("file%05i", ++i); s=0 }
{ k=$1; print >>f }' input
List all the keys by looking at only the first column awk and making them unique sort -u. Then for each of these keys, select only the lines that start with the key grep and redirect this into a file named after the key.
Oneliner:
for key in `awk '{print $1;}' file_to_split | sort -u` ; do grep -e "^$key\\s" file_to_split > splitted_file_$key ; done
Or multiple lines for a script file and better readability:
for key in `awk '{print $1;}' file_to_split | sort -u`
do
grep -e "^$key\\s" file_to_split > splitted_file_$key
done
Not especially efficient as it parses the files many times.
Also not sure the for command being able to use such a large input from the `` subcommand.
On unix systems you also can usually use perl. So here is a perl solution:
#!/usr/local/bin/perl
use strict;
my $last_key;
my $store;
my $c = 0;
my $max_size = 1000000;
while(<>){
my #fields = split(/\t/);
my $key = $fields[0];
if ($last_key ne $key) {
store() if (length($store)>$max_size);
}
$store.=$_;
$last_key = $key;
}
store();
sub store {
$c++;
open (O, ">", "${c}.part");
print O $store;
close O;
$store='';
}
save it as x.pl.
use it like:
x.pl bigfile.txt
It sorts your entries into
1.part
2.part
...
files and tries to keep them around $max_size.
HTH

Efficient way to replace strings in one file with strings from another file

Searched for similar problems and could not find anything that suits my needs exactly:
I have a very large HTML file scraped from multiple websites and I would like to replace all
class="key->from 2nd file"
with
style="xxxx"
At the moment I use sed - it works well but only with small files
while read key; do sed -i "s/class=\"$key\"/style=\"xxxx\"/g"
file_to_process; done < keys
When I'm trying to process something larger it takes ages
Example:
keys - Count: 1233 lines
file_to_ process - Count: 1946 lines
It takes about 40 s to complete only 1/10 of processing I need
real 0m40.901s
user 0m8.181s
sys 0m15.253s
Untested since you didn't provide any sample input and expected output:
awk '
NR==FNR { keys = keys sep $0; sep = "|"; next }
{ gsub("class=\"(" keys ")\"","style=\"xxxx\"") }
1' keys file_to_process > tmp$$ &&
mv tmp$$ file_to_process
I think it's time to Perl (untested):
my $keyfilename = 'somekeyfile'; // or pick up from script arguments
open KEYFILE, '<', $keyfilename or die("Could not open key file $keyfilename\n");
my %keys = map { $_ => 1 } <KEYFILE>; // construct a map for lookup speed
close KEYFILE;
my $htmlfilename = 'somehtmlfile'; // or pick up from script arguments
open HTMLFILE, '<', $htmlfilename or die("Could not open html file $htmlfilename\n");
my $newchunk = qq/class="xxxx"/;
for my $line (<$htmlfile>) {
my $newline = $line;
while($line =~ m/(class="([^"]+)")/) {
if(defined($keys{$2}) {
$newline =~ s/$1/$newchunk/g;
}
}
print $newline;
}
This uses a hash for lookups of keys, which should be reasonably fast, and does this only on the key itself when the line contains a class statement.
Try to generate a very long sed script with all sub commands from the keys file, something like:
s/class=\"key1\"/style=\"xxxx\"/g; s/class=\"key2\"/style=\"xxxx\"/g ...
and use this file.
This way you will read the input file only once.
Here's one way using GNU awk:
awk 'FNR==NR { array[$0]++; next } { for (i in array) { a = "class=\"" i "\""; gsub(a, "style=\"xxxx\"") } }1' keys.txt file.txt
Note that the keys in keys.txt are taken as the whole line, including whitespace. If leading and lagging whitespace could be a problem, use $1 instead of $0. Unfortunately I cannot test this properly without some sample data. HTH.
First convert your keys file into a sed or-pattern which looks like this: key1|key2|key3|.... This can be done using the tr command. Once you have this pattern, you can use it in a single sed command.
Try the following:
sed -i -r "s/class=\"($(tr '\n' '|' < keys | sed 's/|$//'))\"/style=\"xxxx\"/g" file

Implement tail with awk

I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10

How to divide a data file's column A by column B using Perl

I was given a text file with a whole bunch of data sorted in columns. Each of the columns are
separated by commas.
How could I divide a column by another column to print an output answer? I am using Perl right now now so it has to be done in Perl. How could I do this?
This is what I have so far:
#!/usr/bin/perl
open (FILE, 'census2008.txt');
while (<FILE>) {
chomp;
($sumlev, $stname,$ctyname,$popestimate2008,$births2008,$deaths2008) = split(",");
}
close (FILE);
exit;
There are several options:
Read the file in line by line, split the columns on ',' and divide the relevant columns (don't forget to handle the divide-by-zero error)
Do the same thing as a one-liner:
$ perl -F/,/ -lane 'print( $F[1] == 0 ? "" : $F[3]/$F[1] )' file.txt
Utilize a ready-to-use CPAN module like Text::CSV
Of course, there are more unorthodox/crazy/unspeakable alternatives à la TMTOWTDI ™, so one could:
Parse out the relevant columns with a regex and divide the matches:
if (/^\d*,(\d+),\d*,(\d+)/) { say $2/$1 if $2 != 0; }
Do it with s///e:
$ perl -ple 's!^\d*,(\d+),\d*,(\d+).*$! $2 == 0 ? "" : $2/$1 !e' file.txt;
Get the shell to do the dirty work via backticks:
sub print_divide { say `cat file.txt | some_command_line_command` }
#!/usr/bin/env perl
# divides column 1 by column 2 of some ','-delimited file,
# read from standard input.
# usage:
# $ cat data.txt | 8458760.pl
while (<STDIN>) {
#values = split(/,/, $_);
print $values[0] / $values[1] . "\n";
}
If you have fixed width columns of data you could use 'unpack' along the lines of:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($sumlev,$stname,$ctyname,$popest,$births,$deaths)
= unpack("A2xA10xA15xA7xA5xA5");
printf "%-15s %4.2f\n", $ctyname, $births/$deaths;
}
__DATA__
10,Main ,My City , 10000, 200, 150
12,Poplar ,Somewhere , 3000, 90, 100
13,Maple ,Your Place , 9123, 100, 90

Resources