I have a huge csv file with 500.000+ lines. I want to add an amount to the "Price" column via the terminal in Ubuntu. I tried using awk (best solution?) but I don't know how. (I also need to keep the header in the new file)
Here is an example of the file
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"379,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"882,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4306,00";"0";"Minolta"
I want to for example, add 125 to the price so the output is:
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"
$ awk 'BEGIN {FS=OFS="\";\""} NR>1 {$3 = sprintf("%.2f", $3+125)}1' p.txt
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"
Note that this requires a value of environment variable LC_NUMERIC that expects , as the decimal separator (I had mine set to LC_NUMERIC="de_DE", e.g.).
For more DRYness you can pass in the amount you want to add with -v:
$ awk -v n=125 'BEGIN {FS=OFS="\";\""} NR>1 {$3 = sprintf("%.2f", $3+n)}1' p.txt
If you don't care so much about the formatting (that is, if "4431" instead of "4431,00" is acceptable), you can skip the sprintf:
$ awk -v n=125 'BEGIN {FS=OFS="\";\""} NR>1 {$3+=n}1' p.txt
EDIT: Set FS and OFS in BEGIN block, instead of independently via -v, as suggested in the comments (to better ensure that they receive the same value, since it's important that they be the same).
Perl to the rescue! Save as add-price, run as perl add-price input.csv 125.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my ($file, $add) = #ARGV;
my $csv = 'Text::CSV'->new({ binary => 1,
sep_char => ';',
eol => "\n",
always_quote => 1,
}) or die 'Text::CSV'->error_diag;
open my $IN, '<', $file or die $!;
open my $OUT, '>', "$file.new" or die $!;
while (my $row = $csv->getline($IN)) {
if (1 != $csv->record_number) {
my $value = $row->[2];
$value =~ s/,/./;
$value = sprintf "%.2f", $value + $add;
$value =~ s/\./,/;
$row->[2] = $value;
}
$csv->print($OUT, $row);
}
close $OUT or die $!;
You can also use php and this fantastic library :https://github.com/parsecsv/parsecsv-for-php :
Download first the library, add it to a new folder and add a copy of your CSV file to the folder (make sure to use a copy, the save method of this library can delete the data of your csv file if you do not use it properly) :
With this library you can parse and modify directly the values !
<?php
// !!! Make a copy of your csv file before executing this
// Require the Parse CSV library , that you can find there : https://github.com/parsecsv/parsecsv-for-php
require_once 'parsecsv.lib.php';
// Instanciate it
$csv = new parseCSV();
// Load your file
$csv->auto('data.csv');
// Get the number of data rows
$nb_data_rows=count($csv->data)-1;
// Iterate through each data row.
for ($i = 0; $i <= $nb_data_rows; $i++) {
// Define the new Price
$new_price=$csv->data[$i]["Price"]+125;
// Format the price in order to keep two decimals
$new_price=number_format($new_price, 2, '.', '');
// Modify the ith value of your csv data
$csv->data[$i]=array(
"Productno."=> $csv->data[$i]["Productno."],
"Description."=> $csv->data[$i]["Description"],
"price"=>$new_price,
"Stock"=> $csv->data[$i]["Stock"],
"Brand"=> $csv->data[$i]["Brand"] );
// save it !
$csv->save();
}
If you aren't concerned about the possibility of ';' occurring within the first two fields, and if you don't want to be bothered with dependence on environment variables, then consider:
awk -F';' -v add=125 '
function sum(s, d) { # global: q, add
gsub(q, "", s);
split(s,d,",");
return (add+d[1])","d[2];
}
BEGIN {OFS=FS; q="\""; }
NR>1 {$3 = q sum($3) q}
{print} '
This preserves the double-quotes ("). Using your input, the above script produces:
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"
Related
I have a really big xml file. It has certain incrementing numbers inside, which i would like to replace with a different incrementing number. I've looked and here is what someone suggested here before. Unfortunately i cant get it to work :(
In the code below all instances of 40960 should be replaced with 41984, all instances of 40961 with 41985 etc. Nothing happens. What am i doing wrong?
use strict;
use warnings;
my $old = 40960;
my $new = 41984;
my $string;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
open(OUT, '>'.$file) or die $!;
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
close(IN);
close(OUT);
Other answers give you better solutions to your problem. Mine concentrates on explaining why your code didn't work.
The core of your code is here:
$old++;
$new++;
for (my $i = 0; $i < 42; $i++) {
while(<IN>) {
$_ =~ s/$old/$new/g;
print OUT $_;
}
}
You increment the values of $old and $new outside of your loops. And you never change those values again. So you're only making the same substitution (changing 40961 to 41985) 42 times. You never try to change any other numbers.
Also, look at the while loop that reads from IN. On your first iteration (when $i is 0) you read all of the data from IN and the file pointer is left at the end of the file. So when you go into the while loop again on your second iteration (and all subsequent iterations) you read no data at all from the file. You need to reset the file pointer to the start of your file at the end of each iteration.
Oh, and the basic logic is wrong. If you think about it, you'll end up writing each line to the output file 42 times. You need to do all possible substitutions before writing the line. So your inner loop needs to be the outer loop (and vice versa).
Putting those suggestions together, you need something like this:
my $old = 40960;
my $change = 1024;
while (<IN>) {
# Easier way to write your loop
for my $i ( 1 .. 42 ) {
my $new = $old + $change;
# Use \b to mark word boundaries
s/\b$old\b/$new/g;
$old++;
}
# Print each output line only once
print OUT $_;
}
Here's an example that works line by line, so the size of file is immaterial. The example assumes you want to replace things like "45678", but not "fred45678". The example also assumes that there is a range of numbers, and you want them replaced with a new range offset by a constant.
#!/usr/bin/perl
use strict;
use warnings;
use constant MIN => 40000;
use constant MAX => 90000;
use constant DIFF => +1024;
sub repl { $_[0] >= MIN && $_[0] <= MAX ? $_[0] + DIFF : $_[0] }
while (<>) {
s/\b(\d+)\b/repl($1)/eg;
print;
}
exit(0);
Invoked with the file you want to transform as an argument, it produces altered output on stdout. With the following input ...
foo bar 123
40000 50000 60000 99999
fred60000
fred 60000 fred
... it produces this output.
foo bar 123
41024 51024 61024 99999
fred60000
fred 61024 fred
There are a couple of classic Perlisms here, but the example shouldn't be hard to follow if you RTFM appropriately.
Here is an alternative way which reads the input file into a string and does all the substitutions at once:
use strict;
use warnings;
{
my $old = 40960;
my $new = 41984;
my ($regexp) = map { qr/$_/ } join '|', map { $old + $_ } 0..41;
my $file = 'file.txt';
rename($file, $file.'.bak');
open(IN, '<'.$file.'.bak') or die $!;
my $str = do {local $/; <IN>};
close IN;
$str =~ s/($regexp)/do_subst($1, $old, $new)/ge;
open(OUT, '>'.$file) or die $!;
print OUT $str;
close OUT;
}
sub do_subst {
my ( $old, $old_base, $new_base ) = #_;
my $i = $old - $old_base;
my $new = $new_base + $i;
return $new;
}
Note: Can probably be made more efficient by using Regexp::Assemble
I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )
I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, #buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\#buffer);
#buffer = ();
$line_counter = 0;
}
push #buffer, $_;
++$line_counter;
}
process_data(\#buffer) if #buffer; # last batch
sub process_data {
my ($rdata) = #_;
say for #$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', #buffer;
You can also make use of the $. variable and the modulo operator (%)
while (<$fh>) {
chomp;
push #buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\#buffer);
#buffer = ();
}
}
process_data(\#buffer) if #buffer; # last batch
In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).
substr returns the removed part of a string; you can just run it in a loop:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}
Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many records (as defined by a specific EOL such as "^%!") had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
Suggested input: delimiter EOL and filename as arguments.
bash/perl some_script_name ",|" "^%!" samplefile
Desired output:
occs count
3 1
2 1
1 2
0 2
This is because the 1st record had one delimiter, 2nd record had 2, 3rd record had 0, 4th record had 3, 5th record had 1, 6th record had 0.
Bonus pts if you can make the delimiter and EOL argument accept hex input (ie 2C7C) or normal character input (ie ,|) .
Script:
#!/usr/bin/perl
use strict;
$/ = $ARGV[1];
open my $fh, '<', $ARGV[2] or die $!;
my #records = <$fh> and close $fh;
$/ = $ARGV[0];
my %counts;
$counts{(split $_)-1}++ for #records;
delete $counts{-1};
print "$_\t$counts{$_}\n" for (reverse sort keys %counts);
Test:
perl script.pl ',|' '^%!' samplefile
Output:
3 1
2 1
1 2
0 2
This is what perl lives for:
#!perl -w
use 5.12.0;
my ($delim, $eol, $file) = #ARGV;
open my $fh, "<$file" or die "error opening $file $!";
$/ = $eol; # input record separator
my %counts;
while (<$fh>) {
my $matches = () = $_ =~ /(\Q$delim\E)/g; # "goatse" operator
$counts{$matches}++;
}
say "occs\tcount";
foreach my $num (reverse sort keys %counts) {
say "$num\t$counts{$num}";
}
(if you haven't got 5.12, remove the "use 5.12" line and replace the say with print)
A solution in awk:
BEGIN {
RS="\\^%!"
FS=",\\|"
max_occ = 0
}
{
if(match($0, "^ *$")) { # This is here to deal with the final separator.
next
}
if(NF - 1 > max_occ) {
max_occ = NF - 1
}
count[NF - 1]=count[NF - 1] + 1
}
END {
printf("occs count\n")
for(i = 0; i <= max_occ; i++) {
printf("%s %s\n", i, count[i])
}
}
Well, there's one more empty record at the end of the file which has 0. So, here's a script to do what you wanted. Adding headers and otherwise tweaking the printf output is left as an excercise for you. :)
Basically, read the whole file in, split it into records, and for each record, use a /g regex to count the sub-delimiters. Since /g returns an array of all matches, use #{[]} to make an arrayref then deref that in scalar context to get a count. There has to be a more elegant solution to that particular part of the problem, but whatever; it's perl line noise. ;)
user#host[/home/user]
$ ./test.pl ',|' '^%!' test.in
3 1
2 1
1 2
0 3
user#host[/home/user]
$ cat test.in
jdk,|ljn^%!dk,|sn,|fgc^%!
ydfsvuyx^%!67ds5,|bvujhy,|s6d75
djh,|sudh^%!nhjf,|^%!fdiu^%!
user#host[/home/user]
$ cat test.pl
#!/usr/bin/perl
my( $subdelim, $delim, $in,) = #ARGV;
$delim = quotemeta $delim;
$subdelim = quotemeta $subdelim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach( split(/$delim/, join(q{}, <F>)) ){
$counts{ scalar(#{[m/.*?($subdelim)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
And here's a modified version which only keeps fields which contain at least one non-space character. That removes the last field, but also has the consequence of removing any other empty fields. It also uses $/ and \Q\E to reduce a couple of explicit function calls (thank, Alex). And, like the previous one, it works with strict + warnings;
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
foreach ( grep(/\S/, <F>) ){
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++;
}
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
If you really only want to remove the last record unconditionally, I'm partial to using pop:
#!/usr/bin/perl
my( $subdelim, $delim, $in ) = #ARGV;
local $/=$delim;
my %counts;
open(F, $in) or die qq{Failed opening $in: $?\n};
my #lines = <F>;
pop #lines;
$counts{ scalar(#{[m/.*?(\Q$subdelim\E)/g]}) }++ foreach (#lines);
printf( qq{%i% 4i\n}, $_, $counts{$_} ) foreach (sort {$b<=>$a} keys %counts);
I wrote a super simple script:
#!/usr/bin/perl -w
use strict;
open (F, "<ids.txt") || die "fail: $!\n";
my #ids = <F>;
foreach my $string (#ids) {
chomp($string);
print "$string\n";
}
close F;
This is producing an expected output of all the contents of ids.txt:
hello
world
these
annoying
sourcecode
lines
Now I want to add a file-extension: .txt for every line. This line should do the trick:
#!/usr/bin/perl -w
use strict;
open (F, "<ids.txt") || die "fail: $!\n";
my #ids = <F>;
foreach my $string (#ids) {
chomp($string);
$string .= ".txt";
print "$string\n";
}
close F;
But the result is as follows:
.txto
.txtd
.txte
.txtying
.txtcecode
Instead of appending ".txt" to my lines, the first 4 letters of my string will be replaced by ".txt" Since I want to check if some files exist, I need the full filename with extension.
I have tried to chop, chomp, to substitute (s/\n//), joins and whatever. But the result is still a replacement instead of an append.
Where is the mistake?
Chomp does not remove BOTH \r and \n if the file has DOS line endings and you are running on Linux/Unix.
What you are seeing is actually the original string, a carriage return, and the extension, which overwrites the first 4 characters on the display.
If the incoming file has DOS/Windows line endings you must remove both:
s/\R+$//
A useful debugging technique when you are not quite sure why your data is getting set to what it is is to dump it with Data::Dumper:
#!/usr/bin/perl -w
use strict;
use Data::Dumper ();
$Data::Dumper::Useqq = 1; # important to be able to actually see differences in whitespace, etc
open (F, "<ids.txt") || die "fail: $!\n";
my #ids = <F>;
foreach my $string (#ids) {
chomp($string);
print "$string\n";
print Data::Dumper::Dumper( { 'string' => $string } );
}
close F;
have you tried this?
foreach my $string (#ids) {
chomp($string);
print $string.".txt\n";
}
I'm not sure what's wrong with your code though. these results are strange
What's the best way to combine two csv files and append the results to the same line in perl?
For example, one CSV file looks like
1234,user1,server
4323,user2,server
532,user3,server
The second looks like
user1,owner
user2,owner
user3,owner1
The result I want it to look like is
1234,user1,server,owner
4323,user2,server,owner
532,user3,server,owner1
The users are not in order so I'll need to search the first csv file which I've stored in an array to see which users match then apply the owner to the end of the line.
So far I've read in both files into arrays and then I get lost
I would post the code but it's part of a much larger script
This sounds most suited for a hash. First read the one file into a hash, then add the other. Might add warnings for values that exist in one file but not the other.
Something like:
use warnings;
use strict;
use Text::CSV;
use autodie;
my %data;
my $file1 = "user.csv";
my $file2 = "user2.csv";
my $csv = Text::CSV->new ( { binary => 1 } );
open my $fh, '<', $file1;
while (my $row = $csv->getline($fh)) {
my ($num, $user, $server) = #$row;
$data{$user} = { 'num' => $num, 'server' => $server };
}
open $fh, '<', $file2;
while (my $row = $csv->getline($fh)) {
my ($user, $owner) = #$row;
if (not defined $data{$user}) {
# warning? something else appropriate
} else {
$data{$user}{'owner'} = $owner;
}
}
for my $user (keys %data) {
print join(',', $data{$user}{'num'}, $user, $data{$user}{'server'},
$data{$user}{'owner'}), "\n";
}
Edit: As recommended in comments and other answers, I changed the method of extracting the data to using Text::CSV instead of split. I'm not too familiar with the module, but it seems to be working in my testing.
Looks like a direct application for the join command (tied with sort). This assumes that the data is as simple as shown - no commas embedded in strings or anything nasty.
sort -t, -k 2 file1 > file1.sorted
sort -t, -k 1 file2 > file2.sorted
join -t, -1 2 -2 1 file1.sorted file2.sorted
With bash, you could do it all on one line.
If you really want to do it in Perl, then you need to use a hash keyed by the user column, potentially with an array of entries per hash key. You then iterate through the keys of one of the hashes, pulling the matching values from the other and printing the data. If you're in Perl, you can use the Text::CSV module to get accurate CSV splitting.
Assuming the 1st has 2 commas, and the 2nd only one, you will get all lines of the 1st file, but only the matching ones of the 2nd:
my %content;
while( <$file1> ) {
chomp;
/,(.+),/;
$content{$1} = "$_,";
}
while( <$file2> ) {
chomp;
/(.+),(.+)/;
$content{$1} .= $2;
}
print "$content{$_}\n" for sort keys %content;
import csv
files=['h21.csv', 'h20.csv','h22.csv']
lineCount=0
for file in files:
with open(file,'r') as f1:
csv_reader=csv.reader(f1, delimiter=',')
with open('testout1.csv','a' ,newline='') as f2:
csv_writer=csv.writer(f2,delimiter=',')
if lineCount==0:
csv_writer.writerow(["filename","sno","name","age"])
lineCount += 1
next(csv_reader,None)
for row in csv_reader:
data=[file]+row
csv_writer.writerow(data)