Combine CSV files

Combine CSV files - linux

What's the best way to combine two csv files and append the results to the same line in perl?
For example, one CSV file looks like
1234,user1,server
4323,user2,server
532,user3,server
The second looks like
user1,owner
user2,owner
user3,owner1
The result I want it to look like is
1234,user1,server,owner
4323,user2,server,owner
532,user3,server,owner1
The users are not in order so I'll need to search the first csv file which I've stored in an array to see which users match then apply the owner to the end of the line.
So far I've read in both files into arrays and then I get lost
I would post the code but it's part of a much larger script

This sounds most suited for a hash. First read the one file into a hash, then add the other. Might add warnings for values that exist in one file but not the other.
Something like:
use warnings;
use strict;
use Text::CSV;
use autodie;
my %data;
my $file1 = "user.csv";
my $file2 = "user2.csv";
my $csv = Text::CSV->new ( { binary => 1 } );
open my $fh, '<', $file1;
while (my $row = $csv->getline($fh)) {
my ($num, $user, $server) = #$row;
$data{$user} = { 'num' => $num, 'server' => $server };
}
open $fh, '<', $file2;
while (my $row = $csv->getline($fh)) {
my ($user, $owner) = #$row;
if (not defined $data{$user}) {
# warning? something else appropriate
} else {
$data{$user}{'owner'} = $owner;
}
}
for my $user (keys %data) {
print join(',', $data{$user}{'num'}, $user, $data{$user}{'server'},
$data{$user}{'owner'}), "\n";
}
Edit: As recommended in comments and other answers, I changed the method of extracting the data to using Text::CSV instead of split. I'm not too familiar with the module, but it seems to be working in my testing.

Looks like a direct application for the join command (tied with sort). This assumes that the data is as simple as shown - no commas embedded in strings or anything nasty.
sort -t, -k 2 file1 > file1.sorted
sort -t, -k 1 file2 > file2.sorted
join -t, -1 2 -2 1 file1.sorted file2.sorted
With bash, you could do it all on one line.
If you really want to do it in Perl, then you need to use a hash keyed by the user column, potentially with an array of entries per hash key. You then iterate through the keys of one of the hashes, pulling the matching values from the other and printing the data. If you're in Perl, you can use the Text::CSV module to get accurate CSV splitting.

Assuming the 1st has 2 commas, and the 2nd only one, you will get all lines of the 1st file, but only the matching ones of the 2nd:
my %content;
while( <$file1> ) {
chomp;
/,(.+),/;
$content{$1} = "$_,";
}
while( <$file2> ) {
chomp;
/(.+),(.+)/;
$content{$1} .= $2;
}
print "$content{$_}\n" for sort keys %content;

import csv
files=['h21.csv', 'h20.csv','h22.csv']
lineCount=0
for file in files:
with open(file,'r') as f1:
csv_reader=csv.reader(f1, delimiter=',')
with open('testout1.csv','a' ,newline='') as f2:
csv_writer=csv.writer(f2,delimiter=',')
if lineCount==0:
csv_writer.writerow(["filename","sno","name","age"])
lineCount += 1
next(csv_reader,None)
for row in csv_reader:
data=[file]+row
csv_writer.writerow(data)

Related

How can I split my data in small enough chunks to feed to Seq?

I am working on a bioinformatics project where I am looking at very large genomes. Seg only reads 135 lines at a time, so when we feed the genomes in it gets overloaded. I am trying to create a perl command that will split the sections into 135 line sections. The character limit would be 10,800 since there are 80 columns. This is what i have so far
#!usr/bin/perl
use warnings;
use strict;
my $str =
'>AATTCCGG
TTCCGGAA
CCGGTTAA
AAGGTTCC
>AATTCCGG';
substr($str,17) = "";
print "$str";
It splits at the 17th character but only prints that section, I want it to continue printing the rest of the data. How do i add a command that allows the rest of the data to be shown. Like it should split at every 17th character continuing. (then of course i can go back in and scale it up to the size i actually need. )

I assume that the "very large genome" is stored in a very large file, and that it is fine to collect data by number of lines (and not by number of characters) since this is the first mentioned criterion.
Then you can read the file line by line and assemble lines until there is 135 of them. Then hand them off to a program or routine that processes that, empty your buffer, and keep going
use warnings;
use strict;
use feature 'say';
my $file = shift || 'default_filename.txt';
my $num_lines_to_process = 135;
open my $fh, '<', $file or die "Can't open $file: $!";
my ($line_counter, #buffer);
while (<$fh>) {
chomp;
if ($line_counter == $num_lines_to_process)
{
process_data(\#buffer);
#buffer = ();
$line_counter = 0;
}
push #buffer, $_;
++$line_counter;
}
process_data(\#buffer) if #buffer; # last batch
sub process_data {
my ($rdata) = #_;
say for #$rdata; say '---'; # print data for a test
}
If your processing application/routine wants a string, you can append to a string every time instead of adding to an array, $buffer .= $_; and clear that by $buffer = ''; as needed.
If you need to pass a string but there is also some use of an array while collecting data (intermediate checks/pruning/processing?), then collect lines into an array and use as needed, and join into a string before handing it off, my $data = join '', #buffer;
You can also make use of the $. variable and the modulo operator (%)
while (<$fh>) {
chomp;
push #buffer, $_;
if ($. % $num_lines_to_process == 0) # every $num_lines_to_process
{
process_data(\#buffer);
#buffer = ();
}
}
process_data(\#buffer) if #buffer; # last batch
In this case we need to first store a line and then check its number, since $. (line number read from a filehandle, see docs linked above) starts from 1 (not 0).

substr returns the removed part of a string; you can just run it in a loop:
while (length $str) {
my $substr = substr $str, 0, 17, "";
print $substr, "\n";
}

Perl/Linux filtering large file with content of another file

I'm filtering a 580 MB file using the content of another smaller file.
File1 (smaller file)
chr start End
1 123 150
2 245 320
2 450 600
File2 (large file)
chr pos RS ID A B C D E F
1 124 r2 3 s 4 s 2 s 2
1 165 r6 4 t 2 k 1 r 2
2 455 t2 4 2 4 t 3 w 3
3 234 r4 2 5 w 4 t 2 4
I would like to capture lines from the File2 if the following criteria is met.
File2.Chr == File1.Chr && File2.Pos > File1.Start && File2.Pos < File1.End
I’ve tried using awk but it runs very slow, also I was wondering if there is better way to accomplish the same?
Thank you.
Here is the code that I’m using:
#!/usr/bin/perl -w
use strict;
use warnings;
my $bed_file = "/data/1000G/Hotspots.bed";#File1 smaller file
my $SNP_file = "/data/1000G/SNP_file.txt";#File2 larger file
my $final_file = "/data/1000G/final_file.txt"; #final output file
open my $in_fh, '<', $bed_file
or die qq{Unable to open "$bed_file" for input: $!};
while ( <$in_fh> ) {
my $line_str = $_;
my #data = split(/\t/, $line_str);
next if /\b(?:track)\b/;# skip header line
my $chr = $data[0]; $chr =~ s/chr//g; print "chr is $chr\n";
my $start = $data[1]-1; print "start is $start\n";
my $end = $data[2]+1; print "end is $end\n";
my $cmd1 = "awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file"; print "cmd1\n";
my $cmd2 = `awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file`; print "cmd2\n";
}

Read the small file into a data structure and check every line of the other file against it.
Here I read it into an array, each element being an arrayref with fields from a line. Then each line of the data file is checked against the arrayrefs in this array, comparing fields per requirements.
use warnings 'all';
use strict;
my $ref_file = 'reference.txt';
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my #ref = map { chomp; [ split ] } grep { /\S/ } <$fh>;
my $data_file = 'data.txt';
open $fh, '<', $data_file or die "Can't open $data_file: $!";
# Drop header lines
my $ref_header = shift #ref;
my $data_header = <$fh>;
while (<$fh>)
{
next if not /\S/; # skip empty lines
my #line = split;
foreach my $refline (#ref)
{
next if $line[0] != $refline->[0];
if ($line[1] > $refline->[1] and $line[1] < $refline->[2]) {
print "#line\n";
}
}
}
close $fh;
This prints out correct lines from the provided samples. It allows for multiple lines to match. If this somehow can't be, add last in the if block to exit the foreach once a match is found.
A few comments on the code. Let me know if more can be useful.
When reading the reference file, <$fh> is used in list context so it returns all lines, and grep filters out the empty ones. The map first chomps the newline and then makes an arrayref by [ ], with elements being fields on the line obtained by split. The output list is assigned to #ref.
When we reuse $fh it is first closed (if it was open) so there is no need for a close.
I store the header lines just so, perhaps to print or check. We really only need to exclude them.

Another way, this time storing the smaller file in a Hash of Arrays (HoA) based on the 'chr' field:
use strict;
use warnings;
my $small_file = 'small.txt';
my $large_file = 'large.txt';
open my $small_fh, '<', $small_file or die $!;
my %small;
while (<$small_fh>){
next if $. == 1;
my ($chr, $start, $end) = split /\s+/, $_;
push #{ $small{$chr} }, [$start, $end];
}
close $small_fh;
open my $large_fh, '<', $large_file or die $!;
while (my $line = <$large_fh>){
my ($chr, $pos) = (split /\s+/, $line)[0, 1];
if (defined $small{$chr}){
for (#{ $small{$chr} }){
if ($pos > $_->[0] && $pos < $_->[1]){
print $line;
}
}
}
}

Put them into a SQLite database, do a join. This will be much faster and less buggy and use less memory than trying to write something yourself. And it's more flexible, now you can just do SQL queries on the data, you don't have to keep writing new scripts and reparsing the files.
You can import them by parsing and inserting yourself, or you can convert them to CSV and use SQLite's CSV import ability. Converting to CSV with that simple data can be as easy as s{ +}{,}g or you can use the full blown and very fast Text::CSV_XS.
Your tables look like this (you'll want to use better names for the tables and fields).
create table file1 (
chr integer not null,
start integer not null,
end integer not null
);
create table file2 (
chr integer not null,
pos integer not null,
rs integer not null,
id integer not null,
a char not null,
b char not null,
c char not null,
d char not null,
e char not null,
f char not null
);
Create some indexes on the columns you'll be searching on. Indexes will slow down the import, so make sure you do this after the import.
create index chr_file1 on file1 (chr);
create index chr_file2 on file2 (chr);
create index pos_file2 on file2 (pos);
create index start_file1 on file1 (start);
create index end_file1 on file1 (end);
And do the join.
select *
from file2
join file1 on file1.chr == file2.chr
where file2.pos between file1.start and file1.end;
1,124,r2,3,s,4,s,2,s,2,1,123,150
2,455,t2,4,2,4,t,3,w,3,2,450,600
You can do this in Perl via DBI and the DBD::SQLite driver.

As said before, calling awk at each iteration is very slow. A full awk solution would be possible, I just saw a Perl solution, here's my Python solution as the OP wouldn't mind:
create a dictionary from the small file: chr => list of couples start/end
iterate through the big file and try to match the chr & the position between one of the start/end tuples.
Code:
with open("smallfile.txt") as f:
next(f) # skip title
# build a dictionary with chr as key, and list of start,end as values
d = collections.defaultdict(list)
for line in f:
toks = line.split()
if len(toks)==3:
d[int(toks[0])].append((int(toks[1]),int(toks[2])))
with open("largefile.txt") as f:
next(f) # skip title
for line in f:
toks = line.split()
chr_tok = int(toks[0])
if chr_tok in d:
# key is in dictionary
pos = int(toks[1])
if any(lambda x : t[0]<pos<t[1] for t in d[chr_tok]):
print(line.strip())
We could be slightly faster by sorting the list of tuples and appyling bisect to avoid linear search. That is necessary only if the list of tuples is big in the "small" file.

awk power with single pass. Your code is iterating the file2 as many times as there are lines in file1, so the execution time is linearly increasing. Please let me know if this single pass solution is slower than other solutions.
awk 'NR==FNR {
i = b[$1]; # get the next index for the chr
a[$1][i][0] = $2; # store start
a[$1][i][1] = $3; # store end
b[$1]++; # increment the next index
next;
}
{
p = 0;
if ($1 in a) {
for (i in a[$1]) {
if ($2 > a[$1][i][0] && \
$2 < a[$1][i][1])
p = 1 # set p if $2 in range
}
}
}
p {print}'
One-Liner
awk 'NR==FNR {i = b[$1];a[$1][i][0] = $2; a[$1][i][1] = $3; b[$1]++;next; }{p = 0;if ($1 in a){for(i in a[$1]){if($2>a[$1][i][0] && $2<a[$1][i][1])p=1}}}p' file1 file2

Perl: String in Substring or Substring in String

I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable.
Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my #Name;
my #Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within #Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?

I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry is a substring of $line, or
$line is a substring of $exampleentry,
where $line referes to the corresponding line in the file.
First, consider the input file file.txt:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my #name;
my #sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable #Name is better written using all lower case as #name.
I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Used feature say instead of print
Added a chomp after each readline to avoid storing newline characters in the arrays.

Change value of CSV in terminal

I have a huge csv file with 500.000+ lines. I want to add an amount to the "Price" column via the terminal in Ubuntu. I tried using awk (best solution?) but I don't know how. (I also need to keep the header in the new file)
Here is an example of the file
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"379,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"882,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4306,00";"0";"Minolta"
I want to for example, add 125 to the price so the output is:
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"

$ awk 'BEGIN {FS=OFS="\";\""} NR>1 {$3 = sprintf("%.2f", $3+125)}1' p.txt
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"
Note that this requires a value of environment variable LC_NUMERIC that expects , as the decimal separator (I had mine set to LC_NUMERIC="de_DE", e.g.).
For more DRYness you can pass in the amount you want to add with -v:
$ awk -v n=125 'BEGIN {FS=OFS="\";\""} NR>1 {$3 = sprintf("%.2f", $3+n)}1' p.txt
If you don't care so much about the formatting (that is, if "4431" instead of "4431,00" is acceptable), you can skip the sprintf:
$ awk -v n=125 'BEGIN {FS=OFS="\";\""} NR>1 {$3+=n}1' p.txt
EDIT: Set FS and OFS in BEGIN block, instead of independently via -v, as suggested in the comments (to better ensure that they receive the same value, since it's important that they be the same).

Perl to the rescue! Save as add-price, run as perl add-price input.csv 125.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my ($file, $add) = #ARGV;
my $csv = 'Text::CSV'->new({ binary => 1,
sep_char => ';',
eol => "\n",
always_quote => 1,
}) or die 'Text::CSV'->error_diag;
open my $IN, '<', $file or die $!;
open my $OUT, '>', "$file.new" or die $!;
while (my $row = $csv->getline($IN)) {
if (1 != $csv->record_number) {
my $value = $row->[2];
$value =~ s/,/./;
$value = sprintf "%.2f", $value + $add;
$value =~ s/\./,/;
$row->[2] = $value;
}
$csv->print($OUT, $row);
}
close $OUT or die $!;

You can also use php and this fantastic library :https://github.com/parsecsv/parsecsv-for-php :
Download first the library, add it to a new folder and add a copy of your CSV file to the folder (make sure to use a copy, the save method of this library can delete the data of your csv file if you do not use it properly) :
With this library you can parse and modify directly the values !
<?php
// !!! Make a copy of your csv file before executing this
// Require the Parse CSV library , that you can find there : https://github.com/parsecsv/parsecsv-for-php
require_once 'parsecsv.lib.php';
// Instanciate it
$csv = new parseCSV();
// Load your file
$csv->auto('data.csv');
// Get the number of data rows
$nb_data_rows=count($csv->data)-1;
// Iterate through each data row.
for ($i = 0; $i <= $nb_data_rows; $i++) {
// Define the new Price
$new_price=$csv->data[$i]["Price"]+125;
// Format the price in order to keep two decimals
$new_price=number_format($new_price, 2, '.', '');
// Modify the ith value of your csv data
$csv->data[$i]=array(
"Productno."=> $csv->data[$i]["Productno."],
"Description."=> $csv->data[$i]["Description"],
"price"=>$new_price,
"Stock"=> $csv->data[$i]["Stock"],
"Brand"=> $csv->data[$i]["Brand"] );
// save it !
$csv->save();
}

If you aren't concerned about the possibility of ';' occurring within the first two fields, and if you don't want to be bothered with dependence on environment variables, then consider:
awk -F';' -v add=125 '
function sum(s, d) { # global: q, add
gsub(q, "", s);
split(s,d,",");
return (add+d[1])","d[2];
}
BEGIN {OFS=FS; q="\""; }
NR>1 {$3 = q sum($3) q}
{print} '
This preserves the double-quotes ("). Using your input, the above script produces:
"Productno.";"Description";"Price";"Stock";"Brand"
"/5PL0006";"Drum Unit";"504,29";"10";"Kyocera"
"00096103";"Main pcb HUK, OP6w";"1007,00";"0";"OKI"
"000J";"Drum, 7033/7040 200.000";"4431,00";"0";"Minolta"

Using Perl or Linux built-in command-line tools how quickly map one integer to another?

I have a text file mapping of two integers, separated by commas:
123,456
789,555
...
It's 120Megs... so it's a very long file.
I keep to search for the first column and return the second, e.g., look up 789 --returns--> 555 and I need to do it FAST, using regular Linux built-ins.
I'm doing this right now and it takes several seconds per look-up.
If I had a database I could index it. I guess I need an indexed text file!
Here is what I'm doing now:
my $lineFound=`awk -F, '/$COLUMN1/ { print $2 }' ../MyBigMappingFile.csv`;
Is there any easy way to pull this off with a performance improvement?

The hash suggestions are the natural way an experienced Perler would do this, but it may be suboptimal in this case. It scans the entire file and builds a large, flat datastructure in linear time. Cruder methods can short circuit with a worst case linear time, usually less in practice.
I first made a big mapping file:
my $LEN = shift;
for (1 .. $LEN) {
my $rnd = int rand( 999 );
print "$_,$rnd\n";
}
With $LEN passed on the command line as 10000000, the file came out to 113MB. Then I benchmarked three implemntations. The first is the hash lookup method. The second slurps the file and scans it with a regex. The third reads line-by-line and stops when it matches. Complete implementation:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw{timethese};
my $FILE = shift;
my $COUNT = 100;
my $ENTRY = 40;
slurp(); # Initial file slurp, to get it into the hard drive cache
timethese( $COUNT, {
'hash' => sub { hash_lookup( $ENTRY ) },
'scalar' => sub { scalar_lookup( $ENTRY ) },
'linebyline' => sub { line_lookup( $ENTRY ) },
});
sub slurp
{
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
undef $/;
my $s = <$fh>;
close $fh;
return $s;
}
sub hash_lookup
{
my ($entry) = #_;
my %data;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
$data{$name} = $val;
}
close $fh;
return $data{$entry};
}
sub scalar_lookup
{
my ($entry) = #_;
my $data = slurp();
my ($val) = $data =~ /\A $entry , (\d+) \z/x;
return $val;
}
sub line_lookup
{
my ($entry) = #_;
my $found;
open( my $fh, '<', $FILE ) or die "Can't open $FILE: $!\n";
while( <$fh> ) {
my ($name, $val) = split /,/;
if( $name == $entry ) {
$found = $val;
last;
}
}
close $fh;
return $found;
}
Results on my system:
Benchmark: timing 100 iterations of hash, linebyline, scalar...
hash: 47 wallclock secs (18.86 usr + 27.88 sys = 46.74 CPU) # 2.14/s (n=100)
linebyline: 47 wallclock secs (18.86 usr + 27.80 sys = 46.66 CPU) # 2.14/s (n=100)
scalar: 42 wallclock secs (16.80 usr + 24.37 sys = 41.17 CPU) # 2.43/s (n=100)
(Note I'm running this off an SSD, so I/O is very fast, and perhaps makes that initial slurp() unnecessary. YMMV.)
Interestingly, the hash implementation is just as fast as linebyline, which isn't what I expected. By using slurping, scalar may end up being faster on a traditional hard drive.
However, by far the fastest is a simple call to grep:
$ time grep '^40,' int_map.txt
40,795
real 0m0.508s
user 0m0.374s
sys 0m0.046
Perl could easily read that output and split apart the comma in hardly any time at all.
Edit: Never mind about grep. I misread the numbers.

120 meg isn't that big. Assuming you've got at least 512MB of ram, you could easily read the whole file into a hash and then do all of your lookups against that.

use:
sed -n "/^$COLUMN1/{s/.*,//p;q}" file
This optimizes your code in three ways:
1) No needless splitting each line in two on ",".
2) You stop processing the file after the first hit.
3) sed is faster than awk.
This should more than half your search time.
HTH Chris

It all depends on how often the data change and how often in the course of a single script invocation you need to look up.
If there are many lookups during each script invocation, I would recommend parsing the file into a hash (or array if the range of keys is narrow enough).
If the file changes every day, creating a new SQLite database might or might not be worth your time.
If each script invocation needs to look up just one key, and if the data file changes often, you might get an improvement by slurping the entire file into a scalar (minimizing memory overhead, and do a pattern match on that (instead of parsing each line).
#!/usr/bin/env perl
use warnings; use strict;
die "Need key\n" unless #ARGV;
my $lookup_file = 'lookup.txt';
my ($key) = #ARGV;
my $re = qr/^$key,([0-9]+)$/m;
open my $input, '<', $lookup_file
or die "Cannot open '$lookup_file': $!";
my $buffer = do { local $/; <$input> };
close $input;
if (my ($val) = ($buffer =~ $re)) {
print "$key => $val\n";
}
else {
print "$key not found\n";
}
On my old slow laptop, with a key towards the end of the file:
C:\Temp> dir lookup.txt
...
2011/10/14 10:05 AM 135,436,073 lookup.txt
C:\Temp> tail lookup.txt
4522701,5840
5439981,16075
7367284,649
8417130,14090
438297,20820
3567548,23410
2014461,10795
9640262,21171
5345399,31041
C:\Temp> timethis lookup.pl 5345399
5345399 => 31041
TimeThis : Elapsed Time : 00:00:03.343

This example loads the file into a hash (which takes about 20s for 120M on my system). Subsequent lookups are then nearly instantaneous. This assumes that each number in the left column is unique. If that's not the case then you would need to push numbers on the right with the same number on the left onto an array or something.
use strict;
use warnings;
my ($csv) = #ARGV;
my $start=time;
open(my $fh, $csv) or die("$csv: $!");
$|=1;
print("loading $csv... ");
my %numHash;
my $p=0;
while(<$fh>) { $p+=length; my($k,$v)=split(/,/); $numHash{$k}=$v }
print("\nprocessed $p bytes in ",time()-$start, " seconds\n");
while(1) { print("\nEnter number: "); chomp(my $i=<STDIN>); print($numHash{$i}) }
Example usage and output:
$ ./lookup.pl MyBigMappingFile.csv
loading MyBigMappingFile.csv...
processed 125829128 bytes in 19 seconds
Enter number: 123
322
Enter number: 456
93
Enter number:

does it help if you cp the file to your /dev/shm, and using /awk/sed/perl/grep/ack/whatever query a mapping?
don't tell me you are working on a 128MB ram machine. :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combine CSV files - linux

Related

How can I split my data in small enough chunks to feed to Seq?

Perl/Linux filtering large file with content of another file

Perl: String in Substring or Substring in String

Change value of CSV in terminal

Using Perl or Linux built-in command-line tools how quickly map one integer to another?

Categories

Resources