Related
I would like to know if there will be a way to transform a csv to the JSON format suitable for the Tabulator library?
The idea would be to have a format as seen on excel :
- the first cell on the top left, empty
- columns A, B, C... AA, AB... according to the number of cells on the longest row
- the line number automatically on the first cell of each line)
I had the idea of doing it directly with loops, but it takes a lot of time I find. I don't see any other way.
Thank you for the help.
Check the following function, I hope this is what you are looking for...
let csvfile = 'title1,title2,title3,title4\n1,2,3,4\n11,22,33,44' //YOUR CSV FILE
let capLetters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' // ALPHABET SET
let finalJson = [];
let headers;
let line =[];
convertCSV2JSON(csvfile)
function convertCSV2JSON(csv) {
line = csv.split("\n"); //PARSE ALL AVAILABLE LINES INTO ARRAY
result = [];
headers = line[0].split(","); //PARSE ALL AVAILABLE STRING NAMES INTO ARRAY AND KEEP ONLY THE FIRST ONE (HEADER)
line.slice(1).forEach(function(item,i){ //RUN EACH ITEM EXCLUDING COLUMN NAMES
var obj = {};
if(line[i] === null || line[i] === undefined) {
}else{
var entries = line[i+1].split(","); // SEPARATE FOUND ENTRIES EXCLUDING COLUMN NAMES (i+1)
for(var j = 0; j < entries.length; j++) { // PARSE ENTRIES
obj[convert2Letters(j)] = entries[j]; // ASSIGN A LETTER AS COLUMN NAME
}
}
finalJson.push(obj);
})
console.log(finalJson);
}
function convert2Letters(iteration) {
let readyLetter = ''
while (iteration >= 0) {
readyLetter += capLetters[iteration % 26]
iteration = Math.floor(iteration / 26) - 1
}
return readyLetter
}
The fuzzy part was at foreach() function, because you cannot initiate index at your preference... slice() did the trick!
Moreover convert2Letters() function takes an array of letters and on each iteration finds the modulus of 26 letters, removing by one shows you the next combination...
Example:
If you have 30 columns it will give 30 % 26 = 4
4 corresponds to capLetters[4] which is 'E'
calculate next: iteration = Math.floor(iteration / 26) - 1 which means on every 26 increment (0,26,52,78...) it will give (-1,0,1,2...) corresponding. So a 30 columns iteration will have 0 as result which corresponds to capLetters[0] = 'A'
Resulting: 30 columns will give letters 'EA'
I found this interview question floating around, and after having given much thought to it, I couldn't really develop a sound algorithm for it.
Given a string of numbers in sequential order, find the missing number.The range of numbers is not given.
Sample Input:"9899100101103104105"
Answer:102
This is a simple problem.
Guess the number of digits for the first number
Read numbers from the string one by one. If the previous number you have read is x, the next number must be either x + 1 or x + 2. If it is x + 2, remember x + 1 as the missed number, continue until the end of the string anyway to verify that the initial guess was correct. If you read something else than x + 1 or x + 2, the initial guess was wrong and you need to restart with (next) guess.
With your example:
9899100101103104105
First guess length 1
read 9
the next number should be either 10 or 11. Read the next two digits, you get 89.
That is incorrect, so the initial guess was wrong.
Second guess length 2
read 98
the next number should be either 99 or 100. Read the next two digits for 99
the next number should be either 100 or 101. Read the next three digits for 100
... 101
... 103 (remember 102 as the missed number)
... 104
... 105
end of input
Guess of length 2 was verified as correct guess and 102 reported as missing number.
The only dififcult part, of course, is figuring out how many digits the numbers have. I see two approaches.
Try a certain number of digits for the first number, decide what the following number should therefore be (there'll be two options, depending on whether the missing number is the second one), and see if that matches the following string of digits. If so, continue on. If the string doesn't fit the pattern, try again with a different number of digits.
Look at the starting and ending portions of the string, and reason the number of digits based on that and the length of the string. This one's a little more handwavey.
digits=1
parse the string like the first number conatins digits digits only.
parse the next number and check if it is sequential correct related to the last parsed one
if it decreases, digit+=1, goto 1.
if it is 2 higher than the last parsed, you might found the gap, parse the rest, if parsing the restis not an increasing sequence, digit+=1, goto 2, otherwise you have found the gap.
if it is 1 higher than the last parsed number, goto 3.
digit+=1, goto 2. (I am not sure if this case can ever happen)
Example:
given: "131416".
1. digits=1
2. parse '1'
3. parse '3'
4. it does not decrease
5. possibly found the gap: parse the rest '1416' fails, because '1' != '4'
=> digit+=1 (digit=2) goto 2
2. parse '13'
3. parse '14'
4. it does not decrease
5. it is no 2 higher than the last parsed one (13)
6. it is 1 higher (14 = 13+1) => goto 3
3. parse '16'
4. it does not decrease
5. possibly found the gap: parse the rest '' passed because nothing more to parse,
=> found the gab: '15' is the missing number
Here is a working C# solution you can check in LINQPad:
void Main()
{
FindMissingNumberInString("9899100101103104105").Dump("Should be 102");
FindMissingNumberInString("78910121314").Dump("Should be 11");
FindMissingNumberInString("99899910011002").Dump("Should be 1000");
// will throw InvalidOperationException, we're missing both 1000 and 1002
FindMissingNumberInString("99899910011003");
}
public static int FindMissingNumberInString(string s)
{
for (int digits = 1; digits < 4; digits++)
{
int[] numbers = GetNumbersFromString(s, digits);
int result;
if (FindMissingNumber(numbers, out result))
return result;
}
throw new InvalidOperationException("Unable to determine the missing number fro '" + s + "'");
}
public static int[] GetNumbersFromString(string s, int digits)
{
var result = new List<int>();
int index = digits;
int number = int.Parse(s.Substring(0, digits));
result.Add(number);
while (index < s.Length)
{
string part;
number++;
digits = number.ToString().Length;
if (s.Length - index < digits)
part = s.Substring(index);
else
part = s.Substring(index, digits);
result.Add(int.Parse(part));
index += digits;
}
return result.ToArray();
}
public static bool FindMissingNumber(int[] numbers, out int missingNumber)
{
missingNumber = 0;
int? found = null;
for (int index = 1; index < numbers.Length; index++)
{
switch (numbers[index] - numbers[index - 1])
{
case 1:
// sequence continuing OK
break;
case 2:
// gap we expect to occur once
if (found == null)
found = numbers[index] - 1;
else
{
// occured twice
return false;
}
break;
default:
// not the right sequence
return false;
}
}
if (found.HasValue)
{
missingNumber = found.Value;
return true;
}
return false;
}
This can likely be vastly simplified but during exploratory coding I like to write out clear and easy to understand code rather than trying to write it in as few lines of code or as fast as possible.
May I introduce you to the problem that destroyed my weekend. I have biological data in 4 columns
#ID:::12345/1 ACGACTACGA text !"#$%vwxyz
#ID:::12345/2 TATGACGACTA text :;<=>?VWXYZ
I would like to use awk to edit the first column to replace characters : and / with -
I would like to convert the string in the last column with a comma-separated string of decimals that correspond to each individual ASCII character (any character ranging from ASCII 33 - 126).
#ID---12345-1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345-2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
The first part is easy, but i'm stuck with the second. I've tried using awk ordinal functions and sprintf; I can only get the former to work on the first char in the string and I can only get the latter to convert hexidecimal to decimal and not with spaces. Also tried bash function
$ od -t d1 test3 | awk 'BEGIN{OFS=","}{i = $1; $1 = ""; print $0}'
But don't know how to call this function within awk.
I would prefer to use awk as I have some downstream manipulations that can also be done in awk.
Many thanks in advance
Using the ordinal functions from the awk manual, you can do it like this:
awk -f ord.awk --source '{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=ord($4)
for(i=2;i<=length($4);i++) {
res=res","ord(substr($4,i))
}
$4=res
}1' file
Output:
#ID---12345/1 ACGACTACGA text 33,34,35,36,37,118,119,120,121,122
#ID---12345/2 TATGACGACTA text 58,59,60,61,62,63,86,87,88,89,90
Here is ord.awk (taken as is from: http://www.gnu.org/software/gawk/manual/html_node/Ordinal-Functions.html)
# ord.awk --- do ord and chr
# Global identifiers:
# _ord_: numerical values indexed by characters
# _ord_init: function to initialize _ord_
BEGIN { _ord_init() }
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}
function chr(c)
{
# force c to be numeric by adding 0
return sprintf("%c", c + 0)
}
If you don't want to include the whole of ord.awk, you can do it like this:
awk 'BEGIN{ _ord_init()}
function _ord_init(low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
{
# replace : with - in the first field
gsub(/:/,"-",$1)
# calculate the ordinal by looping over the characters in the fourth field
res=_ord_[substr($4,1,1)]
for(i=2;i<=length($4);i++) {
res=res","_ord_[substr($4,i,1)]
}
$4=res
}1' file
Perl soltuion:
perl -lnae '$F[0] =~ s%[:/]%-%g; $F[-1] =~ s/(.)/ord($1) . ","/ge; chop $F[-1]; print "#F";' < input
The first substitution replaces : and / in the first field with a dash, the second one replaces each character in the last field with its ord and a comma, chop removes the last comma.
I have a file with entries seperated by an empty space. For example:
example.txt
24676 256 218503341 2173
13236272 500 1023073758 5089
2230304 96 15622969 705
0 22 0 526
13277 28 379182 141
I would like to print, in the command line, the outcome of "column 1/ column 3" or simila. I believe it can be done with awk. However, some entries are 0, hence division by 0 gives:
fatal: division by zero attempted
In a more advanced case, I would like to find the median value (or some percentile) of the division.
There are many ways to ignore the row with a zero divisor, including:
awk '$3 != 0 { print $1/$3 }' your-data-file
awk '{ if ($3 != 0) print $1/$3 }' your-data-file
The question changed — to print 0 instead. The answer is not much harder:
awk '{ if ($3 != 0) print $1/$3; else print 0 }' your-data-file
Medians and other percentiles are much fiddlier to deal with. It's easiest if the data is in sorted order. So much easier that I'd expect to use a numeric sort and then process the data from there.
I dug out an old shell script which computes descriptive statistics - min, max, mode, median, and deciles of a single numeric column of data:
: "#(#)$Id: dstats.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Calculate Descriptive Statistics: min, max, median, mode, deciles
sort -n $* |
awk 'BEGIN { max = -999999999; min = 999999999; }
{ # Accumulate basic data
count[$1]++;
item[++n] = $1;
if ($1 > max) max = $1;
if ($1 < min) min = $1;
}
END { # Print Descriptive Statistics
printf("# Count = %d\n", n);
printf("# Min = %d\n", min);
decile = 1;
for (decile = 10; decile < 100; decile += 10)
{
idx = int((decile * n) / 100) + 1;
printf("# %d%% decile = %d\n", decile, item[idx]);
if (decile == 50)
median = item[idx];
}
printf("# Max = %d\n", max);
printf("# Median = %d\n", median);
for (i in count)
{
if (count[i] > count[mode])
mode = i;
}
printf("# Mode = %d\n", mode);
}'
The initial values of min and max are not exactly scientific. It serves to illustrate a point.
(This 1997 version is almost identical to its 1991 predecessor - all except for the version information line is identical, in fact. So, the code is over 20 years old.)
Here's one solution:
awk '
$3 != 0 { vals[$NR]=$1/$3; sum += vals[$NR]; print vals[$NR] }
$3 == 0 { vals[$NR]=0; print "skipping division by 0" }
END { sort vals; print "Mean = " sum/$NR ", Median ~ " vals[$NR/2] }
' < your_file
This will calculate, print, and accumulate the quotients if the 3rd column is not zero. When it reaches the end of your file (which should not have an empty line), it will print the mean and median of all the quotients, assuming 0 for each line in which it would have divided by zero.
In awk, $n means the nth field, starting with 1, and $NR means the number of records (that is, the number of lines) that have been processed. Each quotient is stored in the array vals, enabling us to calculate the median value.
In real life, the median is defined as the "middle" item given an odd number of elements, or the mean of the two "middle" items given an even number of elements.
And you're on your own when it comes to implementing the sort function!
Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.
I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:
$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'
PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.
I need an automated and generic way.
You had me confused for a while, but I take it you want to count the dinucleotides in a given string.
Code:
my #dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
Output from Data::Dumper:
$VAR1 = {
"AC" => 5,
"CC" => "",
"AG" => "",
"AA" => 1,
"CG" => 3,
"CA" => ""
};
Close to TLP's answer, but without substitution:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
Benchmark:
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my #dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}#dinucs;
my $count = -3;
my $r = cmpthese($count, {
'match' => sub {
for my $dinuc (#dinucs) {
while($sequence=~/$dinuc/g) {
$count{$dinuc}++;
}
}
},
'substitute' => sub {
for my $dinuc (#dinucs) {
$count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}
}
});
Output:
Rate substitute Match
Substitute 13897/s -- -11%
Match 15622/s 12% --
Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.
(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.
Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)
The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.
sub substrWise {
my ($seq, $wordLength) = #_;
my $cnt = {};
my $w;
for my $i (0 .. length($seq) - $wordLength) {
$w = substr($seq, $i, $wordLength);
$cnt->{$w}++;
}
return $cnt;
}
sub regexWise {
my ($seq, $dinucs) = #_;
my $cnt = {};
for my $d (#$dinucs) {
if (substr($d, 0,1) eq substr($d, 1,1) ) {
my $n = substr($d, 0,1);
$cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
} else {
$cnt->{$d} = ($seq =~ s/$d/$d/g);
}
}
return $cnt;
}
my #dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
use Test::More tests => 1;
my $rWise = regexWise($sequence, \#dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for #dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');
use Benchmark qw(:all);
cmpthese(100000, {
'regex' => sub {
regexWise($sequence, \#dinucs);
},
'substr' => sub {
substrWise($sequence, 2);
}
Output:
1..1
ok 1 - verify equivalence
Rate regex substr
regex 11834/s -- -85%
substr 76923/s 550% --
For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.