Incorrect Empirical Semivariogram Value - geospatial

My gstat program for calculating empirical semivariogram on walker lake data is as follows
data = read.table("C:/Users/chandan/Desktop/walk470.csv",sep=",",header=TRUE);
attach(data);
coordinates(data)=~x+y;
walk.var1 <- variogram(v ~ x+y, data=data,width=5,cutoff=100);
The result is as follows
np dist gamma
1 105 3.836866 32312.63
2 459 8.097102 44486.82
3 1088 12.445035 60230.48
4 985 17.874264 76491.36
5 1579 22.227711 75103.67
6 1360 27.742246 83595.83
7 1747 32.291155 91248.20
8 1447 37.724524 97610.65
9 2233 42.356048 85857.03
10 1794 47.537644 93263.63
11 2180 52.295711 98282.98
12 2075 57.601882 91589.39
13 2848 62.314646 91668.70
14 2059 67.627847 95803.45
15 2961 72.310575 91975.76
16 2240 77.648900 95858.87
17 3067 82.379802 88123.56
18 2463 87.641359 87568.94
19 2746 92.334788 97991.56
20 2425 97.754121 93914.31
I have written a code of my own version of the same peoblem using classical sample variogram estimator. The number of points, dist are coming exactly as in the output. But the gamma value is not same. Why is that and what should I do to make it exactly same with gstat output?
Thanks in advance...

Related

Getting Beautiful soup to read in tables

My first post in stack overflow, and I am also fairly new to the world of Beautiful Soup, so pls be kind.
I am trying to learn web scraping, and in particular, I want to be able to extract data from a government website. See code extract below:
url = 'https://data.gov.sg/dataset/rainfall-monthly-total?view_id=be92ce92-d02e-4920-9479-ff6da87b45e8&resource_id=778814b8-1b96-404b-9ac9-68d6c00e637b'
# Persists parameters across requests
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# Look for iframe, and src as attribute
iframe_src = soup.select_one("iframe").attrs["src"]
r = s.get(f"{iframe_src}")
print("Data Table is extracted from: ", iframe_src)
soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())
rows = soup.findAll("tr", {'class': ['odd', 'even']})
dat = []
for row in rows:
dat.append( map( str, row.findAll('td')))
As you can see, I was just trying to directly extract the table data, with the eventual goal of converting it to csv. However, for some reason, I cant seem to see any table data when I execute the code
Upon further reading, I learnt BeautifulSoup does not work with Javascript. I suspect that is the cause, and my preliminary investigation of teh data set seems to indicate this to be the case, but I am not sure. Thus,
1) Is the reason for the lack of output precisely that i.e. Javascript?
2) I read Selenium can be used - how do I incoporate it here?
3) I have multiple pages; how do I get all the data into 1 csv table?
Thank you in advance.
If learning to web scrape with Beautiful Soup, I think you have the right idea to go after a table (although once you learn Beautiful Soup to extract <table> tags, you should really use pandas as it'll make it a lot easier. But tables are a good way to understand how to work through tags in html.
I would advise to start with a different site. You are correct that this is js rendered, so you would need to have the html rendered first (by something like Selenium) to then go and parse the html. Since you are just starting off, don't over complicate by having to learn bs4 and Selenium. Here's a good site to practice scraping a table. Still a a little tricky, but a static table to play on.
This table can be pulled from a separate request URL that returns a JSON response. You can take that JSON and easily turn that into a table with pandas (see code below). Any API, or such, that you can use is the preferred method. The data comes in a nice JSON structure (sometimes not as nice, but at least consistent), and you don't necessarily have to iterate through "pages":
Code:
import pandas as pd
import requests
url = 'https://data.gov.sg/dataset/5942f8bd-4240-4f68-acd2-a5a276958237/resource/778814b8-1b96-404b-9ac9-68d6c00e637b/data'
payload = {'limit':'2000'}
# Persists parameters across requests
s = requests.Session()
jsonData = s.get(url, params=payload).json()
df = pd.DataFrame(jsonData['records'])
Output:
print (df)
_id month total_rainfall
0 459 2020-03 108.8
1 458 2020-02 65
2 457 2020-01 88.4
3 456 2019-12 421.5
4 455 2019-11 137.4
5 454 2019-10 176.8
6 453 2019-09 22.8
7 452 2019-08 11.8
8 451 2019-07 12.2
9 450 2019-06 173.8
10 449 2019-05 69
11 448 2019-04 174.8
12 447 2019-03 72.2
13 446 2019-02 31.6
14 445 2019-01 63.6
15 444 2018-12 172.6
16 443 2018-11 169.6
17 442 2018-10 234.4
18 441 2018-09 144.4
19 440 2018-08 121.6
20 439 2018-07 143.2
21 438 2018-06 182.6
22 437 2018-05 132.2
23 436 2018-04 61.2
24 435 2018-03 44.6
25 434 2018-02 14.8
26 433 2018-01 287
27 432 2017-12 371.2
28 431 2017-11 268.6
29 430 2017-10 120.8
.. ... ... ...
429 30 1984-06 255
430 29 1984-05 186.5
431 28 1984-04 153.1
432 27 1984-03 361.3
433 26 1984-02 470.4
434 25 1984-01 251.2
435 24 1983-12 370.6
436 23 1983-11 228.8
437 22 1983-10 212.7
438 21 1983-09 170.8
439 20 1983-08 262.2
440 19 1983-07 190
441 18 1983-06 94
442 17 1983-05 160.8
443 16 1983-04 33.6
444 15 1983-03 18.6
445 14 1983-02 5.6
446 13 1983-01 246
447 12 1982-12 494.1
448 11 1982-11 134
449 10 1982-10 39.3
450 9 1982-09 54.1
451 8 1982-08 169.5
452 7 1982-07 76.3
453 6 1982-06 59.5
454 5 1982-05 102.2
455 4 1982-04 157
456 3 1982-03 160.8
457 2 1982-02 27.8
458 1 1982-01 107.1
[459 rows x 3 columns]

Create new file from two files with a common (unsorted) column

This is probably a very basic problem but I am stumped.
I am attempting create a new file from two large tab-delimited files with a common column. The heads of the two files are:
file1
k141_1 319 4 0
k141_2 400 9 0
k141_3 995 43 0
k141_4 670 21 0
k141_5 372 8 0
k141_6 359 9 0
k141_7 483 18 0
k141_8 1826 76 0
k141_9 566 15 0
k141_10 462 14 0
file2
U k141_1 0
U k141_11 0
U k141_24 0
U k141_30 0
C k141_32 2 18 77133,212695,487010, 5444279,5444689,68971626, TIEYSSLHACRSTLEDPT, cellular organisms; Bacteria;
C k141_38 1566886 16 1566886, 50380646, ELVMDREAWCAAIHGV, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium sp. WCM 7299;
U k141_46 0
C k141_57 186802 23 1496,1776046,1776047, 64601048,64601468,64601628,64603689,64604310,64605360,71436886,71436980,71437249,71437272,71437295, CLLYTSDAADDLLCVDLGGRRII, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales;
U k141_64 0
C k141_73 131567 14 287,305,1496,2209,1483596, 47871795,47873311,47873322,47880313,47880625,53485494,53485498,62558724,71434583,71434608, LSRGLGDVYKRQIL,SCLVGSEMCIRDRY,YLSLIHISEPTRQE, cellular organisms;
I want the new file to contain all 4 columns from file 1 and the 8th column of file 2 (taxonomic information separated by semi colons).
I have attempted to sort the files based on the common column but the outputs are not the same despite the columns having the exact same values.
For example,
[user#compute02 Contigs]$ sort -k 1 file1 | head
k141_1000 312 253 0
k141_1001 553 13 0
k141_1002 518 19 0
k141_1003 812 30 0
k141_1004 327 13 0
k141_1005 454 18 0
k141_100 595 20 0
k141_1006 1585 78 0
k141_1007 537 23 0
[user#compute02 Contigs]$ sort -k 2 file2 | head
U k141_1 0
C k141_1000 305 26 305, 62554095,62558735, PVSYTHLRAHETRGNLVCRLLLEKKK, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Burkholderiaceae; Ralstonia; Ralstonia solanacearum;
C k141_1001 946362 11 946362, 5059526, SGRNGLPLKVR, cellular organisms; Eukaryota; Opisthokonta; Choanoflagellida; Craspedida; Salpingoecidae; Salpingoeca; Salpingoeca rosetta;
C k141_1002 131567 15 287,305,2209,1483596, 47870166,47873029,47873592,53485045,55518854,62558495, RTCLLYTSPSPRDKR,NLSLIHISEPTRQEA,EPVSYTHLRAHETRG, cellular organisms;
C k141_100 2 14 287,1496,1776047, 53544868,64603691,71437007, SRSSAASDVYKRQV, cellular organisms; Bacteria;
U k141_1003 0
C k141_1004 2 14 518,1776046,1776047, 28571314,64603094,64605737, LFFFNDTATTEIYT, cellular organisms; Bacteria;
U k141_1005 0
C k141_1006 948 13 948, 73024016, QAPLSMGFSRQEY, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Anaplasmataceae; Anaplasma; phagocytophilum group; Anaplasma phagocytophilum;
C k141_1007 287 14 287, 50594737, RRQRQMCIRDRVGS, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Pseudomonadaceae; Pseudomonas; Pseudomonas aeruginosa group; Pseudomonas aeruginosa;
Any assistance would be greatly appreciated :)
This solution should work.
for i in `cat file1.txt|awk -F" " '{print $1}'`
do
F1=`grep -w $i file1.txt`
F2=`grep -w $i file2.txt|awk -F" " '{$1=$2=$3=$4=$5=$6=$7=""; print $0}'`
echo $F1 $F2
done

read many lines with specific position

Thank you for the time you soent reading it, maybe it is a nooby question
I have a file of 10081 lines, this is an example of the file (a nordic seismic bulletin):
2016 1 8 0921 21.5 L -22.382 -67.835 148.9 OSC 18 0.3 4.7LOSC 1
2016 1 8 1515 43.7 L -20.762 -67.475 188.7 OSC 16 .30 3.7LOSC 1
2016 1 9 0529 35.9 L -18.811 -67.278 235.9 OSC 16 0.5 3.9LOSC 1
2016 110 313 55.6 L -22.032 -67.375 172.0 OSC 14 .30 3.0LOSC 1
2016 110 1021 36.5 L -16.923 -66.668 35.0 OSC 16 0.4 4.5LOSC 1
I tried the following code to extract some information from the file and save them in a separate file.
awk 'NR==1 {print substr($0,24,7), substr($0,32,7), substr($0,40,5)}' select.inp > lat_lon_depth.xyz
substr($0,24,7) means that I take from the 24th position 7 characters which is
the latitude information (-22.382) and the same for the others (longitude from 32th place with 7 characters and depth on 4oth position with 5 characters).
So the question, is possible to go trought all the lines of file and have all latitude, longitude and depth.
Thank you for the time

SAS Data organization

Dataset Sample
I have data set like the attached picture where I want only the observations that have same numsecur every year.
How do I do this in SAS proc sql function? Will this be easier to do in STATA? If so what procedure can I use?
You look like a new user to stackoverflow. Welcome. Your question is getting down voted for at least three reasons:
1) It's not really clear what you want from your description of the problem and the data
you're providing
2) You haven't shown any attempts at what you've tried
3) Providing your data as a picture is not great. It's most helpful if you're going
to provide data to provide it so it's easy for others to consume in their program.
After all, you're asking for our help make it easier for us to help you. If You
included something like the following we just have to copy and paste to create your
dataset to work with:
DATA test;
INPUT ID YEAR EXEC SUM;
DATALINES;
1573 1997 50 1080
1581 1997 51 300
1598 1996 54 80
1598 1998 54 80
1598 1999 54 80
1602 1996 55 112.6
1602 1997 55 335.965
;
RUN;
That being said the following MAY give you what you're looking for but it's only a guess as I'm not sure if this is really what you're asking:
proc sql no print;
create table testout as
select *,count(*) as cnt
from test
group by sum
having cnt > 1;
quit;
Are you asking: show all rows where the same SUM is used or something else?
Assuming I understand your question correctly, you would like to keep the observations from the same company/individual only if the company has the same numsecur every year. So, here is what I would try using STATA:
input ID YEAR EXEC SUM
1573 1997 50 1080 //
1581 1997 51 300 //
1598 1996 54 80 //
1598 1998 54 80 //
1598 1999 54 80 //
1602 1996 55 112.6 //
1602 1997 55 335.965 //
1575 1997 50 1080 //
1575 1998 51 1080 //
1595 1996 54 80 //
1595 1998 54 30 //
1595 1999 54 80 //
1605 1996 55 112.6 //
1605 1997 55 335.965 //
end
bysort ID SUM: gen drop=cond(_N==1, 0,_n)
drop if drop==0
The results show ( based on my data):
ID YEAR EXEC SUM drop
1. 1575 1997 50 1080 1
2. 1575 1998 51 1080 2
3. 1595 1999 54 80 1
4. 1595 1996 54 80 2
5. 1598 1996 54 80 1
6. 1598 1998 54 80 2
7. 1598 1999 54 80 3

Parse complex text file into a row of field names and second row of values

I am trying to parse a text file that has lines that contain mostly text and a single number, (with '#' at the beginning of each line). There is a second part of the file that is made up of lines with multiple numbers, all related to a single structure. As I need to combine these output files for several hundred cases, it would be great help if I could process each one of these files into a a single row of data. I'm having trouble doing it with combinations of bash/perl/awk. Can anyone suggest a way I can do this? (example file below).
Thank you for your consideration.
Best wishes,
-S
# Title Segmentation Statistics
#
# generating_program mri_segstats
# cvs_version $Id: mri_segstats.c,v 1.75.2.9 2013/02/16 00:09:33 greve Exp $
# cmdline mri_segstats --seg mri/aseg.mgz --sum stats/aseg.stats --pv mri/norm.mgz --empty --brainmask mri/brainmask.mgz --brain-vol-from-seg --excludeid 0 --excl-ctxgmwm --supratent --subcortgray --in mri/norm.mgz --in-intensity-name norm --in-intensity-units MR --etiv --surf-wm-vol --surf-ctx-vol --totalgray --euler --ctab /mnt/glusterfs/salsoman/freesurfer/ASegStatsLUT.txt --subject WCA_0162_T1_FS
# sysname Linux
# hostname barley15.stanford.edu
# machine x86_64
# user salsoman
# anatomy_type volume
#
# SUBJECTS_DIR /mnt/glusterfs/salsoman/output/FS
# subjectname WCA_0162_T1_FS
# Measure BrainSeg, BrainSegVol, Brain Segmentation Volume, 1089921.000000, mm^3
# Measure BrainSegNotVent, BrainSegVolNotVent, Brain Segmentation Volume Without Ventricles, 993734.000000, mm^3
# Measure BrainSegNotVentSurf, BrainSegVolNotVentSurf, Brain Segmentation Volume Without Ventricles from Surf, 993214.631437, mm^3
# Measure lhCortex, lhCortexVol, Left hemisphere cortical gray matter volume, 240339.518738, mm^3
# Measure rhCortex, rhCortexVol, Right hemisphere cortical gray matter volume, 236468.599276, mm^3
# Measure Cortex, CortexVol, Total cortical gray matter volume, 476808.118013, mm^3
# Measure lhCorticalWhiteMatter, lhCorticalWhiteMatterVol, Left hemisphere cortical white matter volume, 191135.667925, mm^3
# Measure rhCorticalWhiteMatter, rhCorticalWhiteMatterVol, Right hemisphere cortical white matter volume, 180013.845498, mm^3
# Measure CorticalWhiteMatter, CorticalWhiteMatterVol, Total cortical white matter volume, 371149.513423, mm^3
# Measure SubCortGray, SubCortGrayVol, Subcortical gray matter volume, 52383.000000, mm^3
# Measure TotalGray, TotalGrayVol, Total gray matter volume, 604954.118013, mm^3
# Measure SupraTentorial, SupraTentorialVol, Supratentorial volume, 991108.631437, mm^3
# Measure SupraTentorialNotVent, SupraTentorialVolNotVent, Supratentorial volume, 902611.631437, mm^3
# Measure SupraTentorialNotVentVox, SupraTentorialVolNotVentVox, Supratentorial volume voxel count, 900542.000000, mm^3
# Measure Mask, MaskVol, Mask Volume, 1694747.000000, mm^3
# Measure BrainSegVol-to-eTIV, BrainSegVol-to-eTIV, Ratio of BrainSegVol to eTIV, 0.624390, unitless
# Measure MaskVol-to-eTIV, MaskVol-to-eTIV, Ratio of MaskVol to eTIV, 0.970881, unitless
# Measure lhSurfaceHoles, lhSurfaceHoles, Number of defect holes in lh surfaces prior to fixing, 239, unitless
# Measure rhSurfaceHoles, rhSurfaceHoles, Number of defect holes in rh surfaces prior to fixing, 227, unitless
# Measure SurfaceHoles, SurfaceHoles, Total number of defect holes in surfaces prior to fixing, 466, unitless
# Measure EstimatedTotalIntraCranialVol, eTIV, Estimated Total Intracranial Volume, 1745576.756023, mm^3
# SegVolFile mri/aseg.mgz
# SegVolFileTimeStamp 2013/03/27 19:34:08
# ColorTable /mnt/glusterfs/salsoman/freesurfer/ASegStatsLUT.txt
# ColorTableTimeStamp 2013/02/25 22:23:16
# InVolFile mri/norm.mgz
# InVolFileTimeStamp 2013/03/27 14:00:28
# InVolFrame 0
# PVVolFile mri/norm.mgz
# PVVolFileTimeStamp 2013/03/27 14:00:28
# Excluding Cortical Gray and White Matter
# ExcludeSegId 0 2 3 41 42
# VoxelVolume_mm3 1
# TableCol 1 ColHeader Index
# TableCol 1 FieldName Index
# TableCol 1 Units NA
# TableCol 2 ColHeader SegId
# TableCol 2 FieldName Segmentation Id
# TableCol 2 Units NA
# TableCol 3 ColHeader NVoxels
# TableCol 3 FieldName Number of Voxels
# TableCol 3 Units unitless
# TableCol 4 ColHeader Volume_mm3
# TableCol 4 FieldName Volume
# TableCol 4 Units mm^3
# TableCol 5 ColHeader StructName
# TableCol 5 FieldName Structure Name
# TableCol 5 Units NA
# TableCol 6 ColHeader normMean
# TableCol 6 FieldName Intensity normMean
# TableCol 6 Units MR
# TableCol 7 ColHeader normStdDev
# TableCol 7 FieldName Itensity normStdDev
# TableCol 7 Units MR
# TableCol 8 ColHeader normMin
# TableCol 8 FieldName Intensity normMin
# TableCol 8 Units MR
# TableCol 9 ColHeader normMax
# TableCol 9 FieldName Intensity normMax
# TableCol 9 Units MR
# TableCol 10 ColHeader normRange
# TableCol 10 FieldName Intensity normRange
# TableCol 10 Units MR
# NRows 45
# NTableCols 10
# ColHeaders Index SegId NVoxels Volume_mm3 StructName normMean normStdDev normMin normMax normRange
1 4 41962 41962.4 Left-Lateral-Ventricle 22.0753 10.2057 3.0000 94.0000 91.0000
2 5 2150 2149.7 Left-Inf-Lat-Vent 37.5636 16.3886 5.0000 89.0000 84.0000
3 7 8273 8273.3 Left-Cerebellum-White-Matter 88.0903 11.6908 21.0000 123.0000 102.0000
4 8 35427 35427.4 Left-Cerebellum-Cortex 56.4255 12.5475 2.0000 92.0000 90.0000
5 10 6087 6086.7 Left-Thalamus-Proper 92.2098 11.7928 50.0000 124.0000 74.0000
6 11 5101 5100.7 Left-Caudate 75.0335 9.9708 29.0000 100.0000 71.0000
7 12 4773 4773.0 Left-Putamen 75.7113 6.2195 48.0000 95.0000 47.0000
8 13 1178 1177.6 Left-Pallidum 86.3354 6.2568 59.0000 104.0000 45.0000
9 14 2973 2973.1 3rd-Ventricle 27.5508 11.3394 9.0000 77.0000 68.0000
10 15 2403 2403.0 4th-Ventricle 26.8237 11.9581 6.0000 79.0000 73.0000
11 16 18347 18347.2 Brain-Stem 82.1731 12.0144 15.0000 116.0000 101.0000
12 17 3824 3824.2 Left-Hippocampus 66.7333 8.6661 26.0000 100.0000 74.0000
13 18 2087 2087.1 Left-Amygdala 63.9856 7.2932 37.0000 91.0000 54.0000
14 24 2094 2094.0 CSF 36.2929 14.6972 12.0000 90.0000 78.0000
15 26 340 340.0 Left-Accumbens-area 69.8967 8.7139 37.0000 87.0000 50.0000
16 28 2969 2969.5 Left-VentralDC 94.9737 13.6527 44.0000 122.0000 78.0000
17 30 76 75.9 Left-vessel 58.3205 11.6736 27.0000 80.0000 53.0000
18 31 1103 1102.6 Left-choroid-plexus 51.7182 16.3692 12.0000 100.0000 88.0000
19 43 38108 38108.2 Right-Lateral-Ventricle 20.2269 10.2570 0.0000 92.0000 92.0000
20 44 2165 2165.0 Right-Inf-Lat-Vent 30.2048 13.6808 0.0000 80.0000 80.0000
21 46 9715 9715.4 Right-Cerebellum-White-Matter 86.9395 8.3909 25.0000 115.0000 90.0000
22 47 41688 41688.2 Right-Cerebellum-Cortex 57.5291 10.3208 9.0000 91.0000 82.0000
23 49 4769 4769.3 Right-Thalamus-Proper 82.0576 12.2446 18.0000 106.0000 88.0000
24 50 4587 4587.1 Right-Caudate 69.9613 12.7863 11.0000 103.0000 92.0000
25 51 4694 4694.4 Right-Putamen 69.9372 7.9141 48.0000 91.0000 43.0000
26 52 1407 1406.8 Right-Pallidum 88.0501 5.7841 57.0000 105.0000 48.0000
27 53 3160 3159.6 Right-Hippocampus 63.3511 8.9283 17.0000 95.0000 78.0000
28 54 1877 1877.4 Right-Amygdala 57.3686 8.5163 20.0000 83.0000 63.0000
29 58 376 376.0 Right-Accumbens-area 70.4901 9.9104 41.0000 96.0000 55.0000
30 60 2973 2972.7 Right-VentralDC 89.6143 14.1755 29.0000 120.0000 91.0000
31 62 105 105.1 Right-vessel 50.1458 12.1126 21.0000 78.0000 57.0000
32 63 2843 2842.7 Right-choroid-plexus 46.3759 13.8319 6.0000 115.0000 109.0000
33 72 68 67.9 5th-Ventricle 42.4444 11.2861 26.0000 83.0000 57.0000
34 77 25325 25325.0 WM-hypointensities 71.8650 16.2379 5.0000 112.0000 107.0000
35 78 0 0.0 Left-WM-hypointensities 0.0000 0.0000 0.0000 0.0000 0.0000
36 79 0 0.0 Right-WM-hypointensities 0.0000 0.0000 0.0000 0.0000 0.0000
37 80 153 153.1 non-WM-hypointensities 50.4551 16.1478 18.0000 88.0000 70.0000
38 81 0 0.0 Left-non-WM-hypointensities 0.0000 0.0000 0.0000 0.0000 0.0000
39 82 0 0.0 Right-non-WM-hypointensities 0.0000 0.0000 0.0000 0.0000 0.0000
40 85 350 349.6 Optic-Chiasm 66.0833 15.7641 24.0000 102.0000 78.0000
41 251 806 805.6 CC_Posterior 119.2646 18.1322 57.0000 150.0000 93.0000
42 252 252 251.7 CC_Mid_Posterior 109.1685 16.3862 51.0000 150.0000 99.0000
43 253 295 295.4 CC_Central 113.3418 16.2739 77.0000 140.0000 63.0000
44 254 294 293.7 CC_Mid_Anterior 115.1645 17.9396 72.0000 149.0000 77.0000
45 255 657 657.4 CC_Anterior 124.1047 22.5045 54.0000 166.0000 112.0000
Have you tried Talend Open Studio / Data Integration? TOS is capable of automating such complex transformations. The final executable of your data transformation job would be a jar file that you can easily call from your shell script. It takes a time though to get started with TOS, but it's very powerful. The product is licensed GPL v2 and it has quite an active community.
Of course you can code some awk/sed/perl orgy and you will come to a result, but in your case this might become quite unreadable and unmaintainable with such a complex transformation.
HTH, Michael

Resources