plot data organized by rows - python-3.x

The data are organized where the first column are year and next are monthly average, I need plot x-->month and y--> monthly average and the data base is organized first column= year next columns are monthly mean (12 values) corresponding to year, like that:
1871 -0.107 0.004 -0.503 -0.650 -0.379 0.025 0.317 0.027 -0.732 -0.685 0.037 0.566
1872 0.376 -0.241 -0.904 -1.019 0.367 0.282 -0.061 0.597 0.779 0.818 1.070 1.203
1873 0.831 0.762 0.379 -0.028 0.014 0.349 0.189 0.428 -0.170 0.643 0.859 0.317
1874 0.063 0.125 -0.068 -0.124 0.365 0.535 0.693 1.298 0.554 0.566 0.889 0.185
1875 -0.369 -0.764 -1.238 0.111 0.683 0.696 0.505 1.008 1.210 0.945 -0.307 -0.184
Similar to this graph:

Related

Python function takes a long time to return after final statement

I have a Python program I'm using to parse through a bunch of data. For some reason, it takes a long time for the callee to return control to the caller after its final statement.
parser.py
def gen_report(self):
xl = win32.Dispatch('Excel.Application')
doc = docx.Document()
# Data shuffling here. Most of the parsing has been completed previously, so this function
# primarily just copies charts from Excel into a Word document and then adds headings
# and such to the Word doc.
doc.save(doc_name)
xl.Quit()
print('Report generated successfully!')
gui.py
def parse_thread(self):
parser.gen_report()
print('Report generation returned')
The final statement in gen_report() is the print statement, and I would think that the print statement in parse_thread() would be executed almost immediately afterward. But instead it takes sometimes up to 15 seconds after the first print statement before the second print statement is executed.
I thought this might be something to do with parsing through a lot of data and garbage collection, but disabling garbage collection did not change the performance at all. I was also able to run the program with CProfile and see that parse_thread is taking up most of the execution time, the majority of which I would assume is when the program is hanging after returning from report generation. Garbage collection doesn't even show up in the 20 items that take the most time:
ncalls tottime percall cumtime percall filename:lineno(function)
1 20.374 20.374 32.621 32.621 gui.py:97(parse_thread)
41 3.564 0.087 3.564 0.087 {method 'InvokeTypes' of 'PyIDispatch' objects}
2 2.316 1.158 2.316 1.158 {built-in method pythoncom.CoCreateInstance}
7 0.655 0.094 0.655 0.094 {built-in method PIL._imaging.grabclipboard_win32}
1 0.649 0.649 8.662 8.662 Parser.py:386(gen_report)
9 0.504 0.056 0.504 0.056 {method 'encode' of 'ImagingEncoder' objects}
834432 0.353 0.000 0.356 0.000 worksheet.py:247(_get_cell)
73864 0.317 0.000 0.530 0.000 _writer.py:70(lxml_write_cell)
1956380 0.228 0.000 0.228 0.000 worksheet.py:347(<genexpr>)
683154 0.221 0.000 0.531 0.000 worksheet.py:216(cell)
606673 0.202 0.000 0.682 0.000 worksheet.py:515(<genexpr>)
62 0.199 0.003 0.428 0.007 worksheet.py:339(max_row)
224 0.175 0.001 0.175 0.001 {method 'Bind' of 'PyITypeComp' objects}
9 0.156 0.017 0.156 0.017 {method 'close' of '_io.BufferedRandom' objects}
73800 0.132 0.000 0.273 0.000 _reader.py:177(parse_cell)
126 0.128 0.001 0.128 0.001 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
48 0.099 0.002 0.099 0.002 {method 'GetTypeComp' of 'PyITypeInfo' objects}
151270 0.098 0.000 0.144 0.000 worksheet.py:793(_move_cell)
3 0.098 0.033 0.827 0.276 _reader.py:350(bind_cells)
757136 0.090 0.000 0.090 0.000 worksheet.py:373(<genexpr>)
Any thoughts on why it might be taking so long for control to return to the caller?

How to interpret python cProfile output

I am running cProfile on my python script in order to see where i can improve performance and using snakeviz to visualize. The results are pretty vague however; how do I interpret them? Here are the first few lines of it:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
242059 0.626 0.000 0.914 0.000 pulp.py:585(__init__)
1 0.413 0.413 0.413 0.413 {built-in method _winapi.WaitForSingleObject}
343978/302 0.293 0.000 0.557 0.002 pulp.py:750(addInPlace)
4159617 0.288 0.000 0.288 0.000 pulp.py:165(__hash__)
112 0.282 0.003 0.282 0.003 {method 'read' of '_ssl._SSLSocket' objects}
1913398 0.172 0.000 0.245 0.000 {built-in method builtins.isinstance}
1 0.171 0.171 0.185 0.185 Betfair_Run_Sheet.pyx:243(betfairFinalArray)
377866 0.168 0.000 0.293 0.000 pulp.py:637(addterm)
2255 0.161 0.000 0.161 0.000 mps_lp.py:249(<listcomp>)
1 0.148 0.148 0.570 0.570 mps_lp.py:174(writeMPS)
117214 0.139 0.000 0.444 0.000 pulp.py:820(__mul__)
2 0.136 0.068 0.196 0.098 pulp.py:1465(variablesDict)
5 0.135 0.027 0.135 0.027 {method 'do_handshake' of '_ssl._SSLSocket' objects}
427 0.111 0.000 0.129 0.000 <frozen importlib._bootstrap_external>:914(get_data)
71 0.108 0.002 0.108 0.002 {built-in method _imp.create_dynamic}
2093 0.102 0.000 0.102 0.000 {built-in method nt.stat}
I am using Pulp so aware that takes the lion's share of the time, but the specifics of the setup is not clear from the above, e.g. for the first line of output it seems to be alluding to a line 585 in my script but that is not where I have called or set up the PULP part in it at all.
Same with the <listcomp> 9th one down, there is no list comprehension on that line of my script.
Other things like {method 'do_handshake' of '_ssl._SSLSocket' objects} I don't have a clue what they mean.

complex excel table lookup returning value in row below

I have a complex excel table lookup formula that I need help with. I have tried creating an index match match formula but this needs to look up on the same row, not using column headings in the match formula. Essentially need pull a number based on two parameters from the attached sample table. I'll just throw out an example to make the most sense.
Example 1
Parameter 1: B010, Parameter 2: 60.3
Based on the table I would need the formula to pull in 0.05
B010 part is pretty simple. 60.3 is greater than 60 so it goes to the column with 60.6. And the formula needs to pull in the number BELOW the number that it finds in the row. Result = 0.05 image link
Example 2
Parameter 1: B010, Parameter 2: 45, result: 0.075
Example 3
Parameter 1: B012, Parameter 2: 65, result: 0.033
Is this even possible to do in excel?
B010 48.5 51.6 54.3 57.3 60 60.6 61.1 61.8 62.3 62.8
0.075 0.07 0.065 0.634 0.055 0.05 0.045 0.04 0.035 0.033
B012 48.5 51.6 54.3 57.3 60 60.6 61.1 61.8 62.3 62.8
0.075 0.07 0.065 0.06 0.055 0.05 0.045 0.04 0.035 0.033
B013 48.5 51.6 54.3 57.3 60 60.6 61.1 61.8 62.3 62.8
0.075 0.07 0.065 0.06 0.055 0.05 0.045 0.04 0.035 0.033
H010 48.5 51.6 54.3 57.3 60 61 62.1 63 64 64.9
0.065 0.06 0.055 0.05 0.045 0.042 0.04 0.038 0.035 0.033
H020 48.5 51.6 54.3 57.3 60 61.6 63.1 64.6 66.3 67.7
0.125 0.119 0.113 0.107 0.105 0.1 0.095 0.09 0.085 0.08
Enter as an array formula (ctrl+shift+enter):
=INDEX(B1:K10,MATCH(N2,A1:A10,0)+1,MATCH(TRUE,INDEX(B1:K10,MATCH(N2,A1:A10,0),)>N3,0))

Conditional formatting in gnuplot

I have thickness measurements taken on a flat plate. Is there any excel like conditional formatting option in gnuplot? I would like to have four different plots with the values,
Higher than a given number, say, in this case, 0.5
Lower than 0.5
Between a given range, say, 0.5 and 0.51
Only the values 0.5?
How to modify the codes below?
set pm3d map
splot 't.dat' matrix
Here is my data file
0.509 0.510 0.515 0.529 0.521 0.516 0.515
0.511 0.506 0.512 0.528 0.524 0.517 0.512
0.510 0.506 0.506 0.530 0.524 0.522 0.505
0.511 0.509 0.513 0.516 0.511 0.520 0.510
0.524 0.516 0.512 0.511 0.507 0.518 0.492
0.525 0.521 0.515 0.517 0.518 0.522 0.500
0.530 0.521 0.513 0.512 0.511 0.519 0.503
0.562 0.516 0.510 0.516 0.522 0.518 0.508
0.520 0.518 0.512 0.517 0.518 0.518 0.510
0.510 0.509 0.503 0.507 0.523 0.519 0.522
0.506 0.500 0.424 0.507 0.523 0.527 0.519
0.509 0.430 0.500 0.513 0.519 0.528 0.524
0.506 0.503 0.503 0.506 0.513 0.528 0.533
0.506 0.517 0.519 0.524 0.524 0.526 0.528
0.525 0.517 0.499 0.520 0.521 0.524 0.518
0.519 0.518 0.516 0.519 0.521 0.520 0.519
0.521 0.502 0.515 0.518 0.518 0.523 0.522
0.515 0.519 0.519 0.534 0.524 0.525 0.516
0.517 0.510 0.522 0.532 0.533 0.530 0.525
0.520 0.457 0.526 0.530 0.530 0.531 0.524
0.530 0.520 0.531 0.529 0.527 0.526 0.524
Thanks!
You can do conditional plots like this, but the conditional filtering will leave "holes" in your graph.
Without formatting:
set pm3d map
splot "./data" matrix
Plotting only values larger than 0.5:
set pm3d map
splot "./data" matrix u 1:2:($3 > 0.5 ? $3 : 1/0)
You see you have points missing. If you filter even more, you'll have more points missing, to the extreme that you might not have anything to plot (with < 0.5) because there is no way to interpolate. What you can do is substitute points outside of range by a fixed value. For instance, if the value is smaller than 0.5 substitute by 0.5, if it's larger than 0.53 substitute by 0.53:
set pm3d map
splot "./data" matrix u 1:2:($3 < 0.5 ? 0.5 : $3 > 0.53 ? 0.53 : $3)
For nicer display with these small matrices, you can think about interpolation:
set pm3d map interpolate 32,32
splot "./data" matrix u 1:2:($3 < 0.5 ? 0.5 : $3 > 0.53 ? 0.53 : $3)

RODBC read error where excel column contains leading NAs

I have been reading Excel sheets into R using the RODBC package and have hit an issue with the Excel ODBC driver. Columns that contain (sufficient) leading NAs are coerced to logical.
In Excel the data appears as follows:
period n n.ft n.pt
1/02/1985 0.008 NA 0.025
1/03/1985 -0.003 NA -0.024
1/04/1985 0.002 NA 0.015
1/05/1985 0.006 NA 0.012
1/06/1985 0.001 NA 0.003
1/07/1985 0.005 NA 0.010
1/08/1985 0.006 NA 0.001
1/09/1985 0.007 NA 0.013
1/10/1985 -0.002 NA 0.009
1/11/1985 0.013 NA 0.019
1/12/1985 -0.004 NA -0.021
1/01/1986 0.008 NA 0.009
1/02/1986 0.002 NA 0.009
1/03/1986 0.002 -0.003 1.000
1/04/1986 0.010 -0.003 0.041
1/05/1986 0.000 -0.001 -0.004
1/06/1986 0.005 0.003 0.005
1/07/1986 -0.003 0.005 0.012
1/08/1986 -0.001 -0.003 -0.021
1/09/1986 0.003 -0.001 0.012
1/10/1986 0.003 0.003 0.010
1/11/1986 -0.003 0.003 -0.003
1/12/1986 0.003 -0.003 0.022
1/01/1987 0.001 0.013 -0.004
1/02/1987 0.004 -0.004 0.011
1/03/1987 0.004 0.008 0.005
1/04/1987 0.000 0.002 -0.002
1/05/1987 0.001 0.002 0.006
1/06/1987 0.004 0.010 0.00
I read in the data with:
require(RODBC)
conexcel <- odbcConnectExcel(xls.file="C:/data/example.xls")
s1 <- 'SOx'
dd <- sqlFetch(conexcel, s1)
odbcClose(conexcel)
This reads in the entire second column as NA. I think this is due to the fact it's guessed to be logical, and therefore the subsequent numbers are assessed as invalid and hence NA.
> str(dd)
'data.frame': 29 obs. of 4 variables:
$ period: POSIXct, format: "1985-02-01" "1985-03-01" ...
$ n : num 0.00833 -0.00338 0.00157 0.00562 0.00117 ...
$ n#ft : logi NA NA NA NA NA NA ...
$ n#pt : num 0.02515 -0.02394 0.0154 0.01224 0.00301 ...
I am trying to find a way to prevent this coercion to logical, which I think is causing the subsequent error.
I found this Q+A by searching SO, however I am at work and have no hope of being permitted to edit the registry to change the default for DWORD, as suggested (I understand that the value set here determines how many NAs are required before Microsoft guesses the data type and bombs my read).
Right now, I'm thinking that the best solution is to invert the data in Excel, and read it into R up-side-down.
I love a good hack but surely there's a better solution?
This is not a bug, but a feature of ODBC (note the lack of R) as documented here
http://support.microsoft.com/kb/257819/en-us
(long page, check for "mixed data type").
Since reading Excel files with ODBC is rather limited, I prefer one of the alternatives mentioned by Gabor, with preference for XLConnnect.

Resources