I have been reading Excel sheets into R using the RODBC package and have hit an issue with the Excel ODBC driver. Columns that contain (sufficient) leading NAs are coerced to logical.
In Excel the data appears as follows:
period n n.ft n.pt
1/02/1985 0.008 NA 0.025
1/03/1985 -0.003 NA -0.024
1/04/1985 0.002 NA 0.015
1/05/1985 0.006 NA 0.012
1/06/1985 0.001 NA 0.003
1/07/1985 0.005 NA 0.010
1/08/1985 0.006 NA 0.001
1/09/1985 0.007 NA 0.013
1/10/1985 -0.002 NA 0.009
1/11/1985 0.013 NA 0.019
1/12/1985 -0.004 NA -0.021
1/01/1986 0.008 NA 0.009
1/02/1986 0.002 NA 0.009
1/03/1986 0.002 -0.003 1.000
1/04/1986 0.010 -0.003 0.041
1/05/1986 0.000 -0.001 -0.004
1/06/1986 0.005 0.003 0.005
1/07/1986 -0.003 0.005 0.012
1/08/1986 -0.001 -0.003 -0.021
1/09/1986 0.003 -0.001 0.012
1/10/1986 0.003 0.003 0.010
1/11/1986 -0.003 0.003 -0.003
1/12/1986 0.003 -0.003 0.022
1/01/1987 0.001 0.013 -0.004
1/02/1987 0.004 -0.004 0.011
1/03/1987 0.004 0.008 0.005
1/04/1987 0.000 0.002 -0.002
1/05/1987 0.001 0.002 0.006
1/06/1987 0.004 0.010 0.00
I read in the data with:
require(RODBC)
conexcel <- odbcConnectExcel(xls.file="C:/data/example.xls")
s1 <- 'SOx'
dd <- sqlFetch(conexcel, s1)
odbcClose(conexcel)
This reads in the entire second column as NA. I think this is due to the fact it's guessed to be logical, and therefore the subsequent numbers are assessed as invalid and hence NA.
> str(dd)
'data.frame': 29 obs. of 4 variables:
$ period: POSIXct, format: "1985-02-01" "1985-03-01" ...
$ n : num 0.00833 -0.00338 0.00157 0.00562 0.00117 ...
$ n#ft : logi NA NA NA NA NA NA ...
$ n#pt : num 0.02515 -0.02394 0.0154 0.01224 0.00301 ...
I am trying to find a way to prevent this coercion to logical, which I think is causing the subsequent error.
I found this Q+A by searching SO, however I am at work and have no hope of being permitted to edit the registry to change the default for DWORD, as suggested (I understand that the value set here determines how many NAs are required before Microsoft guesses the data type and bombs my read).
Right now, I'm thinking that the best solution is to invert the data in Excel, and read it into R up-side-down.
I love a good hack but surely there's a better solution?
This is not a bug, but a feature of ODBC (note the lack of R) as documented here
http://support.microsoft.com/kb/257819/en-us
(long page, check for "mixed data type").
Since reading Excel files with ODBC is rather limited, I prefer one of the alternatives mentioned by Gabor, with preference for XLConnnect.
Related
I have a Python program I'm using to parse through a bunch of data. For some reason, it takes a long time for the callee to return control to the caller after its final statement.
parser.py
def gen_report(self):
xl = win32.Dispatch('Excel.Application')
doc = docx.Document()
# Data shuffling here. Most of the parsing has been completed previously, so this function
# primarily just copies charts from Excel into a Word document and then adds headings
# and such to the Word doc.
doc.save(doc_name)
xl.Quit()
print('Report generated successfully!')
gui.py
def parse_thread(self):
parser.gen_report()
print('Report generation returned')
The final statement in gen_report() is the print statement, and I would think that the print statement in parse_thread() would be executed almost immediately afterward. But instead it takes sometimes up to 15 seconds after the first print statement before the second print statement is executed.
I thought this might be something to do with parsing through a lot of data and garbage collection, but disabling garbage collection did not change the performance at all. I was also able to run the program with CProfile and see that parse_thread is taking up most of the execution time, the majority of which I would assume is when the program is hanging after returning from report generation. Garbage collection doesn't even show up in the 20 items that take the most time:
ncalls tottime percall cumtime percall filename:lineno(function)
1 20.374 20.374 32.621 32.621 gui.py:97(parse_thread)
41 3.564 0.087 3.564 0.087 {method 'InvokeTypes' of 'PyIDispatch' objects}
2 2.316 1.158 2.316 1.158 {built-in method pythoncom.CoCreateInstance}
7 0.655 0.094 0.655 0.094 {built-in method PIL._imaging.grabclipboard_win32}
1 0.649 0.649 8.662 8.662 Parser.py:386(gen_report)
9 0.504 0.056 0.504 0.056 {method 'encode' of 'ImagingEncoder' objects}
834432 0.353 0.000 0.356 0.000 worksheet.py:247(_get_cell)
73864 0.317 0.000 0.530 0.000 _writer.py:70(lxml_write_cell)
1956380 0.228 0.000 0.228 0.000 worksheet.py:347(<genexpr>)
683154 0.221 0.000 0.531 0.000 worksheet.py:216(cell)
606673 0.202 0.000 0.682 0.000 worksheet.py:515(<genexpr>)
62 0.199 0.003 0.428 0.007 worksheet.py:339(max_row)
224 0.175 0.001 0.175 0.001 {method 'Bind' of 'PyITypeComp' objects}
9 0.156 0.017 0.156 0.017 {method 'close' of '_io.BufferedRandom' objects}
73800 0.132 0.000 0.273 0.000 _reader.py:177(parse_cell)
126 0.128 0.001 0.128 0.001 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
48 0.099 0.002 0.099 0.002 {method 'GetTypeComp' of 'PyITypeInfo' objects}
151270 0.098 0.000 0.144 0.000 worksheet.py:793(_move_cell)
3 0.098 0.033 0.827 0.276 _reader.py:350(bind_cells)
757136 0.090 0.000 0.090 0.000 worksheet.py:373(<genexpr>)
Any thoughts on why it might be taking so long for control to return to the caller?
I have tested the same code in python in two diferent computers. In the first one the code is 9s longer and in the second one(a more powerfull machine with 16MRAM x 8MRAM of first one) is 185s longer. Analising in cProfile, the most critical process in both case is the waitforsingleobject. Analisyng a specific function, i can see that the critical part is the OCR with tesserecat. why so diferrent perfomance in this two machines?
The main lines from cProfile of this specific function is:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 115.398 115.398 bpl-Redonda4.py:261(pega_stack_nome_jogadores)
18 0.000 0.000 0.001 0.000 pytesseract.py:106(prepare)
18 0.000 0.000 0.118 0.007 pytesseract.py:116(save_image)
18 0.000 0.000 0.000 0.000 pytesseract.py:140(subprocess_args)
18 0.000 0.000 115.186 6.399 pytesseract.py:162(run_tesseract)
18 0.001 0.000 115.373 6.410 pytesseract.py:199(run_and_get_output)
12 0.000 0.000 76.954 6.413 pytesseract.py:295(image_to_string)
12 0.000 0.000 76.954 6.413 pytesseract.py:308()
6 0.000 0.000 38.419 6.403 pytesseract.py:328(image_to_boxes)
6 0.000 0.000 38.419 6.403 pytesseract.py:345()
18 0.000 0.000 0.060 0.003 pytesseract.py:97(cleanup)
18 0.000 0.000 115.096 6.394 subprocess.py:979(wait)
18 115.096 6.394 115.096 6.394 {built-in method_winapi.WaitForSingleObject}
I’m trying to speed up some code that should run fast in both Linux and Windows. However, the same code run in a Fedora 25 takes 131 seconds, while in a Windows 7 only 90 seconds (both computers with 8Gb of RAM and i7 and i5 processor, respectively). I’m using Python 3.5 in Fedora and 3.6 in Windows.
The code is the following:
nprocs = cpu_count()
chunksize = ceil(nrFrames / nprocs)
queue = Queue()
jobs = []
for i in range(nprocs):
start = chunksize * i
if i == nprocs - 1:
end = nrFrames
else:
end = chunksize * (i + 1)
trjCoordsProcess = DAH_Coords[start:end]
p = Process(target=is_hbond, args=(queue, trjCoordsProcess, distCutOff,
angleCutOff, AList, DList, HList))
jobs.append(p)
HbondFreqMatrix = queue.get()
for k in range(nprocs-1):
HbondFreqMatrix = np.add(HbondFreqMatrix, queue.get())
for proc in jobs:
proc.join()
def is_hbond(queue, processCoords, distCutOff, angleCutOff,
possibleAPosList, donorsList, HCovBoundPosList):
for frame in range(len(processCoords)):
# do stuff
queue.put(HbondProcessFreqMatrix)
The start method of each process is actually considerably faster in Linux than in Windows. However, each iteration inside the is_hbond function takes 2.5 times longer in Linux (0.5 vs 0.2s).
The profiler gives the following information:
Windows
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.167 0.167 84.139 84.139 calculateHbonds
4 0.000 0.000 52.039 13.010 \Python36\lib\multiprocessing\ queues.py:91(get)
4 0.000 0.000 51.928 12.982 \Python36\lib\multiprocessing\ connection.py:208(recv_bytes)
4 0.018 0.004 51.928 12.982 \Python36\lib\multiprocessing\ connection.py:294(_recv_bytes)
4 51.713 12.928 51.713 12.928 {built-in method _winapi.WaitForMultipleObjects}
4 0.000 0.000 30.811 7.703 \Python36\lib\multiprocessing\ process.py:95(start)
4 0.000 0.000 30.811 7.703 \Python36\lib\multiprocessing\ context.py:221(_Popen)
4 0.000 0.000 30.811 7.703 \Python36\lib\multiprocessing\ context.py:319(_Popen)
4 0.000 0.000 30.809 7.702 popen_spawn_win32.py:32(__init__)
8 1.958 0.245 30.804 3.851 \Python36\lib\multiprocessing\ reduction.py:58(dump)
8 28.846 3.606 28.846 3.606 {method 'dump' of '_pickle.Pickler' objects}
Linux
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.203 0.203 123.169 123.169 calculateHbonds
4 0.000 0.000 121.450 30.362 /python3.5/multiprocessing/ queues.py:91(get)
4 0.000 0.000 121.300 30.325 /python3.5/multiprocessing/ connection.py:208(recv_bytes)
4 0.019 0.005 121.300 30.325 /python3.5/multiprocessing/ connection.py:406(_recv_bytes)
8 0.000 0.000 121.281 15.160 /python3.5/multiprocessing/ connection.py:374(_recv)
8 121.088 15.136 121.088 15.136 {built-in method posix.read}
1 0.000 0.000 0.082 0.082 /python3.5/multiprocessing/ context.py:98(Queue)
17/4 0.000 0.000 0.082 0.021 <frozen importlib._bootstrap>: 939(_find_and_load_unlocked)
16/4 0.000 0.000 0.082 0.020 <frozen importlib._bootstrap>: 659(_load_unlocked)
4 0.000 0.000 0.052 0.013 /python3.5/multiprocessing/ process.py:95(start)
4 0.000 0.000 0.052 0.013 /python3.5/multiprocessing/ context.py:210(_Popen)
4 0.000 0.000 0.052 0.013 /python3.5/multiprocessing/ context.py:264(_Popen)
4 0.000 0.000 0.051 0.013 /python3.5/multiprocessing/ popen_fork.py:16(__init__)
4 0.000 0.000 0.051 0.013 /python3.5/multiprocessing/ popen_fork.py:64(_launch)
4 0.050 0.013 0.050 0.013 {built-in method posix.fork}
Is there a reason why this might be the case? I know the multiprocessing module works differently in Linux and Windows due to the lack of os.fork in Windows, but I thought Linux should be faster.
Any ideas on how to make it faster in Linux?
Thank you!
E: After testing the same on OS X and Linux, I can confirm that the following only happens on OS X. On Linux it literally runs at a thousand fps, as I happened to wonder. Any explanation? I would much prefer developing on Mac, thanks to TextMate.
Here's a simple loop that does almost nothing, and still runs very slowly. Can anyone explain why? FPS averages at little over 30, it takes a little over 30ms for each pass over the loop. Window size does not seem to affect this at all, as even setting a tiny window size like (50,50) has the same fps.
I find this weird, I would expect that any contemporary hardware could do a thousand fps for such a simple loop, even when we update every pixel every time. From the profile I can see that {built-in method get} and {built-in method update} combined seem to take around 30ms of time per call, is that really the best we can get out without using dirty rects?
pygame.init()
clock = pygame.time.Clock()
fps = 1000
#milliseconds from last frame
new_time, old_time = None, None
done = False
while not done:
clock.tick(fps)
for event in pygame.event.get():
if event.type == pygame.QUIT:
done = True
# show fps and milliseconds
if new_time:
old_time = new_time
new_time = pygame.time.get_ticks()
if new_time and old_time:
pygame.display.set_caption("fps: " + str(int(clock.get_fps())) + " ms: " + str(new_time-old_time))
pygame.display.update()
Here's the beginning of a cProfile of the main function.
94503 function calls (92211 primitive calls) in 21.011 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.026 0.026 21.011 21.011 new_main.py:34(main)
652 14.048 0.022 14.048 0.022 {built-in method get}
652 5.864 0.009 5.864 0.009 {built-in method update}
1 0.444 0.444 0.634 0.634 {built-in method init}
651 0.278 0.000 0.278 0.000 {built-in method set_caption}
72/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:2234(_find_and_load)
72/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:2207(_find_and_load_unlocked)
71/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:1186(_load_unlocked)
46/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:1122(_exec)
46/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:1465(exec_module)
74/1 0.000 0.000 0.151 0.151 <frozen importlib._bootstrap>:313(_call_with_frames_removed)
54/1 0.004 0.000 0.151 0.151 {built-in method exec}
1 0.000 0.000 0.151 0.151 macosx.py:1(<module>)
1 0.000 0.000 0.150 0.150 pkgdata.py:18(<module>)
25/3 0.000 0.000 0.122 0.041 <frozen importlib._bootstrap>:1156(_load_backward_compatible)
8/1 0.026 0.003 0.121 0.121 {method 'load_module' of 'zipimport.zipimporter' objects}
1 0.000 0.000 0.101 0.101 __init__.py:15(<module>)
1 0.000 0.000 0.079 0.079 config_reader.py:115(build_from_config)
2 0.000 0.000 0.056 0.028 common.py:43(reset_screen)
2 0.055 0.027 0.055 0.027 {built-in method set_mode}
72/71 0.001 0.000 0.045 0.001 <frozen importlib._bootstrap>:2147(_find_spec)
70/69 0.000 0.000 0.043 0.001 <frozen importlib._bootstrap>:1934(find_spec)
70/69 0.001 0.000 0.043 0.001 <frozen importlib._bootstrap>:1902(_get_spec)
92 0.041 0.000 0.041 0.000 {built-in method load_extended}
6 0.000 0.000 0.041 0.007 new_map.py:74(add_character)
6 0.000 0.000 0.041 0.007 new_character.py:32(added_to_map)
6 0.001 0.000 0.041 0.007 new_character.py:265(__init__)
1 0.000 0.000 0.038 0.038 macosx.py:14(Video_AutoInit)
1 0.038 0.038 0.038 0.038 {built-in method InstallNSApplication}
1 0.036 0.036 0.036 0.036 {built-in method quit}
65 0.001 0.000 0.036 0.001 re.py:277(_compile)
49 0.000 0.000 0.036 0.001 re.py:221(compile)
The answer to this ended up being that the retina display under OS X is the differentiating factor. Running it even on an external display on the same Mac works fine. But moving the window to the retina display makes it sluggish. With or without an external monitor connected.
On the other hand, it runs just fine on the same retina display under Linux. It is unclear what the difference in the display managers / rendering is that causes this, but I doubt there is anything one could do about it.
Changing the game resolution to fullscreen helped me.
Try this:
window = pygame.display.set_mode((0, 0), pygame.FULLSCREEN)
instead of:
window = pygame.display.set_mode((winx, winy))
In addition to my previous question - how canI extract values in the Columns format from a large text file. Also If I want to extract out the value of the specific column. The text file looks like the attached image.
I want to extract out the RESOL and FSC values as Given in the text to plot it in Numbers or EXCEL.
To be more general - what should be done if I do want to extract the values to Part_FSC and CC later.
Thanks
Text File:
Opening MRC/CCP4 file for WRITE...
File : map2.mrc
NX, NY, NZ: 600 600 600
MODE : real
SYMMETRY REDUNDANCY: 1
Fraction_mask, Fraction_particle = 0.1193 0.0111
C sqrt sqrt
C NO. RESOL RING RAD FSPR FSC Part_FSC Part_SSNR Rec_SSNR CC
C 2 952.50 0.0017 0.09 1.000 0.000 400.0145 479.40 0.1222
C 3 476.25 0.0033 0.19 1.000 0.000 159.3959 159.98 0.1586
C 4 317.50 0.0050 0.92 0.999 0.000 48.2248 43.27 0.0155
C 5 238.12 0.0067 0.42 1.000 0.000 88.3074 76.69 0.2637
C 6 190.50 0.0083 0.48 0.999 0.000 64.0162 56.25 0.4148
C 7 158.75 0.0100 1.41 0.992 0.000 17.1695 15.64 0.1282
C 8 136.07 0.0117 5.56 0.954 0.000 6.8244 6.47 0.0171
C 9 119.06 0.0133 1.49 0.993 0.000 16.1918 16.42 0.2729
C 10 105.83 0.0150 1.68 0.990 0.000 12.8313 13.83 0.3729
C 11 95.25 0.0167 3.55 0.969 0.000 6.8012 7.95 0.2624
C 12 86.59 0.0183 16.00 0.830 0.000 2.5273 3.13 0.0826
Untested, but should work
awk '$1 == "C"{printf "%s\t%s\n", $3,$6}' <filename>