The following Excel VBA code reads data from a file without opening it:
Dim rgTarget As Range
Set rgTarget = ActiveSheet.Range("A1:C3") 'where to put the copied data.
rgTarget.FormulaArray = "='C:\MyPath\[Closed File.xlsx]Sheet1'!$A$1:$C$3"
rgTarget.Formula = rgTarget.Value 'convert formulas to values.
Even if I put a very large amount of additional data in the file, so it takes 25 seconds to open via File > Open or Workbooks.Open, this code still runs almost instantaneously and gets the correct data.
A colleague of mine insists that if it is successfully reading data from the file, then ipso facto it must be opening the file. But obviously it's doing something different than Workbooks.Open.
It isn't about caching; even if I restart Windows, then start Excel and run that code before opening the large file, it still runs almost instantaneously and gets the correct data. And no matter how many times I open the large file via File > Open or Workbooks.Open, it takes 25 seconds each time.
Also FWIW, when I open the large file via Workbooks.Open or File > Open, Task Manager shows Excel's memory usage increases by about 154,000 KB, nearly quadrupling Excel's total memory usage. But running the above code shows only a tiny increase in Excel's memory usage, about 12 KB. The file uses about 29 KB on the hard drive (Excel files are compressed when stored).
I am trying to load this CSV file into a pandas data frame using
import pandas as pd
filename = '2016-2018_wave-IV.csv'
df = pd.read_csv(filename)
However, despite my PC being not super slow (8GB RAM, 64 bit python) and the file being somewhat but not extraordinarily large (< 33 MB), loading the file takes more than 10 minutes. It is my understanding that this shouldn't take nearly that long and I would like to figure out what's behind this.
(As suggested in similar questions, I have tried using chunksize and usecol parameters (EDIT and also low_memory), yet without success; so I believe this is not a duplicate but has more to do with the file or the setup.)
Could someone give me a pointer? Many thanks. :)
I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.
To summarize and expand the answer by #Hubert Dudek:
The issue was with the file; not only did it include "s at the start of every line but also in the lines themselves. After I fixed the former, the latter caused the column attribution being messed up.
I am simulating a Blackjack games, where decisions are made according to Basic Strategy. Each decision is made by reading value in one of the cells from .xlsx file. When trying to simulate lets say 100 000 games it takes a long time. I use these two lines to read the decision:
bs = pandas.read_excel('BasicStrategy.xlsx', sheet_name = 'Soft')
decision = bs.iloc[player_result - 1,dealer_hand[0] - 2]
Since the decisions are just a number in table, what would decrease the time my program takes to execute? Since as I understand the whole sheet is being read every time a decision has to be made but I need only 1 value, how can I read only 1 value? Have not used numpy before but would it work in this case, and if so would it be faster? Any advice will be much appreciated.
Read the values into a variable once, and from then on refer to that variable. If you're accessing the file over a network, then save the variable as a local/temp file. Then when your program is started it can read the local file back into a variable.
I wrote my own terminal program that reads from the serial port to read data from a microcontroller. Data is presented as follows:
0C82949>0D23949>0A75249> etc...
These are ASCII. Some things to note are that all elements start with >_0xx which is the header where xx is some chars such as >0C8 or >0D2 etc... this tells me what the rest of the data is such as if >0C8 is the speed of the car then 2949 holds the actual speed. The microcontroller writes the data really fast so at one time i can see 40 elements at a time. I want to quickly search this for an ">0C8" entry and only print out ">0C82949" out of the bunch:
an example if i only want 0D2:
Read from Serial Port: >0C82949>0D23949>0A75249>
Output: 0D23949
would anyone know how to do this?? I am aware that since it is so fast i would have to create threads which i can do, i am just not sure how to approach this issue for parsing. Any ideas would be greatly appreciated.
I am using Visual C++
You can parse the data and divide it on each > character. Then create separate strings. For each string, just search for desired substring. You may use strstr or CString::Find or string::find.
There is no need to create separate thread - the search operation is quite trivial and won't take much of CPU.
I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?
The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.
You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.
Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.
I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.
I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?
Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.