Antlr4 Performance Issue for high volume files - antlr4

We are facing issues with antlr performance for parsing Oracle files. The oracle files for conversion has significant large size 17/24/38 mb files. While building the parse-tree, it taking a lot of time and memory. Its even giving core OOM dump. We tried disabling the building of parse tree, however that does not work as the walker doesn’t go thru the files and generates a blank file. We tried to use BufferedInputStream in place of FileInoutStream. We even tried to use
BufferedTokenStream, UnbufferedCharStream, UnbufferedTokenStream, instead of the other respective or equivalent stream for parser and lexer. None of the options are working and the parse tree is taking a lot of memory and time to be generated and traversed. We tried running with 2 Gigs of heap memory as well, however it goes beyond that and gives core OOM dump .
From the online forums it seems this is a very common problem in Antlr when it tries to parse large input files. As alterative it either suggests to break down the input files into multiple small files . It also says that we can leave aside listeners and visitors and create objects directly in the grammar and use hashmaps/vectors.
Are there any good examples to references where setBuildParseTree = false?
Can ANTLR4 java parser handle very large files or can it stream files
Is it possible to parse big file with ANTLR?
Have you encountered any such Antlr problems in the past and if yes how was it handled? Any suggestions that would help in reducing the memory footprint and make the performance faster specific to Antlr?
The input file mostly contains selects and insert statements. but these files are large volume.
INSERT INTO crmuser.OBJECT_CONFIG_DETAILS(
ATTRIBCONFIGID,OBJCONFIGID,ATTRIBNAME,PARENTNAME,ISREQUIRED
,ISSELECTED,READACCESS,WRITEACCESS,DEFAULTLABEL,CONFIGLABEL
,DATATYPE,ISCOMPOSITE,ISMANDATORY,ATTRIBSIZE,ATTRIBRANGE
,ATTRIBVALUES,ISWRITABLE)
VALUES (
91933804, 1682878, 'ACCOUNTS_EXTBO.RELATIVE_MEMBER_ID', 'ACCOUNTS_EXTBO',
'N', 'Y', 'F', 'F', 'ACCOUNTS_EXTBO.RELATIVE_MEMBER_ID',
'ACCOUNTS_EXTBO.RELATIVE_MEMBER_ID', 'String', 'N', 'N', 50,
null, null, 'N')
;
INSERT INTO crmuser.OBJECT_CONFIG_DETAILS(
ATTRIBCONFIGID,OBJCONFIGID,ATTRIBNAME,PARENTNAME,ISREQUIRED
,ISSELECTED,READACCESS,WRITEACCESS,DEFAULTLABEL,CONFIGLABEL
,DATATYPE,ISCOMPOSITE,ISMANDATORY,ATTRIBSIZE,ATTRIBRANGE
,ATTRIBVALUES,ISWRITABLE)
VALUES (
91933805, 1682878, 'ACCOUNTS_EXTBO.ELIGIBILITY_CRITERIA', 'ACCOUNTS_EXTBO',
'N', 'Y', 'F', 'F', 'ACCOUNTS_EXTBO.ELIGIBILITY_CRITERIA',
'ACCOUNTS_EXTBO.ELIGIBILITY_CRITERIA', 'String', 'N', 'N', 50,
null, null, 'N')
;

Related

reading a 20gb csv file in python

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:
from tqdm import tqdm
chunksize=100000
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
df_list.append(df_chunk)
train_df = pd.concat(df_list)
Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk) is the culprit.
What you need to is:
read it by smaller pieces (you already do);
process it piece by piece;
discard the old piece after processing. Python's garbage collection will do it for you unless you keep a reference to the spent chunk, as you currently do in df_list.
Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.
Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.

The effect of otherwise inconsequential calls to readlines() on the speed of open(filepath) in Python 3.7

I'm working on a university assignment that consists in building a hash table.
One of the tasks requires that we measure the time taken to associate a list of keys with some value, and measure the effects of different hash bases and different table capacities on the execution time. Incidentally, collisions are resolved by linear probing.
We're asked to read the keys from a file and associate them with the value of 1, as in:
with open(filename, 'r', encoding='utf-8-sig') as data:
for line in data.readlines():
line = line.strip('\n')
hashtable[line] = 1
However, given certain combinations of suboptimal hash bases and table capacities, such as 1 and 250,727 respectively, or 3 and 250,727, the excerpt above is intolerably slow. It seems to run indefinitely. For instance, it didn't complete after running for several hours!
Curiously, if I add some expression like len(data.readlines()) or type(data.readlines()), accessing the file object before entering the loop, then the program completes in less than one second with the same parameters.
with open(filename, 'r', encoding='utf-8-sig') as data:
len(data.readlines())
for line in data.readlines():
line = line.strip('\n')
hashtable[line] = 1
Can anyone clarify this for me?
Thank you!
The length is calculated where based on where the file starts(The memory location at which the file starts) - where the file ends(The memory location at which the file ends).
It is an O(1) operation that why it takes less time.
On the other hand, insertion in a hashtable is an O(1) operation but depending on the number of entries the total time taken would be O(n). O(1) for each number so total O(n) for n entries. But if you are implementing a linear probing method for collision handling.
Maybe an element sarts at 1st position and gets hashed at the end of the hashtable. So, here the worst-case complexity would be O(n^2). If you could upload the file on cloud and share the link i would be able to share some more analytics.

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Processing bzipped json file in Spark?

I have about 200 files in S3, e.g., a_file.json.bz2, each line of these file is a record in JSON format but some fields were serialised by pickle.dumps, e.g. a datetime field. Each file is about 1GB after bzip compression. Now I need to process these files in Spark (pyspark, actually) but I couldn't even get each record out. So what would be the best practice here?
The ds.take(10) gives
[(0, u'(I551'),
(6, u'(dp0'),
(11, u'Vadv_id'),
(19, u'p1'),
(22, u'V479883'),
(30, u'p2'),
(33, u'sVcpg_id'),
(42, u'p3'),
(45, u'V1913398'),
(54, u'p4')]
Apparently the splitting is not by each record.
Thank you.
I had this issue reading gpg-encrypted files. You can use wholeTextFiles as Daniel suggests, but you have to be careful when reading large files as the entire file will be loaded to memory before processing. If the file is too large, it can crash the executor. I used parallelize and flatMap. Maybe something along the lines of
def read_fun_generator(filename):
with bz2.open(filename, 'rb') as f:
for line in f:
yield line.strip()
bz2_filelist = glob.glob("/path/to/files/*.bz2")
rdd_from_bz2 = sc.parallelize(bz2_filelist).flatMap(read_fun_generator)
You can access the input file-by-file (instead of line-by-line) via SparkContext.wholeTextFiles. You can then use flatMap to uncompress and parse the lines in your own code.
In fact it is a problem caused by pickle. By looking at the file content after the compression, it is indeed
(I551
(dp0
Vadv_id
p1
V479883
p2
sVcpg_id
p3
V1913398
p4
which gives me trouble to parse. I know I can just pick.load(file) multiple times to get the objects out, but cannot find a quick solution in Spark where I can only access the loaded files line by line. Also, the records in this file have variable fields and lengths which makes it more difficult to hack.
I ended up re-generating these bz2 files from the source because it is actually easier and faster. And I learnt that Spark and hadoop supports bz2 compression perfectly so there is no additional action required.

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Resources