Proc Groovy to parse larger XML into SAS - groovy

We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.
Is there any other faster way to process the XML ?
can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code?
i tried searching online but not able to get one.
The XML has PII data and its huge of 3 GB .
The Code being run is very simple and straight forward:
filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;
Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records
The Log window shows
NOTE: PROCEDURE COPY used (Total process time):
real time 4:03.72
user cpu time 4:00.68
system cpu time 1.17 seconds
memory 32842.37k
OS Memory 52888.00k
Timestamp 19/05/2020 03:14:43 PM
Step Count 4 Switch Count 802
Page Faults 3
Page Reclaims 17172
Page Swaps 0
Voluntary Context Switches 3662
Involuntary Context Switches 27536
Block Input Operations 504
Block Output Operations 56512
SAS Version : 9.4_M2
total memsize is MEMSIZE=3221225472 in our server
3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.
The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.
Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.
Any idea about the shredder in Teradata ? will it be efficient?

I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.
Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

Liferay: huge DLFileRank table

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).
Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

How big the spark stream window could be?

I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Resources