Liferay: huge DLFileRank table - liferay

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).

Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

Related

Proc Groovy to parse larger XML into SAS

We tried reading 3-4 GB of XML file using SAS XML mapper .but when we PROC COPY the data from the XML engine to SAS Dataset its taking almost 5 to 6 mins which is too much time for us since we have to process 3000 files a day .We are running almost 10 files in parallel.One table almost have 230 columns.
Is there any other faster way to process the XML ?
can we use PROC GROOVY ? will it be efficient? if yes can any one provide me a sample code?
i tried searching online but not able to get one.
The XML has PII data and its huge of 3 GB .
The Code being run is very simple and straight forward:
filename NHL "/path/ODM.xml";
filename map "/path/odm_map.map";
libname NHL xmlv2 xmlmap=map;
proc copy in=nhl out=work;
run;
Total Table created : 54 in which more than 14 tables have ~18000 records and remaining tables have ~1000 records
The Log window shows
NOTE: PROCEDURE COPY used (Total process time):
real time 4:03.72
user cpu time 4:00.68
system cpu time 1.17 seconds
memory 32842.37k
OS Memory 52888.00k
Timestamp 19/05/2020 03:14:43 PM
Step Count 4 Switch Count 802
Page Faults 3
Page Reclaims 17172
Page Swaps 0
Voluntary Context Switches 3662
Involuntary Context Switches 27536
Block Input Operations 504
Block Output Operations 56512
SAS Version : 9.4_M2
total memsize is MEMSIZE=3221225472 in our server
3000 files total out of which 1000 will be 3 to 4 GB and some of which will be 1 GB and 1000 files will be in KB .The smaller files are getting processed quickly the problem is only with big files .it uses almost the entire CPU.
The copy time from XML engine varies when we reduce the number of file,but for that to happen we have to change the map file or the input xml.
Already raised SAS tracks and have questioned the same in SAS communities still no luck.looks like its parser limitation itself.
Any idea about the shredder in Teradata ? will it be efficient?
I would do this in two pieces, first convert XML to ascii and then into SAS. SAS isn't going to be very fast at converting XML into SAS; it's just not something SAS is optimized for. You're using nearly entirely CPU time, so you're not disk limited - you're limited by SAS's ability to parse the XML file.
Write a program in a more optimized language that can parse the XML much faster, and then read the results of that into SAS. Python might be one option - it's not super optimized either, but it's more optimized for this sort of thing than SAS I suspect - or an even lower level language (like c/c++) might be your best bet.

node.js persistent store json with minimal performance penalty

I'm running a node web server using express module and would like to include the following features in it:
primarily, track every visitors source IP, time and unique or repeated visit by saving it to a JSON file.
secondly, if someone is hitting my server more than 10 times in last 15 seconds looking for vulnerabilities (non-existent pages) then collect those attempts in a buffer (that holds 30 seconds worth of data) and once threshold is reached, start blocking that source IP for X number of hours.
I'm interested in finding out the fastest way to save this information with very minimal performance penalty.
My choice so far is to create a RAMDISK and save this info into a continuous file on that RAMDISK.
The info for Visitor info gets written to a database every few minutes.
The info for notorious visitors will be reset every 30 seconds so as to keep the lookup quick.
The question I have is - Is writing to RAMDISK the fastest way to retain information (so its not lost during a crash) or is there a better/faster way to achieve this goal ?

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

In New Relic RPM, I get reports with an Apdex index listed. What is the subscript meaning?

This sounds ridiculous, but New Relic RPM reports an Apdex index in a form like this:
0.92(3.5)
Where the 3.5 is subscripted.
What does the 3.5 mean? I can't find the definition anywhere, and yet there it is in my reports, staring me in the face.
The bracketed/subscripted number is the threshold (in seconds) for your Apdex score. So, in your case, if the full application response (page load) is less than 3.5s then that satisfies the requirement. If your app responds slower than the threshold then your Apdex score is impacted.
This threshold is customizable, so you can select what is appropriate for your application type.
You can read more about Apdex in our docs.
The sub-scripted number is your target response time for that tier. On the user agent (browser) the high water mark is 7 seconds. You should check US-Only and make this number 2 to 4 seconds to be world class.
The app server tier must respond much faster. The high water mark default that NR sets is .5 seconds or 500 milliseconds, a world class page buffer flush would be in the 50-200 ms on average.
Remember all this information is about aggregated averages and not instance data which will have many outliers and have a broad distribution.

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Resources