Best way to split up large array in a Node.js environment - node.js

I’m pulling data from the Cloudflare API, getting all web request logs for a very high traffic website in a certain time frame (less than 7 days of data).
The Cloudflare API takes start and end parameters for the dates that you want to pull logs from. Start can be no later than 7 days and the difference between start and end cannot be greater than 1 hour. For this reason, in order to pull the data I need (usually 3-4 days worth), I wrote some custom code to generate a range of dates separated by one hour from the start to end I need.
With this range, I query the API with a loop and concat the array response to a single large array as I need to do analysis on all the data. This array typically has ~1 million entries (objects). I’m sure you can realize the problem here.
I’m using Deno.js (Node.js alternative) and, at first, the program wouldn’t even run as it would run out of memory. However, I figured out a workaround for this by passing the v8 engine flag to the run command: —-max-old-space-size=8000. It runs now with my massive array but it’s very slow and my computer essentially becomes a brick while it’s running.
My question is, how can I better deal with data of this size, specifically in a Node.js style environment?
Proposed Idea (please tell me if it’s stupid)
Deno gives a nice interface for creating temp directories and files so I was thinking of saving the data from each request to the API in a temporary .json file and then reading the file(s) where I need them as the next step for this data is to filter it down.
Would this approach improve speed?

To elaborate on my comment, the following awk script counts the number of log entries by IP. I'd start there, and then grep that IP to list the visited resources.
./ip-histogram mylogfile.log
# Output
1 127.0.0.1
3 127.0.2.2
mylogfile.log
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /foo" 200 3395
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /bar" 200 3395
127.0.2.2 - - [28/Jul/2006:10:27:10 -0300] "GET /baz" 200 3395
127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET /foo" 200 2216
ip-histogram
#!/usr/bin/awk -f
# Counts the frequency of each IP in a log file.
# Expects the IP to be in the first ($1) column.
#
# Sample Output:
# ./ip-histogram mylogfile.log
# 12 1.1.1.1
# 18 2.2.2.2
{
histogram[$1]++
}
END {
for (ip in histogram)
print histogram[ip], ip | "sort -n"
}
If you save the 1hr responses as follows: mylog0001.log, mylog0002.log, you can aggregate them by:
./ip-histogram mylog*.log

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

How to extract Requests/Seconds (Throughput) of a performance test using Locust?

I am running Locust performance Test against an API and I need to plot a Requests/Second vs Response time plot. I can see the req/s as a parameter in the results of the tests. Is there a Library/Class from where I can directly access this parameter ?
Have you looked at using the master report / slave report event hook (depending on where you want to log it from?
https://docs.locust.io/en/stable/api.html#locust.events.EventHook
You havent said how you want to plot it, but we use something similar to shunt the metrics into a database to report on.
I think you can use the _requests.csv and _distribution.csv files that get generated if you pass in the --csv flag. These contain the requests/s column as well as the response times for different percentiles and also min, max, medain and avg.
https://docs.locust.io/en/stable/retrieving-stats.html

node.js persistent store json with minimal performance penalty

I'm running a node web server using express module and would like to include the following features in it:
primarily, track every visitors source IP, time and unique or repeated visit by saving it to a JSON file.
secondly, if someone is hitting my server more than 10 times in last 15 seconds looking for vulnerabilities (non-existent pages) then collect those attempts in a buffer (that holds 30 seconds worth of data) and once threshold is reached, start blocking that source IP for X number of hours.
I'm interested in finding out the fastest way to save this information with very minimal performance penalty.
My choice so far is to create a RAMDISK and save this info into a continuous file on that RAMDISK.
The info for Visitor info gets written to a database every few minutes.
The info for notorious visitors will be reset every 30 seconds so as to keep the lookup quick.
The question I have is - Is writing to RAMDISK the fastest way to retain information (so its not lost during a crash) or is there a better/faster way to achieve this goal ?

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Flickr API returning duplicate photos

I've come across a confusing issue with the flickr API.
When I do a photo search (flickr.photos.search) and request high page numbers, I
often get duplicate photos returned for different page numbers.
Here's three URLs, they should each return three sets of different images,
however, they - bizarrely - return the same images:
http://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=ca3035f67faa0fcc72b74cf6e396e6a7&tags=gizmo&tag_mode=all&per_page=3&page=6820
http://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=ca3035f67faa0fcc72b74cf6e396e6a7&tags=gizmo&tag_mode=all&per_page=3&page=6821
http://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=ca3035f67faa0fcc72b74cf6e396e6a7&tags=gizmo&tag_mode=all&per_page=3&page=6822
Has anyone else come across this?
I seem to be able to recreate this on any tag search.
Cheers.
After further investigation it seems there's an undocumented "feature" build into the API which never allows you to get more than 4000 photos returned from flickr.photos.search.
So whilst 7444 pages is available, it will only let you load the first 1333.
It is possible to retrieve more than 4000 images from flickr; your query has to be paginated by (for example) temporal range such that the total number of images from that query is not more than 4000. You can also use other parameters such as bounding box to limit the total number of images in the response.
For example, if you are searching with the tag 'dogs', this is what you can do ( binary search over time range):
Specify a minimum date and a maximum date in the request url, such as Jan 1st, 1990 and Jan 1st 2015.
Inspect the total number of images in the response. If it is more than 4000, then divide the temporal range into two and work on the first half until you get less than 4000 images from the query. Once you get that, request all the pages from that time range, and move on to the next interval and do the same until (a) Number of required images is met (b) searched all over the initial time interval.

Resources