Say I have a rather large associative array of 100k elements like so:
$resources = array(
'stone'=>'current_stone.gif',
'stick'=>'uglystick.jpg',
...
);
Stored in a file called resources.php and it will never change at runtime.
I would like to have this data in Zend opcache so as to share it across all processes (saving memory) and possibly speed up the lookup speed.
My current assumption is that, in this form, this array will not get stored in opcache as it's not defined as a static structure anywhere.
How would I go about making sure this data gets into opcache?
Answer above is incorrect. Variables are stored in opcode cache.
You can test it by creating large set of data, storing it in PHP file and checking cache statistics with opcache_get_status(). Result includes "scripts" array that lists all cached files and used memory.
I wrote several cache files that have nothing but one massive array. One of them is 24.1Mb
Here is dump result of print_r(opcache_get_status()):
Array
(
[opcache_enabled] => 1
...
[scripts] => Array
(
...
[/test/noto.php] => Array
(
[full_path] => /test/noto.php
[hits] => 3
[memory_consumption] => 24591120
[last_used] => Sat Nov 24 21:09:58 2018
[last_used_timestamp] => 1543086598
[timestamp] => 1543086378
)
...
)
)
So it is definitely stored in memory with all data.
To make sure it is so, I made several files with combined size of about 300Mb of data. Without opcache it takes about 1.5 seconds to load them all. Then after initial load it gets cached and it takes 2 ms to load all data. Almost 1000 times difference.
So storing cache in PHP files that will be cached by Zend OPcache and using include to include it (with try..catch(\Throwable $e) to avoid errors) is by far the most efficient way to cache data.
No you can't store variables in OPcache, but statics in classes work:
class Resource {
static $image = [
'stone'=>'current_stone.gif',
'stick'=>'uglystick.jpg',
...
];
}
...
echo Resource::$image['stone'], "\n";
This saves all of the opcodes initialising the arrays, but OPcache will still deep copy the version of Resource::$image in the compiled script in SMA into the corresponding class static property in the process space, so you will still have a copy of the HashTable in each of the active processes which are using Resource -- though the strings themselves will be interned and hence shared across all active php requests which are using this class.
If you are using a class autoloader, to load your classes, then you don't even need to do anything other than refer to Resoure::$image... and the autoloader will do the mapping for you.
Related
I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs
I use node-influx and influx.query(...) use too much heap.
In my application I have something like
const data = await influx.query(`SELECT "f_power" as "power", "nature", "scope", "participantId"
FROM "power"
WHERE
"nature" =~ /^Conso$|^Prod$|^Autoconso$/ AND
"source" =~ /^$|^One$|^Two AND
"participantId" =~ /${participantIdFilter.join('|')}/ AND
"time" >= '${req.query.dateFrom}' AND
"time" <= '${req.query.dateTo}'
GROUP BY "timestamp"
`);
wb = dataToWb(data)
XLSX.writeFile(wb, filename);
Data is a result set of about 50M (I used this code)
And the heap used by this method is about 350M (I used process.memoryUsage().heapUsed)
I'm surprised by the diference between these two values...
So is possible to make this query less resource intensive?
Actually I use data to make a xlsx file. And the generation of this file lead to a node process out of memory. The method XLSX.writeFile(wb, filename) use about 100M. That's it's not enougth to fill my 512M RAM. So I figured me that is heap used by influx query which is never collected by the GC.
Actually I don't understand why the generation make this error. Why V8 can't free memory used by a method executed after and in another context ?
The node-influx (1.x client) reads the whole response, parses it into JSON, and transforms it into results (data array). There are a lot more intermediate objects on the heap during the processing of the response. You should also run the node garbage collector before and after the query to get a better estimate of how much heap does it take. You can now control the result memory usage only by reducing the result columns or rows (by limit, time, or aggregations). You can also join queries with smaller results to reduce the maximum heap usage caused by temporary objects. Of course, that is paid by the time and complexity of your code.
You can expect less memory usage with 2.x client (https://github.com/influxdata/influxdb-client-js). It does not create intermediate JSON objects, internally processes results in smaller chunks, it has an observable-like API that lets the user decide how to represent a result row. It uses FLUX as a query language and requires InfluxDB 2.x or InfluxDB 1.8 (with 2.x API enabled).
I have about ~300mb of data (~180k json objects) that gets updated once every 2-3 days.
This data is divided into three "collections", that I must keep up to date.
I decided to take the Node.JS way, but any solution in a language i know ( Java, Python) will be welcomed.
Whenever I perform a batch set using the node.JS firebase-admin client, not only it consumes an aberrant amount of ram ( about 4-6GB!), but it also tends to crash with errors that don't have a clear ( up to page 4 of google search without a meaningful answer ) reason.
My code is frankly simple, this is it:
var collection = db.collection("items");
var batch = db.batch();
array.forEach(item => {
var ref = collection.doc(item.id);
batch.set(ref, item);
});
batch.commit().then((res) => {
console.log("YAY",res);
});
I haven't found anywhere if there is a limit on the number of writes in a limited span of time (I understand doing 50-60k writes should be easy peasy with a backend the size of firebase), and also found that this can go up the ram train and have like 4-6GB of ram allocated.
I can confirm that when the errors are thrown, or the ram usage clogs my laptop, whatever happens first, I am still at less than 1-4% my daily usage quotas, so that is not the issue.
I have a query that may return up to 2000 documents.
Within these documents I need six pcdata items return as string values.
There is a possiblity, since the documents size range from small to very large,
exp tree cache error.
I am looking at spawn-function to break up my result set.
I will pass wildcard values, based on known "unique key structure", and will know the max number of results possible;each wildcard values will return 100 documents max.
Note: The pcdata for the unique key structure does have a range index on it.
Am I on the right track with below?
The task server will create three tasks.
The task server will allow multiple queries to run, but what stops them all running simultaneously and blowing out the exp tree cache?
i.e. What, if anything, forces one thread to wait for another? Or one task to wait for another so they all do not blow out the exp tree cache together?
xquery version "1.0-ml";
let $messages :=
(:each wildcard values will return 100 documents max:)
for $message in ("WILDCARDVAL1","WILDCARDVAL2", "WILDCARDVAL3")
let $_ := xdmp:log("Starting")
return
xdmp:spawn-function(function() {
let $_ := xdmp:sleep(5000)
let $_ := xdmp:log(concat("Searching on wildcard val=", $message))
return concat("100 pcdata items from the matched documents for ", $message) },
<options xmlns="xdmp:eval">
<result>true</result>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)
return $messages
The Task Server configuration listed in the Admin UI defines the maximum number of simultaneous threads. If more tasks are spawned than there are threads, they are queued (FIFO I think, although ML9 has task priority options that modify that behavior), and the first queued task takes the next available thread.
The <result>true</result> option will force the spawning query to block until the tasks return. The tasks themselves are run independently and in parallel, and they don't wait on each other to finish. You may still run into problems with the expanded tree cache, but by splitting up the query into smaller ones, it could be less likely.
For a better understanding of why you are blowing out the cache, take a look at the functions xdmp:query-trace() and xdmp:query-meters(). Using the Task Server is more of a brute force solution, and you will probably get better results by optimizing your queries using information from those functions.
If you can't make your query more selective than 2000 documents, but you only need a few string values, consider creating range indexes on those values and using cts:values to select only those values directly from the index, filtered by the query. That method would avoid forcing the database to load documents into the cache.
It might be more efficient to use MarkLogic's capability to return co-occurrences, or even 3+ tuples of value combinations from within documents using functions like cts:values. You can blend in a (cts:uri-reference](http://docs.marklogic.com/cts:uri-reference) to get the document uri returned as part of the tuples.
It requires having range indexes on all those values though..
HTH!
I don't know what changed--things were working relatively well with our Lucene implementation. But now, the number of files in the index directory just keeps growing. It started with _0 files, then _1 files appeared, then _2 and _3 files. I am passing in false to the IndexWriter's constructor for the 'create' parameter, if there are existing files in that directory when it begins:
indexWriter = new IndexWriter(azureDirectory, analyzer, (azureDirectory.ListAll().Length == 0), IndexWriter.MaxFieldLength.UNLIMITED);
if (indexWriter != null)
{
// Set the number of segments to save in memory before writing to disk.
indexWriter.MergeFactor = 1000;
indexWriter.UseCompoundFile = false;
indexWriter.SetRAMBufferSizeMB(800);
...
indexWriter.Dispose(); indexWriter = null;
}
Maybe it's realated to the UseCompoundFile flag?
Every couple of minutes, I create a new IndexWriter, process 10,000 documents, then dispose the IndexWriter. The index works, but the growing number of files is very bad, because I'm using AzureDirectory which copies every file out of Azure into a cache directory before starting the Lucene write.
Thanks.
This is the normal behavior. If you want a single index segment you have some options:
Use compound files
Use a MergeFactor of 1 if you use LogMergePolicy, which is the default policy for lucene 3.0. Note that the method you use on the IndexWriter is just a convenience method that calls mergePolicy.MergeFactor as long as mergePolicy is an instance of LogMergePolicy.
Run an optimization after each updates to your index
Low merge factors and optimizations after each updates can have serious drawbacks on the performance of your app which will depend on the type of indexing you do.
See this link which documents a little bit the effects of MergeFactor :
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/index/LogMergePolicy.html#setMergeFactor%28%29