Dask worker resources attribute values - attributes

Are resource value limited to numbers? With all the example I have seen, it is always a number and as per doc it is converted to float. So we cannot have something like resources={'GRAPHIC_CARD': 'NVIDIA Quadro M5000M'}. Is that rue ?

Yes, I believe resources have numerical values. You could do something like:
{'NVIDIA Quadro M5000M': 1, 'Tesla V100': 2}

Related

How to divide two large numpy matrix elemtwise without getting killed?

I have two matrices A,B both of shape 150000 X 150000
I want to divide each element of A with each element of B, element wise.
The way I currently do it is -
res=A/B
I do get the output for small matrices, but for large matrices as mine. The process gets killed. Any suggestions on how to do this efficiently ?
sample data
A =
[
[2.2,3.3,4.4]
[2.2,3.3,4.4]
[2.2,3.3,4.4]]
B=
[
[1,3.3,4.4]
[2.2,1,4.4]
[2.2,3.3,1]]
res =
[
[2.2,1,1]
[1,3.3,1]
[1,1,4.4]]
This is a 3 X 3 matrix, I'm working with 150000 X 150000 matrix
you could try to use pandas and set the type of the values to something small memory wise, or at least check the memory allocated for a value, usually Python uses float64 or so, which is in some cases way too much.
use
pd.to_numeric(s, errors='coerce')
or
pd.to_numeric(column, downcast='integer')
If the problem is a limitation on memory allocation (i.e. you have enough RAM for A and B, but not enough for A, B and res all together) then in-place division /= will do the job in the memory space already allocated for A without having to allocate memory for a new array res. Of course, you'll overwrite the original content of A in the process:
A /= B
But if you end up needing to use arrays that are too large to fit in RAM then you should explore the use of numpy.memmap which is designed for this purpose. See for example Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Is it worth converting 64bit integers to 32bit (of 16bit) ints in a spark dataframe?

I have a dataframe that contains ~4bn records. Many of the columns are 64bit ints, but could be truncated into 32bit or 16bit ints without data loss. When I try converting the data types using the following function:
def switchType(df, colName):
df = df.withColumn( colName + "SmallInt", df[colName].cast(ShortType()))
df = df.drop(colName)
return df.withColumnRenamed(colName + 'SmallInt', colName)
positionsDf = switchType(positionsDf, "FundId")
# repeat for 4 more cols...
print(positionsDf.cache().count())
This shows as taking 54.7 MB in ram. When I don't do this, it shows as 56.7MB in ram.
So, is it worth trying to truncate ints at all?
I am using Spark 2.01 in stand alone mode.
If you plan to write it in format that saves numbers in binary (parquet, avro) it may save some space. For calculations there will be probably no difference in speed.
Ok, for the benefit of anyone else that stumbles across this. If I understand it, it depends on your JVM implementation (so, machine/OS specific), but in my case it makes little difference. I'm running java 1.8.0_102 on RHEL 7 64bit.
I tried it with a larger dataframe (3tn+ records). The dataframe contains 7 coulmns of type short/long, and 2 as doubles:
As longs - 59.6Gb
As shorts - 57.1Gb
The tasks I used to create this cached dataframe also showed no real difference in execution time.
What is nice to note is that the storage size does seem to scale linearly with the number of records. So that is good.

Cassandra metrics- difference between latency to total latency

I'm using Cassandra 2.2 and sending the Cassandra metrics to Graphite using pluggable metrics,
I've searched in org.apache.cassandra.metrics.ColumnFamily and saw there is an attribute "count" in ReadLatency and ReadTotalLatency,
What is the difference between the 2 count attributes?
My main goal is to get the latency per read/write, how do you advise me to get it?
Thanks!
org.apache.cassandra.metrics.ColumnFamily.ReadTotalLatency is a Counter which gives the sum of all read latencies.
org.apache.cassandra.metrics.ColumnFamily.ReadLatency is a Timer which gives insights about how long the reads are taking, it reports attributes like min, max, mean, 75percentile, 90percentile, 99percentile
for your purpose you should be using the ReadLatency and Writelatency
Difference of 2 "count" attributes
org.apache.cassandra.metrics.ColumnFamily.ReadTotalLatency is a Counter.
Its "count" attribute provides the sum of all read latencies.
org.apache.cassandra.metrics.ColumnFamily.ReadLatency is a Timer.
Its "count" attribute provides the count of Timer#update calls.
To get the recent latency per read/write
Use attributes like "min", "max", "mean", "75percentile", "90percentile", "99percentile".
Cassandra 2.2.7 uses DecayingEstimatedHistogramReservoir for Timer's reservoir, which makes recent values more significant.

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?
You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.
For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Spark: getting cumulative frequency from frequency values

My question is rather simple to be answered in a single node environment, but I don't know how to do the same thing in a distributed Spark environment. What I have now is a "frequency plot", in which for each item I have the number of times it occurs. For instance, it may be something like this: (1, 2), (2, 3), (3,1) which means that 1 occurred 2 times, 2 3 times and so on.
What I would like to get is the cumulated frequency for each item, so the result I would need from the instance data above is: (1, 2), (2, 3+2=5), (3, 1+3+2=6).
So far, I have tried to do this by using mapPartitions which gives the correct result if there is only one partition...otherwise obviously no.
How can I do that?
Thanks.
Marco
I don't think what you want is possible as a distributed transformation in Spark unless your data is small enough to be aggregated into a single partition. Spark functions work by distributing jobs to remote processes, and the only way to communicate back is using an action which returns some value, or using an accumulator. Unfortunately, accumulators can't be read by the distributed jobs, they're write-only.
If your data is small enough to fit in memory on a single partition/process, you can coalesce(1), and then your existing code will work. If not, but a single partition will fit in memory, then you might use a local iterator:
var total = 0L
rdd.sortBy(_._1).toLocalIterator.foreach(tuple => {
total = total + tuple._2;
println((tuple._1, total)) // or write to local file
})
If I understood your question correctly, it really looks like a fit for one of the combiner functions – take a look at different versions of aggregateByKey or reduceByKey functions, both located here.

Resources