I fear downvotes. Anyways, just like an ArrayList would have a contiguous memory allocation, a LinkedList would have a random memory allocation, how does HashMap occupy memory? Does it also take random chunks in the memory? Can I be briefed with a memory diagram of how map's buckets and the LinkedLists inside are located in the memory?
I hope this is not a bs question. Did not find much of information regarding Map's memory allocation diagram.
EDIT: The question I put has nothing to do with debugging/profiling. It's just about how a HashMap fits into the memory. I was unclear about it.
It's a combination of both.
There is an underlying, contiguous array that backs HashMap. The elements of this array are actually singly linked lists. Each time you add a key-value pair to the map, the key is hashed and a linked list entry is added to the corresponding slot of the backing array (i.e. the slot corresponding to the key's hash value).
For instance, a map that maps k to v might look like this:
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| | | | | | | | |
+-X-+-X-+-↓-+-X-+-X-+-X-+-X-+-X-+
↓
↓
+---+
| k |
| - |
| v |
+---+
There is a long "table" that backs the map, and an entry that supports the specific k-to-v pairing.
It would probably be best for you to take a look at the HashMap source for yourself.
A hashmap is always an array where the hashcode can be determined to get the index of the array element(In jdk this is the entry) . Therefore, it should take a contiguous memory as well.
Related
I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.
I frequently come across the use case that I have (a time ordered) Spark dataframe with values from which I should like to know the differences between consecutive rows:
>>> df.show()
+-----+----------+----------+
|index| c1| c2|
+-----+----------+----------+
| 0.0|0.35735932|0.39612636|
| 1.0| 0.7279809|0.54678476|
| 2.0|0.68788993|0.25862947|
| 3.0| 0.645063| 0.7470685|
+-----+----------+----------+
The question on how to do this has been asked before in a narrower context:
pyspark, Compare two rows in dataframe
Date difference between consecutive rows - Pyspark Dataframe
However, I find the answers rather involved:
a separate module "Window" must be imported
for some data types (datetimes) a cast must be done
then using "lag" finally the rows can be compared
It strikes me as odd, that this cannot be done with a single API call like, for example, so:
>>> import pyspark.sql.functions as f
>>> df.select(f.diffs(df.c1)).show()
+----------+
| diffs(c1)|
+----------+
| 0.3706 |
| -0.0400 |
| -0.0428 |
| null |
+----------+
What is the reason for this?
There are a few basic reasons:
In general distributed data structures used in Spark are not ordered. In particular any lineage containing shuffle phase / exchange can output a structure with non-deterministic order.
As a result when we talk about Spark DataFrame we actually mean relation not DataFrame as known from local libraries like Pandas and without explicit ordering comparing consecutive rows is just not meaningful.
Things are even more fuzzy when you realize that sorting methods used in Spark use shuffles and are not stable.
If you ignore possible non-determinism handling partition boundaries is a bit involved and typically breaks lazy execution.
In other words you cannot access an element which is left from the first element on a given partition or right from the last element of a given partition without a shuffle, an additional action or separate data scan.
I have some data in a tab-delimited file on HDFS that looks like this:
label | user_id | feature
------------------------------
pos | 111 | www.abc.com
pos | 111 | www.xyz.com
pos | 111 | Firefox
pos | 222 | www.example.com
pos | 222 | www.xyz.com
pos | 222 | IE
neg | 333 | www.jkl.com
neg | 333 | www.xyz.com
neg | 333 | Chrome
I need to transform it to create a feature vector for each user_id to train a org.apache.spark.ml.classification.NaiveBayes model.
My current approach is the essentially the following:
Load the raw data into a DataFrame
Index the features with StringIndexer
Go down to the RDD and Group by user_id and map the feature indices into a sparse Vector.
The kicker is this... the data is already pre-sorted by user_id. What's the best way to take advantage of that? It pains me to think about how much needless work may be occurring.
In case a little code is helpful to understand my current approach, here is the essence of the map:
val featurization = (vals: (String,Iterable[Row])) => {
// create a Seq of all the feature indices
// Note: the indexing was done in a previous step not shown
val seq = vals._2.map(x => (x.getDouble(1).toInt,1.0D)).toSeq
// create the sparse vector
val featureVector = Vectors.sparse(maxIndex, seq)
// convert the string label into a Double
val label = if (vals._2.head.getString(2) == "pos") 1.0 else 0.0
(label, vals._1, featureVector)
}
d.rdd
.groupBy(_.getString(1))
.map(featurization)
.toDF("label","user_id","features")
Lets start with your other question
If my data on disk is guaranteed to be pre-sorted by the key which will be used for a group aggregation or reduce, is there any way for Spark to take advantage of that?
It depends. If operation you apply can benefit from map-side aggregation then you can gain quite a lot by having presorted data without any further intervention in your code. Data sharing the same key should located on the same partitions and can be aggregated locally before shuffle.
Unfortunately it won't help much in this particular scenario. Even if you enable map side aggregation (groupBy(Key) doesn't use is so you'll need custom implementation) or aggregate over feature vectors (you'll find some examples in my answer to How to define a custom aggregation function to sum a column of Vectors?) there is not much to gain. You can save some work here and there but you still have to transfer all indices between nodes.
If you want to gain more you'll have to do a little bit more work. I can see two basic ways you can leverage existing order:
Use custom Hadoop input format to yield only complete records (label, id, all features) instead of reading data line by line. If your data has fixed number of lines per id you could even try to use NLineInputFormat and apply mapPartitions to aggregate records afterwards.
This is definitely more verbose solution but requires no additional shuffling in Spark.
Read data as usual but use custom partitioner for groupBy. As far as I can tell using rangePartitioner should work just fine but to be sure you can try following procedure:
use mapPartitionsWithIndex to find minimum / maximum id per partition.
create partitioner which keeps minimum <= ids < maximum on the current (i-th) partition and pushes maximum to the partition i + 1
use this partitioner for groupBy(Key)
It is probably more friendly solution but requires at least some shuffling. If expected number of records to move is low (<< #records-per-partition) you can even handle this without shuffle using mapPartitions and broadcast* although having partitioned can be more useful and cheaper to get in practice.
* You can use an approach similar to this: https://stackoverflow.com/a/33072089/1560062
In a resource allocation problem i have n bucket sizes and m resources. Resources should be allocated to buckets in such a way that there will be max utilization. I need to write algorithm in Node js
Here's the problem: Let's say i have 2 buckets of sizes 50 and 60 respectively. Resource sizes are 20, 25, 40. Following is the more proper representation with possible solutions:
Solution 1:
| Bucket Size | Resource(s) allocated | Utilization |
| 50 | 20, 25 | 45/50 = 0.9 |
| 60 | 40 | 40/60 = 0.667 |
Total Utilization in this case is >1.5
Solution 2:
| Bucket Size | Resource(s) allocated | Utilization |
| 50 | 25 | 25/50 = 0.5 |
| 60 | 20, 40 | 60/60 = 1.0 |
Total Utilization in this case is 1.5
Inference:
-- Knapsack approach will return Solution 2 because it will do optimization based on higher bucket size.
-- Brute-Force approach will return both the solutions. One concern with this approach i have is; given that i have to use Node js and it is single threaded, i am little skeptic about performance when n (buckets) and m (resources) will be very large.
Will Brute-Force would do just fine or is there a better way/algorithm with which i can solve this problem? Also, is the concern which I've cited above is valid in any sense?
Knapsack problem (and this is knapsack problem) is NPC, which means, you can find solution only by brute-force or with alghoritms which have O-complexity same as bruteforce, but can be better in average case...
it is single threaded, i am little skeptic about performance when n
(buckets) and m (resources) will be very large.
I am not sure, if you know how thing works. If you do not create child threads and handle them (which is not that easy), every standard language will work in one thread, therefore in one processor. And if you want more processors that much, you can create child threads even in Node.Js.
Also in complexity problems, it does not matter, if solution takes multiple-time more, if the "multiple" is constant. In your case, I suppose the "multiple" means 4, if you have quad-core.
There are two good solutions :
1)Backtracking - it is basically advanced brute-force mechanism, which can in same cases return solution much faster.
2)Dynamic programming - If you have items with relatively low-values, then while classic brute-force is not able to find solution for 200 items in the expected time of universe itself, the dynamic approach can give you solution in (mili)seconds.