SPARK df.show() function algorithm - apache-spark

Recently I was asked in an interview about the algorithm of spark df.show() function.
how will spark decide from which executor/executors the records will be fetched?

Without undermining #thebluephantom's and #Hristo Iliev's answers (each give some insight into what's happening under the hood), I also wanted to add my answer to this list.
I came to the same conclusion(s), albeit by observing the behaviour of the underlying partitions.
Partitions have an index associated with them. This is seen in the code below.
(Taken from original spark source code here).
trait Partition extends Serializable {
def index: Int
:
So amongst partitions, there is an order.
And as already mentioned in other answers, the df.show() is the same as df.show(20) or the top 20 rows. So the underlying partition indexes determine which partition (and hence executor) those rows come from.
The partition indexes are assigned at the time of read, or (re-assigned) during a shuffle.
Here is some code to see this behaviour:
val df = Seq((5,5), (6,6), (7,7), (8,8), (1,1), (2,2), (3,3), (4,4)).toDF("col1", "col2")
// above sequence is defined out of order - to make behaviour visible
// see partition structure
df.rdd.glom().collect()
/* Array(Array([5,5]), Array([6,6]), Array([7,7]), Array([8,8]), Array([1,1]), Array([2,2]), Array([3,3]), Array([4,4])) */
df.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, we see 8 partitions (each inner-Array is a partition) - this is because spark defaults to 8 partitions when we create a dataframe.
Now let's repartition the dataframe.
// Now let's repartition df
val df2 = df.repartition(2)
// lets see the partition structure
df2.rdd.glom().collect()
/* Array(Array([5,5], [6,6], [7,7], [8,8], [1,1], [2,2], [3,3], [4,4]), Array()) */
// lets see output
df2.show(4,false)
/*
+----+----+
|col1|col2|
+----+----+
|5 |5 |
|6 |6 |
|7 |7 |
|8 |8 |
+----+----+
only showing top 4 rows
*/
In the above code, the top 4 rows came from the first partition (which actually has all elements of the original data in it). Also note the skew in partition sizes, since no partitioning column was mentioned.
Now lets try and create 3 partitions
val df3 = df.repartition(3)
// lets see partition structures
df3.rdd.glom().collect()
/*
Array(Array([8,8], [1,1], [2,2]), Array([5,5], [6,6]), Array([7,7], [3,3], [4,4]))
*/
// And lets see the top 4 rows this time
df3.show(4, false)
/*
+----+----+
|col1|col2|
+----+----+
|8 |8 |
|1 |1 |
|2 |2 |
|5 |5 |
+----+----+
only showing top 4 rows
*/
From the above code, we observe that Spark went to the first partition and tried to get 4 rows. Since only 3 were available, it grabbed those. Then moved on to the next partition, and got one more row. Thus, the order that you see from show(4, false), is due to the underlying data partitioning and the index ordering amongst partitions.
This example uses show(4), but this behaviour can be extended to show() or show(20).

It's simple.
In Spark 2+, show() calls showString() to format the data as a string and then prints it. showString() calls getRows() to get the top rows of the dataset as a collection of strings. getRows() calls take() to take the original rows and transforms them into strings. take() simply wraps head(). head() calls limit() to build a limit query and executes it. limit() adds a Limit(n) node at the front of the logical plan which is really a GlobalLimit(n, LocalLimit(n)). Both GlobalLimit and LocalLimit are subclasses of OrderPreservingUnaryNode that override its maxRows (in GlobalLimit) or maxRowsPerPartition (in LocalLimit) methods. The logical plan now looks like:
GlobalLimit n
+- LocalLimit n
+- ...
This goes through analysis and optimisation by Catalyst where limits are removed if something down the tree produces less rows than the limit and ends up as CollectLimitExec(m) (where m <= n) in the execution strategy, so the physical plan looks like:
CollectLimit m
+- ...
CollectLimitExec executes its child plan, then checks how many partitions the RDD has. If none, it returns an empty dataset. If one, it runs mapPartitionsInternal(_.take(m)) to take the first m elements. If more than one, it applies take(m) on each partition in the RDD using mapPartitionsInternal(_.take(m)), builds a shuffle RDD that collects the results in a single partition, then again applies take(m).
In other words, it depends (because optimisation phase), but in the general case it takes the top rows of the concatenation of the top rows of each partition and so involves all executors holding a part of the dataset.
OK, perhaps not so simple.

A shitty question as not what u would use in prod.
It is a smart action that looks at what you have in terms of transformations.
Show() is in fact show(20). If just show it looks at 1st and consecutive partitions to get 20 rows. Order by also optimized. A count does need complete processing.
Many google posts btw.

Related

Spark: What is the difference between repartition and repartitionByRange?

I went through the documentation here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
It says:
for repartition: resulting DataFrame is hash partitioned.
for repartitionByRange: resulting DataFrame is range partitioned.
And a previous question also mentions it. However, I still don't understand how exactly they differ and what the impact will be when choosing one over the other?
More importantly, if repartition does hash partitioning, what impact does providing columns as its argument have?
I think it is best to look into the difference with some experiments.
Test Dataframes
For this experiment, I am using the following two Dataframes (I am showing the code in Scala but the concept is identical to Python APIs):
// Dataframe with one column "value" containing the values ranging from 0 to 1000000
val df = Seq(0 to 1000000: _*).toDF("value")
// Dataframe with one column "value" containing 1000000 the number 0 in addition to the numbers 5000, 10000 and 100000
val df2 = Seq((0 to 1000000).map(_ => 0) :+ 5000 :+ 10000 :+ 100000: _*).toDF("value")
Theory
repartition applies the HashPartitioner when one or more columns are provided and the RoundRobinPartitioner when no column is provided. If one or more columns are provided (HashPartitioner), those values will be hashed and used to determine the partition number by calculating something like partition = hash(columns) % numberOfPartitions. If no column is provided (RoundRobinPartitioner) the data gets evenly distributed across the specified number of partitions.
repartitionByRange will partition the data based on a range of the column values. This is usually used for continuous (not discrete) values such as any kind of numbers. Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.
It is also worth mentioning that for both methods if numPartitions is not given, by default it partitions the Dataframe data into spark.sql.shuffle.partitions configured in your Spark session, and could be coalesced by Adaptive Query Execution (available since Spark 3.x).
Test Setup
Based on the given Testdata I am always applying the same code:
val testDf = df
// here I will insert the partition logic
.withColumn("partition", spark_partition_id()) // applying SQL built-in function to determine actual partition
.groupBy(col("partition"))
.agg(
count(col("value")).as("count"),
min(col("value")).as("min_value"),
max(col("value")).as("max_value"))
.orderBy(col("partition"))
testDf.show(false)
Test Results
df.repartition(4, col("value"))
As expected, we get 4 partitions and because the values of df are ranging from 0 to 1000000 we see that their hashed values will result in a well distributed Dataframe.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |249911|12 |1000000 |
|1 |250076|6 |999994 |
|2 |250334|2 |999999 |
|3 |249680|0 |999998 |
+---------+------+---------+---------+
df.repartitionByRange(4, col("value"))
Also in this case, we get 4 partitions but this time the min and max values clearly shows the ranges of values within a partition. It is almost equally distributed with 250000 values per partition.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |244803|0 |244802 |
|1 |255376|244803 |500178 |
|2 |249777|500179 |749955 |
|3 |250045|749956 |1000000 |
+---------+------+---------+---------+
df2.repartition(4, col("value"))
Now, we are using the other Dataframe df2. Here, the hashing algorithm is hashing the values which are only 0, 5000, 10000 or 100000. Of course, the hash of the value 0 will always be the same, so all Zeros end up in the same partition (in this case partition 3). The other two partitions only contain one value.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1 |100000 |100000 |
|1 |1 |10000 |10000 |
|2 |1 |5000 |5000 |
|3 |1000001|0 |0 |
+---------+-------+---------+---------+
df2.repartition(4)
Without using the content of the column "value" the repartition method will distribute the messages on a RoundRobin basis. All partitions have almost the same amount of data.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |250002|0 |5000 |
|1 |250002|0 |10000 |
|2 |249998|0 |100000 |
|3 |250002|0 |0 |
+---------+------+---------+---------+
df2.repartitionByRange(4, col("value"))
This case shows that the Dataframe df2 is not well defined for a repartitioning by range as almost all values are 0. Therefore, we end up having only two partitions whereas the partition 0 contains all Zeros.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1000001|0 |0 |
|1 |3 |5000 |100000 |
+---------+-------+---------+---------+
By using df.explain you can get much information about these operations.
I'm using this DataFrame for the example :
df = spark.createDataFrame([(i, f"value {i}") for i in range(1, 22, 1)], ["id", "value"])
Repartition
Depending on whether a key expression (column) is specified or not, the partitioning method will be different. It is not always hash partitioning as you said.
df.repartition(3).explain(True)
== Parsed Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange RoundRobinPartitioning(3)
+- Scan ExistingRDD[id#0L,value#1]
We can see in the generated physical plan that RoundRobinPartitioning is used:
Represents a partitioning where rows are distributed evenly across
output partitions by starting from a random target partition number
and distributing rows in a round-robin fashion. This partitioning is
used when implementing the DataFrame.repartition() operator.
When using repartition by column expression:
df.repartition(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange hashpartitioning(id#0L, 3)
+- Scan ExistingRDD[id#0L,value#1]
Now the picked partitioning method is hashpartitioning.
In hash partitioning method, a Java Object.hashCode is being calculated for every key expression to determine the destination partition_id by calculating a modulo: key.hashCode % numPartitions.
RepartitionByRange
This partitioning method creates numPartitions consecutive and not overlapping ranges of values based on the partitioning key. Thus, at least one key expression is required and needs to be orderable.
df.repartitionByRange(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange rangepartitioning(id#0L ASC NULLS FIRST, 3)
+- Scan ExistingRDD[id#0L,value#1]
Looking at the generated physical plan, we can see that rangepartitioning differs from the two others described above by the presence of the ordering clause in the partitioning expression. When no explicit sort order is specified in the expression, it uses ascending order by default.
Some interesting links:
Repartition Logical Operators — Repartition and RepartitionByExpression
Range partitioning in Apache SparkSQL
hash vs range partitioning

using DataSet.repartition in Spark 2 - several tasks handle more than one partition

we have a spark streaming application (spark 2.1 run over Hortonworks 2.6) and use the DataSet.repartition (on a DataSet<Row> that's read from Kafka) in order to repartition the DataSet<Row>'s partitions according to a given column (called block_id).
We start with a DataSet<Row>containing 50 partitions and end up (after the call to DataSet.repartition) with number of partitions equivalent to the number of unique block_id's.
The problem is that the DataSet.repartition behaves not as we expected - when we look at the event timeline of the spark job that runs the repartition, we see there are several tasks that handle 1 block_id and fewer tasks that handle 2 block_id's or even 3 or 4 block_id's.
It seems that DataSet.repartition ensures that all the Rows with the same block_id will be inside a single partition, but not that each task that creates a partition will handle only one block_id.
The result is that the repartition job (that runs inside the streaming application) takes as much time as its longest task (which is the task that handles the most block_id's.
We tried playing with the number of Vcores given to the streaming app - from 10 to 25 to 50 (we have 50 partitions in the original RDD that's read from Kafka) but the result was the same - there's always one or more task that handles more than one block_id.
We even tried increasing the batch time, again that didn't help us to achieve the goal of one task handling one block_id.
To give an example - here's the event timeline and the tasks table describing a run of the repartitionspark job:
event timeline - the two tasks in red are the ones that handle two block_id's:
tasks table - the two tasks in red are the same two from above - notice the duration of each of them is twice as the duration of all other tasks (that handle only one block_id)
This is a problem for us because the streaming application is delayed due to these long tasks and we need a solution that will enable us to perform repartition on a DataSet while having each task handling only one block_id.
And if that's not possible then maybe that's possible on an JavaRDD? Since in our case the DataSet<Row> we run the repartition on is created from a JavaRDD.
The 2 problems you need to consider:
Have a custom partitioner that assures data uniform distribution, 1 block_id / partition
Sizing the cluster so that you have enough executors to run all tasks (block_ids) simultaneously
As you've seen a simple repartition on the DataFrame doesn't assure you'll get an uniform distribution. When you repartition by block_id it will use the HashPartitioner, with formula:
Utils.nonNegativeMod(key.hashCode, numPartitions)
See: https://github.com/apache/spark/blob/branch-2.2/core/src/main/scala/org/apache/spark/Partitioner.scala#L80-L88
It's very possible 2+ keys are assigned to the same partition_id as the partition_id is key's hashCode modulo numPartitions.
What you need can be achieved by using the RDD with a custom partitioner. The easiest will be to extract the list of distinct block_ids before repartitioning.
Here's a simple example. Let's say you can have 5 blocks (2,3,6,8,9) and your cluster has 8 executors (can run up to 8 tasks simultaneously), we're over-provisioned by 3 executors:
scala> spark.conf.get("spark.sql.shuffle.partitions")
res0: String = 8
scala> spark.conf.get("spark.default.parallelism")
res1: String = 8
// Basic class to store dummy records
scala> case class MyRec(block_id: Int, other: String)
defined class MyRec
// Sample DS
scala> val ds = List((2,"A"), (3,"X"), (3, "B"), (9, "Y"), (6, "C"), (9, "M"), (6, "Q"), (2, "K"), (2, "O"), (6, "W"), (2, "T"), (8, "T")).toDF("block_id", "other").as[MyRec]
ds: org.apache.spark.sql.Dataset[MyRec] = [block_id: int, other: string]
scala> ds.show
+--------+-----+
|block_id|other|
+--------+-----+
| 2| A|
| 3| X|
| 3| B|
| 9| Y|
| 6| C|
| 9| M|
| 6| Q|
| 2| K|
| 2| O|
| 6| W|
| 2| T|
| 8| T|
+--------+-----+
// Default partitioning gets data distributed as uniformly as possible (record count)
scala> ds.rdd.getNumPartitions
res3: Int = 8
// Print records distribution by partition
scala> ds.rdd.mapPartitionsWithIndex((idx, it) => Iterator((idx, it.toList))).toDF("partition_id", "block_ids").show
+------------+--------------+
|partition_id| block_ids|
+------------+--------------+
| 0| [[2,A]]|
| 1|[[3,X], [3,B]]|
| 2| [[9,Y]]|
| 3|[[6,C], [9,M]]|
| 4| [[6,Q]]|
| 5|[[2,K], [2,O]]|
| 6| [[6,W]]|
| 7|[[2,T], [8,T]]|
+------------+--------------+
// repartitioning by block_id leaves 4 partitions empty and assigns 2 block_ids (6,9) to same partition (1)
scala> ds.repartition('block_id).rdd.mapPartitionsWithIndex((idx, it) => Iterator((idx, it.toList))).toDF("partition_id", "block_ids").where(size('block_ids) > 0).show(false)
+------------+-----------------------------------+
|partition_id|block_ids |
+------------+-----------------------------------+
|1 |[[9,Y], [6,C], [9,M], [6,Q], [6,W]]|
|3 |[[3,X], [3,B]] |
|6 |[[2,A], [2,K], [2,O], [2,T]] |
|7 |[[8,T]] |
+------------+-----------------------------------+
// Create a simple mapping for block_id to partition_id to be used by our custom partitioner (logic may be more elaborate or static if the list of block_ids is static):
scala> val mappings = ds.map(_.block_id).dropDuplicates.collect.zipWithIndex.toMap
mappings: scala.collection.immutable.Map[Int,Int] = Map(6 -> 1, 9 -> 0, 2 -> 3, 3 -> 2, 8 -> 4)
//custom partitioner assigns partition_id according to the mapping arg
scala> class CustomPartitioner(mappings: Map[Int,Int]) extends org.apache.spark.Partitioner {
| override def numPartitions: Int = mappings.size
| override def getPartition(rec: Any): Int = { mappings.getOrElse(rec.asInstanceOf[Int], 0) }
| }
defined class CustomPartitioner
// Repartition DS using new partitioner
scala> val newDS = ds.rdd.map(r => (r.block_id, r)).partitionBy(new CustomPartitioner(mappings)).toDS
newDS: org.apache.spark.sql.Dataset[(Int, MyRec)] = [_1: int, _2: struct<block_id: int, other: string>]
// Display evenly distributed block_ids
scala> newDS.rdd.mapPartitionsWithIndex((idx, it) => Iterator((idx, it.toList))).toDF("partition_id", "block_ids").where(size('block_ids) > 0).show(false)
+------------+--------------------------------------------+
|partition_id|block_ids |
+------------+--------------------------------------------+
|0 |[[9,[9,Y]], [9,[9,M]]] |
|1 |[[6,[6,C]], [6,[6,Q]], [6,[6,W]]] |
|2 |[[3,[3,X]], [3,[3,B]]] |
|3 |[[2,[2,A]], [2,[2,K]], [2,[2,O]], [2,[2,T]]]|
|4 |[[8,[8,T]]] |
+------------+--------------------------------------------+

How to add column and records on a dataset given a condition

I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.

Can you nest a Spark Dataframe in another Dataframe?

In spark I want to be able to parallelise over multiple dataframes.
The method I am trying is to nest dataframes in a parent dataframe but I am not sure the syntax or if it is possible.
For example I have the following 2 dataframes:
df1:
+-----------+---------+--------------------+------+
|id |asset_id | date| text|
+-----------+---------+--------------------+------+
|20160629025| A1|2016-06-30 11:41:...|aaa...|
|20160423007| A1|2016-04-23 19:40:...|bbb...|
|20160312012| A2|2016-03-12 19:41:...|ccc...|
|20160617006| A2|2016-06-17 10:36:...|ddd...|
|20160624001| A2|2016-06-24 04:39:...|eee...|
df2:
+--------+--------------------+--------------+
|asset_id| best_date_time| Other_fields|
+--------+--------------------+--------------+
| A1|2016-09-28 11:33:...| abc|
| A1|2016-06-24 00:00:...| edf|
| A1|2016-08-12 00:00:...| hij|
| A2|2016-07-01 00:00:...| klm|
| A2|2016-07-10 00:00:...| nop|
So i want to combine these to produce something like this.
+--------+--------------------+-------------------+
|asset_id| df1| df2|
+--------+--------------------+-------------------+
| A1| [df1 - rows for A1]|[df2 - rows for A1]|
| A2| [df1 - rows for A2]|[df2 - rows for A2]|
Note, I don't want to join or union them as that would be very sparse (I actually have about 30 dataframes and thousands of assets each with thousands of rows).
I then plan to do a groupByKey on this so that I get something like this that I can call a function on:
[('A1', <pyspark.resultiterable.ResultIterable object at 0x2534310>), ('A2', <pyspark.resultiterable.ResultIterable object at 0x25d2310>)]
I'm new to spark so any help greatly appreciated.
TL;DR It is not possible to nest DataFrames but you can use complex types.
In this case you could for example (Spark 2.0 or later):
from pyspark.sql.functions import collect_list, struct
df1_grouped = (df1
.groupBy("asset_id")
.agg(collect_list(struct("id", "date", "text"))))
df2_grouped = (df2
.groupBy("asset_id")
.agg(collect_list(struct("best_date_time", "Other_fields"))))
df1_grouped.join(df2_grouped, ["asset_id"], "fullouter")
but you have to be aware that:
It is quite expensive.
It has limited applications. In general nested structures are cumbersome to use and require complex and expensive (especially in PySpark) UDFs.

HashMap and ConcurrentHashMap interal working

I got somewhat confused on my understanding, when the below things asked to me in a quiz:
1)ConcurrentHashMap : As per my understanding, there is no lock to get values(corresponding to a key) from this map.
Question is : if this is true, suppose t1 is writing(by taking lock on a Segment/bucket) and t2 tries to read same, t2 will not get the correct value, and thus inconsistent value to t2
2)HashMap: As per my understanding,before a element is added to a hashbucket, the hashvalue H is calculated as hashcode%16 (gives values from 0 to 15) for the key (key.hashcode()) and then added to the bucket whose hashvalue is H
Note : there are 16 buckets(default implementation), which represents a ArrayList of LinkedList
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 | 11 |12 |13 |14 |15 |
ArrayList at 2000(memory address)
You can very well say this is duplicate of Segmentation in ConcurrentHashMap, Regarding internal working of concurrent hashmap, HashMap or ConcurrentHashMap at Java Controllers?, etc. But I need to understand on doubts. few links/blogs to a good explanation will work for me. Thanks
Use this link for studying ConcurrentHashMap and HashMap
For the 1st part. What you are saying is correct. The get() doesn't takes a lock on the segment if it finds the key in the ConcurrentHashMap. If you have an another thread which is modifying the structure of the Map(eg. put()) and it hasn't finished yet, it's sure that you will get stale values.But it won't throw ConcurrentModificationException. The updated values will be reflected if the modifying operation completes before the retrieval operation.
I hope this helps.

Resources