I am new to Scala but I know Java. Thus, as far as I understand, the difference is that == in Scala acts as .equals in Java, which means we are looking for the value; and eq in Scala acts as == in Java, which means we are looking for the reference address and not value.
However, after running the code below:
val greet_one_v1 = "Hello"
val greet_two_v1 = "Hello"
println(
(greet_one_v1 == greet_two_v1),
(greet_one_v1 eq greet_two_v1)
)
val greet_one_v2 = new String("Hello")
val greet_two_v2 = new String("Hello")
println(
(greet_one_v2 == greet_two_v2),
(greet_one_v2 eq greet_two_v2)
)
I get the following output:
(true,true)
(true,false)
My theory is that the initialisation of these strings differs. Hence, how is val greet_one_v1 = "Hello" different from val greet_one_v2 = new String("Hello")? Or, if my theory is incorrect, why do I have different outputs?
As correctly answered by Luis Miguel Mejía Suárez, the answer lies in String Interning which is part of what JVM (Java Virtual Machine) does automatically. To initiate a new String it needs to be initiated explicitly like in my example above; otherwise, Java will allocate the same memory for same values for optimisation reasons.
Related
I am getting the scan result in a string like ---
DriverId=60cb1daa20056c0c92ebe457,Amount=10.0
I want to retrive driver id and amount from this string.
How can I retrive?
Please help...
It depends on your overall format. Basic operations like substrings as suggested by #iLoveYou3000 can work fine if you really have this fixed format.
If the keys are dynamic, or could be changed in the future, you could also use more general approaches, for instance using split():
val attributeStrings = input.split(",")
val attributesMap = attributeStrings.map { it.split("=") }.associate { it[0] to it[1] }
val driverId = attributesMap["DriverId"]
val amount = attributesMap["Amount"].toDouble() // or .toBigDecimal()
This is one of the possible ways that I could think of.
val driverID= str.substringAfter("DriverId=", "").substringBefore(",", "")
val amount = str.substringAfter("Amount=", "")
I want to perform the following scala code into python3 language
class xyz{
def abc():Unit={
val clazz:Class[_] = this.getClass()
var fields: List[String] = getFields(clazz);
val method = clazz.getDeclaredMethods()
val methodname=method.getName()
val supper= clazz.getSuperclass()
println(clazz)
println(fields)
println(method)
}}
Class[_] equivalent in python
Class[_] is a static type. Python doesn't have static types, so there is no equivalent to the static type Class[_] in Python.
I want to perform the following scala code into python3 language
class xyz{
def abc():Unit={
val clazz:Class[_] = this.getClass()
var fields: List[String] = getFields(clazz);
val method = clazz.getDeclaredMethods()
val methodname=method.getName()
val supper= clazz.getSuperclass();}
def mno():Unit={
println("hello")}}
abc is simply a NO-OP(*). mno just prints to stdout. So, the equivalent in Python is
class xyz:
def abc(self):
pass
def mno(self):
print("hello")
Note that I made abc and mno instance methods, even though it makes no sense. (But that's the same for the Scala version.)
(*) Some who knows more about corner cases and side-effects of Java Reflection can correct me here. Maybe, this triggers some kind of Classloader refresh or something like that?
You can't get one-to-one correspondence simply because Python classes are organized very differently from JVM classes.
The equivalent of getClass() is type;
there is no equivalent to Class#getFields because fields aren't necessarily defined on a class in Python, but see How to list all fields of a class (and no methods)?.
Similarly getSuperclass(); Python classes can have more than one superclass, so __bases__ returns a tuple of base classes instead of just one.
When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava
An sample skeleton code is sort of as follows, where i am basically reading a RDD from bigquery and select out all data point where my_field_name value is null
JavaPairRDD<String, GenericData.Record> input = sc
.newAPIHadoopRDD(hadoopConfig, AvroBigQueryInputFormat.class, LongWritable.class, GenericData.Record.class)
.mapToPair( tuple -> {
GenericData.Record record = tuple._2;
Object rawValue = record.get(my_field_name); // Problematic !! want to get my_field_name of this bq row, but just gave something not making sense
String partitionValue = rawValue == null ? "EMPTY" : rawValue.toString();
return new Tuple2<String, GenericData.Record>(partitionValue, record);
}).cache();
JavaPairRDD<String, GenericData.Record> emptyData =
input.filter(tuple -> StringUtils.equals("EMPTY", tuple._1));
emptyData.values().saveAsTextFile(my_file_path)
However the output RDD is totally seems totally unexpected. Especially the value of my_field_name seems totally random. After a little debugging, it seems filtering is do what is expected but the problem is on the value I extracted from GenericData.Record (basically record.get(my_field_name)) seems totally random.
Therefore after I switched from AvroBigQueryInputFormat to GsonBigQueryInputFormat
to read bq in json instead, this code seems to be working correctly.
However, ideally I really I want to use Avro instead (which should be much faster than handling json) however its current behavior in my code is totally disturbing. My I just using the AvroBigQueryInputFormat wrong?
I have Java String array which contains 45 string which is basically column names
String[] fieldNames = {"colname1","colname2",...};
Currently I am storing above array of String in a Spark driver in a static field. My job is running slow so trying to refactor code. I am using above String array while creating a DataFrame
DataFrame dfWithColNames = sourceFrame.toDF(fieldNames);
I want to do the above using broadcast variable to that it don't ship huge string array to every executor. I believe we can do something like the following to create broadcast
String[] brArray = sc.broadcast(fieldNames,String[].class);//gives compilation error
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I use it as is by passing brArray
I am new to Spark.
This is a bit old question, however, I hope my solution would help somebody.
In order to broadcast any object (could be a single POJO or a collection) with Spark 2+ you first need to have the following method that creates a classTag for you:
private static <T> ClassTag<T> classTag(Class<T> clazz) {
return scala.reflect.ClassManifestFactory.fromClass(clazz);
}
next you use a JavaSparkContext from a SparkSession to broadcast your object as previously:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(YourObject.class)
)
In case of a collection, say, java.util.List, you use the following:
sparkSession.sparkContext().broadcast(
yourObject,
classTag(List.class)
)
The return variable of sc.broadcast is of type Broadcast<String[]> and not String[]. When you want to access the value, you simply call value() on the variable. From your example it would be like:
Broadcast<String[]> broadcastedFieldNames = sc.broadcast(fieldNames)
DataFrame df = sourceFrame.toDF(broadcastedFieldNames.value())
Note, that if you are writing this in Java, you probably want to wrap the SparkContext within the JavaSparkContext. It makes everything easier and you can then avoid having to pass a ClassTag to the broadcast function.
You can read more on broadcasting variables on http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
ArrayList<String> dataToBroadcast = new ArrayList();
dataToBroadcast .add("string1");
...
dataToBroadcast .add("stringn");
//Creating the broadcast variable
//No need to write classTag code by hand use akka.japi.Util which is available
Broadcast<ArrayList<String>> strngBrdCast = spark.sparkContext().broadcast(
dataToBroadcast,
akka.japi.Util.classTag(ArrayList.class));
//Here is the catch.When you are iterating over a Dataset,
//Spark will actally run it in distributed mode. So if you try to accees
//Your object directly (e.g. dataToBroadcast) it would be null .
//Cause you didn't ask spark to explicitly send tha outside variable to each
//machine where you are running this for each parallelly.
//So you need to use Broadcast variable.(Most common use of Broadcast)
someSparkDataSetWhere.foreach((row) -> {
ArrayList<String> stringlist = strngBrdCast.value();
...
...
})