val conf = new SparkConf(true)
.setAppName("Streaming Example")
.setMaster("spark://127.0.0.1:7077")
.set("spark.cassandra.connection.host","127.0.0.1")
.set("spark.cleaner.ttl","3600")
.setJars(Array("your-app.jar"))
Lets say I am creating a Spark Streaming Application
What should be the content of "your-app.jar" file ? Do I have to create it manually in my local file system and pass the path or Is that a Scala compiled file using sbt.
If thats a scala file please help to write the code
Since I am a beginer I am just trying to run some sample codes.
setJars method of the SparkConf class takes external JARs that need to be distributed on the cluster. Any external drivers like JDBC, etc.
You do not have to pass your own application JAR in this if that's what you are asking.
Related
According to various docs, to write a custom Aggregator in Spark it must be written in Java/Scala.
https://medium.com/swlh/apache-spark-3-0-remarkable-improvements-in-custom-aggregation-41dbaf725903
I have built and compiled a test implementation of a custom aggregator, but would now like to register and invoke it through PySpark and SparkSQL.
I tried spark.udf.registerJavaUDAF ... but that seems only to work with the older style UDAF functions not the new Aggregators.
How can I register a new Aggregator function written in Java through PySpark if at all possible? (I know how to pass the JAR to spark-submit etc the problem is the registration call).
I'm not sure what the correct approach is, but I was able to get the following to work.
In your Java class that extends Aggregator:
// This is assumed to be part of: com.example.java.udaf
// MyUdaf is the class that extends Aggregator
// I'm using Encoders.LONG() as an example, change this as needed
// Change the registered Spark SQL name, `myUdaf`, as needed
// Note that if you don't want to hardcode the "myUdaf" string, you can pass that in too.
// Expose UDAF registration
// This function is necessary for Python utilization
public static void register(SparkSession spark) {
spark.udf().register("myUdaf", functions.udaf(new MyUdaf(), Encoders.LONG()));
}
Then in Python:
udaf_jar_path = "..."
# Running in standalone mode
spark = SparkSession.builder\
.appName("udaf_demo")\
.config("spark.jars", udaf_jar_path)\
.master("local[*]")\
.getOrCreate()
# Register using registration function provided by Java class
spark.sparkContext._jvm.com.example.java.udaf.MyUdaf.register(_spark._jsparkSession)
As a bonus, you can use this same registration function in Java:
// Running in standalone mode
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName("udaf_demo")
.getOrCreate();
register(spark);
Then you should be able to use this directly in Spark SQL:
SELECT
col0
, myUdaf(col1)
FROM some_table
GROUP BY 1
I tested this with a simple summation and it worked reasonably well. For summing 1M numbers, the Python version was ~150ms slower than the Java one (local testing using standalone mode, with both run directly within my IDEs). Compared to the built-in sum it was about half a second slower.
An alternative approach is to use Spark native functions. I haven't directly used this approach; however, I have used the spark-alchemy library which does. See their repo for more details.
I'm new to spark and my understanding is this:
jars are like a bundle of java code files
Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls
Questions:
Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
You specify these jar files locations while creating the spark context via spark.driver.extraClassPath and spark.executor.extraClassPath. These are outdated parameters though I guess. What is the recent way to specify these jar files location?
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?
I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.
So, can you please help me understand this whole business with jars and how to find and specify them?
Each library that I install that internally uses spark (or pyspark)
has its own jar files
Can you tell which library are you trying to install ?
Yes, external libraries can have jars even if you are writing code in python.
Why ?
These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.
Java and Scala UDFs are usually faster that's why some libraries ship with jars.
Why could it not have sufficed to have all the code in python?
Same reason, scala/java UDFs are faster than python UDF.
What is the recent way to specify these jar files location?
You can use spark.jars.packages property. It will copy to both driver and executor.
Where do I find these jars for each library that I install? For
example synapseml. What is the general idea about where the jar files
for a package are located?
https://github.com/microsoft/SynapseML#python
They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
import synapse.ml
Can you try the above snippet?
I have written a simple program to join the orders and order_items files which are in HDFS.
My Code to read the data:
val orders = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/orders/part-00000")
val orderItems = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/order_items/part-00000")
I got the below exception:
**Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://quickstart.cloudera:8020/user/root/retail_db, expected: file:///**
Can you please let me know the issue here? Thanks!!
You are currently using the Cloudera Quickstart VM, which most likely means you are running Spark 1.6 as those are the parcels that can be installed directly from Cloudera Manager and the default version for CDH 5.x
If that is the case, Spark on Yarn points by default to HDFS so you don't need to specify hdfs.
Simply do this:
val orderItems = sc.textFile ("/user/cloudera/retail_db/order_items/part-00000")
Note I changed also to /user/cloudera. Make sure your current user has permissions.
The hdfs:// is only if you are using Spark standalone
I'm working in a scenario where i want to broadcast Spark context and get it in the other side. Is it possible in any other way? If not can someone explain why.
Any help is highly appreciated.
final JavaStreamingContext jsc = new JavaStreamingContext(conf,
Durations.milliseconds(2000));
final JavaSparkContext context = jsc.sc();
final Broadcast<JavaSparkContext> broadcastedFieldNames = context.broadcast(context);
Here's what i'm trying to achieve.
1. We have a XML EVENT that is coming form Kafka.
2. In the xml event we have one HDFS file path (hdfs:localhost//test1.txt)
3. We are using SparkStreamContext to create a DSTREAM and fetch the xml. Using Map Function we are reading the file path in each xml.
4. Now we need to read the file from HDFS (hdfs:localhost//test1.txt).
To Read this i need sc.readfile so i'm trying to broadcast the spark context to executor for parallel read of the input file.
Currently we are using HDFS Read file but that will not read parallel right?
You can't delete row using apache spark but if you use spark as olap engine to run SQL queries you also conce check apache incubator carbondata its provide you support of update delete records and it build on top of spark
Is it possible to create a RDD using data from master or worker? I know that there is a option SC.textFile() which sources the data from local system (driver) similarly can we use something like "master:file://input.txt" ? because I am accessing a remote cluster and my input data size is large and cannot login to remote cluster.
I am not looking for S3 or HDFS. Please suggest if there is any other option.
Data in an RDD is always controlled by the Workers, whether it is in memory or located in a data-source. To retrieve the data from the Workers into the Driver you can call collect() on your RDD.
You should put your file on HDFS or a filesystem that is available to all nodes.
The best way to do this is to as you stated use sc.textFile. To do that you need to make the file available on all nodes in the cluster. Spark provides an easy way to do this via the --files option for spark-submit. Simply pass the option followed by the path to the file that you need copied.
You can access the hadoop, file by creating hadoop configuration.
import org.apache.spark.deploy.SparkHadoopUtil
import java.io.{File, FileInputStream, FileOutputStream, InputStream}
val hadoopConfig = SparkHadoopUtil.get.conf
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(fileName), hadoopConfig)
val fsPath = new org.apache.hadoop.fs.Path(fileName)
Once you get the path you can copy, delete, move or perform any operations.