I'm looking to use spark for some ETL, which will mostly consist of "update" statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra
Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in "normal" apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single "withSessionDo" seems like it'll cause problems. I was thinking of using something like this:
class CassandraStorage(conf:SparkConf) {
val session = CassandraConnector(conf).openSession()
def store (t:Thingy) : Unit = {
//session.execute cql goes here
}
}
Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated.
You actually do want to use withSessionDo because it won't actually open and close a session on every access. Under the hood, withSessionDo accesses a JVM level session. This means you will only have one session object PER cluster configuration PER node.
This means code like
val connector = CassandraConnector(sc.getConf)
sc.parallelize(1 to 10000000L).map(connector.withSessionDo( Session => stuff)
Will only ever make 1 cluster and session object on each executor JVM regardless of how many cores each machine has.
For efficiency i would still recommend using mapPartitions to minimize cache checks.
sc.parallelize(1 to 10000000L)
.mapPartitions(it => connector.withSessionDo( session =>
it.map( row => do stuff here )))
In addition the session object also uses a prepare cache, which lets you cache a prepared statement in your serialized code, and it will only ever be prepared once per jvm(all other calls will return the cache reference.)
Related
Can someone explain and provide the document that explains the behavior of
select * from <keyspace.table>
Let's assume I have 5 node cluster, how does Cassandra DataStax Driver behave when such queries are being issued?
(Fetchsize was set to 500)
Is this a proper way to pull data ? Does it cause any performance issues?
No, that's really a very bad way to pull data. Cassandra shines when it fetches the data by at least partition key (that identifies a server that holds the actual data). When you are doing the select * from table, request is sent to coordinating node, that will need to pull all data from all servers and send via that coordinating node, overloading it, and most probably lead to the timeout if you have enough data in the cluster.
If you really need to perform full fetch of the data from cluster, it's better to use something like Spark Cassandra Connector that read data by token ranges, fetching the data directly from nodes that are holding the data, and doing this in parallel. You can of course implement the token range scan in Java driver, something like this, but it will require more work on your side, comparing to use of Spark.
I have a very basic understanding about spark and I am trying to find something that can help me achieve the following :
Have a Pool of objects shared over all the nodes, asynchronously.
What I am thinking of currently, is, lets say there are ten nodes numbered from 1 to 10.
If I have a single object, I will have to make my object synchronous in order for it to be accessible by any node. I do not want that.
Second option is, I can have a pool of say 10 objects.
I want to write my code in such a way that the node number 1 always uses the object number 1, the node number 2 always uses the object number 2 and so on..
A sample approach would be, before performing a task, get the thread ID and use the object number (threadID % 10). This would result in a lot of collisions and would not work.
Is there a way that I can somehow get a nodeID or processID, and make my code fetch an object according to that ID ? Or some other way to have an asynchronous pool of objects on my cluster?
I apologize if it sounds trivial, I am just getting started and cannot find a lot of resources pertaining to my doubt online.
PS : I am using a SparkStreaming + Kafka + YARN setup if it matters.
Spark automatically partitions the data across all available cluster nodes; you don't need to control or keep track of where the partitions are actually stored. Some RDD operations also require shuffling which is fully managed by Spark, so you can't rely on the layout of the partitions.
Sharing an object only makes sense if it's immutable. Each worker node receives a copy of the original object, and any local changes to it will not be reflected on other nodes. If that's what you need, you can use sc.broadcast() to efficiently distribute an object across all workers prior to a parallel operation.
So asking if anyone knows a way to change the Spark properties (e.g. spark.executor.memory, spark.shuffle.spill.compress, etc) during runtime, so that a change may take effect between the tasks/stages during a job...
So I know that...
1) The documentation for Spark 2.0+ (and previous versions too) state that once the Spark Context has been created, it can't be changed in runtime.
2) SparkSession.conf.set that may change a few things for SQL, but I was looking at more general, all encompassing configurations.
3) I could start a new context in the program with new properties, but the case here is to actually tune the properties once a job is already executing.
Ideas...
1) Would killing an Executor force it to read a configuration file again, or does it just get what's already configured during the beginning of the job?
2) Is there any command to force a "refresh" of the properties in spark context?
So hoping there might be a way or other ideas out there (thanks in advance)...
After submitting the Spark application, we can change a few parameter values at Runtime and a few not.
By using spark.conf.isModifiable() method, we can check parameter value we can modify at runtime or not. If the value returns true then we can modify the parameter value otherwise, we can't modify the value at runtime.
Examples:
>>> spark.conf.isModifiable("spark.executor.memory")
False
>>> spark.conf.isModifiable("spark.sql.shuffle.partitions")
True
So based on the above testing, we can't modify the spark.executor.memory parameter value at runtime.
No, it is not possible to change settings like spark.executor.memory at runtime.
In addition, there are probably not too many great tricks in the direction of 'quickly switching to a new context' as the strength of spark is that it can pick up data and keep going. What you essentially are asking for is a map-reduce framework. Of course you could rewrite your job into this structure, and divide the work across multiple spark jobs, but then you would lose some of the ease and performance that spark brings. (Though possibly not all).
If you really think the request makes sense on a conceptual level, you could consider making a feature request. This can be through your spark supplier, or directly by logging a Jira on the apache Spark project.
I am trying to Load huge amount of data from Spark to HBase. I am using saveAsNewAPIHadoopDataset method.
I am creating ImmutableWritable and Put and saving that is required as below.
dataframe.mapPartitions { rows =>
{
rows.map { eachRow =>
{
val rowKey = Seq(eachRow.getAs[String]("uniqueId"), eachRow.getAs[String]("authTime")).mkString(",")
val put = new Put(Bytes.toBytes(rowKey));
val fields = eachRow.schema.fields;
for (i <- 0 until fields.length) {
put.addColumn(userCF, Bytes.toBytes(fields(i).name), Bytes.toBytes(String.valueOf(eachRow.get(i))))
}
(new ImmutableBytesWritable(Bytes.toBytes(rowKey)), put)
}
}
}
}.saveAsNewAPIHadoopDataset(job.getConfiguration)
My data is 30GB worth and it is present in HDFS in 60 files.
When i submit the same job with 10 files at a time, every thing went fine.
But, when i submit every thing at once, it is giving this error. The error is really frustrating and i tried every thing within possibility. But really wondering what made it to run successfully when the data is of 5GB and what made it to result in error when it is 30GB.
Has any one faced this kind of issues.?
That's because ImmutableBytesWritable is not serializable. When there is a shuffle, apache spark tries to serialize it to send to another node. The same would happen if you'd try to take some or collect it on driver.
There only two approaches, actually.
Do not use it if you're shuffling. If you just need to put each record from disk into a database, then looks like shuffling is not required. Make sure it is. If you need to preprocess your data before they go to database, keep it in some other serializable format and convert to the required only, when you saving it.
Use another serializer. Apache spark comes with Kryo (make sure you're using spark 2.0.0 - Kryo has been updated there and it fixes few nasty concurrency bugs). In order to use it, you have to configure it. It's not hard, but require a bit of code.
My program has an issue with Oracle query performance, I believe the SQL have good performance, because it returns quickly in SQLPlus.
But when my program has been running for a long time, like 1 week, the SQL query (using JDBC) becomes slower (In my logs, the query time is much longer than when I originally started the program). When I restart my program, the query performance comes back to normal.
I think it is could be something wrong with the way I use the preparedStatement, because the SQL I'm using does not use placeholders "?" at all. Just a complex select query.
The query process is done by a util class. Here is the pertinent code building the query:
public List<String[]> query(String sql, String[] args) {
Connection conn = null;
conn = openConnection();
conn.setAutocommit(true);
....
PreparedStatement preStatm = null;
ResultSet rs = null;
....//set preparedstatment arg code
rs = preStatm.executeQuery();
....
finally{
//close rs
//close prestatm
//close connection
}
}
In my case, the args is always null, so it just passes a query sql to this query method. Is that possible this way could slow down the DB query after program long time running? Or I should use statement instead, or just pass args with "?" in the SQL? How can I find out the root cause for my issue? Thanks.
Maybe problem in jdbc cache... oracle spec
Try to turn it off.
or try to reinit the driver some times (one time per day)
You first need to look into data that will help you see where you are spending most your time, guessing is not an option when performance tunning.
So I would recommend get solid data that pin points the layer presenting the issue (JAVA or DB).
For this I would suggest to look at AWR and ASH reports when the problem is most noticeable. Also collect data on the JVM (you can use JConsole and/or JVisualVM).
When first diagnosing bad performance I always do the "USE" method, Utilization, Saturation and Error.
So first, look for Errors in logs.
Then look for any resource becoming Saturated (CPUs, Memory etc...)
Finally Look at the Utilization of each resource, having a client server layout will make this easier, if this is not the case you will need to drill down to process level to know whether its Java or the DB.
Once you have collected this data you can direct your tunning efforts accordingly. Going this approach will only make you waste time and sometimes even mask problems or induce new ones.
You can come back later with this data and we can take a look!