I'm using FiloDB 0.4 with Cassandra 2.2.5 column and meta store and trying to insert data into it using Spark Streaming 1.6.1 + Jobserver 0.6.2. I use the following code to insert data:
messages.foreachRDD(parseAndSaveToFiloDb)
private static Function<JavaPairRDD<String, String>, Void> parseAndSaveToFiloDb = initialRdd -> {
final List<RowWithSchema> parsedMessages = parseMessages(initialRdd.collect());
final JavaRDD<Row> rdd = javaSparkContext.parallelize(createRows(parsedMessages));
final DataFrame dataFrame = sqlContext.createDataFrame(rdd, generateSchema(rawMessages);
dataFrame.write().format("filodb.spark")
.option("database", keyspace)
.option("dataset", dataset)
.option("row_keys", rowKeys)
.option("partition_keys", partitionKeys)
.option("segment_key", segmentKey)
.mode(saveMode).save();
return null;
};
Segment key is ":string /0", row key is set to column which is unique for each row and partition key is set to column which is const for all rows. In other words all my test data set goes to single segment on single partition. When I'm using single one-node Spark then everything works fine and I get all data inserted but when I'm running two separate one-node Sparks(not as a cluster) at the same time then I get lost about 30-60% of data even if I send messages one by one with several seconds as interval.
I checked that dataFrame.write() is executed for each message so the issue happens after this line.
When I'm setting segment key to column which is unique for each row then all data reaches Cassandra/FiloDB.
Please suggest me solutions for scenario with 2 separate sparks.
#psyduck, this is most likely because data for each partition can only be ingested on one node at a time -- for the 0.4 version. So to stick with the current version, you would need to partition your data into multiple partitions and then ensure each worker only gets one partition. The easiest way to achieve the above is to sort your data by partition key.
I would highly encourage you to move to the latest version though - master (Spark 2.x / Scala 2.11) or spark1.6 branch (spark 1.6 / Scala 2.10). The latest version has many changes that are not in 0.4 that would solve your problem:
Using Akka Cluster to automatically route your data to the right ingestion node. In this case with the same model your data would all go to the right node and ensure no data loss
TimeUUID-based chunkID, so even in case multiple workers (in case of a split brain) somehow write to the same partition, data loss is avoided
A new "segment less" data model so you don't need to define any segment keys, more efficient for both reads and writes
Feel free to reach out on our mailing list, https://groups.google.com/forum/#!forum/filodb-discuss
Related
I have a requirement of scanning a table which contains 100 million record in Production. The search will be made on the first clustering key. The requirement is to find the unique partition keys where first clustering key is matching a condition. The table looks like the following -
employeeid, companyname , lastdateloggedin, floorvisted, swipetimestamp
Partition Key - employeeid
Clustering Key - companyname , lastdateloggedin
I would like to get select distinct(employeeid),company, swipetimestamp where companyname = 'XYZ'. This is an SQL representation of what i would like to fetch from the table.
SparkConf conf = new SparkConf().set("spark.cassandra.connection.enabled", "true")
.set("spark.cassandra.auth.username", "XXXXXXXXXX")
.set("spark.cassandra.auth.password", "XXXXXXXXX")
.set("spark.cassandra.connection.host", "hostname")
.set("spark.cassandra.connection.port", "29042")
.set("spark.cassandra.connection.factory", ConnectionFactory.class)
.set("spark.cassandra.connection.cluster_name", "ZZZZ")
.set("spark.cassandra.connection.application_name", "ABC")
.set("spark.cassandra.connection.local_dc", "DC1")
.set("spark.cassandra.connection.cachedClusterFile", "/tmp/xyz/test.json")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.input.fetch.size_in_rows","10000") //
.set("spark.driver.allowMultipleContexts","true")
.set("spark.cassandra.connection.ssl.trustStore.path", "sampleabc-spark-util/src/main/resources/x.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "cassandrasam");
CassandraJavaRDD<CassandraRow> ctable = javaFunctions(jsc).cassandraTable("keyspacename", "employeedetails").
select("employeeid", "companyname","swipetimestamp").where("companyname= ?","XYZ");
List<CassandraRow> cassandraRows = ctable.distinct().collect();
This code run in non production with close 5 million data. Since it is production i would like to approach this query with caution. Questions -
What are the config that should be present in my SparkConf ?
Will the spark job ever bring down the db because of the large table ?
Running that job might starve threads to cassandra at that moment ?
I would recommend to use Dataframe API instead of the RDDs - theoretically, SCC may do more optimizations for that API. If you have condition on the first clustering column, then this condition should be pushed by SCC down to Cassandra and filtering will happen there. You can check that by using .expalin on the dataframe, and checking that you have rules marked with * in the PushedFilters part.
Regarding config - use default version of the spark.cassandra.input.fetch.size_in_rows - if you have too high value, then you can have a higher chance of getting timeouts. You can still bring down the nodes with default value, as SCC is reading with LOCAL_ONE, and that overload single nodes. Sometimes, reading with LOCAL_QUORUM is faster because it won't overload individual nodes too much, and won't restart the tasks that are reading data.
And I recommend to make sure that you're using latest Spark Cassandra Connector - 2.5.0 - it has a lot of new optimizations and new functionality...
We are using Cassandra DataStax 6.0 and Spark enabled. We have 10GB of data coming every day. All queries are based on date. We have one huge table with 40 columns. We are planning to generate reports using Spark. What is the best way to setup this data. Since we keep getting data every day and save data for around 1 year in one table.
We tried to use different partition but most of our keys are based on date.
No code just need suggestion
Our query should be fast enough. We have 256GB Ram with 9 nodes. 44 core CPU.
Having the data organized in the daily partitions isn't very good design - in this case, only RF nodes will be active during the day writing the data, and then at the time of the report generation.
Because you'll be accessing that data only from Spark, you can use following approach - have some bucket field as partition key, for example, with uniformly generated random number, and timestamp as a clustering column, and maybe another uuid column for uniqueness guarantee of records, something like this:
create table test.sdtest (
b int,
ts timestamp,
uid uuid,
v1 int,
primary key(b, ts, uid));
Where maximum value for generatio of b should be selected to have not too very big and not very small partitions, so we can effectively read them.
And then we can run Spark code like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T00:00:00+0000' as timestamp) AND ts < cast('2019-03-11T00:00:00+0000' as timestamp)")
The trick here is that we distribute data across the nodes by using the random partition key, so the all nodes will handle the load during writing the data and during the report generation.
If we look into physical plan for that Spark code (formatted for readability):
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [b#23,ts#24,v1#25]
PushedFilters: [*GreaterThanOrEqual(ts,2019-03-10 00:00:00.0),
*LessThan(ts,2019-03-11 00:00:00.0)], ReadSchema: struct<b:int,ts:timestamp,v1:int>
We can see that both conditions will be pushed to DSE on the CQL level - this means, that Spark won't load all data into memory and filter them, but instead all filtering will happen in Cassandra, and only necessary data will be returned back. And because we're spreading requests between multiple nodes, the reading could be faster (need to test) than reading one giant partition. Another benefit of this design, is that it will be easy to perform deletion of the old data using Spark, with something like this:
val toDel = sc.cassandraTable("test", "sdtest").where("ts < '2019-08-10T00:00:00+0000'")
toDel.deleteFromCassandra("test", "sdtest", keyColumns = SomeColumns("b", "ts"))
In this case, Spark will perform very effective range/row deletion that will generate less tombstones.
P.S. it's recommended to use DSE's version of the Spark connector as it may have more optimizations.
P.P.S. theoretically, we can merge ts and uid into one timeuuid column, but I'm not sure that it will work with Dataframes.
I'm trying to write a batch job to process a couple of hundreds of terabytes that currently sit in an HBase database (in an EMR cluster in AWS), all in a single large table. For every row I'm processing, I need to get additional data from a lookup table (a simple integer to string mapping) that is in a second HBase table. We'd be doing 5-10 lookups per row.
My current implementation uses a Spark job that's distributing partitions of the input table to its workers, in the following shape:
Configuration hBaseConfig = newHBaseConfig();
hBaseConfig.set(TableInputFormat.SCAN, convertScanToString(scan));
hBaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> table = sparkContext.newAPIHadoopRDD(hBaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
table.map(val -> {
// some preprocessing
}).foreachPartition(p -> {
p.forEachRemaining(row -> {
// code that does the lookup
});
});
The problem is that the lookup table is too big to fit in the workers' memory. They all need access to all parts of the lookup table, but their access pattern would significantly benefit from a cache.
Am I right in thinking that I cannot use a simple map as a broadcast variable because it'd need to fit into memory?
Spark uses a shared nothing architecture, so I imagine there won't be an easy way to share a cache across all workers, but can we build a simple LRU cache for every individual worker?
How would I implement such a local worker cache that gets the data from the lookup table in HBase on a cache miss? Can I somehow distribute a reference to the second table to all workers?
I'm not set on my choice of technology, apart from HBase as the data source. Is there a framework other than Spark which could be a better fit for my use case?
You have a few of options for dealing with this requirement:
1- Use RDD or Dataset joins
You can load both of your HBase tables as Spark RDD or Datasets and then do a join on your lookup key.
Spark will split both RDD into partitions and shuffle content around so that rows with the same keys end up on the same executors.
By managing the number of number of partitions within spark you should be able to join 2 tables on any arbitrary sizes.
2- Broadcast a resolver instance
Instead of broadcasting a map, you can broadcast a resolver instance that does a HBase lookup and temporary LRU cache. Each executor will get a copy of this instance and can manage its own cache and you can invoke them within for foreachPartition() code.
Beware, the resolver instance needs to implement Serializable so you will have to declare the cache, HBase connections and HBase Configuration properties as transient to be initialized on each executor.
I run such a setup in Scala on one of the projects I maintain: it works and can be more efficient than the straight Spark join if you know your access patterns and manage you cache efficiently
3- Use the HBase Spark connector to implement your lookup logic
Apache HBase has recently incorporated improved HBase Spark connectors
The documentation is pretty sparse right now, you need to look at the JIRA tickets and the documentation of the previous incarnation of these tools
Cloudera's SparkOnHBase but the last unit test in the test suite looks pretty much like what you want
I have no experience with this API though.
I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.
I'm using spark with mongodb , I want to know how input rdd splitted across different worker nodes in the cluster, because my job is to club two records(one is request another is response) into one , based on the msg_id ,flag(flag indicates request or response) fields, msg_id is same in both records.while spark splitting input rdd ,each split for each node then how to handle the case if request record in one node and response record in another node.
Firstly, Spark master does not split data. It just controls workers.
Secondly, rdd splits (while reading from external sources) are decided by InputSplits, implemented through input format. This part is fairly similar to map reduce. So in your case, rdd splits (or partitions, in spark terms) are decided by mongodb input format.
In your case, I believe what you are looking for is to co-locate all records for a msg id to one node. That can be achieved using a partitionByKey function.
RDDs will be build based on your transformation(Subjected to the scenario) and master has less room to play the role here.Refer this link How does Spark paralellize slices to tasks/executors/workers? .
In your case, you may need to implement groupby() or groupbykey() (this one is not recommended) transformations to group your values based on the keys(msg_id).
For example
val baseRDD = sc.parallelize(Array("1111,REQUEST,abcd","1111,RESPONSE,wxyz","2222,REQUEST,abcd","2222,RESPONSE,wxyz"))
//convert your base rdd to keypair RDD
val keyValRDD =baseRDD.map { line => (line.split(",")(0),line)}
//Group it by message_id
val groupedRDD = keyValRDD.groupBy(keyvalue => keyvalue._1)
groupedRDD.saveAsTextFile("c:\\result")
Result :
(1111,CompactBuffer((1111,1111,REQUEST,abcd), (1111,1111,RESPONSE,wxyz)))
(2222,CompactBuffer((2222,2222,REQUEST,abcd), (2222,2222,RESPONSE,wxyz)))
In the above case, possibility of having all the values for a key in same partition is high(subjected to data volume and available computing resource at run time)