I am inserting and updating multiple entries into a table in Cassandra using python Cassandra driver. Currently my code looks like:
cluster = Cluster()
session = cluster.connect('db')
for a in list:
if bool:
# calculate b
session.execute("UPDATE table SET col2 = %s WHERE col1 = %s", (b, a))
else:
# calculate b
session.execute("INSERT INTO table(col1, col2) VALUES(%s, %s)", (a, b))
This method of insertion and update is quite slow as the number of entries in the list (all are unique) which are to be inserted is very large. Is there any faster way of doing this?
Generally for this scenario, you will see the best performance by increasing the number of concurrent writes to Cassandra.
You can do this with the Datastax Python Cassandra driver using execute_concurrent
From your description, it is worth noting that for your case there is no difference between an Update and an Insert with Cassandra. (i.e. you can simply do the insert statement from your else clause for all values of (a, b).
You will want to create a prepared statement.
Rather than doing the inserts one at a time in your for-loop, consider pre-computing groups of (a,b) pairs as input for execute_concurrent; you can also write a generator or generator expression as input for execute_concurrent.
Example:
parameters = ((a, calculate_b(a)) for a in my_list)
execute_concurrent_with_args(my_session, my_prepared_statement, parameters)
Related
So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));
I have a question about ordering a DateTime RDD, finding the holes contained in it and fill it, for example, suppose we have this record into my database:
20160410,"info1"
20160409,"info2"
20160407,"info3"
20160404,"info4"
Basically for my purpose I need also holes, because it will impact over my calculations, so I would like something like this at the end:
Some(20160410,"info1")
Some(20160409,"info2")
None
Some(20160407,"info3")
None
None
Some(20160404,"info4")
What the best strategy to do that?
This is a little imcomplete excerpt code:
val records = bdao // RDD[(String,List[RecordPO])]
.findRecords
.filter(_.getRecDate >= startDate)
.filter(_.getRecDate < endDate)
.keyBy(_.getId)
.aggregateByKey(List[RecordPO]())((list, value) => value +: list, _ ++ _)
...
/* transformations */
...
val finalRecords=.... // RDD[(String,List[Option[RecordPO])]
Thanks in advance
You will need to create dataframe of all dates you want to see in resulting dataset (for example, all dates from 20160404 to 20160410). Then perform left outer join of this dataset with your records and you will get None where you expect.
Aggregating multiple columns:
I have a dataframe input.
I would like to apply different aggregation functions per grouped columns.
In the simple case, I can do this, and it works as intended:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.friends_count" -> "avg"))
However, if I want to add more aggregation functions for the same column, they are missed, for instance:
val x = input.groupBy("user.lang").agg(Map("user.followers_count" -> "avg", "user.followers_count" -> "max", "user.friends_count" -> "avg")).
As I am passing a map it is not exactly surprising. How can I resolve this problem and add another aggregation function for the same column?
It is my understanding that this could be a possible solution:
val x = input.groupBy("user.lang").agg(avg($"user.followers_count"), max($"user.followers_count"), avg("user.friends_count")).
This, however returns an error: error: not found: value
avg.
New column naming:
In the first case, I end up with new column names such as: avg(user.followers_count AS ``followers_count``), avg(user.friends_count AS ``friends_count``). Is it possible to define a new column name for the aggregation process?
I know that using SQL syntax might be a solution for this, but my goal eventually is to be able to pass arguments via command line (group by columns, aggregation columns and functions) so I'm trying to construct the pipeline that would allow this.
Thanks for reading this!
I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
Thanks
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.
I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.
Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like
SELECT value FROM data WHERE starttime <= x AND endtime >= y
Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?
This is an oddly difficult thing to model efficiently.
I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.
In pycassa, a Python Cassandra client, your query would look like this:
from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
# do stuff with result
There should be something similar in other clients.
The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.