Connection issue python with Cassandra - python-3.x

I am trying to fetch data from Cassandra cluster. There is a issue it some time fetches few rows and some times no row.
I am using Python 3.8.5 and latest DataStax Python driver.
Below is the code
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
cluster = Cluster(['xx.xx.x.xx','xx.xx.x.xx','xx.xx.x.xx'])
session = cluster.connect('keyspace')
query = "SELECT * from table ALLOW FILTERING" # Allow Filtering is for test
statement = SimpleStatement(query)
for user_row in session.execute(statement):
print(user_row)
The whole statement some time give some result but some time no result.
I have enable trace as suggested to see the execution below is the result
0:00:00.000150 Parsing SELECT * from table ALLOW FILTERING
0:00:00.000271 Preparing statement
0:00:00.000425 Computing ranges to query
0:00:00.000654 Submitting range requests on 769 ranges with a concurrency of 1 (0.0 rows per range expected)
0:00:00.001804 Submitted 1 concurrent range requests
0:00:00.001839 Executing seq scan across 0 sstables for (min(-9223372036854775808), min(-9223372036854775808)]
0:00:00.001924 Read 0 live rows and 0 tombstone cells

Related

Spark SQL performance issue on large table

We are connecting Tera data from spark SQL with below API
Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB.
This below information we got from DBA. We dont have any logs on SPARK SQL.
SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
1
1
1
1
1
1
1
1
1
Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe.
Please provide me your inputs on this.
I searched on various mailing list and even created sapk question.

Spark request only a partial sorting for row_number().over partitioned window

Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.

Cassandra execute_async request lose data

I need to insert the huge amount of data by using Python DataStax driver for Cassandra. As a result I cannot use execute( ) request. execute_async( ) is much faster.
But I faced the problem of losing data during calling execute_async( ). If I use execute( ), everything is ok. But, If I use execute_async( ) (for the SAME insert queries), the only about 5-7% of my request executed correctly (and no any errors were occured). And in a case I add time.sleep( 0.01 ) after each of 1000 insert request (by using execute_async( ) ), it's again ok.
No any data lose (case 1):
for query in queries:
session.execute( query )
No any data lose (case 2):
counter = 0
for query in queries:
session.execute_async( query )
counter += 1
if counter % 1000 == 0:
time.sleep( 0.01 )
Data losing:
for query in queries:
session.execute_async( query )
Is there any reason why it could be?
Cluster has 2 nodes
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
DataStax Python driver version 3.14.0
Python 3.6
Since execute_async is a non-blocking query, your code is not waiting for completion of the request before proceeding. The reason why you probably observe no data loss when you add a 10ms sleep after each execution is because that gives enough time for requests to be processed before you are reading data back.
You need something in your code that waits for completion of the requests before reading data back, i.e.:
futures = []
for query in queries:
futures.push(session.execute(query))
for f in futures:
f.result() # blocks until query is complete
You may want to evaluate using execute_concurrent for submitting many queries and having the driver manage the concurrency level for you.

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Resources