I need to insert the huge amount of data by using Python DataStax driver for Cassandra. As a result I cannot use execute( ) request. execute_async( ) is much faster.
But I faced the problem of losing data during calling execute_async( ). If I use execute( ), everything is ok. But, If I use execute_async( ) (for the SAME insert queries), the only about 5-7% of my request executed correctly (and no any errors were occured). And in a case I add time.sleep( 0.01 ) after each of 1000 insert request (by using execute_async( ) ), it's again ok.
No any data lose (case 1):
for query in queries:
session.execute( query )
No any data lose (case 2):
counter = 0
for query in queries:
session.execute_async( query )
counter += 1
if counter % 1000 == 0:
time.sleep( 0.01 )
Data losing:
for query in queries:
session.execute_async( query )
Is there any reason why it could be?
Cluster has 2 nodes
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
DataStax Python driver version 3.14.0
Python 3.6
Since execute_async is a non-blocking query, your code is not waiting for completion of the request before proceeding. The reason why you probably observe no data loss when you add a 10ms sleep after each execution is because that gives enough time for requests to be processed before you are reading data back.
You need something in your code that waits for completion of the requests before reading data back, i.e.:
futures = []
for query in queries:
futures.push(session.execute(query))
for f in futures:
f.result() # blocks until query is complete
You may want to evaluate using execute_concurrent for submitting many queries and having the driver manage the concurrency level for you.
Related
Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.
When I run a query through Robo 3T, mongod, or pymongo using the command line, the query takes 50 milliseconds to return the results. Running the same query using pymongo in AWS Lambda takes 15-16 seconds. All queries are not this slow, just address queries in my case (queries for a name take under 1 second). I'm using python 3.6, pymongo 3.7, and mongodb 3.6.
I don't believe it's a cold start issue because I run 2 queries in a row, this being the second query, and the first query still takes less than a second. Also, I've tried running it multiple times in a row and I get the same results every time. The function only uses up 57MB of the 128MB allotted, so I don't believe it could be a CPU issue, and increasing the CPU didn't change the speed at all.
MongoDB query
db.getCollection('CA').find({'$and': [{'OWNER_ZIP': {'$regex': '^95120'}},{'OWNER_STREET_1': '123 MAIN STREET'}]})
pymongo query
cursor = list(db_unclaimed.find({'$and': [{'OWNER_ZIP': {'$regex': '^95120'}},{'OWNER_STREET_1': '123 MAIN STREET'}]}).skip(0).limit(50))
Python function I'm using
def searchAddress(zipcode, address, page_size, page_num):
print('Searching by address...')
print(address)
skips = page_size * (int(page_num) - 1)
cursor = list(db_unclaimed.find({'$and': [{'OWNER_ZIP': {'$regex': zipcode}},{'OWNER_STREET_1': address}]}).skip(skips).limit(page_size))
for document in cursor: print(document)
return cursor
I would expect the query to take close to the same amount of time in lambda that it does using the other 3 methods, even if it might a bit slower. Does anyone have any ideas as to what could be causing this?
I am using pyspark and Flask for interactive spark as service application.
My application should get some request with some parameters and return response back. My code is here:
//first I make udf function
def dict_list(x, y):
return dict((zip(map(str, x), map(str, y))))
dict_list_udf = F.udf(lambda x, y: dict_list(x, y),
types.MapType(types.StringType(), types.StringType()))
//then I read my table from cassandra
df2 = spark.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="property_change", keyspace="strat_keyspace_cassandra_raw2") \
.load()
#app.route("/test/<serviceMatch>/<matchPattern>")
def getNodeEntries1(serviceMatch, matchPattern):
result_df = df2.filter(df2.id.like(matchPattern + "%") & (df2.property_name == serviceMatch)) \
.groupBy("property_name") \
.agg(F.collect_list("time").alias('time'), F.collect_list("value").alias('value'))
return json.dumps(result_df.withColumn('values', dict_list_udf(result_df.time, result_df.value)).select('values').take(1))
When I start my server(using spark submit), and use Postman for get request, i takes about 13 seconds first time to give me response, and after that every other response takes approximately 3 seconds. To serve users with delay of 13 seconds at first is not acceptable. I am new spark user and I am assuming that this behaviour is due to the spark nature, but I do not know what exactly is causing it. Maube something about caching or compiling execution plan like sql queries. Is there any chance that I could solve this problem. Ps I am new user, so sorry if my question is not clear enought or anything else.
Such delay is fully expected. Skipping over simple fact that Spark is not designed to be used directly embedded in an interactive application (nor is suitable for real time queries) there is simply a significant overhead of
Initializing context.
Acquiring resources from the cluster manager.
Fetching metadata from Cassandra.
The question is if it makes any sense to use Spark here at all - if you need close to real time response, and you collect full results to the driver, using native Cassandra connector should be much better choice.
However if you plan to execute logic that is not supported by Cassandra itself then all you can do is accept the cost of such indirect architecture.
I am looking to measure how much time is taken in my Spark job for the IO part of reading from an external DB. My code is
val query = s"""
|(
| select
| ...
|) as project_data_tmp """.stripMargin
sparkSession.time(
sparkSession.read.jdbc(
url = msqlURLWithCreds,
table = query,
new Properties()
)
)
sparkSession.time doesn't seem to do anything in-depth enough to measure the full load time of the sql.
The web UI is giving me timing for the entire Stage
The green box is my `read and call cache on the DataFrame.
The only way I could come up with to split into a separate Stage was to perform an operation that required shuffling data; but then that introduced its own overheads.
Thanks,
Brent
I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.