I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.
Related
I have to make 6 different calculations (sums and averages by day) in a parquet file that contains 1 year of data (day level). The problem is the file is too big and Jupyter crashes in the process. So I divided the file into 12 months (12 parquet files). I tested if the server would be able to make the calculations in 1 month of data in a reasonable time and it did. I want to avoid writing 72 different queries (6 calculations * 12 months). The result of each calculation would have to be saved in a parquet file and then joined in a final table. How would you recommend solving this by automating the process in PySpark? I would appreciate any suggestions. Thanks.
This is an example of the code I have to run in each of the 12 parts of the data:
month1= spark.read.parquet("s3://af/my_folder/month1.parquet")
month1.createOrReplaceTempView("month1")
month1sum= spark.sql("select id, date, sum(sessions) as sum_num_sessions from month1 where group by 1,2 order_by 1 asc")
month1sum.write.mode("overwrite").parquet("s3://af/my_folder/month1sum.parquet")
month1sum.createOrReplaceTempView("month1sum")
month_1_calculation=month1sum.groupBy('date').agg(avg('sum_num_sessions').alias('avg_sessions'))
month_1_calculation.write.mode("overwrite").parquet("s3://af/my_folder/month_1_calculation.parquet")```
Quick approach: how about a for loop?
for i in range(1, 13):
month= spark.read.parquet(f"s3://af/my_folder/month{i}.parquet")
month.createOrReplaceTempView(f"month{i}")
monthsum= spark.sql(f"select id, date, sum(sessions) as sum_num_sessions from month{i} where group by 1,2 order_by 1 asc")
monthsum.write.mode("overwrite").parquet(f"s3://af/my_folder/month{i}sum.parquet")
monthsum.createOrReplaceTempView(f"month{i}sum")
month_calculation = monthsum.groupBy('date').agg(avg('sum_num_sessions').alias('avg_sessions'))
month_calculation.write.mode("overwrite").parquet(f"s3://af/my_folder/month_{i}_calculation.parquet")
Long-term approach: Spark is designed to handle big data, so no matter how big your data is, as long as you have sufficient hardware (number of cores and memory), Spark should be able to take care of it with correct configurations. So adjusting your number of core, executor memory, driver memory, improving parallelism (by changing number of partitions), ... would definitely solve your issue.
I have a parquet dataframe, with the following structure:
ID String
DATE Date
480 other feature columns of type Double
I have to replace each of the 480 feature columns with their corresponding weighted moving averages, with a window of 250.
Initially, I am trying to do this for a single column, with the following simple code:
var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")
The input data contains 20 Million rows, and each ID has about 4-5000 dates associated.
I have run this on an AWS EMR cluster(m4.xlarge instances), with the following results for one column:
4 executors X 4 cores X 10 GB + 1 GB for yarn overhead (so 2.5GB per task, 16 concurrent running tasks) , took 14 min
8 executors X 4 cores X 10GB + 1 GB for yarn overhead (so 2.5GB per task, 32 concurrent running tasks), took 8 minutes
I have tweaked the following settings, with the hope of bringing the total time down:
spark.memory.storageFraction 0.02
spark.sql.windowExec.buffer.in.memory.threshold 100000
spark.sql.constraintPropagation.enabled false
The second one helped prevent some spilling seen in the logs, but none helped with the actual performance.
I do not understand why it takes so long for just 20 Million records. I know that for computing weighted moving average, it needs to do 20 M X 250 (the window size) averages and divisions, but with 16 cores (first run) I don't see why it would take so long. I can't imagine how long it would take for the rest of the 479 remaining feature columns!
I have also tried increasing the default shuffle paritions, by setting:
spark.sql.shuffle.partitions 1000
but even with 1000 partitions, it didn't bring the time down.
Also tried sorting the data by ID and DATE before calling the window aggregations, without any benefit.
Is there any way to improve this, or window functions generally run slow with my usecase? This is 20M rows only, nowhere near what spark can process with other types of workload..
Your dataset size is approximately 70 GB.
if i understood it correctly for each id it is sorting on date for all the records and then taking the preceding 250 records to do average. As you need to apply this on more than 400 columns, i would recommend trying bucketing while parquet creation to avoid shuffling. it takes considerable amount of time for writing the bucketted parquet file but for all 480 columns derivation that may not take 8 minutes *480 executing time.
please try bucketing or repartition and sortwithin while creating parquet file and let me know if it works.
I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.
I am running a large job that consolidates about 55 streams (tags) of samples (one sample per record) at irregular times over two years into 15-minute averages. There are about 1.1 billion records in 23k streams in the raw dataset, and these 55 streams make up about 33 million of those records.
I calculated a 15-minute index and am grouping by that to get the average value, however I seem to have am exceeded the max dynamic partitions on my hive job in spite of cranking it way up to 20k. I can increase it further I suppose, but it already takes awhile to fail (about 6 hours, although I reduced it to 2 by reducing the number of streams to consider), and I don’t actually know how to calculate how many I really need.
Here is the code:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.max.dynamic.partitions=50000;
SET hive.exec.max.dynamic.partitions.pernode=20000;
DROP TABLE IF EXISTS sensor_part_qhr;
CREATE TABLE sensor_part_qhr (
tag STRING,
tag0 STRING,
tag1 STRING,
tagn_1 STRING,
tagn STRING,
timestamp STRING,
unixtime INT,
qqFr2013 INT,
quality INT,
count INT,
stdev double,
value double
)
PARTITIONED BY (bld STRING);
INSERT INTO TABLE sensor_part_qhr
PARTITION (bld)
SELECT tag,
min(tag),
min(tag0),
min(tag1),
min(tagn_1),
min(tagn),
min(timestamp),
min(unixtime),
qqFr2013,
min(quality),
count(value),
stddev_samp(value),
avg(value)
FROM sensor_part_subset
WHERE tag1='Energy'
GROUP BY tag,qqFr2013;
And here is the error message:
Error during job, obtaining debugging information...
Examining task ID: task_1442824943639_0044_m_000008 (and more) from job job_1442824943639_0044
Examining task ID: task_1442824943639_0044_r_000000 (and more) from job job_1442824943639_0044
Task with the most failures(4):
-----
Task ID:
task_1442824943639_0044_r_000000
URL:
http://headnodehost:9014/taskdetails.jsp?jobid=job_1442824943639_0044&tipid=task_1442824943639_0044_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException:
[Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:747)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:498)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:521)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:232)
... 7 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 520 Reduce: 140 Cumulative CPU: 7409.394 sec HDFS Read: 0 HDFS Write: 393345977 SUCCESS
Job 1: Map: 9 Reduce: 1 Cumulative CPU: 87.201 sec HDFS Read: 393359417 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 days 2 hours 4 minutes 56 seconds 595 msec
Can anyone give some ideas as to how to calculate how many of these dynamic nodes I might need for a job like this?
Or maybe I should be doing this differently? I am running Hive 0.13 by the way on Azure HDInsight.
Update:
Corrected some of the numbers above.
Reduced it to 3 streams operating on 211k records and it finally
succeeded.
Started experimenting, reduced the partitions per node to 5k, and then 1k, and it still succeeded.
So I am not blocked anymore, but I am thinking I would have needed millions of nodes to do the whole dataset in one go (which is what I really wanted to do).
Dynamic partition columns must be specified last among the columns in the SELECT statement during insertion in sensor_part_qhr.
I am using cassandra in my app and it started eating up disk space much faster than I expected and much faster than defined in manual. Consider this most simple example:
CREATE TABLE sizer (
id ascii,
time timestamp,
value float,
PRIMARY KEY (id,time)
) WITH compression={'sstable_compression': ''}"
I am turning off compression on purpose to see how many bytes will each record take.
Then I insert few values, I run nodetool flush and then I check the size of data file on disk to see how much space did it take.
Results show huge waste of space. Each record take 67 bytes, I am not sure how that is possible.
My id is 13 bytes long at it is saved only once in data file, since it is always the same for testing purposes.
According to: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningUserData_t.html
Size should be:
timestamp should be 8 bytes
value as column name takes 6 bytes
column value float takes 4 bytes
column overhead 15 bytes
TOTAL: 33 bytes
For testing sake, my id is always same, so I have actually only 1 row if I understood correctly.
So, my questions is how do I end up on using 67 bytes instead of 33.
Datafile size is correct, I tried inserting 100, 1000 and 10000 records. Size is always 67 bytes.
There are 3 overheads discussed in the file. One is the column overhead, which you have accommodated for. The second is the row overhead. And also if you have replication_factor greater than 1 there's an over head for that as well.