I am working with redis and want to store more than 1 Million keys as a hashset with 5 fields.
At first it looked like that:
key : {
a : 1,
b : 2,
c : 3,
d : 4,
e : 5
}
1M keys used about 130MB of Memory. Then I found an post from the instagram developers: http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value
So I adapted my approach to:
key1 : {
1a : 1,
1b : 2,
1c : 3,
1d : 4,
1e : 5,
2a : 1,
...
}
key2 : {
1a : 1,
1b : 2,
1c : 3,
1d : 4,
1e : 5,
2a : 1,
...
}
...
This used more than 200MB for just 500k entries. So is there a better structure within redis my data?
//EDIT
Ok so I have an object named 'user' I want to save in redis with 5 fields ('name','email','age', 'gender', 'active'). So at first I used the hmset command to create an entry in redis via nodejs 'redis'-package.
userId : {
name : Emil Example,
email : email#stackoverflow.com,
age : 20,
gender : m,
active : true
}
So I created 1M entries of dummy data an checked the memory-usage via 'INFO' command. About ~130MB. Then I read the instagram blog and their usage of buckets. So I implemented it with my data (userID : '12345'):
12 : {
345name : Emil Example,
345email : email#stackoverflow.com,
345age : 20,
345gender : m,
345active : true
}
Of course every userID beginning with '12' would fall into this bucket. After I tried a run with 500k dummy entries I had a usage of more than 500MB. So I just ask myself if I made something wrong or for my case, are there some other optimization options?
P.S: Ofc I adjusted the 'hash-zipmap-max-entries' settings in the config file and started redis using it.
Related
For example, suppose I have a basic_info type:
CREATE TYPE basic_info (first_name text, last_name text, nationality text)
And table like this:
CREATE TABLE student_stats (id int PRIMARY KEY, grade text, basics FROZEN<basic_info>)
And I have millions of record in the table.
If I add a field in the basic_info like this:
ALTER TYPE basic_info ADD address text;
I want to ask what happens in Cassandra when you add a new field in UDT type (it's currently a column in a table)? The reason for this question is I afraid that some side effects will happen when the table contains a lot of data (millions of record). It's best if you can explain things that will happen from the start to the end.
fields of UDT are described in table system_schema.types. When you add a new field, the entry for that type is updated inside Cassandra, but no changes in data on disk will happen (SSTables are immutable). Instead, when Cassandra read data, it checks if field is present or not, and if not (because it wasn't set, or it's a new field of UDT), then it will return null for that value, but not modify data on disk.
For example, if I have following type and table that uses it:
CREATE TYPE test.udt (
id int,
t1 int
);
CREATE TABLE test.u2 (
id int PRIMARY KEY,
u udt
)
And I have some data in the table, so I get:
cqlsh> select * from test.u2; id | u ----+---------------- 5 | {id: 1, t1: 3}
If I add a field to UDT with alter type test.udt add t2 int;, I immediately see the null as a value for a new UDT field:
cqlsh> select * from test.u2;
id | u
----+--------------------------
5 | {id: 1, t1: 3, t2: null}
And if I do sstabledump on the SSTable, I can see that it contains only old data:
[
{
"partition" : {
"key" : [ "5" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 46,
"liveness_info" : { "tstamp" : "2019-07-28T09:33:12.019Z" },
"cells" : [
{ "name" : "u", "path" : [ "id" ], "value" : 1 },
{ "name" : "u", "path" : [ "t1" ], "value" : 3 }
]
}
]
}
]
See also my answer about adding/removing columns
I have 20 columns in a table in Cassandra. Will there be a performance impact in performing
select * from table where partitionKey = 'test';
I am not able to understand from this link,
https://wiki.apache.org/cassandra/CassandraLimitations
1) What will be the consequence of having too many columns (say 20) in the Cassandra tables?
Unless you have a lot of rows on the partition, I don't see an impact with having 20 columns. As stated in the documentation that you linked:
The maximum number of cells (rows x columns) in a single partition is 2 billion.
So, unless you are expecting to have more than 100 million rows in a single partition, I don't see why 20 columns would be an issue. Keep in mind that Cassandra is a column family store. This designation means that Cassandra can store a large number of columns per partition.
Having said that, I would personally recommend not to go over 100 MB per partition. It might bring you problems in the future with streaming during repairs.
===============================
To answer to your comment. Keep in mind that partitions and rows are 2 different things in Cassandra. A partition is only equal to a row if there's no clustering columns. For instance, take a look at this table creation and the values we insert, and then look at the sstabledump:
create TABLE tt2 ( foo int , bar int , mar int , PRIMARY KEY (foo , bar )) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 2, 3) ;
insert INTO tt2 (foo , bar , mar ) VALUES ( 1, 3, 4) ;
sstabledump:
./cassandra/tools/bin/sstabledump ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 32,
"clustering" : [ "2" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:57:36.362483Z" },
"cells" : [
{ "name" : "mar", "value" : 3 }
]
},
{
"type" : "row",
"position" : 32,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2018-01-30T12:58:03.538482Z" },
"cells" : [
{ "name" : "mar", "value" : 4 }
]
}
]
}
]
Also, if you use the -d option, it might make it easier for you to see the internal representation. As you can see, for the same partition, we have 2 distinct rows:
./cassandra/tools/bin/sstabledump -d ~/cassandra/data/data/tk/tt2-1386f69005bd11e89c0bbfb5c1157523/mc-1-big-Data.db
[1]#0 Row[info=[ts=1517317056362483] ]: 2 | [mar=3 ts=1517317056362483]
[1]#32 Row[info=[ts=1517317083538482] ]: 3 | [mar=4 ts=1517317083538482]
I have a simple job with trigger=15 seconds, Source=Kafka and Sink=S3. Is it possible to find how much time did it take to download messages from Kafka? Or say if I had Sink=Console, it bring back data on the driver, is it possible to find how much time to download data from Kafka and how much time to bring it back to driver?
From driver I get these for query while writing to S3. Is it possible to understand how much time did it spend in downloading 99998 rows from Kafka out of triggerExecution = 44 seconds?
Streaming query made progress: {
id : 1383g52b-8de4-4e95-a3s9-aea73qe3ea56,
runId : 1206f5tc-t503-44r0-bc0c-26ce404w6724,
name : null,
timestamp : 2017-08-25T01:42:10.000Z,
numInputRows : 99998,
inputRowsPerSecond : 1666.6333333333334,
processedRowsPerSecond : 2263.9860535669814,
durationMs : {
addBatch : 42845,
getBatch : 3,
getOffset : 68,
queryPlanning : 6,
triggerExecution : 44169,
walCommit : 1245
},
stateOperators : [ ],
sources : [ {
description : KafkaSource[Subscribe[kafka_topic]],
startOffset : {
kafka_topic : {
2 : 20119244,
4 : 20123550,
1 : 20124601,
3 : 20113622,
0 : 20114208
}
},
endOffset : {
kafka_topic : {
2 : 20139245,
4 : 20143531,
1 : 20144592,
3 : 20133663,
0 : 20134192
}
},
numInputRows : 99998,
inputRowsPerSecond : 1666.6333333333334,
processedRowsPerSecond : 2263.9860535669814
} ],
sink : {
description : FileSink[s3://s3bucket]
}
}
Thanks!
You should find the answers to your questions by reviewing StreamingQuery.lastProgress.durationMs.
In the order of their calculation the following durations tell you:
getOffset is the time to get the offsets from all the sources
getBatch is the time to get the streaming Datasets (aka batches) from all the sources (one by one, sequentially).
addBatch is the time to write the streaming Dataset to a sink
With that said...
Is it possible to find how much time did it take to download messages from Kafka?
That's addBatch duration (since that's when the Dataset gets executed as an RDD on executors)
Is it possible to understand how much time did it spend in downloading 99998 rows from Kafka out of triggerExecution = 44 seconds?
You'd have to sum addBatch durations from StreamingQuery.recentProgress array.
Since the reading from Kafka and the processing of the read records are pipelined, it is pretty hard to find the exact time taken to read.
And many times this is not important because processing is the bottleneck rather than reading from Kafka. So the real question is, why do you care about the exact Kafka read time?
I am new to ElasticSearch. I have a huge data to index using Elasticsearch.
I am use Apache Spark to index the data in hive table using Elasticsearch.
as part of this functionality, i wrote simple Spark Script.
object PushToES {
def main(args: Array[String]) {
val Array(inputQuery, index, host) = args
val sparkConf = new SparkConf().setMaster("local[1]").setAppName("PushToES")
sparkConf.set("....",Host)
sparkConf.set("....","9200")
val sc = new SparkContext(sparkConf)
val ht = new org.apache.spark.sql.hive.HiveContext(sc)
val ps = hhiveSqlContext.sql(inputQuery)
ps.toJSON.saveJsonToEs(index)
}
}
After that I am generating jar and submitting the job by using spark-submit
spark-submit --jars ~/*.jar --master local[*] --class com.PushToES *.jar "select * from gtest where day=20170711" gest3 localhost
then I am executing the below command for
curl -XGET 'localhost:9200/test/test_test/_count?pretty'
first time it is showing properly
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute second time same curl command it is giving result like bleow
{
"count" : 20,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
if i execute 3rd time same command i am getting
{
"count" : 30,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
But I am not understanding every time why it is adding count value to existing index value(i.e. Count)
Please let me know how can I resolve this issue i.e . if I am execute any number of time also I have to get same value (correct count value i.e 10)
I am expecting below result for this case because correct count value is 10.(I executed count query on hive table for getting every time count(*) as 10)
{
"count" : 10,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Thanks in advance .
If you want to "replace" the data each time you run, and not to "append" it, then you have to configure for such a scenario in your Spark Elasticsearch properties.
First thing you need to do is to have an ID in your document, and tell elastisearch what is your id "column" (if you come from a dataframe) or key (in json terms).
This is documented here : https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the RDD configuration (it is also possible to set the property on the SparkConf though due to its global effect it is discouraged):
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "id"))
A second configuration key is available to control what kind of job elasticsearch tries to do upon writing data, but the default is correct for your user case :
es.write.operation (default index)
The write operation elasticsearch-hadoop should peform - can be any of:
index (default)
new data is added while existing data (based on its id) is replaced (reindexed).
create
adds new data - if the data already exists (based on its id), an exception is thrown.
update
updates existing data (based on its id). If no data is found, an exception is thrown.
upsert
known as merge or insert if the data does not exist, updates if the data exists (based on its id).
Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?
You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra