like full table in hadoop/spark

like full table in hadoop/spark - apache-spark

I have about 1 million records. I want manage this juridical judgments records in spark hadoop. My question is Can I query spark/hadoop for get all records at once like a full table scan? or Can I paginate efficiently for example records from 800 000 to 800 050?
My problem is that I use elasticsearch for full text search but if I want get results from 800 000 to 800 050 I'm obliged to use scroll api that appear very slow because start from 0 then take 10 000 records then others 10 000 and so on. My goal is get all records then "jump" to 800 000 without chunk of 10 000 records.

Hive or SparkSQL can be used to query offset ranges, of datasets, yes. But they won't help with textual search out of the box.
MongoDB can do both, since it also includes Lucene indexes like Elasticsearch.

Related

Cassandra data model for high ingestion rate and delete operation

I am using the following Cassandra data model
ruleid - bigint
patternid - bigint
key - string
value - string
time - timestamp
event_uuid -time based uuid
partition key - ruleid, patterid
clustering key - event_uuid order by descending
Our ingestion rate is around 100 records per second per pattern id and there might be 10 000+ pattern ids.
Our query is fairly straightforward we query the last 100 000 records based on the desc uuid filtered by the partition key.
Also for our use case we would need to perform around 5 deletes per second on this per pattern ids.
However this leads to the so called tombstones and causes readtimeout on querying on the datastore again.
How to overcome the above issue?

It sounds like you are storing records into the table, doing some transformation/processing on the records, then deleting them.
But since you're deleting rows within partitions (instead of the partitions themselves), you have to iterate over the deleted rows (tombstones) to get to the live records.
The real problem though is reading too many rows which won't perform well. Retrieving 100K rows is going to be slow so consider paging through the result set.
With limited information you've provided, this is not an easy problem to solve. Cheers!

Merge very large hive Tables (11 to be precise) using Spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Why does aggregating paginated query takes less time than fetching the entire table

I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?

This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows

After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.

MemSQL table design for a word cloud

I'm processing a twitter feed by storing the tweets into a table in memsql. The table has fields like tweet_id, posted_time, body, etc...
The table contains around 5 million tweets per day. total of a billion tweet for the whole period stored so far
The table is stored as a columnstore, with the tweet_id as a sharding key, and the posted_time as the columnstore clustering column.
It is working fine for all real-time analytics so far, and returns answers in sub-second if you query one day. The wider your date filters, the slower the query
The requirement is to generate a word cloud from the body field of the tweet. My question is; what is the best way to do it? I need the query to be efficient (takes seconds not minutes)
Keep in mind the following
joins are not efficient for this big table.
taking the body field for a few million tweets and break it down
to words and then aggregate words and come up with the top ones is not efficient.
I believe a separate table will be needed, what could be the design for this table? suggestions please
Finally, my MemSQL cluster has 5 nodes, total of 1 TB of RAM, and 192 cores

I don't think MemSQL is the best way to do this. Your best bet is to index it with a search server/library like Apache Solr, or just use Apache Lucene as your backend. That way, the queries needed for a word cloud, like "Give me all the counts of the top ranked n-words sorted by count" would return in seconds.

Datastax driver limit option

I construct a select query using datastax java driver. I set the limit using limit option. But i see another property that can be set too
setFetchSize(int size)
DEFAULT_FETCH_SIZE- 5000 according to the docs.
http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/QueryOptions.html#DEFAULT_FETCH_SIZE
Does this mean that if i have around 10000 columns in a row,if i have a query run with a limit of 3, it will always fetch the default value specified- 5000 rows and then limit the last 3 rows from that?
I thought the limit query fetches the last 3 values alone by default when used like this. Can someone clarify on this?

LIMIT sets the max number of rows the engine retrieves while setFetchSize sets the max number of rows that are returned to the client in one roundtrip.

in cassandra LIMIT will not work the same as mysql or any other RDBMS limit,
by default when you execute select query it will display data till 10000 columns, after that 1 message will appear that it is out of limit something like that, so let's say you have 50000 records in the database and you are executing select query then only 10000 records will appear so now you will execute query select * from table LIMIT 50000 in this case all 50000 data will display..

I think the difference between the fetchsize and the limit is the same as JDBC with other databases like MySQL.
So, the LIMIT will limit your query results to those that fall within the range specified.
The fetch size is the number of rows physically retrieved from the database at one time by the driver as you scroll through a query ResultSet with next().

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

like full table in hadoop/spark - apache-spark

Hive or SparkSQL can be used to query offset ranges, of datasets, yes. But they won't help with textual search out of the box. MongoDB can do both, since it also includes Lucene indexes like Elasticsearch.

Related

Cassandra data model for high ingestion rate and delete operation

Merge very large hive Tables (11 to be precise) using Spark

Why does aggregating paginated query takes less time than fetching the entire table

MemSQL table design for a word cloud

Datastax driver limit option

Categories

Resources