Find similar items in a dataset

Find similar items in a dataset - search

I have a dataset of of 500 mobile devices having 10 attributes namely
Date|Company|ModelName|Price|HardDisk|RAM|Colour|Display size|Cam1|Cam2
The sample dataset is given below :
24/10/2015 | walmart | Samsung Galaxy Note 4 N910H 32GB Unlocked GSM OctaCore Cell Phone-N910H 32GB GOLD | 599.99 | 32 | N/A | cell gold | N/A | 10.2 | 16
25/10/2015 | walmart | Samsung Galaxy Note 5 SM-N920i Gold International Model Unlocked GSM Mobile Phone | 717.95 | 32 | N/A | gold | N/A | 5.7 | 16
26/10/2015 | amazon | T-Mobile AllShare Cast Wireless Hub | 65.15 | N/A | N/A | streaming | N/A | N/A | N/A
I have to find the the most similar or unique devices or remove duplicate mobile devices from the dataset by taking into account the various attributes of the mobile devices.
I have explored many similarity algorithms like Jaccard similarity, cosine similarity. Levenshtein Distance but they seem to work upon attributes with same datatype.
Please suggest an algorithm or approach that could work on this type of mixed datatype dataset taking into account almost all attributes.

You can compute the hash code of each row.
Then use the difference of the hash codes as similarity measure.
Obviously, this depends on all the attributes.
It is very good for finding duplicates!
It may not be good for your application - but you did not specify what is good for your application.

Related

Sizing Azure Kubernetes Services (AKS) Cluster

I am trying to size my AKS clusters. What I understood and followed is the number of micro services and their replication copies would be primary parameters. Also the resource usage by each micro services and prediction of that usage increase during coming years also needs to be considered. But all these information seems totally scattered to reach a number for AKS sizing. Sizing I meant by how many nodes to be assigned? what could be the configuration of nodes, how many pods to be considered, how many IP numbers to be reserved based on number of pods etc..
Is there any standard matrix here or practical way of calculation to
compute AKS cluster sizing, based on any ones'experience?

no, pretty sure there is none (and how it could be)? just take your pod cpu\memory usage and sum that up, you'll get an expectation of the resources needed to run your stuff, add k8s services on top of that.
also, like Peter mentions in his comment, you can always scale your cluster, so such planning seems a bit unreasonable.

Actually, you may be interested in the Sizing of your nodes, things like Memory, CPU, Networking, Disk are directly linked with the node you chose, example:
Not all memory and CPU in a Node can be used to run Pods. The resources are partitioned in 4:
Memory and CPU reserved to the operating system and system daemons such as SSH
Memory and CPU reserved to the Kubelet and Kubernetes agents such as the CRI
Memory reserved for the hard eviction threshold
Memory and CPU available to Pods
CPU and Memory available for PODs
________________________________________________
Memory | % Available | CPU | % Available
________________________________________________
1 | 0.00% | 1 | 84.00%
2 | 32.50% | 2 | 90.00%
4 | 53.75% | 4 | 94.00%
8 | 66.88% | 8 | 96.50%
16 | 78.44% | 16 | 97.75%
64 | 90.11% | 32 | 98.38%
128 | 92.05% | 64 | 98.69%
192 | 93.54%
256 | 94.65%
Other things are Disk and Networking, example:
Node Size | Maximum Disks| Maximum Disk IOPS | Maximum Throughput (MBps)
_______________________________________________________________________________
Standard_DS2_v2 | 8 | 6,400 | 96
Standard_B2ms | 4 | 1,920 | 22.5

Best way to filter to a specific row in pyspark dataframe

I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.

Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.

Efficient Cassandra DB design to retrieve a summary of time series financial data

I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.
Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.
I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.
What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.
I have come up with a simple and naive way to store each record with clustering ordered by time:
CREATE TABLE symbols (
time timestamp,
symbol text,
open double,
high double,
low double,
close double,
volume double
PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);
But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.
Any suggestions will be much appreciated.

While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.
An OHLCV query for a 15-minute window might look as follows:
SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time,
FIRST(t_open.value) AS open,
MAX(t_high.value) AS high,
MIN(t_low.value) AS low,
LAST(t_close.value) AS close,
SUM(t_volume.value) AS volume
FROM stock.open AS t_open
JOIN stock.high AS t_high
JOIN stock.low AS t_low
JOIN stock.close AS t_close
JOIN stock.volume AS t_volume
WHERE t_open.entity = 'ibm'
AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
ORDER BY datetime
Note the GROUP BY PERIOD clause above which does all the work behind the scenes.
Query results:
| time | open | high | low | close | volume |
|----------------------|----------|---------|----------|---------|--------|
| 2018-03-29 10:32:00 | 151.8 | 152.14 | 151.65 | 152.14 | 85188 |
| 2018-03-29 10:47:00 | 152.18 | 152.64 | 152 | 152.64 | 88065 |
| 2018-03-29 11:02:00 | 152.641 | 153.04 | 152.641 | 152.69 | 126511 |
| 2018-03-29 11:17:00 | 152.68 | 152.75 | 152.43 | 152.51 | 104068 |
You can use a Type 4 JDBC driver, API clients or just curl to run these queries.
I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.
Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.
Disclaimer: I work for Axibase.

Cassandra isolation model

I have a use case for Cassandra where I need to store multiple rows of data, which will belong to different customers. I'm new to Cassandra and I need to provide a permissions model where only one customer is accessible at once from a base permissions role but all could be accessible from a 'supervisor' role. Essentially every time a query is made, one customer cannot see another customer's data, except for when the query is made from a supervisor. We have to enforce a security as a design approach.
The data could look like this:
-----------------------------------------
| id | customer name | data column1... |
-----------------------------------------
| 0 | customer1 | 3 |
-----------------------------------------
| 1 | customer2 | 23 |
-----------------------------------------
| 2 | customer3 | 33 |
-----------------------------------------
| 3 | customer3 | 32 |
-----------------------------------------
Is something like this easily doable with Cassandra?

The way you have modeled this is a perfectly good way to do multi-tenant. This is how UserGrid models multiple tenants and is used in several large scale applications.
Couple of drawbacks to be up-front:
Doesn't help with a "noisy neighbor" problem and unequal tenants
Application code has to manage the tenant security

Read time increases by 1ms for every 100 rows

I am stuck with this issue for almost a week now. I would like to get your suggestions and help with this. I have been getting read latency problems for simple table too. I just created simple table with 4k rows and when I read 500 rows it is fetching in 5ms but if I increase 1000 it gets ~10ms if take 4k it gets around 50ms. I tried checking stats, network, iostat, tpstats, heap but couldn't get a clue of what the issue is. Could anyone help me in what more i need to do resolve this high priority issue assigned to me. Thank you very much in advance.
Tracing session: b4287090-0ea5-11e5-a9f9-bbcaf44e5ebc
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
Execute CQL3 query | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 0
Parsing select * from location_eligibility_by_type12; [SharedPool-Worker-1] | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 33
Preparing statement [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 62
Computing ranges to query [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 101
Submitting range requests on 1537 ranges with a concurrency of 1537 (1235.85 rows per range expected) [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 314
Submitted 1 concurrent range requests covering 1537 ranges [SharedPool-Worker-1] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 6960
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] [SharedPool-Worker-2] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 7033
Read 4007 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-09 07:47:36.045000 | 10.65.133.202 | 84055
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-09 07:47:36.046000 | 10.65.133.202 | 84109
Request complete | 2015-06-09 07:47:36.052498 | 10.65.133.202 | 91498

Selecting lots of rows in Cassandra often takes unpredictably long since the query will be routed to more machines.
It's best to avoid such schemas if you need high read performance. A better approach is to store data in a single row and spread the load between nodes by having a higher replication factor. Wide rows are generally preferable: http://www.slideshare.net/planetcassandra/cassandra-summit-2014-39677149

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find similar items in a dataset - search

You can compute the hash code of each row. Then use the difference of the hash codes as similarity measure. Obviously, this depends on all the attributes. It is very good for finding duplicates! It may not be good for your application - but you did not specify what is good for your application.

Related

Sizing Azure Kubernetes Services (AKS) Cluster

Best way to filter to a specific row in pyspark dataframe

Efficient Cassandra DB design to retrieve a summary of time series financial data

Cassandra isolation model

Read time increases by 1ms for every 100 rows

Categories

Resources