Understanding Time and Space Complexity of Networkx function in python with relation to system configuration - python-3.x

I am trying to find all the cycles in my data using networkx.simple_cycles() function in python. As written in networkx documentation "This is a non-recursive, iterator/generator version of Johnson’s algorithm". Johnson's algorithm has a time complexity of: ((nodes+edges)*(cycles+1)) and space complexity of: (nodes+edges).
I have created a directed graph from transaction data with edges as From node--->To node. I am removing the nodes which have in-degree or out-degree of 0 as they will not contribute to cycles(elementary circuits).
My graph details after removing the in-degree and out-degree 0 nodes are as follows:
For 1.09 million records:
Number of nodes remaining-39278;
Number of edges-26324;
Number of cycles- Not able to get any output as the code keeps on running.
For 1.08 million records:
Number of nodes remaining-38664;
Number of edges-25710;
Number of cycles- 5612438.
For 1.05 million records:
Number of nodes remaining-36737;
Number of edges-23784;
Number of cycles- 69671.
For 1.01 million records:
Number of nodes remaining-34393;
Number of edges-21566;
Number of cycles- 3079.
For 1 million records:
Number of nodes remaining-33841;
Number of edges-21125;
Number of cycles- 3072.
I want to understand the relation of time and space complexity with system configuration. Given all these details about an algorithm, how can i find out what system configuration will be suitable for a particular size of data. Why am i not able to run this algorithm beyond 1.08 million data?
My system configuration:
AMD PRO A12-8800B (2.1 GHz, up to 3.4 GHz, 2MB Cache, 4 Cores) with AMD Radeon R7 Graphics.
8GB RAM.

Related

Spark Window performance issues

I have a parquet dataframe, with the following structure:
ID String
DATE Date
480 other feature columns of type Double
I have to replace each of the 480 feature columns with their corresponding weighted moving averages, with a window of 250.
Initially, I am trying to do this for a single column, with the following simple code:
var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")
The input data contains 20 Million rows, and each ID has about 4-5000 dates associated.
I have run this on an AWS EMR cluster(m4.xlarge instances), with the following results for one column:
4 executors X 4 cores X 10 GB + 1 GB for yarn overhead (so 2.5GB per task, 16 concurrent running tasks) , took 14 min
8 executors X 4 cores X 10GB + 1 GB for yarn overhead (so 2.5GB per task, 32 concurrent running tasks), took 8 minutes
I have tweaked the following settings, with the hope of bringing the total time down:
spark.memory.storageFraction 0.02
spark.sql.windowExec.buffer.in.memory.threshold 100000
spark.sql.constraintPropagation.enabled false
The second one helped prevent some spilling seen in the logs, but none helped with the actual performance.
I do not understand why it takes so long for just 20 Million records. I know that for computing weighted moving average, it needs to do 20 M X 250 (the window size) averages and divisions, but with 16 cores (first run) I don't see why it would take so long. I can't imagine how long it would take for the rest of the 479 remaining feature columns!
I have also tried increasing the default shuffle paritions, by setting:
spark.sql.shuffle.partitions 1000
but even with 1000 partitions, it didn't bring the time down.
Also tried sorting the data by ID and DATE before calling the window aggregations, without any benefit.
Is there any way to improve this, or window functions generally run slow with my usecase? This is 20M rows only, nowhere near what spark can process with other types of workload..
Your dataset size is approximately 70 GB.
if i understood it correctly for each id it is sorting on date for all the records and then taking the preceding 250 records to do average. As you need to apply this on more than 400 columns, i would recommend trying bucketing while parquet creation to avoid shuffling. it takes considerable amount of time for writing the bucketted parquet file but for all 480 columns derivation that may not take 8 minutes *480 executing time.
please try bucketing or repartition and sortwithin while creating parquet file and let me know if it works.

YCSB low read throughput cassandra

The YCSB Endpoint benchmark would have you believe that Cassandra is the golden child of Nosql databases. However, recreating the results on our own boxes (8 cores with hyperthreading, 60 GB memory, 2 500 GB SSD), we are having dismal read throughput for workload b (read mostly, aka 95% read, 5% update).
The cassandra.yaml settings are exactly the same as the Endpoint settings, barring the different ip addresses, and our disk configuration (1 SSD for data, 1 for a commit log). While their throughput is ~38,000 operations per second, ours is ~16,000 regardless (relatively) of the threads/number of client nodes. I.e. one worker node with 256 threads will report ~16,000 ops/sec, while 4 nodes will each report ~4,000 ops/sec
I've set the readahead value to 8KB for the SSD data drive. I'll put the custom workload file below.
When analyzing disk io & cpu usage with iostat, it seems that the reading throughput is consistently ~200,000 KB/s, which seems to suggest that the ycsb cluster throughput should be higher (records are 100 bytes). ~25-30% of cpu seems to be under %iowait, 10-25% in use by the user.
top and nload stats are not ostensibly bottlenecked (<50% memory usage, and 10-50 Mbits/sec for a 10 Gb/s link).
# The name of the workload class to use
workload=com.yahoo.ycsb.workloads.CoreWorkload
# There is no default setting for recordcount but it is
# required to be set.
# The number of records in the table to be inserted in
# the load phase or the number of records already in the
# table before the run phase.
recordcount=2000000000
# There is no default setting for operationcount but it is
# required to be set.
# The number of operations to use during the run phase.
operationcount=9000000
# The offset of the first insertion
insertstart=0
insertcount=500000000
core_workload_insertion_retry_limit = 10
core_workload_insertion_retry_interval = 1
# The number of fields in a record
fieldcount=10
# The size of each field (in bytes)
fieldlength=10
# Should read all fields
readallfields=true
# Should write all fields on update
writeallfields=false
fieldlengthdistribution=constant
readproportion=0.95
updateproportion=0.05
insertproportion=0
readmodifywriteproportion=0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=usertable
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
The key was increasing native_transport_max_threads in the casssandra.yaml file.
Along with the increased settings in the comment (increasing connections in ycsb client as well as concurrent read/writes in cassandra), Cassandra jumped to ~80,000 ops/sec.

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Cassandra nodetool "compactionstats" meaning of displayed values

I cannot find documentation on the "compactionstats":
While using nodetool compactionstats, what do the numerical values on the completed and total columns mean?
My column family has a total data size of about 360 GB but my compaction status displays:
pending tasks: 7
compaction type keyspace column family completed total unit progress
Compaction Test Message 161257707087 2475323941809 bytes 6.51%
While I see the "completed" increasing slowly (also the progress;-).
But how is this "total" computed? Why is it 2.5 TB when I have only 360 GB of data?
You must have compression on. total is the total number of uncompressed bytes comprising the set of sstables that are being compacted together. If you grep the cassandra log file for lines containing Compacting you will find the sstables that are part of a compaction. If you sum these sizes and multiply by the inverse of your compression ratio for the column family you will get pretty close to the total. By default this can be a bit difficult to verify on a multi-core system because the number of simultaneous compactions defaults to the number of cores.
You can also verify this answer by looking at the code:
AbstractionCompactionIterable - getCompactionInfo() uses the bytesRead and totalBytes fields from that class. totalBytes is final and is computed in the constructor, by summing getLengthInBytes() from each file that is part of the compaction.
The scanners vary, but the length in bytes returned by CompressedRandomAccessReader is the uncompressed size of the file.

Memory management scenario with MongoDB & Node.JS

I'm implementing a medium scale marketing e-commerce affiliation site, which has following estimates,
Total Size of Data: 5 - 10 GB
Indexes on Data: 1 GB approx (which I wanted to be in memory)
Disk Size (fast I/O): 20-25 GB
Memory: 2 GB
App development: node.js
Working set estimation of Query: Average 1-2 KB, Maximum 20-30 KB of text base article
I'm trying to understand whether MongoDB would be right choice for database or not. Index is going to be fairly downsize of Memory but I have noticed that after querying that MongoDB, it has occupied the memory (size of result set) for caching query. In 8 hours I'm expecting that all queries' depth would cover almost 95% of data, in that scenario how will MongoDB manage limited memory scenario also app instance of node.js running on same server.
Would a MongoDB a right choice for this scenario or I should go for other JSON based no-SQL Databases.

Resources