Subtract two dataframes to get the value difference

Subtract two dataframes to get the value difference - python-3.x

I have a dataframe on which I get a daily aggregate for particular dates.Below is the dataframe for the date 2018-02-11 where I have found out the mean, min, max, std
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-11 4.282442748 0 17 4.361148065 13.61068702 0 27 6.123815451 3.891450382 0 47.62 6.426298507 1.526717557 0 100 12.30842628 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Similarly I have another dataframe for the date 2018-02-12 for which I found the mean, min, max, std
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-12 5.726315789 0 21 2.938315053 22.30526316 0 23 3.581474037 6.06 0 44.75 6.798944285 0.5263157895 0 100 7.254762501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is code below
import pandas as pd
df = pd.read_csv("metrics.csv", parse_dates=["date"])
df.set_index("date", inplace=True)
df_prev = df.loc['2018-02-11'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_next = df.loc['2018-02-12'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
Now I want to subtract the two dataframes to get the value difference for each of the column.This is what I do
df_diff = df_next.sub(df_prev, fill_value=0)
print(df_diff)
But it doesn't subtract anything and I also get the dates which doesn't make any sense since I only want the stats difference.
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
date
2018-02-11 -4.282442748 0 -17 -4.361148065 -13.61068702 0 -27 -6.123815451 -3.891450382 0 -47.62 -6.426298507 -1.526717557 0 -100 -12.30842628 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2018-02-12 5.726315789 0 21 2.938315053 22.30526316 0 23 3.581474037 6.06 0 44.75 6.798944285 0.5263157895 0 100 7.254762501 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As you can see it doesn't do any subtraction at all.Why is that happening?
PS I ultimately want to find out the percentage difference between the stats of the two dates.Is there any direct way to do that?

To get the difference
df_next - df_prev.values
To get the % change,
(df_next - df_prev.values)/(df_prev.values) * 100

Related

dsbulk unload is failing on large table

trying to unload data from a huge table, below is the command used and output.
$ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000
Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 1 | 0 | 109,790.10 | 110,058.54 | 110,058.54
Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
Here are the dsbulk records --
cassandra#somehost> cd logs
cassandra#somehost> cd COUNT_20210423-070104-108326/
cassandra#somehost> ls
operation.log unload-errors.log
cassandra#somehost> cat operation.log
2021-04-23 07:01:04 WARN Setting dsbulk.driver.protocol.compression is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.protocol.compression instead.
2021-04-23 07:01:04 WARN Setting dsbulk.driver.auth.* is deprecated and will be removed in a future release; please configure the driver directly using --datastax-java-driver.advanced.auth-provider.* instead.
2021-04-23 07:01:04 INFO Operation directory: /home/cassandra/logs/COUNT_20210423-070104-108326
2021-04-23 07:02:55 WARN Operation COUNT_20210423-070104-108326 completed with 1 errors in 1 minute and 50 seconds.
2021-04-23 07:02:55 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:55 INFO Memory usage: used: 212 MB, free: 1,922 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:55 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:55 INFO Throughput: 0 reads/second
2021-04-23 07:02:55 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
2021-04-23 07:02:58 INFO Final stats:
2021-04-23 07:02:58 INFO Records: total: 1, successful: 0, failed: 1
2021-04-23 07:02:58 INFO Memory usage: used: 251 MB, free: 1,883 MB, allocated: 2,135 MB, available: 27,305 MB, total gc count: 4, total gc time: 98 ms
2021-04-23 07:02:58 INFO Reads: total: 1, successful: 0, failed: 1, in-flight: 0
2021-04-23 07:02:58 INFO Throughput: 0 reads/second
2021-04-23 07:02:58 INFO Latencies: mean 109,790.10, 75p 110,058.54, 99p 110,058.54, 999p 110,058.54 milliseconds
cassandra#somehost> cat unload-errors.log
Statement: com.datastax.oss.driver.internal.core.cql.DefaultBoundStatement#1083fef9 [0 values, idempotence: <UNSET>, CL: <UNSET>, serial CL: <UNSET>, timestamp: <UNSET>, timeout: <UNSET>]
SELECT batch_id from .... allow filtering (Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded))
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.toErrorPage(ResultSubscription.java:534)
at com.datastax.oss.dsbulk.executor.api.subscription.ResultSubscription.lambda$fetchNextPage$1(ResultSubscription.java:372)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.setFinalError(CqlRequestHandler.java:447) [4 skipped]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.access$700(CqlRequestHandler.java:94)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processRetryVerdict(CqlRequestHandler.java:859)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.processErrorResponse(CqlRequestHandler.java:828)
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler$NodeResponseCallback.onResponse(CqlRequestHandler.java:655)
at com.datastax.oss.driver.internal.core.channel.InFlightHandler.channelRead(InFlightHandler.java:257)
at java.lang.Thread.run(Thread.java:748) [24 skipped]
Caused by: com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
Cassandra's system.log snippet ----
DEBUG [ScheduledTasks:1] 2021-04-23 00:01:48,539 MonitoringTask.java:152 - 1 operations timed out in the last 5015 msecs:
<SELECT * FROM my query being run with limit - LIMIT 5000>, total time 10004 msec, timeout 10000 msec/cross-node
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,540 MessagingService.java:1302 - RANGE_SLICE messages were dropped in last 5000 ms: 0 internal and 1 cross node
. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 10299 ms
INFO [ScheduledTasks:1] 2021-04-23 00:02:38,551 StatusLogger.java:114 -
Pool Name Active Pending Completed Blocked All Time Blocked
ReadStage 1 0 1736872997 0 0
ContinuousPagingStage 0 0 586 0 0
RequestResponseStage 0 0 1483193130 0 0
ReadRepairStage 0 0 9079516 0 0
CounterMutationStage 0 0 0 0 0
MutationStage 0 0 351841038 0 0
ViewMutationStage 0 0 0 0 0
CommitLogArchiver 0 0 32961 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 12034828 0 0
MemtableReclaimMemory 0 0 68612 0 0
PendingRangeCalculator 0 0 9 0 0
AntiCompactionExecutor 0 0 0 0 0
GossipStage 0 0 20137208 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 3798 0 0
MigrationStage 0 0 8 0 0
MemtablePostFlush 0 0 338955 0 0
PerDiskMemtableFlushWriter_0 0 0 66297 0 0
ValidationExecutor 0 0 247600 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 41757 0 0
InternalResponseStage 0 0 525242 0 0
AntiEntropyStage 0 0 767527 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 958717934 0 65
CompactionManager 0 0
MessagingService n/a 0/0
Cache Type Size Capacity KeysToSave
KeyCache 104857216 104857600 all
RowCache 0 0 all

The problem is that you are running an unload command in DSBulk to perform a SELECT COUNT() which means it has to do a full table scan to return a single row.
In addition, ALLOW FILTERING is not recommended unless you restrict the query to a single partition. In any case, the performance of ALLOW FILTERING is very unpredictable even in optimal situations.
I suggest that you instead use the DSBulk count command which is optimised for counting rows or partitions in Cassandra. For details, see the Counting data with DSBulk example.
There's also additional examples in this DSBulk Counting blog post which Alex Ott already linked in his answer. Cheers!

Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end - in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.
Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Excel: Match and Index based on range

I am stumped with the following problem and not sure how to accomplish it in excel. Here is an example of the data:
A B
1 Date Stock_Return
2 Jan-95 -5.2%
3 Feb-95 2.1%
4 Mar-95 3.7%
5 Apr-95 6.9%
6 May-95 6.5%
7 Jun-95 -5.6%
8 Jul-95 6.6%
9 Aug-95 6.2%
What I would like is to have the dates returned which fall within a certain return range and sorted from low to high.
For example:
1 2 3 4 5
Below -7% 0 0 0 0 0
-7% to -5% Jun-95 Jan-95 0 0 0
-5% to -3% 0 0 0 0 0
-3% to 0% 0 0 0 0 0
0% to 3% Feb-95 0 0 0 0
3% to 5% Mar-95 0 0 0 0
5% to 7% Aug-95 May-95 Jul-95 Apr-95 0
I thought Index and Match might make the most sense but when I drag across columns it doesn't work. Any help is very much appreciated.

You can use AGGREGATE function:
=IFERROR(AGGREGATE(14,6,$A$2:$A$9/(($B$2:$B$9>$D2)*($B$2:$B$9<=$E2)),COLUMN(A1)),"0")

If you have Excel O365, you can use the FILTER function:
F2: =IFERROR(TRANSPOSE(FILTER($A$2:$A$9,(F2<=$B$2:$B$9)*(G2>=$B$2:$B$9))),"")
and fill down.

Is there a way to prevent heap exhausted in common lisp

I am not too familiar with garbage collection in lisp, and I wonder how it is possible to manage it in order to avoid the
fatal error: Heap exhausted during garbage collection in *inferior-lisp*.
SLIME 2.20 and SBCL 2.0.1
Heap exhausted during garbage collection: 0 bytes available, 16 requested.
Gen Boxed Code Raw LgBox LgCode LgRaw Pin Alloc Waste Trig WP GCs Mem-age
1 19685 0 1 0 0 0 5 644710992 359856 21474836 19686 0 1.0000
2 29304 0 1 0 0 0 13 960070208 196032 21474836 29305 0 0.0000
3 0 0 0 0 0 0 0 0 0 21474836 0 0 0.0000
4 0 0 0 0 0 0 0 0 0 21474836 0 0 0.0000
5 367 1 130 34 0 15 51 17101008 823088 38575844 547 13 0.0000
6 485 2 221 55 0 10 0 24716944 612720 2000000 773 0 0.0000
7 15224 0 1 0 0 0 0 498860064 32736 2000000 15225 0 0.0000
Total bytes allocated = 2145459216
Dynamic-space-size bytes = 2147483648
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 84761(tid 0x700000026000):
Heap exhausted, game over.
Error opening /dev/tty: Device not configured
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
I am using an algorithm to solve combinatorial issues, and as you can guess, the field of search increase exponentially. It is no point to increase the dynamic-space-size, because this will not solve the issue. So, the idea is to stop the process before heap exhausted. Something like a condition when a memory limit is reached for instance.
Any help is welcome.

It seems that there is no real answer to my question.
I found some guidelines with which I will manage my issue.
Here are the links for the record: https://sourceforge.net/p/sbcl/mailman/message/30184781/ and https://sourceforge.net/p/sbcl/mailman/message/18414749/

Cassandra write performance

We have this Cassandra cluster and would like to know if current performance is normal and what we can do to improve it.
Cluster is formed of 3 nodes located on the same datacenter with total capacity of 465GB and 2GB of Heap per node. Each node has 8 cores and 8GB or RAM. Version of different components are cqlsh 5.0.1 | Cassandra 2.1.11.872 | DSE 4.7.4 | CQL spec 3.2.1 | Native protocol v3
The workload is described as follows:
Keyspace use org.apache.cassandra.locator.SimpleStrategy placement strategy and replication factor of 3 (this is very important for us)
Workload consist mainly of write operations into a single table. The table schema is as follows:
CREATE TABLE aiceweb.records (
process_id timeuuid,
partition_key int,
collected_at timestamp,
received_at timestamp,
value text,
PRIMARY KEY ((process_id, partition_key), collected_at, received_at)
) WITH CLUSTERING ORDER BY (collected_at DESC, received_at ASC)
AND read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
Write operations come from a NodeJS based API server. The Nodejs driver provided by Datastax is used (version recently updated from 2.1.1 to 3.2.0). The code in charge of performing the write request will group write operations per Primary Key and additionally it will limit the request size to 500 INSERTs per request. The write operation is performed as a BATCH. The only options explicitly set are prepare:true, logged:false.
OpsCenter reflect a historial level of less than one request per second in the last year using this setup (each write request been a BATCH of up to 500 operations directed to the same table and the same partition). Write request latency has been at 1.6ms for 90% of requests for almost the entire year but lately it has increased up to more than 2.6ms for the 90% of requests. Os Load has been below 2.0 and Disk Utilization has been below 5% most of the time with few peaks at 7%. Average Heap usage has been 1.3GB the entire year with peaks at 1.6GB even though currently this peak is rising during the last month.
The problem with this setup is that API performance has been degrading the entire year. Currently the BATCH operation can take from 300ms up to more than 12s (leading to a operation timeout). In some cases the NodeJS driver report all Cassandra drivers down even when OpsCenter report all nodes alive and healthy.
Compaction Stats shows always 0 on each node and nodetool tpstats is showing something like:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 10554 0 0
ReadStage 0 0 687567 0 0
RequestResponseStage 0 0 767898 0 0
MutationStage 0 0 393407 0 0
ReadRepairStage 0 0 411 0 0
GossipStage 0 0 1314414 0 0
CacheCleanupExecutor 0 0 48 0 0
MigrationStage 0 0 0 0 0
ValidationExecutor 0 0 126 0 0
Sampler 0 0 0 0 0
MemtableReclaimMemory 0 0 497 0 0
InternalResponseStage 0 0 126 0 0
AntiEntropyStage 0 0 630 0 0
MiscStage 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MemtableFlushWriter 0 0 485 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 0 0 7879 0 0
CompactionExecutor 0 0 263599 0 0
AntiEntropySessions 0 0 3 0 0
HintedHandoff 0 0 8 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 0
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Any help or suggestion with this problem will be deeply appreciated. Feel free to request any other information you need to analyze it.
Best regards

You are staying with same amount of requests, or the workload is growing?
Looks like the server is overloaded (maybe network).

I would try to find a reproducer and run the reproducer with tracing enabled - hopefully that will help to understand what the issue is (especially if you compare it to a trace in which the latency is good).
There is an example on how to enable query tracing and retrive the output via nodejs driver examples retrieve-query-trace.js (can be found on https://github.com/datastax/nodejs-driver)

libjpeg not exact pixel values even with quality of 100

I am writing a program to read some text files and write it to a JPEG file using libjpeg. When I set the quality to 100 (withjpeg_set_quality), there is actually no quality degradation in grayscale. However, when I move to RGB, even with a quality of 100, there seems to be compression.
When I give this input to convert to a grayscale JPEG image it works nicely and gives me a clean JPEG image:
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 255 0 0 0
255 0 0 0 0
The (horizontally flipped) output is:
Now when I assume that array was the Red color, and use the following two arrays for the Green and Blue colors respectively:
0 0 0 0 0
0 0 0 0 0
0 0 255 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 255
0 0 0 255 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
This is the color output I get:
While only 5 input pixels have any color value, the surrouding pixels have also gotten a value when converted to color. For both the grayscale image and RGB image the quality was set to 100.
I wanted to see what is causing this and how I can fix it so the colors are also only used for the pixels that actually have an input value?

You are getting errors from the RGB->YCbCr conversion. That is impossible to avoid in the large because there is not a 1:1 mapping between the two color spaces.

The fix is easy - just don't use jpeg. Png is a better choice for your use case.
What you are seeing is result of how jpeg compression works, there is such a thing as "lossless jpeg" but its really a completely different file format that isn't well supported.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Subtract two dataframes to get the value difference - python-3.x

To get the difference df_next - df_prev.values To get the % change, (df_next - df_prev.values)/(df_prev.values) * 100

Related

dsbulk unload is failing on large table

Excel: Match and Index based on range

Is there a way to prevent heap exhausted in common lisp

Cassandra write performance

libjpeg not exact pixel values even with quality of 100

Categories

Resources