There is the qr code with M level size 41x41. I found that it's capacity is 154 alphanumerical data. How it is distributed between the lines? What is the max string size for one row of the code? Didn't find exact info on this.
There are several senteces in the data string to be decoded and they has different lengths. Some of them are large.
Related
we have Cassandra 3.4.4.
In the system.log we have a lot message like this:
INFO [CompactionExecutor:2] 2020-09-16 13:42:52,916 PerSSTableIndexWriter.java:211 - Rejecting value (size 1.938KiB, maximum 1.000KiB) for column payload (analyzed false) at /opt/cassandra/data/cfd/request-c8399780225a11eaac403f5be58182da/md-609-big SSTable.
What are the significance of these messages?
These entries appear several hundred per second, log rotates every minute.
The symptoms you described tell me that you have added a SASI index on the payload column of the cfd.request table but didn't used to.
Those messages are coming because Cassandra is going through the data trying to index them that the payload column has too much data in it. The maximum term size for SASI is 1024 bytes but in the example you posted, the term size was 1.9KB.
If the column only contains ASCII characters, the maximum term length is 1024 characters since each ASCII character is 1 byte. If the column has extended Unicode such as Chinese or Japanese characters, the maximum term length is shorter since each of those take up 3 bytes.
You don't have a SASI analyzer configured on the index (analyzed false) so the whole column value is taken up as a single term. If you use the standard SASI analyzer, the column value will get tokenised breaking them up into multiple terms which are shorter and you won't see those indexing failures get logged.
If you're interested in the detailed fix steps, see https://community.datastax.com/questions/8370/. Cheers!
There is a text file of a size of 20GBs. Each line of the file is a json object. The odd lines are the ones describing the even subsequent lines.
The goal is to divide the big file into chunks of maximum size of 10MBs knowing that each file should have even number of lines so the formatting doesn't get lost. What would be the most efficient way to do that?
My research so far made me lean towards:
split function in Linux. Is there any way to export always even number of lines based on the size?
Modified version of divide & conquer algo. Would this even work?
Estimating average number of lines that meet the 10MBs criteria and iterating through it & exporting it it meets the criteria.
I'm thinking that 1. would be the most efficient but I wanted to get an opinion of experts here.
I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?
Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.
I'm assuming that
each word is a separate variable
means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].
Mean of words
collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:
gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp
Variance of words
collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.
Number of words not used at all
Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:
foreach var of varlist word_* {
gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*
Lets first quote:
Combined size of all of the properties in an entity cannot exceed 1MB.
(for a ROW/Entity) from msdn
My Questions is: Since everything is XMLed data, so for 1MB, is 1MB of what, 1MB of ASCII Chars, or 1MB of UTF8 Chars, or something else?
Sample:
Row1: PartitionKey="A', RowKey="A", Data="A"
Row2: PartitionKey="A', RowKey="A", Data="A" (this is a UTF8 unicode A)
Is Row1 and Row2 same size (in length), or Row2.Length=Row1.Length+1?
Single columns such as "Data" in your example are limited to 64 KB of binary data and single rows are limited to 1 MB of data. Strings are encoding into binary in the UTF8 format so the limit is whatever the byte size ends up being for your string. If you want your column to store more than 64 KB of data you can use a technique such as FAT Entity which is provided to you with Lokad (https://github.com/Lokad/lokad-cloud-storage/blob/master/Source/Lokad.Cloud.Storage/Azure/FatEntity.cs). The technique is pretty simple, you just encode your string to binary and then split the binary across multiple columns. Then when you want to read the string from the table, you would just re-join the columns again and convert the binary back to a string.
Azure table row size calculation is quite involved and includes both the size of the property name and its value plus some overhead.
http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/30/how-the-size-of-an-entity-is-caclulated-in-windows-azure-table-storage.aspx
Edit. Removed statement that earlier said that size calculation was slightly inaccurate. It is quite accurate.
When the size of a code base is reported in lines, is it more usual/standard to report raw wc count, or nonblank noncomment lines? I'm not asking which measure should be used, only, if I see a number given with no other information, which measure it is at best guess more likely to be.
I'd say nonblank, noncomment lines. For many languages in the C family tree, this is approximately the number of semi-colons.