Lines counted raw or nonblank noncomment - statistics

When the size of a code base is reported in lines, is it more usual/standard to report raw wc count, or nonblank noncomment lines? I'm not asking which measure should be used, only, if I see a number given with no other information, which measure it is at best guess more likely to be.

I'd say nonblank, noncomment lines. For many languages in the C family tree, this is approximately the number of semi-colons.

Related

Determine the number of search results upon using different sets of separators (Excel)

I would like to ask for your help with the formulation of a formula in Excel in order to compare the total number of search results upon using different sets of separator characters.
As I have multiple columns with content, as in the example below, I thought it would be possible to Count the search results in some way and do this for each column separately ( I would actually prefer to treat each column separately).
A
1 L-516-S-221-S-223
2 H-140.STR3
3 ST0 XP 23-9
4 etc.......
Preferably, I would like to use a varying a set of separator characters in order to determine the impact on the number of search results based on this set of separator characters. Logically, with an increasing number of separators more results will be returned (depending on separators included in the cell values of course).
The set of characters that I would like to experiment with is: “-_ .,;: “
Hopefully this makes sense and someone is able to help me out. Thank you.
Kind regards,
P
In your example - on its own will detect all three instances but for an overview you might construct a grid (say B1:H1 of your separators, including a space rather than an empty cell) and ColumnA each column in turn (maybe via links) then a formula in B2 such as:
=--ISNUMBER(FIND(B$1,$A2))
copied across to ColumnH and down to suit.
Alternative formula (for different question):
=IF(LEN($A2)-LEN(SUBSTITUTE($A2,B$1,""))>0,LEN($A2)-LEN(SUBSTITUTE($A2,B$1,""))+1,0)
Assumes, for example, no trailing spaces and separators are always separated. Results are not necessarily cumulative.

Excel conditional formating based on the multiple cells and values

I am trying to implement various conditional formatting to a specific data base. Looked for answer around here but can not find anything similar. Might not be possible but it is worth a try.
I am preforming various data cleansing and validation.
Here is the case: (small sample, working with 100k data entries in this particular file)
Ultimately what I want is the formula that will compare the low-level Description characters after the last "UNDERSCORE" to the characters after last "UNDERSCORE" of the higher level(highlighted). If it does not match then highlight the cell?
Asking for too much, yes, no, maybe? I am open to any other suggestions on how can I perform various data cleaning and validation!
Thank you!
If you must use the last "UNDERSCORE" character, and can't depend on the suffixes being four characters, the formula becomes quite complex. For simplicity's sake, I assumed the higher level is always missing the last five characters of the lower level, if you must go by the last "DASH" character, then this will be a lot longer.
Use this formula to highlight the cells, defining the two names LEVELS and DESCRS to be the two columns:
=IFNA(MID(B2,FIND("[]",SUBSTITUTE(B2,"_","[]",LEN(B2)-LEN(SUBSTITUTE(B2,"_",""))))+1,999)<>MID(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),FIND("[]",SUBSTITUTE(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),"_","[]",LEN(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1))-LEN(SUBSTITUTE(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),"_",""))))+1,999),FALSE)
This uses a very nice trick with SUBSTITUTE to find the last occurrence of a character.
BTW, I would probably write a Perl program to parse the data and find errors.

The most efficient way to break down huge file into chunks of max 10MBs

There is a text file of a size of 20GBs. Each line of the file is a json object. The odd lines are the ones describing the even subsequent lines.
The goal is to divide the big file into chunks of maximum size of 10MBs knowing that each file should have even number of lines so the formatting doesn't get lost. What would be the most efficient way to do that?
My research so far made me lean towards:
split function in Linux. Is there any way to export always even number of lines based on the size?
Modified version of divide & conquer algo. Would this even work?
Estimating average number of lines that meet the 10MBs criteria and iterating through it & exporting it it meets the criteria.
I'm thinking that 1. would be the most efficient but I wanted to get an opinion of experts here.

Descriptive statistics in Stata - Word frequencies

I have a large data set containing as variables fileid, year and about 1000 words (each word is a separate variable). All line entries come from company reports indicating the year, an unique fileid and the respective absolute frequency for each word in that report. Now I want some descriptive statistics: Number of words not used at all, Mean of words, variance of words, top percentile of words. How can I program that in Stata?
Caveat: You are probably better off using a text processing package in R or another program. But since no one else has answered, I'll give it a Stata-only shot. There may be an ado file already built that is much better suited, but I'm not aware of one.
I'm assuming that
each word is a separate variable
means that there is a variable word_profit that takes a value k from 0 to K where word_profit[i] is the number of times profit is written in the i-th report, fileid[i].
Mean of words
collapse (mean) word_* will give you the average number of times the words are used. Adding a by(year) option will give you those means by year. To make this more manageable than a very wide one observation dataset, you'll want to run the following after the collapse:
gen temp = 1
reshape long word_, i(temp) j(str) string
rename word_ count
drop temp
Variance of words
collapse (std) word_* will give you the standard deviation. To get variances, just square the standard deviation.
Number of words not used at all
Without a bit more clarity, I don't have a good idea of what you want here. You could count zeros for each word with:
foreach var of varlist word_* {
gen zero_`var' = (`var' == 0)
}
collapse (sum) zero_*

Converting various file sizes to bytes

I have a column of "file sizes" that has been output poorly, as in it's not consistent. For example values may be, "4GB", "32 MB", "320 KB", "932 bytes", etc. I need to convert these all to a standard value so that I can add them up for a report.
Consider this approach
pick one display format. Perhaps choose bytes.
For each cell:
determine its scale. This would likely involve string parsing, looking for "ends with" some valid range of possibilities : "bytes", "kb", "mb", "gb", "kilobytes", "gigabytes". Convert to lower case first, to ensure sanity. Consider misspellings as well!
extract the number. Use a variation of this VBA numeric regex to extract out the numbers. Watch out for decimals!
your output will be (the number) * (the scale in bytes)
Here's a very unsophisticated answer - but it might make this a very quick fix for you, if exact byte counts are not all important. Just do a simple text search and replace.
Replace "KB" (and "kilobytes" and other variations) with "000", "MB" with "000000" and "GB" with "000000000". "bytes" you replace with "". Then convert the cell/column type to numeric.
It won't be as easy if the values are given with decimals ("4.32 MB"), but your examples should work fine.
I would say you have two options:
1: require that all this data be in units of bytes (probably not feasible if the data already exists)
2: use a regex to separate the number from the unit, then use a switch statement (or loop or whatever you like) to perform the correct multiplications to get the number in bytes (probably the easier of the two).
edit :
the regex would look something like this :
(\d*) *(.*)
This will capture the numbers and units separately and ignore any whitespace between the two (you will still need to trim the input to the regex, as preceding and proceeding whitespace can cause some grief).
Bytes, kilobytes, megabytes, etc. are all metric units. Just pick a standard unit for your report (say, megabytes), and multiply or divide values given in different units to get the values you need.

Resources