Converting various file sizes to bytes - excel

I have a column of "file sizes" that has been output poorly, as in it's not consistent. For example values may be, "4GB", "32 MB", "320 KB", "932 bytes", etc. I need to convert these all to a standard value so that I can add them up for a report.

Consider this approach
pick one display format. Perhaps choose bytes.
For each cell:
determine its scale. This would likely involve string parsing, looking for "ends with" some valid range of possibilities : "bytes", "kb", "mb", "gb", "kilobytes", "gigabytes". Convert to lower case first, to ensure sanity. Consider misspellings as well!
extract the number. Use a variation of this VBA numeric regex to extract out the numbers. Watch out for decimals!
your output will be (the number) * (the scale in bytes)

Here's a very unsophisticated answer - but it might make this a very quick fix for you, if exact byte counts are not all important. Just do a simple text search and replace.
Replace "KB" (and "kilobytes" and other variations) with "000", "MB" with "000000" and "GB" with "000000000". "bytes" you replace with "". Then convert the cell/column type to numeric.
It won't be as easy if the values are given with decimals ("4.32 MB"), but your examples should work fine.

I would say you have two options:
1: require that all this data be in units of bytes (probably not feasible if the data already exists)
2: use a regex to separate the number from the unit, then use a switch statement (or loop or whatever you like) to perform the correct multiplications to get the number in bytes (probably the easier of the two).
edit :
the regex would look something like this :
(\d*) *(.*)
This will capture the numbers and units separately and ignore any whitespace between the two (you will still need to trim the input to the regex, as preceding and proceeding whitespace can cause some grief).

Bytes, kilobytes, megabytes, etc. are all metric units. Just pick a standard unit for your report (say, megabytes), and multiply or divide values given in different units to get the values you need.

Related

Excel: How to add two numbers that has unit prefixes?

I'm trying to add numbers that have the unit prefixes appended at the end (mili, micro, nano, pico, etc). For example, we have these two columns below:
Obviously doing something like =A2+A3+A4 and =B2+B3+B4 would not work. How would you resolve this? Thanks
Assuming you don't have excel version constraints, per the tags listed in your question. Put all the suffixes as delimiters inside {} delimited by a comma as follow in TEXTSPLIT, then define the conversion rules in XLOOKUP. We use SUBSTITUTE(col, nums, "") as input of XLOOKUP to extract the unit of measure.
=BYCOL(A2:B4, LAMBDA(col, LET(nums, 1*TEXTSPLIT(col,{"ms","us"},,1),
units, XLOOKUP(SUBSTITUTE(col, nums, ""), {"us";"ms"},{1;1000}),
SUM(nums * units))))
The above formula converts the result to a common unit of microseconds (us), i.e. to the lower unit, so milliseconds get converted by multiplying by 1000. If the unit of measure was not found it returns #N/A, it can be customized by adding a fourth parameter to XLOOKUP. If you want the result in milliseconds, then replace: {1;1000} with {0.001;1} or VSTACK(10^-3;1) for example.
If you would like to have everything in seconds, you can use the trick of using power combined with the XMATCH index position, to generate the multiplier. I took the idea from this question: How to convert K, M, B formatted strings to plain numbers?, check the answer from #pgSystemTester (for Gsheet, but it can be adapted to Excel). I included nanoseconds too.
=BYCOL(A2:B4,LAMBDA(col,LET(nums,1*TEXTSPLIT(col,{"ms","us"},,1),
units, 1000^(-IFERROR(XMATCH(RIGHT(col,2), {"ms";"us";"ns"}),0)),
SUM(nums * units))))
Under this approach, seconds is the output unit, because it is not part of the XMATCH lookup_array input argument, the multiplier will be 1 (as a result of 1000^0), so no units or seconds (s) will be treated the same way.
Notes:
In my initial version I used INDEX, but as #P.b pointed out in the comments, it is not really necessary to remove the second empty column, instead, we can use the ignore_empty input argument from TEXTSPLIT. Thanks
You can use TEXTBEFORE instead of TEXTSPLIT, as follows: TEXTBEFORE(A2:A4,{"ms","us"})

String parsing in optimal way

Suppose I have a string as onehhhtwominusthreehhkkseveneightjnine
Now I want to parse this string to get the numbers out of it. For Example this string should return an array, [one,two,minusthree,seven,eight,nine].
The order of the Integers should be maintained.
Can anyone Please suggest an optimal way to do this parsing? Thanks.
(You haven't mentioned a programming language?)
I would probably search for "minus" and check the number(s) that follow it. Then search for "one", then "two", noting their indexes. This would provide enough information to map and output the results, and order, that you need.
Another option is to look at each character in order, comparing each to the 10 choices. I couldn't tell you which is the most efficient - I think it depends on the possible total string length. I'd probably write both and profile them.
If the string to search is not of inordinate length then I suspect that the second approach might be more efficient. This is because, as soon as you have a match, you can eliminate searching the following (known) length of characters.
That is, if you have "abceightd", once you discover the "e" and its "eight" you can skip four characters. You can also skip the a, b, and c anyway, as they are not the beginning character for any of the 10 choices.
I am assuming your choices are:
one, two, three, four, five, six, seven, eight, nine, minus
Assuming that a) you have access to regular expressions in your choice of programming language and b) your possible choices are as Andy G has assumed... then this regular expression can pick out the numbers grouped with their associated minus, if present:
/((?:minus)*(?:one|two|three|four|five|six|seven|eight|nine))/g
Applied to your example string using JavaScript's RegEx.exec(), for example, this extracts:
one
two
minusthree
seven
eight
nine
You could easily place a space after any minus matched if required. Does this help at all?

Excel conditional formating based on the multiple cells and values

I am trying to implement various conditional formatting to a specific data base. Looked for answer around here but can not find anything similar. Might not be possible but it is worth a try.
I am preforming various data cleansing and validation.
Here is the case: (small sample, working with 100k data entries in this particular file)
Ultimately what I want is the formula that will compare the low-level Description characters after the last "UNDERSCORE" to the characters after last "UNDERSCORE" of the higher level(highlighted). If it does not match then highlight the cell?
Asking for too much, yes, no, maybe? I am open to any other suggestions on how can I perform various data cleaning and validation!
Thank you!
If you must use the last "UNDERSCORE" character, and can't depend on the suffixes being four characters, the formula becomes quite complex. For simplicity's sake, I assumed the higher level is always missing the last five characters of the lower level, if you must go by the last "DASH" character, then this will be a lot longer.
Use this formula to highlight the cells, defining the two names LEVELS and DESCRS to be the two columns:
=IFNA(MID(B2,FIND("[]",SUBSTITUTE(B2,"_","[]",LEN(B2)-LEN(SUBSTITUTE(B2,"_",""))))+1,999)<>MID(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),FIND("[]",SUBSTITUTE(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),"_","[]",LEN(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1))-LEN(SUBSTITUTE(INDEX(DESCRS,MATCH(LEFT(A2,LEN(A2)-5),LEVELS,0),1),"_",""))))+1,999),FALSE)
This uses a very nice trick with SUBSTITUTE to find the last occurrence of a character.
BTW, I would probably write a Perl program to parse the data and find errors.

Calculating all number from a text file (division)

I have a file or a text which contains huge numbers. This is how it looks:
2622256647732477952, 3146707977278973440, 3776049572734768128, 4531259487281721344, 5437511384738065408, 6525013661685678080, 7830016394022813696, 9396019672827375616, 11275223607392849920, 13530268328871419904,
I want to divide every number by the factor of 100. Is there any fast way to do this? notepadd++ maybe? or any 3rd party editor which is able to do such stuff?
It's around 1000 numbers would be pretty time consuming to do this manually.
All the numbers seem to be integers. If that is true, and if they are all above 100 (the divisor), why not just use a regular expression to insert a decimal point in every number.
In Notepad++ try:
Search string: (\d+)(\d{2})
Replace string: $1.$2
Check "Regular expression" box and hit "Replace all".
Edit:
In the special case you mention in your comment, where the decimals should just be disregarded, you can simply use (\d+)\d{2} as search string and $1 as the replace string. Note that the result won't be rounded to the nearest integer though (11189 should become 112 really, but you'll get 111).
Other options include importing the string into Excel or other spreadsheet software and use a formula in there, writing a javascript snippet to split the string up and divide each number etc.

Finding similar strings in large datasets

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks
One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

Resources