String Concatenation - string

As I asked in my previous question(Link) about concatenating a multipart string of variable lengths, I used the method answered there by rkhayrov and now, my function looks like this:
local sToReturn = string.format( "\t%03s\t%-25s\t%-7s\n\t", "S. No.", "UserName", "Score" )
SQLQuery = assert( Conn:execute( string.format( [[SELECT username, totalcount FROM chatstat ORDER BY totalcount DESC LIMIT %d]], iLimit ) ) )
DataArray = SQLQuery:fetch ({}, "a")
i = 1
while DataArray do
sTemp = string.format( "%03s\t%025s\t%-7d", tostring(i), DataArray.username, DataArray.totalcount )
sToReturn = sToReturn..sTemp.."\n\t"
DataArray = SQLQuery:fetch ({}, "a")
i = i + 1
end
But, even now, the value of score is still not following the order as required. The max length of username is 25. I've used %025s inside the while loop because I want the usernames to be right-justified, while the %-25s is to make the word UserName centre justified.
EDIT
Current output:
Required Output:
Displaying the list of top 5 chit-chatters.
S. No. UserName Score
1 Keeda 9440
2 _2.2_™ 7675
3 aim 7057
4 KGBRULES 6770
5 Guddu 6322
I think it's because of difference in fonts, but since most of the clients have Windows 7 default fonts(Tahoma/Verdana at 11px), I need optimum result for at-least that.

I think it's because of difference in fonts
It is. string.format formats by inserting whitespace. That only works for a fixed width fonts (i.e. all characters have the same width, including whitespace).
since most of the clients have Windows 7 default fonts(Tahoma/Verdana at 11px)
In what? How are they viewing your output? Do you write it to a textfile, that they then open in the editor of their choice (likely Notepad)? Then this approach will simply not work.
Don't know enough about your output requirements to steer you any futher, but it's worth noting that everyone has a browser so HTML output is very portable.

string.format doesn't truncate - the width of the field is minimum, not maximum. You'll have to truncate the strings to 25 characters yourself with something like DataArray.username:sub(0,25).

I'd remove the tabs from the string.format; and use the justification provided by %25s only. Won't be perfect but will probably be closer.
Use a fixed-width font if you can.

Related

ArangoDB AQL: Find Gaps In Sequential Data

I've been given data to build an application that has sequential data in the form of part numbers of products: "000000", "000001", "000002", "000010", "000011" .... The previous application was an old MS Access database that didn't have any gap filling features in the part number generator, hence the gap between "000002" and "000010" (Yes, they are also strings, but I can work with that...).
We could continue to increment based on the last value and ignore the gaps, however, in an attempt to use all numbers available to us with our naming scheme, we'd like to be able to fill the gaps. Our naming scheme describes the "product family" with the first two digits such that: [00]0000 would be a different family from [02]0000.
I can find the starting and ending values using something like:
let query = `
LET first = (
MIN(
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN part.PartNumber
)
)
LET last = (
MAX(
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN part.PartNumber
)
)
RETURN { first, last }
`
The above example returns: {first: "000000", last: "000915"}
Using ArangoDB and AQL, how could I go about finding these gaps? I've found some SQL examples but I feel the features of AQL are a bit more limiting.
Thanks in advance!
To start with, I think your best bet for getting min/max values is using aggregates:
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
COLLECT x = 1
AGGREGATE first = MIN(part.PartNumber), last = MAX(part.PartNumber)
RETURN {
first: first,
last: last
}
But that won't really help when trying to find gaps. And you're right - SQL has several logical constructs that could help (like using variables and cursor iteration), but even that would be a pattern I would discourage.
The better path might be to do a "brute force" approach - compare a table containing your existing numbers with a table of all numbers, using a native method like JOIN to find the difference. Here's how you might do that in AQL:
LET allNumbers = 0..9999
LET existingParts = (
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
LET childId = RIGHT(part.PartNumber, 4)
RETURN TO_NUMBER(childId)
)
RETURN MINUS(allNumbers, existingParts)
The x..y construct creates a sequence (an array of numbers), which we use as the full set of possible numbers. Then, we want to return only the "non-family" part of the ID (I'm calling it "child"), which needs to be numeric to compare with the previous set. Then, we use MINUS to remove elements of existingParts from the allNumbers list.
One thing to note, that query would return only the "child" portion of the part number, so you would have to join it back to the family number later. Alternatively, you could also skip string-splitting, and get "fancy" with your list creation:
LET allNumbers = TO_NUMBER(CONCAT(#family, '0000'))..TO_NUMBER(CONCAT(#family, '9999'))
LET existingParts = (
FOR part in part_search
SEARCH STARTS_WITH(part.PartNumber, #family)
RETURN TO_NUMBER(part.PartNumber)
)
RETURN MINUS(allNumbers, existingParts)

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

how to chop a string till "_" in ssrs using a cube report

I have a string (46318g_orchidpinkminisneakpead) Here i want to chop off the part till "_" and give me the result. So my result should be "orchidpinkminisneakpead" now what expression should I use to get this value in ssrs and the chopped string size do differ sometimes it can be 7 char sometimes it can be 8,9 also.
I am building the report using a cube so no changes can be done on the back-end.
Try:
=MID(Fields!YourField.Value,
InStr(Fields!YourField.Value,"_")+1,
LEN(Fields!YourField.Value)
)
In your case,
=MID(
"46318g_orchidpinkminisneakpead",
InStr("46318g_orchidpinkminisneakpead","_")+1,
LEN("46318g_orchidpinkminisneakpead")
)
should produce orchidpinkminisneakpead
Let me know if this helps.

How to remove percent character from a string in Cognos?

I have a string field with mostly numeric values like 13.4, but some have 13.4%. I am trying to use the following expression to remove the % symbols and retain just the numeric values to convert the field to integer.
Here is what I have so far in the expression definition of Cognos 8 Report Studio:
IF(POSITION('%' IN [FIELD1]) = NULL) THEN
/*** this captures rows with valid data **/
([FIELD1])
ELSE
/** trying to remove the % sign from rows with data like this 13.4% **/
(SUBSTRING([FIELD1]), 1, POSITION('%' IN [FIELD1])))
Any hints/help is much appreciated.
An easy way to do this is to use the trim() function. The following will remove any trailing % characters:
TRIM(trailing '%',[FIELD1])
The approach you are using is feasable. However, the syntax you are using is not compatible with the version of the ReportStudio that I'm familiar with. Below you will find an updated expression which works for me.
IF ( POSITION( '%'; [FIELD1]) = 0) THEN
( [FIELD1] )
ELSE
( SUBSTRING( [FIELD1]; 1; POSITION( '%'; [FIELD1]) - 1 ) )
Since character positions in strings are 1-based in Cognos it's important to substract 1 from the position returned by POSITION(). Otherwise you would only cut off characters after the percent sign.
Another note: what you are doing here is data cleansing. It's usually more advantageous to push these chores down to a lower level of the data retrieval chain, e.g. the Data Warehouse or at least the Framework Manager model, so that at the reporting level you can use this field as numeric field directly.

Oracle/SQL - Removing undefined chars from string

I currently have an assignemnt where i have to handle data from a lot of countries. My customer have given me a list of acceptable characters, lets call it:
'aber =*'
All other characters should just be changed to '_'.
I know the conversion for my country's specific chars (æøå), easily done with something like
select replace ('Ål', 'Å', 'AA') from dual;
But how would i go about removing all unwanted "noise" without splitting it up in char-by-char comparison?
For example "bear*2 = fear" should become "bear*_ = _ear" as 2 and f are not in the accepted list.
Oracle 10g and up. As one of the approaches, you can use regular expression function regexp_replace():
select regexp_replace('bear*2 = fear', '[^aber =*]', '_') as res
from dual
res
------------------------------
bear*_ = _ear
Find out more about regexp_replace() function.

Resources