How to crate random data for the struct in julia? - struct

I have created a structure, txn:
struct txn
txn_id::Int64
bank::String
branch::String
teller::String
customer::String
account::String
timestamp::DateTime
dr_cr::String
amount::Float64
end
Please guide me in creating random data for the struct!
Update 1:
With reference to Professor Bogumił Kamiński's advice, adding domains of fields as follows:
a) txn_id is a unique integer(auto incrementing)
b) bank is a 20 character Legal Entity Identifier
c) branch is a 8 or 11 character Business Identifier Codes(SWIFT-BIC)
d) teller is a 9 digit Social Security Number
e) customer is a 9 digit Social Security Number or a 20 character Legal Entity Identifier
f) account is an 34 character International Bank Account Number(IBAN)
g) timestamp is a iso8601 date-time.
h) dr_cr is in (dr, cr)
i) amount > 0.0000

This question is not specified precisely enough, as "random" is an ambiguous term, therefore what you ask for does not have a single solution.
In general if you say "random" you should specify the domain and the distribution over this domain. Given this here is a solution that automatically detects field types of the struct and populates them with uniform pseudorandom samples from prespecified domains.
using Dates
domains = Dict(Int => 1:10,
String => string.('a':'z'),
DateTime => DateTime("2019-01-01"):Day(1):DateTime("2019-05-30"))
struct Txn
txn_id::Int64
bank::String
branch::String
teller::String
customer::String
account::String
timestamp::DateTime
dr_cr::String
amount::Int64
end
Txn_rand() = Txn(map(t -> rand(domains[t]), fieldtypes(Txn))...)
And now you can write:
julia> Txn_rand()
Txn(3, "q", "f", "j", "m", "z", 2019-03-10T00:00:00, "c", 1)
julia> Txn_rand()
Txn(8, "e", "o", "m", "l", "z", 2019-04-05T00:00:00, "p", 5)
julia> Txn_rand()
Txn(3, "k", "u", "c", "z", "y", 2019-03-13T00:00:00, "x", 1)
EDIT
Given the comment here is how I would approach the generation of the Txn structure (you can probably be more specific e.g. by giving a closed list of bank and branch values etc. as you probably have it - then use the approach proposed above):
using Dates, Random
global TXN_ID_COUNTER = 1
function Txn_rand()
global TXN_ID_COUNTER += 1
Txn(TXN_ID_COUNTER,
randstring('A':'Z', 20),
randstring('A':'Z', rand(Bool) ? 8 : 11),
rand(Bool) ? randstring('1':'9', 9) : randstring('A':'Z', 20),
randstring('1':'9', 9),
randstring('A':'Z', 2) * randstring('1':'9', 32),
rand(DateTime("2019-01-01"):Second(1):DateTime("2019-05-30")),
rand(["dr", "cr"]),
rand(1:100000) / 10000
)
end
I also omit validation of the generated fields (which in general you could do as some of them have validation rules).

Related

Generate A Random String With A Set of Banned Substrings

I want to generate a random string of a fixed length L. However, there is a set of "banned" substrings all of length b that cannot appear in the string. Is there a way to algorithmically generate this parent string?
Here is a small example:
I want a string that is 10 characters long -> XXXXXXXXXX
The banned substrings are {'AA', 'CC', 'AD'}
The string ABCDEFGHIJ is a valid string, but AABCDEFGHI is not.
For a small example it is relatively easy to randomly generate and then check the string, but as the set of banned substrings gets larger (or the length of the banned substrings gets smaller), the probability of randomly generating a valid string rapidly decreases.
This will be a fairly efficient approach, but it requires a lot of theory.
First you can take your list of strings, like in this case AA, BB, AD. This can trivially be turned into a regular expression that matches any of them, namely /AA|BB|AD/. Which you can then turn into an Nondeterministic Finite Automaton (NFA) and a Deterministic Finite Automaton (DFA) for matching the regular expression. See these lecture notes for an example of how to do that. In this case the DFA will have the following states:
Matched nothing (yet)
Matched A
Matched B
End of match
And the transition rules will be:
If A go to state 2, if B go to state 3, else go to state 1.
If A or D go to state 4, else go to state 1.
If B go to state 4, else go to state 1.
Match complete, we're done. (Stay in state 4 forever.)
Now normally a DFA is used to find a match. We're going to use it as a way to find ways to avoid a match. So how will we do that?
The trick here is dynamic programming.
What we will do is create a table of:
by position in the string
by state of the match
how many ways there are to get here
how many ways we got here from the previous (position, state) pairs
In other words we go forward and create a table which starts like this:
[
{1: {'count': 1}},
{1: {'count': 24, 'prev': {1: 24}},
2: {'count': 1, 'prev': {1: 1}},
3: {'count': 1, 'prev': {1: 1}},
},
{1: {'count': 625, 'prev': {1: 576, 2: 24, 3: 25}},
2: {'count': 25, 'prev': {1: 24, 3: 1}},
3: {'count': 25, 'prev': {1: 24, 2: 1}},
4: {'count': 3, 'prev': {2: 2, 3: 1}},
},
...
]
By the time this is done, we know exactly how many ways we can wind up at the end of the string with a match (state 4), partial match (states 2 or 3) or not currently matching (state 1).
Now we generate the random string backwards. First we randomly pick the final state with odds based on the count. Then from that final state's prev entry we can pick the state we were on before that final one. Then we randomly pick a letter that would have done that. We are picking that letter/state combination completely randomly from all solutions.
Now we know what state we were in at the second to last letter. Pick the previous state, then pick a letter in the same way.
Continue back until finally we know the state we're in after we've picked the first letter, and then pick that letter.
It is a lot of logic to write, but it is all deterministic in time, and will let you pick random strings all day long once you've done the initial analysis.
There are two ways.
Random String is created while repeating until the ban string does not appear.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "AA";
char rnd[N];
int main() {
srand(time(0));
int n = 100;
do {
for (int i = 0; i < n; i++) rnd[i] = 'A'+rand()%26;
rnd[n] = 0;
} while (strstr(rnd, ban));
cout << rnd << endl;
}
I think this is the easiest way to implement.
However, this method has complexity up to O((26/25)^n*(n+b)), if the length of the string to be created is very long and the length of the ban string is very small.
For example if ban="A", n=10000, then there will be time limit exceed!
You can proceed with the creation of one character and one character while checking whether there is a ban string.
If you want to use this way, you must know about KMP algorithm.
In this way, we can not use the system default search function strstr.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "A";
char rnd[N];
int b = strlen(ban);
int n = 10;
int *pre_kmp() {
int *pie=new int [b];
pie[0] = 0;
int k=0;
for(int i=1;i<b;i++)
{
while(k>0 && ban[k] != ban[i] )
{
k=pie[k-1];
}
if( ban[k] == ban[i] )
{
k=k+1;
}
pie[i]=k;
}
return pie;
}
int main() {
srand(time(0));
int *pie = pre_kmp();
int matched_pos = 0;
for (int cur = 0; cur < n; cur++) {
do {
rnd[cur] = 'A'+rand()%26;
while (matched_pos > 0 && ban[matched_pos] != rnd[cur])
matched_pos = pie[matched_pos-1];
if (ban[matched_pos] == rnd[cur])
matched_pos = matched_pos+1;
} while (matched_pos == b);
}
cout << rnd << endl;
}
This algorithm's time complexity will be O(26 * n * b).
Of course you could also search manually without using the KMP algorithm. However, in this case, the time complexity becomes O(n*b), so the time complexity becomes very large when both the ban string length and the generator string length are long.
Hope this helps you.
Simple approach:
Create a map where the keys are the banned substrings excluding the final letter, and the values are the lists of allowed letters. Alternatively you can have the values be the lists of banned letters with slight modification.
Generate random letters, using the set of allowed letters for the (b-1)-length substring at the end of the letters already generated, or the full alphabet if that substring doesn't match a key.
This is O(n*b) unless there's a way to update the substring of the last b-1 characters in O(1).
Ruby Code
def randstr(strlen, banned_strings)
b = banned_strings[0].length # All banned strings are length b
alphabet = ('a'..'z').to_a
prefix_to_legal_letters = Hash.new {|h, k| h[k] = ('a'..'z').to_a}
banned_strings.each do |s|
prefix_to_legal_letters[s[0..-2]].delete(s[-1])
end
str = ''
while str.length < strlen
letters_to_use = alphabet
if str.length >= b-1
str_end = str[(-b+1)..-1]
if prefix_to_legal_letters.has_key?(str_end)
letters_to_use = prefix_to_legal_letters[str_end]
end
end
str += letters_to_use.sample()
end
return str
end
A silly example to show this works. The only legal letter after 'a' is 'z'. Every 'a' in the output is followed by a 'z'.
randstr(1000, ['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao',
'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay'])
=> "gkowhxkhrknrxkbjxbjwiqohvvazwrjxjdekrujdyprjnmbjuklqsjdwzidhpgzzmnfbyjuptbpyezfeeydgdkpznvjfwziazrzohwvnitnfupdqxivtvkbazpvqzdzzsneslrazmhbjojyqowhvjhsrdgpbejicitprxzmkhgsuvvlyfizmhohorazemyhtbazvhvazdmnjmjzoggwmjjnrqxcmrdhxozbsjjdqgfjorazmtwtvvujpgivdxijowgxnkuxovncnivazmtykliqiielsfixuflfsgqbpevazozfsvfynhxyjpxtuolqooowazpyoukssknxdntzjjbqazxjttdblepsjzqgxmxvtrmjbgvuyfvspdrrohmtwhtdxfcvidswxtzbznszsqorpxdywbytsitxeziudmvlnluwmcqtfydxlocltozovhusbblfutbqjfjeslverzctxazyprazxzmazxwbdfkwxdwdqxnqhbcliwuitsnnpscbsjitoftblgjycpnxqsikpjqysmqiazdazwwjmeazxcbejthnlsskhazxazlrceyjtbmcpazscazvsjkqhiqfbjygjhyqazsbjymsovojfxynygzwmlhkmpvswpweqkkvmbrxhazpmiqrazcgprlbywmqpyvtphydniazovrkolzbslsosjvdqkgrjmcorqtgeazfwskjuhndszliiirtncmzrzhocyazyrhhpbcsmneuiktyswvgqwkzswkjnyuazggnreeccyidvrbxuskrlchjxnrrpljilogxmicjvmoeequbpkursrqsisqtfkruswnyftdgbjhwvcrlcnfecyfdnslmxztlbfxjhgeslqedrflthlhnlwopmsdjgochxwxhfhvqcixvxdjixcazggmexidtlhymkiyyfuhxufvxyfazmmwsbrlooqwfphgfhvthspvmyiazdazggpeuhnpjmzsazfxmsukpd"
Note that this code and approach needs modification if the OPs statement that all banned substrings are length b is modified -- either directly (by giving banned substrings of varying lengths), or implicitly (by having a (b-1)-length substring for which every letter is banned, in which case the (b-1)-length substring is effectively banned.
The modification here is the obvious one of checking all the possible key lengths in the map; it's still O(n*b) assuming b is the longest banned substring.

Calculating the number of characters inside a list in python

I made a list with some character in it and I looped through it to calculate the number of a specific character and it returns the number of all the characters inside the list and not the one's that I said it to. Take a look at my code and if someone can help I will appreciate it!
This is the code:
array = ['a', 'b', 'c', 'a']
sum_of_as = 0
for i in array:
if str('a') in array:
sum_of_as += 1
print(f'''The number of a's in this array are {sum_of_as}''')
If you know the list is only ever going to contain single letter strings, as per your example, or if you are searching for a word in a list of words, then you can simply use
list_of_strings = ["a", "b,", "c", "d", "a"]
list_of_strings.count("a")
Be aware though that will not count things such us
l = ["ba", "a", "c"] where the response would be 1 as opposed to 2 when searching for a.
The below examples do account for this, so it really does depend on your data and use case.
list_of_strings = ["a", "b,", "c", "d", "ba"]
count = sum(string.count("a") for string in list_of_strings)
print(count)
>>> 2
The above iterates each element of the list and totals up (sums) the amount of times the letter "a" is found, using str.count()
str.count() is a method that returns the number of how many times the string you supply is found in that string you call the method on.
This is the equivalent of doing
count = 0
list_of_strings = ["a", "b,", "c", "d", "ba"]
for string in list_of_strings:
count += string.count("b")
print(count)
name = "david"
print(name.count("d"))
>>> 2
The if str('a') in array evaluates to True in every for-loop iteration, because there is 'a' in the array.
Try to change the code to if i == "a":
array = ["a", "b", "c", "a"]
sum_of_as = 0
for i in array:
if i == "a":
sum_of_as += 1
print(sum_of_as)
Prints:
2
OR:
Use list.count:
print(array.count("a"))

Recursion for combinations of objects but with limited spaces

In Python, I'm trying to develop a recursive for loop that can produce a list of lists with X objects in Y combinations. For instance, if X= 26 (say the alphabet) and Y=5 (length of word), I need to generate a list of all possible 5 letter words. The program is not using letters, however, it's using objects that are already in a list X long. Any suggestions?
I presume I need a repetitive counter for Y and an iterative recursive for loop for X, but I keep getting hung up on the details. I've tried drawing it out on paper, but the recursive nature is making my head hurt.
Edit: Per the answer below, I developed the following script that does not require recursion:
list1 = ["v", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n"]
def combo(object_list, spots):
pool = len(object_list)
total = pool**spots
permlist = list()
wordlist = list()
z = []
for i in range(total):
print("top", wordlist)
wordlist = []
z = base(i,pool,spots)
for q in z:
j=-1
for a in object_list:
j+=1
if int(q) == int(j):
wordlist.append(a)
permlist.append(wordlist)
return permlist
def base(number, base, digits):
remainder = []
result = [0,0]
while number >0:
dividend = number // int(base)
remainder.append(str(number % int(base)))
number = dividend
result = list(reversed(remainder))
while len(result) < digits:
result.insert(0,0)
return result
print (combo(list1,4))
One easy way to generate all the possibilities here requires only one loop, and no recursion.
You can think of this as a simple problem in counting. Using your analogy, we can think of a 5 letter word as a 5-digit number in base 26. The number of possible results for this example, would be 265. 265 = 11,881,376. So, we count from 0 to 11,881,376, then convert each value to base 26, and treat each base-26 digit as an index into (in this case) the alphabet.
In your case, the number of digits and the base are both probably different (and it sounds like the "base" you're using may not be a constant either), but neither of those poses a particularly difficult problem.
Based on your comments, you only want "numbers" in which each digit is unique. Given how easy it is to just generate all the combinations (simple counting) it's probably easiest to generate every number in the computed range, then filter out any that have repeated "digits".

Rust: BTreeMap equality+inequality range query

I want to call range on a BTreeMap, where the keys are tuples like (a,b). Say we have:
(1, 2) => "a"
(1, 3) => "b"
(1, 4) => "c"
(2, 1) => "d"
(2, 2) => "e"
(2, 3) => "f"
The particularity is that I want all the entries that have a specific value for the first field, but a range on the second field, i.e. I want all entries where a = 1 AND 1 < b <= 4. The RangeBounds operator in that case is not too complicated, it would be (Excluded((1, 1)), Included((1, 4))). If I have an unbounded range, say a = 1 AND b > 3, we would have the following RangeBounds: (Excluded((1, 3)), Included((1, i64::max_value()))).
The problem arises when the type inside of the tuple does not have a maximum value, for instance a string (CStr specifically). Is there a way to solve that problem? It would be useful to be able to use Unbounded inside of the tuple, but I don't think it's right. The less interesting solution would be to have multiple layers of datastructures (for instance a hashmap for the first field, where keys map to... a BTreeMap). Any thoughts?
If the first field of your tuple is an integer type, then you can use an exclusive bound on the next integer value, paired with an empty CStr. (I'm assuming that <&CStr>::default() is the "smallest" value in &CStr's total order.)
let range = my_btree_map.range((Excluded((1, some_cstr)), Excluded((2, <&CStr>::default()))));
If the first field is of a type for which it is difficult or impossible to obtain the "next greater value", then a combination of range and take_while will give the correct results, though with a little overhead.
let range = my_btree_map
.range((Excluded((1, some_cstr)), Unbounded))
.take_while(|&((i, _), _)| *i == 1);

python dictionaries when keys are numbers [duplicate]

This question already has answers here:
Why is the order in dictionaries and sets arbitrary?
(5 answers)
Closed 7 years ago.
I have a question about dictionary properties in python when the keys are number.
in my case when I print a dictionary with number keys the result of print will be sorted by keys but in the other case (keys are string) dictionary is unordered. I want to know about this rule in dictionaries.
l = {"one" : "1", "two" : "2", "three" : "3"}
print(l)
l = {1: "one", 2: "two", 3: "three", 4: "four", 5: "five"}
print(l)
l = {2: "two", 3: "three", 4: "four", 1: "one", 5: "five"}
print(l)
result:
{'three': '3', 'two': '2', 'one': '1'}
{1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five'}
{1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five'}
Python use hash table for storing the dictionaries so there is no ordered in dictionaries or other objects that use hash function.
But about the indices of items in a hash object, python calculate the indices based on following code within hashtable.c:
key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);
So as the hash value of integers is the integer itself the index is based on the number (ht->num_buckets - 1 is a constant) so the index calculated by Bitwise-and between (ht->num_buckets - 1) and the number.
consider the following example with set that use hash-table :
>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])
For number 33 we have :
33 & (ht->num_buckets - 1) = 1
That actually it's :
'0b100001' & '0b111'= '0b1' # 1 the index of 33
Note in this case (ht->num_buckets - 1) is 8-1=7 or 0b111.
And for 1919 :
'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919
And for 333 :
'0b101001101' & '0b111' = '0b101' # 5 the index of 333
For more details about python hash function its good to read the following quotes from python source code :
Major subtleties ahead: Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness. Python doesn't: its most
important hash functions (for strings and ints) are very regular in common
cases:
>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
This isn't necessarily bad! To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are "consecutive" strings. So this
gives better-than-random behavior in common cases, and that's very desirable.
OTOH, when collisions occur, the tendency to fill contiguous slices of the
hash table makes a good collision resolution strategy crucial. Taking only
the last i bits of the hash code is also vulnerable: for example, consider
the list [i << 16 for i in range(20000)] as a set of keys. Since ints are their own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0: they all map to the same table index.
But catering to unusual cases should not slow the usual ones, so we just take
the last i bits anyway. It's up to collision resolution to do the rest. If
we usually find the key we're looking for on the first try (and, it turns
out, we usually do -- the table load factor is kept under 2/3, so the odds
are solidly in our favor), then it makes best sense to keep the initial index
computation dirt cheap.

Resources