How to interpret a sensor value with non multiple of 8 bit number - sensors

I am trying to make my sensor more accurate, but after calibrating it I run into the problem that I don't know how to read what it measures now. For a 16-bit value I can just use
angY=i2c.mem_read(2,0x68,0x45)
VangY=(int(angY[0])*256+int(angY[1]))
but now I add an additional bit to the right, how should I read it ? I can't make it count as half, it would be useless.

I guess I just have to make my last bit count as 1 and multiply the rest by 2.

Related

How to handle various units in a single attribute / feature using Pandas?

I have a dataset, on which i am working on Data Cleaning part, where one of the attribute or feature is having the values with various units. for example some of the values are as follow.
1 kg; 6 LB; 900 gms; 32 oz; etc.
If i use the standard scaler then it will not be fair as the values and their units are different, so cannot treat them as is.
Please do suggest how to handle such data.
I will recommend to change the different value to same unit first of all. For example, you can make all the value to kg or whatever suits best for you, and then perform the standard scale.
Thanks All, I did some research and found that i need to convert the various units into standard units and which follow internation norms referred to SI Units https://www.nist.gov/pml/weights-and-measures/metric-si/si-units , and same suggestion has given by #sharmajee499.
Moving ahead with this approach.. though this is going to be a lot of manual code, but seems there is no direct short and easy way.
Please do post if have any better solution.

How can I better optimize a search in possible Fantasyland constructions in Pineapple poker?

So, a bit of explanation to preface the question. In variants of Open Face Chinese poker, you are dealt one and one card, which are to be placed into three different rows, and the goal is to make each row increasingly better, and of course getting the best hands possible. A difference from normal poker is that the top row only contains three cards, so three of a kind is the best possible hand you can get there. In a variant of this called Pineapple, which is what I'm working on a bot for, you are dealt three and three cards after the initial 5, and you discard one of those three cards each round.
Now, there's a special rule called Fantasyland, which means that if you get a pair of queens or better in the top row, and still manage to get successively better hands in the middle and top row, your next round becomes a Fantasyland round. This is a round where are dealt 15 cards at the same time, and are free to construct the best three rows possible (rows of 3, 5, and 5 cards, and discarding 2 of them). Each row yields a certain number of points (royalties, as they're called) depending on which hand is constructed, and each successive row needs better and better hands to yield the same amount of points.
Trying to optimize solutions for this seemed like a natural starting point, and one of the most interesting parts as well, so I started working on it. My first attempt, which is also where I'm stuck, was to use Simulated Annealing to do local search optimization. The energy/evaluation function is the amount of points, and at first I tried a move/neighbor function of simply swapping two cards at random, having first places them as they were drawn. This worked decently, managing to get a mean of around 6 points per hand, which isn't bad, but I often noticed that I could spot better solutions by swapping more than one pair of cards at the same time. Thus, I changed the move/neighbor function to swapping several pairs of cards at once, and also tried swapping a random amount of pairs between 1 and 3 through 5, which managed to yield slightly better results, but still I often spot better solutions by simply taking a look.
If anyone is reading this and understands the problem, any idea on how to better optimize this search? Should I use a different move/neighbor function, different Annealing parameters, or perhaps a different local search method, or even some kind of non-local search? All ideas are welcome and deeply appreciated.
You haven't indicated a performance requirement, so I'll assume that this should work quickly enough to be usable in a game with human players. It can't take an hour to find the solution, but you don't need it in a millisecond, either.
I'm wondering of simulated annealing is the right method. This might be an opportunity for brute force.
One can make a very fast algorithm for evaluating poker hands. Consider an encoding of the cards where 13 bits encode the card value and 4 bits encode the suit. OR together the cards in the hand and you can quickly identify pairs, triples, straights, and flushes.
At first glance, there would seem to be 15! (13,076,743,680,000) possible positions for all the cards which are dealt, but there are other symmetries and restrictions that reduce the meaningful combinations and limit the space that must be explored.
One important constraint is that the bottom row must have a higher score than the middle row and that the middle row must have a higher score than the top row.
There are 3003 sets of bottom cards, COMBINATIONS(15 cards, 5 at a time) = (15!)/(5!(15-5)!) = 3003. For each set of possible bottom cards, there are COMBINATIONS(10 cards, 5 at a time) = (10!)/(5!(10-5!)) = 252 sets of middle cards. The top row has COMBINATIONS(5 cards, 3 at a time) = (5!)/(3!*(5-3)!) = 10. With no further optimization, a brute force approach would require evaluating 3003*252*10 = 7567560 positions. I suspect that this can be evaluated within an acceptable response time.
A further optimization uses the constraint that each row must be worth less than the row below. If the middle row is worth more than the bottom row, the top row can be ignored by pruning the tree at that point, which removes a factor of 10 for those cases.
Also, since the bottom row must be work more than the middle and top rows, there may be some minimum score the bottom row must achieve before it is worth trying middle rows. Rejecting a bottom row prunes 2520 cases from the tree.
I understand that there is a way to use simulated annealing for estimating solutions for discrete problems. My use of simulated annealing has been limited to continuous problems with edge constraints. I don't have a good intuition for how to apply SA to discrete problems. Many discrete problems lend themselves to an exhaustive search, provided the search space can be trimmed by exploiting symmetries and constraints in the particular problem.
I'd love to know the solution you choose and your results.

Divide Data Into Quartiles

I have a dataset that contains admissions rates of all providers that we work with. I need to divide that data into quartiles, so that each provider can see where their rate lies in comparison to other providers. The rate ranges from 7% to 89%. can anyone suggest me how to do this? I am not sure if this is the right place to ask this question but if somebody can help me with this, I would really appreciate that.
The other concern is that if a provider's numbers is really small eg: 2/4 = 50%, the provider might fall into worse quartile but it doesn't mean that the provider's performance is bad because the numbers are so small. I hope this is making sense. Please let me know if I can clarify it further.
There are ways to obtain quantiles without doing a complete sort but unless you've got huge amounts of data there is no point in implementing those algorithms if you haven't already got them available. Presuming you have a sort() function available, all you need to do is:
Given n data points.
Sort the data points.
Find the n/4, n/2 and 3*n/4th points in the sorted data, which are your quartiles.
As you say, if n is less than some number (that you'll have to decide for yourself) you may want to say that the quartile result is "not applicable" or some such.
First concern: For small n, do not use quartiles. Whether n is small is arbitrary.

NLP - Improving Running Time and Recall of Fuzzy string matching

I have made a working algorithm but the running time is very horrible. Yes, I know from the start that it will be horrible but not that much. For just 200000 records, the program runs for more than an hour.
Basically what I am doing is:
for each searchfield in search fields
for each sample in samples
do a q-gram matching
if there are matches then return it
else
split the searchfield into uniwords
for each sample in samples
split sample into uniwords
for each uniword in samples
if the uniword is a known abbreviation
then search the dictionary for its full word or other known abbr
else do a jaro-winkler matching
average the distances of all the uniwords
if the average is above threshold then make it as a match and break
end for
if there is a match make a comment that it matched one of the samples partially
end else
end for
Yes, this code is very loop-happy. I am using brute-force because the recall is very important. So, I'm wondering how can I make it faster since I am not only running it for 200000 data for millions of data and the computers of the client are not high-end (1GB-2GB of Ram Pentium 4 or Dual-Core, the computer where I test this program is a Dual Core with 4GB of Ram). I came across TF/IDF but I do not know if it will be sufficient. And I wonder how can google make searches real time.
Thanks in advance!
Edit:
This program is a data filterer. From 200,000 dummy data (actual data is about 12M), I must filter data that is irrelevant to the samples (500 dummy samples, I still do not know how much the actual amount of samples).
With the given dummy data and samples, the running time is about 1 hour but after tinkering here and there, I have successfully lessen it to 10-15 minutes. I have lessen it by grouping the fields and samples that begin with the same character (discounting special and non-meaningful words e.g. the, a, an) and matching the fields to the sample with the same first character. I know there is a problem there. What if the field was misspelled at the first character? But I think the number of those are negligible. The samples are spelled correctly since it is always maintained.
what is your programing language? I guess using q=2 or 3 is sufficient. Also I suggested to come from uni gram to higher degrees.

Pin Generation

I am looking to develop a system in which i need to assign every user a unique pin code for security. The user will only enter this pin code as a means of identifying himself. Thus i dont want the user to be able to guess another users pincode. Assuming the max users i will have is 100000, how long should this pin code be?
e.g. 1234 4532 3423
Should i generate this code via some sort of algorithm? Or should i randomly generate it?
Basically I dont want people to be able to guess other peoples pincode and it should support enough number of users.
Am sorry if my question sounds a bit confusing but would gladly clarify any doubts.
thank you very much.
UPDATE
After reading all the posts below, I would like to add some more detail.
What i am trying to achieve is something very similar to a scratch card.
A user is given a card, which he/she must scratch to find the pin code.
Now using this pin code the user must be able to access my system.
I cannot add extra security (e.g. username and password), as then it will deter the user from using the scratch card. I want to make it as difficult as possible to guess the pincode within the limitations.
thankyou all for your amazing replies again.
4 random digits should be plenty if you append it to unique known userid (could still be number) [as recommended by starblue]
Pseudo random number generator should also be fine. You can store these in the DB using reversable encryption (AES) or one-way hashing
The main concern you have is how many times a person can incorrectly input the pin before they are locked out. This should be low, say around three...This will stop people guessing other peoples numbers.
Any longer than 6 digits and people will be forgetting them, or worse, writing them on a post-it note on their monitor.
Assuming an account locks with 3 incorrect attempts, then having a 4 digit pin plus a user ID component UserId (999999) + Pin (1234) gives you a 3/10000 chance of someone guessing. Is this acceptable? If not make the pin length 5 and get 3/100000
May I suggest an alternative approach? Take a look at Perfect Paper Passwords, and the derivatives it prompted .
You could use this "as is" to generate one-time PINs, or simply to generate a single PIN per user.
Bear in mind, too, that duplicate PINs are not of themselves an issue: any attack would then simply have to try multiple user-ids.
(Mileage warning: I am definitely not a security expert.)
Here's a second answer: from re-reading, I assume you don't want a user-id as such - you're just validating a set of issued scratch cards. I also assume you don't want to use alphabetic PINs.
You need to choose a PIN length such that the probability of guessing a valid PIN is less than 1/(The number of attempts you can protect against). So, for example, if you have 1 million valid PINs, and you want to protect against 10000 guesses, you'll need a 10-digit PIN.
If you use John Graham-Cumming's version of the Perfect Paper Passwords system, you can:
Configure this for (say) 10-digit decimal pins
Choose a secret IV/key phrase
Generate (say) the first million passwords(/PINs)
I suspect this is a generic procedure that could, for example, be used to generate 25-alphanumeric product ids, too.
Sorry for doing it by successive approximation; I hope that comes a bit nearer to what you're looking for.
If we assume 100,000 users maximum then they can have unique PINs with 0-99,999 ie. 5 digits.
However, this would make it easier to guess the PINs with the maximum number of users.
If you can restrict the number of attempts on the PIN then you can have a shorter PIN.
eg. maximum of 10 failed attempts per IP per day.
It also depends on the value of what you are protecting and how catastrophic it would be if the odd one did get out.
I'd go for 9 digits if you want to keep it short or 12 digits if you want a bit more security from automated guessing.
To generate the PINs, I would take a high resolution version of the time along with some salt and maybe a pseudo-random number, generate a hash and use the first 9 or 12 digits. Make sure there is a reasonable and random delay between new PIN generations so don't generate them in a loop, and if possible make them user initiated.
eg. Left(Sha1(DateTime + Salt + PseudoRandom),9)
Lots of great answers so far: simple, effective, and elegant!
I'm guessing the application is somewhat lottery-like, in that each user gets a scratch card and uses it to ask your application if "he's already won!" So, from that perspective, a few new issues come to mind:
War-dialing, or its Internet equivalent: Can a rogue user hit your app repeatedly, say guessing every 10-digit number in succession? If that's a possibility, consider limiting the number of attempts from a particular location. An effective way might be simply to refuse to answer more than, say, one attempt every 5 seconds from the same IP address. This makes machine-driven attacks inefficient and avoids the lockout problem.
Lockout problem: If you lock an account permanently after any number of failed attempts, you're prone to denial of service attacks. The attacker above could effectively lock out every user unless you reactivate the accounts after a period of time. But this is a problem only if your PINs consist of an obvious concatenation of User ID + Key, because an attacker could try every key for a given User ID. That technique also reduces your key space drastically because only a few of the PIN digits are truly random. On the other hand, if the PIN is simply a sequence of random digits, lockout need only be applied to the source IP address. (If an attempt fails, no valid account is affected, so what would you "lock"?)
Data storage: if you really are building some sort of lottery-like system you only need to store the winning PINs! When a user enters a PIN, you can search a relatively small list of PINs/prizes (or your equivalent). You can treat "losing" and invalid PINs identically with a "Sorry, better luck next time" message or a "default" prize if the economics are right.
Good luck!
The question should be, "how many guesses are necessary on average to find a valid PIN code, compared with how many guesses attackers are making?"
If you generate 100 000 5-digit codes, then obviously it takes 1 guess. This is unlikely to be good enough.
If you generate 100 000 n-digit codes, then it takes (n-5)^10 guesses. To work out whether this is good enough, you need to consider how your system responds to a wrong guess.
If an attacker (or, all attackers combined) can make 1000 guesses per second, then clearly n has to be pretty large to stop a determined attacker. If you permanently lock out their IP address after 3 incorrect guesses, then since a given attacker is unlikely to have access to more than, say, 1000 IP addresses, n=9 would be sufficient to thwart almost all attackers. Obviously if you will face distributed attacks, or attacks from a botnet, then 1000 IP addresses per attacker is no longer a safe assumption.
If in future you need to issue further codes (more than 100 000), then obviously you make it easier to guess a valid code. So it's probably worth spending some time now making sure of your future scaling needs before fixing on a size.
Given your scratch-card use case, if users are going to use the system for a long time, I would recommend allowing them (or forcing them) to "upgrade" their PIN code to a username and password of their choice after the first use of the system. Then you gain the usual advantages of username/password, without discarding the ease of first use of just typing the number off the card.
As for how to generate the number - presumably each one you generate you'll store, in which case I'd say generate them randomly and discard duplicates. If you generate them using any kind of algorithm, and someone figures out the algorithm, then they can figure out valid PIN codes. If you select an algorithm such that it's not possible for someone to figure out the algorithm, then that almost is a pseudo-random number generator (the other property of PRNGs being that they're evenly distributed, which helps here too since it makes it harder to guess codes), in which case you might as well just generate them randomly.
If you use random number generator algorithms, so you never have PIN like "00038384882" ,
starts with 0 (zeros), because integer numbers never begins with "0". your PIN must be started with 1-9 numbers except 0.
I have seen many PIN numbers include and begins many zeros, so you eliminate first million of numbers. Permutation need for calculations for how many numbers eliminated.
I think you need put 0-9 numbers in a hash, and get by randomly from hash, and make your string PIN number.
If you want to generate scratch-card type pin codes, then you must use large numbers, about 13 digits long; and also, they must be similar to credit card numbers, having a checksum or verification digit embedded in the number itself. You must have an algorithm to generate a pin based on some initial data, which can be a sequence of numbers. The resulting pin must be unique for each number in the sequence, so that if you generate 100,000 pin codes they must all be different.
This way you will be able to validate a number not only by checking it against a database but you can verify it first.
I once wrote something for that purpose, I can't give you the code but the general idea is this:
Prepare a space of 12 digits
Format the number as five digits (00000 to 99999) and spread it along the space in a certain way. For example, the number 12345 can be spread as __3_5_2_4__1. You can vary the way you spread the number depending on whether it's an even or odd number, or a multiple of 3, etc.
Based on the value of certain digits, generate more digits (for example if the third digit is even, then create an odd number and put it in the first open space, otherwise create an even number and put it in the second open space, e.g. _83_5_2_4__1
Once you have generated 6 digits, you will have only one open space. You should always leave the same open space (for example the next-to-last space). You will place the verification digit in that place.
To generate the verification digit you must perform some arithmetic operations on the number you have generated, for example adding all the digits in the odd positions and multiplying them by some other number, then subtracting all the digits in the even positions, and finally adding all the digits together (you must vary the algorithm a little based on the value of certain digits). In the end you have a verification digit which you include in the generated pin code.
So now you can validate your generated pin codes. For a given pin code, you generate the verification digit and check it against the one included in the pin. If it's OK then you can extract the original number by performing the reverse operations.
It doesn't sound so good because it looks like security through obscurity but it's the only way you can use this. It's not impossible for someone to guess a pin code but being a 12-digit code with a verification digit, it will be very hard since you have to try 1,000,000,000,000 combinations and you just have 100,000 valid pin codes, so for every valid pin code there are 10,000,000 invalid ones.
I should mention that this is useful for disposable pin codes; a person uses one of these codes only once, for example to charge a prepaid phone. It's not a good idea to use these pins as authentication tokens, especially if it's the only way to authenticate someone (you should never EVER authenticate someone only through a single piece of data; the very minimum is username+password)
It seems you want to use the pin code as the sole means of identification for users.
A workable solution would be to use the first five digits to identify the user,
and append four digits as a PIN code.
If you don't want to store PINs they can be computed by applying a cryptographically secure hash (SHA1 or better)
to the user number plus a system-wide secret code.
Should i generate this code via some
sort of algorithm?
No. It will be predictable.
Or should i randomly generate it?
Yes. Use a cryptographic random generator, or let the user pick their own PIN.
In theory 4 digits will be plenty as ATM card issuers manage to support a very large community with just that (and obviously, they can't be and do not need to be unique). However in that case you should limit the number of attempts at entering the PIN and lock them out after that many attempts as the banks do. And you should also get the user to supply a user ID (in the ATM case, that's effectively on the card).
If you don't want to limit them in that way, it may be best to ditch the PIN idea and use a standard password (which is essentially what your PIN is, just with a very short length and limited character set). If you absolutely must restrict it to numerics (because you have a PIN pad or something) then consider making 4 a (configurable) minimum length rather than the fixed length.
You shouldn't store the PIN in clear anywhere (e.g. salt and hash it like a password), however given the short length and limited char set it is always going to be vulnerable to a brute force search, given an easy way to verify it.
There are various other schemes that can be used as well, if you can tell us more about your requirements (is this a web app? embedded system? etc).
There's a difference between guessing the PIN of a target user, and that of any valid user. From your use case, it seems that the PIN is used to gain access to certain resource, and it is that resource that attackers may be after, not particular identities of users. If that's indeed the case, you will need to make valid PIN numbers sufficiently sparse among all possible numbers of the same number digits.
As mentioned in some answers, you need to make your PIN sufficiently random, regardless if you want to generate it from an algorithm. The randomness is usually measured by the entropy of the PIN.
Now, let's say your PIN is of entropy N, and there are 2^M users in your system (M < N), the probability that a random guess will yield a valid PIN is 2^{M-N}. (Sorry for the latex notations, I hope it's intuitive enough). Then from there you can determine if that probability is low enough given N and M, or compute the required N from the desired probability and M.
There are various ways to generate the PINs so that you won't have to remember every PIN you generated. But you will need a very long PIN to make it secure. This is probably not what you want.
I've done this before with PHP and a MySQL database. I had a permutations function that would first ensure that the number of required codes - $n, at length $l, with the number of characters, $c - was able to be created before starting the generation process.
Then, I'd store each new code to the database and let it tell me via UNIQUE KEY errors, that there was a collision (duplicate). Then keep going until I had made $n number of successfully created codes. You could of course do this in memory, but I wanted to keep the codes for use in a MS Word mail merge. So... then I exported them as a CSV file.

Resources