How to compare two dicts based on roundoff values in python? - python-3.x

I need to check if two dicts are equal. If the values rounded off to 6 decimal places are equal, then the program must say that they are equal. For e.g. the following two dicts are equal
{'A': 0.00025037208557341116}
and
{'A': 0.000250372085573415}
Can anyone suggest me how to do this? My dictionaries are big (more than 8000 entries) and I need to access this values multiple times to do other calculations.

Test each key as you produce the second dict iteratively. Looking up a key/value pair from the dict you are comparing with is cheap (linear cost), and round the values as you find them.
You are essentially performing a set difference to test for equality of the keys, which requires at least a full loop over the smallest of the sets. If you already need to loop to generate one of the dicts, you are at an advantage as that'll give you the shortest route to determining inequality soonest.
To test for two floats being the same within a set tolerance, see What is the best way to compare floats for almost-equality in Python?.

Related

creating a function using only loops, conditionals, and variables

Two numbers are twins if they both consist of the same digits. For instance: 134425 and 12345 are twins but 189 and 18 are not. Create a function areTwins(num1,num2) which takes in two numbers and returns True if they are twins, and False otherwise. You are only allowed to use conditionals, functions, and loops.
I was thinking to use two helper functions one to get the number length and the other to check the appearance of numbers (0 to 9) in the two numbers and compare it but I am lost in doing the second helper function. Can someone please help?
I would do two array and put the number in the array. I would then sort both arrays, and remove all duplicates. Then you just need to compare them and it’s done!

Smartest way to filter a list using a different list

I have two lists. One of them is essentially representing keys (dates), the other the values.
I really just need the values themselves, but I want to get all values that lie between two dates. And optimally, I'd also like to use a certain sampling frequency to, say, get the values for all first days of the week between my two dates (ie sampling every 7th day).
I can easily filter my dates between two dates by calling .filter(e => e > start && e < end), and combine it with my prices array into its own object and then mapping it or something.
But since I'll be running this on large datasets in AWS, I'd need to be quite efficient with the way I do this. What would be the computationally least expensive algorithm to achieve what I want?
The best way would probably be a simple for loop, or actually, probably a binary search, but is there a less ugly way of doing it? I really enjoy chaining stream operations.

When I combine two pandas columns with zip into a dict it reduces my samples

I have two colums in pandas: df.lat and df.lon.
Both have a length of 3897 and 556 NaN values.
My goal is to combine both columns and make a dict out of them.
I use the code:
dict(zip(df.lat,df.lon))
This creates a dict, but with one element less than my original columns.
I used len()to confirm this. I can not figure out why the dict has one element
less than my columns, when both columns have the same length.
Another problem is that the dict has only raw values, but not the keys "lat" respectively "lon".
Maybe someone here has an idea?
You may have a different length if there are repeated values in df.lat as you can't have duplicate keys in the dictionary and so these values would be dropped.
A more flexible approach may be to use the df.to_dict() native method in pandas. In this example the orientation you want is probably 'records'. Full code:
df[['lat', 'lon']].to_dict('records')

How to apply sklean pipeline to a list of features depending on availability

I have a pandas dataframe with 10 features (e.g., all floats). Given the different characteristics of the features (e.g., mean), the dataframe can be broken into 4 subsets: mean <0, mean within range (0,1), mean within range (1,100), mean >=100
For each subset, a different pipeline will be applied, however, they may not always be available, for example, the dataset might only contain mean <0; or it may contain only mean <0 and mean (1,100); or it may contain all 4 subsets
The question is how to apply the pipelines depending on the availability of the subsets.
The problem is that there will be total 7 different combinations:
all subset exists, only 3 exists, only 2 subset exists, only 1 subset exist.
How can I assign different pipelines depending on the availability of the subsets without using a nested if else (10 if/else)
if subset1 exists:
make_column_transformer(pipeline1, subset1)
elif subset2 exists:
make_column_transformer(pipeline2, subset2)
elif subset3 exists:
make_column_transformer(pipeline3, subset3)
elif subset1 and subset 2 exists
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2)]
elif subset3 and subset 2 exists
make_column_transformer([(pipeline3, subset3), (pipeline2, subset2)]
elif subset1 and subset 3 exists
make_column_transformer([(pipeline1, subset1), (pipeline3, subset3)]
elif subset1 and subset2 and subset3 exists:
make_column_transformer([(pipeline1, subset1), (pipeline2, subset2), (pipeline3, subset3)]
Is there a better way to avoid this nested if else (considering that if we have 10 different subsets _)
The way to apply different transformations to different sets of features is by ColumnTransformer [1]. You could then have a lists with the column names, which can be filled based on the conditions you want. Then, each transformer will take the columns listed in each list, for example cols_mean_lt0 = [...], etc.
Having said that, your approach doesn't look good to me. You probably want to scale the features so they all have the same mean and std. Depending on the algorithm you'll use, this may be mandatory or not.
[1] https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
EDIT:
ColumnTransformer takes transformers, which are a tuple of name, tuple and columns. What you want is to have multiple transformers, each of which will process different columns. The columns in the tuple can be indicated by 'string or int, array-like of string or int, slice, boolean mask array or callable'. Here is where I suggest you pass a list of columns.
This way, you can have three transformers, one for each of your cases. Now, to indicate which columns you want each transformer to process, you just have to create three lists, one for each transformer. Each column will corresond to one of the lists. This is simple to to. In a loop you can check for each column what the mean is, and then append the column name to the list which corresponds to the corresponding transformer.
Hope this helps!

Looking for a way to distinguish identical string entries for index use

I am making a function in python 3.5.2 to read chemical structures (e.g. CaBr2) and then gives a list with the names of the elements and their coefficients.
The general rundown of how I am doing it is i have a for loop, it skips the first letter. Then it will append the previous element when it reaches one of: capital letter/number/the end. I did this with index of my iteration, and then get the entry with index(iteration)-1 or -2 depending on the specifics. For the given example it would skip C, read a but do nothing, reach B and append to my name list the translation of Ca, and append 1 to my coefficient list.
This works perfectly for structures with unique entries, but with something like CaCl2, the index of the iteration at the second C is not 2, but zero as index doesn't differentiate between the two. How would I be able to have variables in my function equal to the value at previous index(es) without running in to this problem? Keeping in mind inputs can be of any length, capitalization cannot change, and there could be any number of repeated values

Resources