Calculate dot product between two dictionaries millions of times - python-3.x

I have two dictionaries d1 = {'string1': number1, ..., 'string5 000 000': number5000000} which does not change and many small dictionaries d_i = {'str1': num1, ..., 'str50': num50} (i = 2, 3, ..., a few million). I want to do a dot product between these dictionaries i.e. for every key in dictionary d_i that exists also in d_1 I would like their numbers multiplied and then added to the sum.
The problem is that first dictionary is extremely big and there are millions of small dictionaries.
How do I do that fast? Can I use some big data techniques for that?

You can put your data to pandas dataframe and then do dot product between series in dataframe. It can be faster but in you case I would measure how much time it takes in case of python implementation and pandas.

Related

Comparing the word-counts of two files, accounting for the number of occurrences

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.
Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

Function to maximise the number of sets with the total numbers that all subsets not being repeated

I am trying to solve this problem in python. But I cannot find a solution. The problem is the following: I have many lists hundreds of thousands, each list contains from 1 to 16 numbers that range from 0 to N. I want to maximise the number of sets that I can pool, so that the union of the chosen subsets does not contain any repeated numbers.
For instance:
List1 = [2,4,1012]
List2 = [0,1,3]
List3 = [1,2]
List4 = [5,8]
Result: [2,4,1012], [5,8] ,[0,1,3]
If I chose List3, there can only be 2 subsets without repeating elements.
Thanks in advance
I have tried using networkX graphs, but I cannot grasp how to reduce the graph to find the optimal answer or answers.

Check if the lines in the dataframe roughly correspond to each other

I have a data frame with names of cities in Morocco and another one with similar names but that was not well coded. Here's the first one:
>>> df[['new_regiononame']].head()
new_regiononame
0 Grand Casablanca-Settat
1 Fès-Meknès
2 Souss-Massa
3 Laayoune-Sakia El Hamra
4 Fès-Meknès
and here's the other one I wanted to change to the names of the first one. At least they know a way to read it correctly:
>>>X_train[['S02Q03A_Region']].head()
S02Q03A_Region
10918 Fès-Meknès
1892 Rabat-Salé-Kénitra
6671 Casablanca-Settat
4837 Marrakech-Safi
6767 Casablanca-Settat
How can I check if the lines in the dataframe roughly correspond to each other and, if so, rename X_train rows by df ones?
So far I only know how to extract which rows in X_train have exact equivalents in df:
X_train['S02Q03A_Region'][X_train['S02Q03A_Region'].isin(df['new_regiononame'].unique())]
The Levenshtein distance could do the job here.
The Levenshtein distance gives you the distance between two words by calculating the number of single character edits that are needed to convert one word to the other. You could establish a reasonable threshold comparing one dataframe column to the other such as:
If it starts with the same character (?)
If the difference between
lengths of the city names is only x characters apart?
If the Levenshtein distance is less than y
etc. etc.
The code to calculate Levenshtein distance is:
import nltk
nltk.edit_distance("Fès-Meknès", "Fès-Meknès")
Output:
4

Efficient way to convert a string column with comma-separated floats into 2D NumPy array?

I have a Pandas dataset with one string column containing a series of floats separated by commas (as one big string):
4677579 -0.26751,0.559024,0.690269,0.0446298,0.967434,...
5193597 -0.587694,0.15063,-0.163831,-0.696565,0.972488...
596398 -0.732648,0.69533,0.722288,-0.0453636,0.435788...
5744417 -0.354733,-0.782564,-0.301927,0.263827,0.96237...
2464195 -0.326526,0.341944,0.330533,-0.250108,0.673552...
So, the first row has the following format:
{4677579: '-0.26751,0.559024,[..skipped..],-0.394059,0.974787'}
I need to convert that column into a (preferably 2D NumPy) array of floats:
array([[-2.67510e-01, 5.59024e-01, 6.90269e-01, skipped, 4.45222e-01, -1.82369e-01],
[-5.87694e-01, 1.50630e-01, -1.63831e-01, skipped, 9.47768e-01]])
The problem is that the dataset is very large (>10M rows, >400 floats/row), and my naive approaches to conversion, like:
vectordata.apply(lambda x: np.fromstring(x, sep=','))
or
vectordata.apply(lambda x: list(map(float,x.split(','))))
just time out (they work OK if the total size of the dataset is <5M rows, though).
Any ideas how one could optimize this operation to make it work on a large dataset?

Astropy get table length

How can I get the length (i.e. number of rows) of an astropy Table? From the documentation, there are serveral ways of having the table length printed out, such as t.info(). However, I can't use this information in a script.
How do I assign the length of a table to a variable?
In Python the len() built-in function typically gives the length/size of some collection-like object. For example the length of a 1-D array is given like:
>>> a = [1, 2, 3]
>>> len(a)
3
For a table you could ask what the "size" of a table means--the number of rows? The number of columns? The total number of items in the table? But it sounds like you want the number of rows. In Python, this will almost always be given by len() on table-like objects as well (arguably anything that does otherwise is a mistake). You can consider this by analogy to how you might construct a table-like data structure with simple Python lists, by nesting them:
>>> t = [
... [1, 2, 3],
... [4, 5, 6],
... [7, 8, 9]
... ]
Here each "row" is represented by a single list nested in outer lists, so len(t) gives th number of rows. In fact this is just a convention and can be broken if need-be. For example you could also treat the above t as list of columns for some column-oriented data.
But in Python we typically assume 2-dimensional arrays to be row-oriented unless otherwise stated--to remember you can see that the syntax for a nested list as I wrote above looks row-oriented.
The logic extends to Numpy arrays and other more complicated data structures built on them such as Astropy's Table or Pandas DataFrames.

Resources