When I combine two pandas columns with zip into a dict it reduces my samples - python-3.x

I have two colums in pandas: df.lat and df.lon.
Both have a length of 3897 and 556 NaN values.
My goal is to combine both columns and make a dict out of them.
I use the code:
dict(zip(df.lat,df.lon))
This creates a dict, but with one element less than my original columns.
I used len()to confirm this. I can not figure out why the dict has one element
less than my columns, when both columns have the same length.
Another problem is that the dict has only raw values, but not the keys "lat" respectively "lon".
Maybe someone here has an idea?

You may have a different length if there are repeated values in df.lat as you can't have duplicate keys in the dictionary and so these values would be dropped.
A more flexible approach may be to use the df.to_dict() native method in pandas. In this example the orientation you want is probably 'records'. Full code:
df[['lat', 'lon']].to_dict('records')

Related

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

How do I get additional column name information in a pandas group by / nlargest calculation?

I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair.
This line does the job:
final2_df = final_df[['nameHiringOrganization', 'mesure', 'name', 'valeur']].groupby(['nameHiringOrganization', 'name'])['valeur'].nlargest(3)
However, the excel output table lacks the 'mesure' column, which contains the ratio's name. This is annoying, because then I'm not able to identify which of the six ratios works best for any given pair.
I thought selecting columns ath the beginning might work (final_df[['columns', ...]]), but it doesn't seem to.
Any thought on how I might add that info?
Many thanks in advance!
I think here is possible use another solution with sorting by 3 columns with DataFrame.sort_values and then using GroupBy.head:
final2_df = (final_df.sort_values(['nameHiringOrganization', 'name', 'valeur'],
ascending=[True, True, False])
.groupby(['nameHiringOrganization', 'name'])
.head(3))

Pandas read_table with duplicate names

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.
In [1]:
​
df = pd.read_table('Datasets/tbl.csv', header=0, names=['one','two','one','two','one'])
df
tbl.csv contains a table with 5 different columns. The last two will be repeated instead of giving all columns.
Out[1]:
one two one two one
0 0.132846 0.120522 0.132846 0.120522 0.132846
1 -0.059710 -0.151850 -0.059710 -0.151850 -0.059710
2 0.003686 0.011072 0.003686 0.011072 0.003686
3 -0.220749 -0.029358 -0.220749 -0.029358 -0.220749
The actual table has different values in every column. Here, the same two columns (corresponding to the two last ones in the file) are repeated. No error or warning is given.
Do you think this is a bug or is it intended? I find it very dangerous to silently change an input like that. Or is it my ignorance?
Using duplicate values in indexes are inherently problematic.
They lead to ambiguity. Code that you think works fine can suddenly fail on DataFrames with non-unique indexes. argmax, for instance, can lead to a similar pitfall when DataFrames have duplicates in the index.
It's best to avoid putting duplicate values in (row or
column) indexes if you can. If you need to use a non-unique index, use them with care.
Double-check the effect duplicate values have on the behavior of your code.
In this case, you could use
df = pd.read_csv('data', header=None)
df.columns = ['one','two','one','two','one']
instead.

How to compare two dicts based on roundoff values in python?

I need to check if two dicts are equal. If the values rounded off to 6 decimal places are equal, then the program must say that they are equal. For e.g. the following two dicts are equal
{'A': 0.00025037208557341116}
and
{'A': 0.000250372085573415}
Can anyone suggest me how to do this? My dictionaries are big (more than 8000 entries) and I need to access this values multiple times to do other calculations.
Test each key as you produce the second dict iteratively. Looking up a key/value pair from the dict you are comparing with is cheap (linear cost), and round the values as you find them.
You are essentially performing a set difference to test for equality of the keys, which requires at least a full loop over the smallest of the sets. If you already need to loop to generate one of the dicts, you are at an advantage as that'll give you the shortest route to determining inequality soonest.
To test for two floats being the same within a set tolerance, see What is the best way to compare floats for almost-equality in Python?.

Sort a dictionary according to the lists size and then alphabetically by their key in Python?

So I have a tiny problem. I got help here a while ago about sorting a dictionary with keys that have a list to each key according to the value of things in the list. The keys with lists with the least amount of values on the left and to the right the keys with lists with the most amount of values. That worked great. Now I know how to sort dictionary keys alphabetically but I cant get it to work combined with the above..
I'm trying to sort the dictionary below according to first how many values the key list contains... and then alphabetically if the key list contains the same amount of values as a previous key list.
so before I would have this:
Dict = {"anna":[1,2,3],"billy":[1,2],"cilla":[1,2,3,4],"cecillia":[1,2,3,4],"dan":[1]}
And after if everything goes well I would like to have...
Dict = {"dan":[1],"billy":[1,2],"anna":[1,2,3],"cecillia":[1,2,3,4],"cilla":[1,2,3,4]}
As you see in the above, cecillia comes before cilla since they both have 4 values in their lists... and dan comes in first since he has the least amount of values in his list. I hope this makes sense. What I have right now to get the below result is:
ascending = sorted(Dict, key =lambda x: len(Dict[x]))
this gives me for example:
{"dan":[1],"billy":[1,2],"anna":[1,2,3],"cilla":[1,2,3,4],"cecillia":[1,2,3,4]}
So it works but only for the values in the list.. now when I go
ascending.sort()
it sorts the dictionary alphabetically but then the order of values from least to greatest is gone. Anyone know how to to combine the two things? I would greatly appreciate it.
You cannot keep dictionaries sorted so you must convert it to a list of tuples:
D = [ (x, Dict[x]) for x in Dict]
ascending = sorted(D, key = lambda x: x[1])
ascending.sort()
See http://wiki.python.org/moin/HowTo/Sorting . BTW the feature you are relying on is in fact because the sorting algorithm is stable (which apparently was not the case when I was programming in Python).

Resources