Pandas read_table with duplicate names - python-3.x

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.
In [1]:
​
df = pd.read_table('Datasets/tbl.csv', header=0, names=['one','two','one','two','one'])
df
tbl.csv contains a table with 5 different columns. The last two will be repeated instead of giving all columns.
Out[1]:
one two one two one
0 0.132846 0.120522 0.132846 0.120522 0.132846
1 -0.059710 -0.151850 -0.059710 -0.151850 -0.059710
2 0.003686 0.011072 0.003686 0.011072 0.003686
3 -0.220749 -0.029358 -0.220749 -0.029358 -0.220749
The actual table has different values in every column. Here, the same two columns (corresponding to the two last ones in the file) are repeated. No error or warning is given.
Do you think this is a bug or is it intended? I find it very dangerous to silently change an input like that. Or is it my ignorance?

Using duplicate values in indexes are inherently problematic.
They lead to ambiguity. Code that you think works fine can suddenly fail on DataFrames with non-unique indexes. argmax, for instance, can lead to a similar pitfall when DataFrames have duplicates in the index.
It's best to avoid putting duplicate values in (row or
column) indexes if you can. If you need to use a non-unique index, use them with care.
Double-check the effect duplicate values have on the behavior of your code.
In this case, you could use
df = pd.read_csv('data', header=None)
df.columns = ['one','two','one','two','one']
instead.

Related

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

When I combine two pandas columns with zip into a dict it reduces my samples

I have two colums in pandas: df.lat and df.lon.
Both have a length of 3897 and 556 NaN values.
My goal is to combine both columns and make a dict out of them.
I use the code:
dict(zip(df.lat,df.lon))
This creates a dict, but with one element less than my original columns.
I used len()to confirm this. I can not figure out why the dict has one element
less than my columns, when both columns have the same length.
Another problem is that the dict has only raw values, but not the keys "lat" respectively "lon".
Maybe someone here has an idea?
You may have a different length if there are repeated values in df.lat as you can't have duplicate keys in the dictionary and so these values would be dropped.
A more flexible approach may be to use the df.to_dict() native method in pandas. In this example the orientation you want is probably 'records'. Full code:
df[['lat', 'lon']].to_dict('records')

How do I get additional column name information in a pandas group by / nlargest calculation?

I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair.
This line does the job:
final2_df = final_df[['nameHiringOrganization', 'mesure', 'name', 'valeur']].groupby(['nameHiringOrganization', 'name'])['valeur'].nlargest(3)
However, the excel output table lacks the 'mesure' column, which contains the ratio's name. This is annoying, because then I'm not able to identify which of the six ratios works best for any given pair.
I thought selecting columns ath the beginning might work (final_df[['columns', ...]]), but it doesn't seem to.
Any thought on how I might add that info?
Many thanks in advance!
I think here is possible use another solution with sorting by 3 columns with DataFrame.sort_values and then using GroupBy.head:
final2_df = (final_df.sort_values(['nameHiringOrganization', 'name', 'valeur'],
ascending=[True, True, False])
.groupby(['nameHiringOrganization', 'name'])
.head(3))

Python-Hypothesis: specifying and managing NaN values

I'm trying to use Hypothesis to generate a set of dataframes that I'll merge together. I want each individual column to be allowed to have NaN values, and I want to allow Hypothesis to generate some wacky examples.
But I mostly want to focus on examples where there is at least one row in each dataframe with actual values - and in particular, I'd like to be able to generate dataframes with some information shared between corresponding columns, such that a merged dataframe is not empty. (E.g. I want some values from 'store' in store.csv to overlap with values from 'store' in train.csv.)
I have some example code here that generates NaN values and wacky examples all over the place, but most of the generated examples contain very few non-NaN values. (A dataframe strategy starts on line 57.)
Any suggestions for how to create slightly more 'realistic' examples? Thanks!
Your solution looks fine to me, but here's two more tactics that might help:
Use the fill=st.nothing() argument to columns and series, to disable filling behaviour. This makes the entries dense instead of sparse(ish), so there's a substantial runtime cost but noticable change in the example density. Alternatively fill=st.floats(allow_nan=False) might be cheaper and still work!
Use a .filter(...) on the strategy to reject dataframes without any nan-free rows. A typical rule of thumb is to avoid using .filter when it would reject more than half the examples and look for an alternative when it's over a tenth... but this could be combined with the first point easily enough.
Answering my own question, but I'd love to hear other answers.
I ended up doing two things:
1) Requiring that the end user not give garbage files. (Just because we have a magical property-generation framework doesn't absolve us of the responsibility of having common sense, which I forgot.)
2) Testing for things that are reasonable accidents but not absolute garbage, by requiring that each dataframe have at least one row with no NaNs. With that requirement, I generate the non-NaN dataframe, and then add some NaNs afterward.
From there, ipython and .example() make it easy to see what's going on.
Example code below (google_files and google_weeks are custom strategies previously created)
# Create dataframes from the strategies above
# We'll create dataframes with all non-NaN values, then add NaNs to rows
# after the fact
df = draw(data_frames([
column('file', elements=google_files),
column('week', elements=google_weeks),
column('trend',
elements=(integers(min_value=0, max_value=100)))],
index=range_indexes(min_size=1, max_size=100)))
# Add the nans
# With other dataframes, this ended up getting written into a function
rows = len(df)
df.loc[rows+1] = [np.NaN, '2014-01-05 - 2014-01-11', 42]
df.loc[rows+2] = ['DE_BE', np.NaN, 42]
df.loc[rows+3] = ['DE_BE', '2014-01-05 - 2014-01-11', np.NaN]
df.loc[rows+4] = [np.NaN, np.NaN, np.NaN]

Pandas read_excel removes columns under empty header

I have an Excel file where A1,A2,A3 are empty but A4:A53 contains column names.
In "R" when you were to read that data, the columns names for A1,A2,A3 would be "X_1,X_2,X_3" but when using pandas.read_excel it simply skips the first three columns, thus ignoring them. The problem is that the number of columns in each file is dynamic thus I cannot parse the column range, and I cannot edit the files and adding "dummy names" for A1,A2,A3
Use parameter skip_blank_lines=False, like so:
pd.read_excel('your_excel.xlsx', header=None, skip_blank_lines=False)
This stackoverflow question (finally) pointed me in the right direction:
Python Pandas read_excel doesn't recognize null cell
The pandas.read_excel docs don't contain any info about this since it is one of the keywords, but you can find it in the general io docs here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
A quick fix would be to pass header=None to pandas' read_excel() function, manually insert the missing values into the first row (it now will contain the column names), then assign that row to df.columns and drop it after. Not the most elegant way, but I don't know of a builtin solution to your problem
EDIT: by "manually insert" I mean some messing with fillna(), since this appears to be an automated process of some sort
I realize this is an old thread, but I solved it by specifying the column names and naming the final empty column, rather than importing with no names and then having to deal with a row with names in it (also used use_cols). See below:
use_cols = 'A:L'
column_names = ['Col Name1', 'Col Name 2', 'Empty Col']
df = pd.read_excel(self._input_path, usecols=use_cols, names=column_names)

Resources