I have following dataframe df with 3 rows where 3rd row consists of all empty strings. I am trying to drop all the rows which has all the columns empty but somehow the rows are not getting dropped. Below is my snippet.
import pandas as pd
d = {'col1': [1, 2, ''], 'col2': [3, 4, '']}
df = pd.DataFrame(data=d)
df = df.dropna(how='all')
Please suggest where I am doing wrong?
You don't have NaN values. You have '', which is not NaN. So:
df[df.ne('').any(1)]
Related
I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:
I have a pandas DataFrame which has values that are not correct
data = {'Model':['A', 'B', 'A', 'B', 'A'], 'Value':[20, 40, 20, 40, -1]}
df = pd.DataFrame(data)
df
Out[46]:
Model Value
0 A 20
1 B 40
2 A 20
3 B 40
4 A -1
I would like to replace -1 with the unique values of A.
In this case it should be 20.
How do I go about it. I have tried the following.
In my case its a large DF with 2million rows.
df2 = df[df.model != -1]
pd.merge(df, df2, on='model', how='left')
Out:
MemoryError: Unable to allocate 5.74 TiB for an array with shape (788568381621,) and data type int64
You don't need to merge, which creates all possible pairs of rows with the same Model. The following will do
df['Value'] = df['Value'].mask(df['Value']!=-1).groupby(df['Model']).transform('first')
Or you can also use map:
s = (df[df['Value'] != -1].drop_duplicates('Model')
.set_index('Model')['Value'])
df['Value'] = df['Model'].map(s)
Here's a quick solution:
df['Value'] = df.groupby('Model').transform('max')
I am scraping data from a website which builds a Pandas dataframe with different column names dependent on the data available on the site. I have a vector of column names, say:
colnames = ['column1', 'column2', 'column3', 'column5']
which are the columns of a postgres database for which I wish to store the scraped data in.
The problem I am having is that the way I have had to set up the scraping to get all the data I require, I end up grabbing some columns for which I have no use and which aren't in my postgres database. These columns will not have the same names each time, as some pages have extra data, so I can't simply exclude the column names I don't want, as I don't know what all of these will be. There will also be columns in my postgres database for which the data will not be scraped every time.
Hence, when I try and upload the resulting dataframe to postgres, I get the error:
psycopg2.errors.UndefinedColumn: column "column4" of relation "my_db" does not exist
This leads to my question:
How do I subset the resulting pandas dataframe using the column names I have stored in the vector, given some of the columns may not exist in the dataframe? I have tried my_dt = my_dt[colnames], which returns the error:
KeyError: ['column1', 'column2', 'column3'] not in index
Reproducible example:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
test = df[subset_columns]
Any help will be appreciated.
You can simply do:
colnames = ['column1', 'column2', 'column3', 'column5']
df[df.columns & colnames]
I managed to find a fix, though I still don't understand what was causing the initial 'Key Error' to come out as a vector rather than just the elements which weren't columns of my dataframe:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
column_match = set(subset_columns) & set(df.columns)
df = df[column_match]
Out[69]:
column2 column1 column3
0 2 1 3
1 6 5 7
I want to iterate over all the index rows of my first dataframe.
And if this index exists in the indexes of the second dataframe, I want to return this line.
I see that df1.loc[2] returns the data in the row where the index is 2.
How can I iterate over all of the indexes in both dataframes?
You can use .join between dataframes to get the rows with same indexes.
In [1]: import pandas as pd
...: a = pd.DataFrame({'a': [1, 3]}, index=[1, 2])
...:
...: b = pd.DataFrame({'b': [3, 4]}, index=[2, 5])
...: a.join(b, how='inner')
Out[1]:
a b
2 3 3
I have a dataframe with three columns containing 220 datapoints. Now I need to make one column the key and the other column the value and remove the third column. How do I do that?
I have created the dataframe by scraping Wikipedia in order to create a Keyword Search. Now I need to create an index of terms contained, for which dictionaries are the most effective. How do I create a dictionaries out of a dataframe where one column in the key for another column?
I have used a sample dataframe having 3 columns and 3 rows as you have not provided the actual data. You can replace it with your data and column names.
I have used for loop with iterrows() to loop over each row.
Code:
import pandas as pd
df = pd.DataFrame (
{'Alphabet': ['A', 'B','C'] ,
'Number': [1,2,3],
'To_Remove': [10, 15, 8]})
sample_dictionary = {}
for index,row in df.iterrows():
sample_dictionary[row['Alphabet']] = row['Number']
print(sample_dictionary)
Output:
{'A': 1, 'B': 2, 'C': 3}
You can use the Pandas function,
pd.Dataframe.to_dict
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html
Example
import pandas as pd
# Original dataframe
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0.5, 0.75, 1.0],
'col3':[0.1, 0.9, 1.9]},
index=['a', 'b', 'c'])
# To dictonary
dictionary = df.to_dict(df)