I have a dataframe with three columns containing 220 datapoints. Now I need to make one column the key and the other column the value and remove the third column. How do I do that?
I have created the dataframe by scraping Wikipedia in order to create a Keyword Search. Now I need to create an index of terms contained, for which dictionaries are the most effective. How do I create a dictionaries out of a dataframe where one column in the key for another column?
I have used a sample dataframe having 3 columns and 3 rows as you have not provided the actual data. You can replace it with your data and column names.
I have used for loop with iterrows() to loop over each row.
Code:
import pandas as pd
df = pd.DataFrame (
{'Alphabet': ['A', 'B','C'] ,
'Number': [1,2,3],
'To_Remove': [10, 15, 8]})
sample_dictionary = {}
for index,row in df.iterrows():
sample_dictionary[row['Alphabet']] = row['Number']
print(sample_dictionary)
Output:
{'A': 1, 'B': 2, 'C': 3}
You can use the Pandas function,
pd.Dataframe.to_dict
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html
Example
import pandas as pd
# Original dataframe
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [0.5, 0.75, 1.0],
'col3':[0.1, 0.9, 1.9]},
index=['a', 'b', 'c'])
# To dictonary
dictionary = df.to_dict(df)
Related
I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:
I have following dataframe df with 3 rows where 3rd row consists of all empty strings. I am trying to drop all the rows which has all the columns empty but somehow the rows are not getting dropped. Below is my snippet.
import pandas as pd
d = {'col1': [1, 2, ''], 'col2': [3, 4, '']}
df = pd.DataFrame(data=d)
df = df.dropna(how='all')
Please suggest where I am doing wrong?
You don't have NaN values. You have '', which is not NaN. So:
df[df.ne('').any(1)]
I want to iterate over all the index rows of my first dataframe.
And if this index exists in the indexes of the second dataframe, I want to return this line.
I see that df1.loc[2] returns the data in the row where the index is 2.
How can I iterate over all of the indexes in both dataframes?
You can use .join between dataframes to get the rows with same indexes.
In [1]: import pandas as pd
...: a = pd.DataFrame({'a': [1, 3]}, index=[1, 2])
...:
...: b = pd.DataFrame({'b': [3, 4]}, index=[2, 5])
...: a.join(b, how='inner')
Out[1]:
a b
2 3 3
I have a pandas dataframe that looks something like this:
df=pd.DataFrame({'a':['A','B','C','A'], 'b':[1,4,1,3], 'c':[0,6,1,0], 'd':[1,0,0,5]})
I want a dataframe that will look like this:
The original dataframe was grouped by values in column 'a' and its corresponding values are saved as a dictionary in a new column 'dict'. The key - value pairs are the column name and values in the column respectively. In case if a value in column 'a' has multiple entries (for eg A in column 'a' occurs twice), then a list of dictionary should be created for the same value.
How can I do this ?(Please ignore the grammatical mistakes and please ask any doubts regarding the question if it sounded too vague)
Don't do this. Pandas was never designed to hold list/tuples/dicts in series / columns. You can concoct expensive workarounds, but these are not
recommended.
The main reason holding lists in series is not recommended is you lose
the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of
object dtype, which represents a sequence of pointers, much like list. You will lose
benefits in terms of memory and performance, as well as access to optimized Pandas methods.
See also What are the advantages of NumPy over regular Python
lists?
The arguments in favour of Pandas are the same as for NumPy.
But if really need it:
df = df.groupby('a').apply(lambda x: x.to_dict('r')).reset_index(name='dict')
print (df)
a dict
0 A [{'a': 'A', 'b': 1, 'c': 0, 'd': 1}, {'a': 'A'...
1 B [{'a': 'B', 'b': 4, 'c': 6, 'd': 0}]
2 C [{'a': 'C', 'b': 1, 'c': 1, 'd': 0}]
Say I have a pandas dataframe. I can access the columns either by their name or by their index.
Is there a simple way in which I can retrieve the column index given its name?
Use get_loc on the columns Index object to return the ordinal index value:
In [283]:
df = pd.DataFrame(columns=list('abcd'))
df
Out[283]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []
In [288]:
df.columns.get_loc('b')
Out[288]:
1
What do you mean by index exactly?
I bet you are referring to index as a list index, right?
Because Pandas has another kind of index too.
From my first understandying, you can do the following:
my_df = pd.DataFrame(columns=['A', 'B', 'C'])
my_columns = my_df.columns.tolist()
print my_columns # yields ['A', 'B', 'C'], therefore you can recover the index by just doing the following
my_columns.index('C') #yields 2