I am new to Pandas and was wondering how to delete a specific row using the row id. Currently, I have a CSV file that contains data about different students. I do not have any headers in my CSV file.
data.csv:
John 21 34 87 ........ #more than 100 columns of data
Abigail 18 45 53 ........ #more than 100 columns of data
Norton 19 45 12 ........ #more than 100 columns of data
data.py:
I have a list that has a record of some names.
names = ['Jonathan', 'Abigail', 'Cassandra', 'Ezekiel']
I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'student_list' assigned.
Now, for all elements in the student_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete John and Norton since they do not appear in the names list. How can I achieve this using pandas? Or, is there a better alternative out there than compared to using pandas for this problem?
I have tried the following code below:
csv_filename = data.csv
with open(csv_filename, 'r') as readfile:
reader = csv.reader(readfile, delimiter=',')
student_list = [row[0] for row in reader] #returns John, Abigail and Norton.
for student in student_list:
if student not in names:
id = student_list.index(student) #grab the index of the student in student list who's not found in the names list.
#using pandas
df = pd.read_csv(csv_filename) #read data.csv file
df.drop(df.index[id], in_place = True) #delete the row id for the student who does not exist in names list.
df.to_csv(csv_filename, index = False, sep=',') #close the csv file with no index
else:
print("Student name found in names list")
I am not able to delete the data properly. Can anybody explain?
You can just use a filter to filter out the ids you don't want.
Example:
import pandas as pd
from io import StringIO
data = """
1,John
2,Beckey
3,Timothy
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, names=['id', 'name'])
unwanted_ids = [3]
new_df = df[~df.id.isin(unwanted_ids)]
You could also use a filter and get the indices to drop the columns in the original dataframe. Example:
df.drop(df[df.id.isin([3])].index, inplace=True)
Update for updated question:
df = pd.read_csv(csv_filename, sep='\t', header=None, names=['name', 'age'])
# keep only names wanted and reset index starting from 0
# drop=True makes sure to drop old index and not add it as column
df = df[df.name.isin(names)].reset_index(drop=True)
# if you really want index starting from 1 you can use this
df.index = df.index + 1
df.to_csv(csv_filename, index = False, sep=',')
Related
I am comparing two CSV files. I need to know which rows match by comparing specific columns. The output needs to be the rows that match.
Data:
CSV 1:
name, age, occupation
Alice,51,accountant
John,23,driver
Peter,32,plumber
Jose,50,doctor
CSV 2:
name, age, occupation
Alice,51,dentist
Ted,43,carpenter
Peter,32,plumber
Jose,50,doctor
Desired Ouput:
Rather than returning a boolean, I would like to output the matching rows when only comparing the name and age columns:
Alice,51,accountant
Alice,51,dentist
Peter,32,plumber
Jose,50,doctor
Code:
I am comparing the two CSVs to see if the columns in the 'columns_to_compare_test' list match. I return a boolean, true or false.
# read two csv files into dataframes
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# compare only the name and age columns
columns_to_compare_test = ['name', 'age']
# print true or false
print(df1[columns_to_compare_test].equals(df2[columns_to_compare_test]))
# output: false
Thank you.
I'd suggest the following:
import pandas as pd
# load csv
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# look for matching rows
filter = ['name', 'age']
filter = df1[filter].eq(df2[filter]).all(axis=1)
df1 = df1[filter].append(df2[filter])
# remove duplicates e.g. Jose
df1 = df1.drop_duplicates()
print(df1.head())
You are using equals to check if the row is in the other df on an element basis. all() will make sure, that both age and name match.
You can finally use the resulting series to select the relevant rows in both dfs and append them. This will result in the following output:
name age occupation
0 Alice 51 accountant
2 Peter 32 plumber
3 Jose 50 doctor
0 Alice 51 dentist
Please let me know, if you wanted to achieve a different effect.
I'm reading the following excel sheet into a dataframe.
I want to split it into three dataframes by product. The tables will always be delimited by a single blank column in between, but each table can have different number of columns.
Based on the article introduced in the comment, you can process it as follows.
import pandas as pd
#### Read excel file to dataframe
df = pd.read_excel('test.xlsx', index_col=None, header=None)
#### Find empty column and listed
empcols = [col for col in df.columns if df[col].isnull().all()]
df.fillna('', inplace=True)
#### Split into consecutive columns of valid data
allcols = list(range(len(df.columns)))
start = 0
colslist = []
for sepcol in empcols:
colslist.append(allcols[start:sepcol])
start = sepcol+1
colslist.append(allcols[start:])
#### Extract consecutive columns of valid data and store them in a dictionary
dfdic = {}
for i in range(len(colslist)):
wkdf = df.iloc[:, colslist[i]]
title = ''.join(wkdf.iloc[0].tolist())
wkcols = wkdf.iloc[1].tolist()
wkdf.drop(wkdf.index[[0,1]], inplace=True)
wkdf.columns = wkcols
dfdic[title] = wkdf.reset_index(drop=True)
#### Display each DataFrame stored in the dictionary
dfkeys = dfdic.keys()
for k in dfkeys:
print(k)
print(dfdic[k])
print()
I have a dataframe with number of sheets,i wants to delete duplicate from all sheets.i used below code
df = df.drop_duplicates(subset='Month',keep='last')
after that i save this df
df.to_excel(path,index=False)
but its removing only 1st sheet duplicate and showing only one sheet
I would suggest treating each sheet of your document as an separate data frame, then in iteration remove the duplicates of each set according to your criteria. This is quick draft of concept I had on mind, for 2 sheets:
xls = pd.ExcelFile('myFile.xls')
xls_dfs = []
df1 = pd.read_excel(xls, 'Sheet1')
xls_dfs.append(df1)
df2 = pd.read_excel(xls, 'Sheet2')
xls_dfs.append(df2)
for df in xls_dfs:
df = df.drop_duplicates(subset='Month',keep='last')
df.to_excel('myFile.xls',index=False)
I'm getting the error code:
ValueError: Wrong number of items passed 3, placement implies 1.
What i want to do is import a dataset and count the duplicated values, drop the duplicated values and add a column which says that there were x number of duplicates of that number.
This is to try and sort a dataset of 13 000 rows and 45 columns.
I've tried different solutions found online but it seems like it does not help. I'm pretty new to programming and all help is really appreciated
'''import pandas as pd
# Making file ready
data = pd.read_excel(r'Some file.xlsx', header = 0)
data.rename(columns={'Dato': 'Last ordered', 'ArtNr': 'Item No:'}, inplace
= True)
#Formatting dates
pd.to_datetime(data['Last ordered'],
format = '%Y-%m-%d %H:%M:%S')
#Creates new table content and order
df = data[['Item No:','Last ordered', 'Description']]
df['Last ordered'] = df['Last ordered'].dt.strftime('%Y-/%m-/%d')
df = df.sort_values('Last ordered', ascending = False)
#Adds total sold quantity column
df['Quantity'] = df.groupby('Item No:').transform('count')
df2 = df.drop_duplicates('Item No:').reset_index(drop=True)
#Prints to environment and creates new excel file
print(df2)
df2.to_excel(r'New Sorted File.xlsx')'''
I expect it to provide a new excel file with columns:
Item No | Last ordered | Description | Quantity
And i want to be able to add other columns from the original dataset as well if i need to later on.
The problem is at this line:
df['Quantity'] = df.groupby('Item No:').transform('count')
The right side part of the assignment is a dataframe and you are trying to fit it inside a column. You need to select only one of the columns. Something like
df['Quantity'] = df.groupby('Item No:').transform('count')['Description']
should work.
I have an excel file which contains 2 columns and I want to convert these columns to the dictionary using the header as key and rows as values.
file1.xlxs
name price
Merry 5000
John 6500
Nat 4800
The dictionary should be like this:
List1 = {'name':['Merry,John,Nat'],'price':['5000,6500,4800']}
Please help?
Use pandas library for this
import pandas as pd
df = pd.read_excel("path to your excel file")
list1 = df.to_dict(orient='list')
Here is documentation for df.to_dict