I have an excel file which contains 2 columns and I want to convert these columns to the dictionary using the header as key and rows as values.
file1.xlxs
name price
Merry 5000
John 6500
Nat 4800
The dictionary should be like this:
List1 = {'name':['Merry,John,Nat'],'price':['5000,6500,4800']}
Please help?
Use pandas library for this
import pandas as pd
df = pd.read_excel("path to your excel file")
list1 = df.to_dict(orient='list')
Here is documentation for df.to_dict
Related
I have 25 sheets in the excel file and I want the list of column names(top row/header) from each of the sheets.
Can you specify how you want the answers collected? Do you want all the column names from each sheet in the same list or dataframe?
Assuming you want the results in one DataFrame: I will assume you want to collect the results into one DataFrame where each row represents one sheet and each column represents one column name. The general idea is to loop through The pd.read_excel() method specifying a different sheet name each time.
import pandas as pd
import numpy as np
n_sheets = 25
int_sheet_names = np.arange(0,n_sheets,1)
df = pd.DataFrame()
for i in int_sheet_names:
sheet_i_col_names = pd.read_excel('file.xlsx', sheet_name = i, header=None, nrows=1)
df = df.append(sheet_i_col_names)
The resulting DataFrame can be further manipulated based on your specific requirements.
Output from my example excel sheet, which only had 4 sheets
Alternatively, you can pass a list to the sheet_names argument. In this case, you are given a dictionary, which I find to be less useful. In this case, int_sheet_names must be a list and not a numpy array.
n_sheets = 25
int_sheet_names = list(range(0,n_sheets))
dict = pd.read_excel('file.xlsx', sheet_name = int_sheet_names, head=None, nrows=1)
Output as a dictionary when passing a list to sheet_name kwarg
I am new to Pandas and was wondering how to delete a specific row using the row id. Currently, I have a CSV file that contains data about different students. I do not have any headers in my CSV file.
data.csv:
John 21 34 87 ........ #more than 100 columns of data
Abigail 18 45 53 ........ #more than 100 columns of data
Norton 19 45 12 ........ #more than 100 columns of data
data.py:
I have a list that has a record of some names.
names = ['Jonathan', 'Abigail', 'Cassandra', 'Ezekiel']
I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'student_list' assigned.
Now, for all elements in the student_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete John and Norton since they do not appear in the names list. How can I achieve this using pandas? Or, is there a better alternative out there than compared to using pandas for this problem?
I have tried the following code below:
csv_filename = data.csv
with open(csv_filename, 'r') as readfile:
reader = csv.reader(readfile, delimiter=',')
student_list = [row[0] for row in reader] #returns John, Abigail and Norton.
for student in student_list:
if student not in names:
id = student_list.index(student) #grab the index of the student in student list who's not found in the names list.
#using pandas
df = pd.read_csv(csv_filename) #read data.csv file
df.drop(df.index[id], in_place = True) #delete the row id for the student who does not exist in names list.
df.to_csv(csv_filename, index = False, sep=',') #close the csv file with no index
else:
print("Student name found in names list")
I am not able to delete the data properly. Can anybody explain?
You can just use a filter to filter out the ids you don't want.
Example:
import pandas as pd
from io import StringIO
data = """
1,John
2,Beckey
3,Timothy
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, names=['id', 'name'])
unwanted_ids = [3]
new_df = df[~df.id.isin(unwanted_ids)]
You could also use a filter and get the indices to drop the columns in the original dataframe. Example:
df.drop(df[df.id.isin([3])].index, inplace=True)
Update for updated question:
df = pd.read_csv(csv_filename, sep='\t', header=None, names=['name', 'age'])
# keep only names wanted and reset index starting from 0
# drop=True makes sure to drop old index and not add it as column
df = df[df.name.isin(names)].reset_index(drop=True)
# if you really want index starting from 1 you can use this
df.index = df.index + 1
df.to_csv(csv_filename, index = False, sep=',')
I have a file that has 312759 rows but only one column with different header names in the one row, so I need to separate that rows with their own values and columns. So the data frame has 312759 rows × 1 columns but I need 312759 X approx. 40 headers/cols. I am new python and to stackoverflow community so any help would be appreciated.
read the data using pandas
import pandas as pd
read = pd.read_csv('output.csv')
read.drop(read.head(5).index, inplace=True)
then save it back as a .csv file
read.to_csv("output2.csv")
I want to apply filters to spread sheet using Python, which module is more useful Pandas or any other?
Filtering within your pandas dataframe can be done with loc (in addition to some other methods). What I THINK you're looking for is a way to export dataframes to excel and apply a filter within excel.
XLSXWRITER (by John McNamara) satisfies pretty much all xlsx/pandas use cases and has great documentation here --> https://xlsxwriter.readthedocs.io/.
Auto-filtering is an option :) https://xlsxwriter.readthedocs.io/worksheet.html?highlight=auto%20filter#worksheet-autofilter
I am not sure if I understand your question right. Maybe the combination of pandas and
qgrid might help you.
Simple filtering in pandas can be accomplished using the .loc DataFrame method.
In [4]: data = ({'name': ['Joe', 'Bob', 'Alice', 'Susan'],
...: 'dept': ['Marketing', 'IT', 'Marketing', 'Sales']})
In [5]: employees = pd.DataFrame(data)
In [6]: employees
Out[6]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
3 Susan Sales
In [7]: marketing = employees.loc[employees['dept'] == 'Marketing']
In [8]: marketing
Out[8]:
name dept
0 Joe Marketing
2 Alice Marketing
You can also use .loc with .isin to select multiple values in the same column
In [9]: marketing_it = employees.loc[employees['dept'].isin(['Marketing', 'IT'])]
In [10]: marketing_it
Out[10]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
You can also pass multiple conditions to .loc using an and (&) or or (|) statement to select values from multiple columns
In [11]: joe = employees.loc[(employees['dept'] == 'Marketing') & (employees['name'] == 'Joe')]
In [12]: joe
Out[12]:
name dept
0 Joe Marketing
Here is an an example of adding an autofilter to a worksheet exported from Pandas using XlsxWriter:
import pandas as pd
# Create a Pandas dataframe by reading some data from a space-separated file.
df = pd.read_csv('autofilter_data.txt', sep=r'\s+')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_autofilter.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object. We also turn off the
# index column at the left of the output dataframe.
df.to_excel(writer, sheet_name='Sheet1', index=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Get the dimensions of the dataframe.
(max_row, max_col) = df.shape
# Make the columns wider for clarity.
worksheet.set_column(0, max_col - 1, 12)
# Set the autofilter.
worksheet.autofilter(0, 0, max_row, max_col - 1)
# Add an optional filter criteria. The placeholder "Region" in the filter
# is ignored and can be any string that adds clarity to the expression.
worksheet.filter_column(0, 'Region == East')
# It isn't enough to just apply the criteria. The rows that don't match
# must also be hidden. We use Pandas to figure our which rows to hide.
for row_num in (df.index[(df['Region'] != 'East')].tolist()):
worksheet.set_row(row_num + 1, options={'hidden': True})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:
The data used in this example is here.
What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?
I've tried using pandas:
For sheet names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
return xl.sheet_names
For column names using pd.ExcelFile:
xl = pd.ExcelFile(filename)
df = xl.parse(sheetname, nrows=2, **kwargs)
df.columns
For column names using pd.read_excel with and without nrows (>v23):
df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
df.columns
However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.
Thanks a lot!
Here is the easiest way I can share with you:
# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.
Possible solutions:
Convert the sheet to csv, and read that in chunks.
Use something other than pandas. See this page for a list of alternative libraries.
I think this would help the need
from openpyxl import load_workbook
workbook = load_workbook(filename, read_only=True)
data = {} #for storing the value of sheet with their respective columns
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
data[sheet.title] = value #value would be a tuple with headings of each column
This program lists all the sheets in the excel.
Pandas is used here.
import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)