I am trying to create a csv file where if few columns are same then i will merge row with similar value into one row .
eg:
Input :
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1 LA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 1 KA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 2 PA UK
Output
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1,2 LA,KA,PA UK,USA
ex : in my case
if value of party_number , item_install_date ,Contract_subline_date , Contract_Subline_end_date , Instance_family
i will merger row with same value into one row . other column apart from above mentioned will have comma separated value
Input CSV file link
Expected output CSV link
Code i tried:
import pandas as pd
import np
df = None
df = pd.read_csv("Export.csv")
df.fillna(0,inplace=True)
pf=df.groupby(['PARTY_NUMBER','ITEM_INSTALL_DATE','CONTRACT_SUBLINE_START_DATE','CONTRACT_SUBLINE_END_DATE','INSTANCE_PRODUCT_FAMILY']).agg([','.join])
pf.to_csv("result1.csv", index=False)
Adding the unqiue (or set when order is not important)
df.groupby(['...']).agg(lambda x : ','.join(x.unique())) # set(x)
Related
I am comparing two CSV files. I need to know which rows match by comparing specific columns. The output needs to be the rows that match.
Data:
CSV 1:
name, age, occupation
Alice,51,accountant
John,23,driver
Peter,32,plumber
Jose,50,doctor
CSV 2:
name, age, occupation
Alice,51,dentist
Ted,43,carpenter
Peter,32,plumber
Jose,50,doctor
Desired Ouput:
Rather than returning a boolean, I would like to output the matching rows when only comparing the name and age columns:
Alice,51,accountant
Alice,51,dentist
Peter,32,plumber
Jose,50,doctor
Code:
I am comparing the two CSVs to see if the columns in the 'columns_to_compare_test' list match. I return a boolean, true or false.
# read two csv files into dataframes
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# compare only the name and age columns
columns_to_compare_test = ['name', 'age']
# print true or false
print(df1[columns_to_compare_test].equals(df2[columns_to_compare_test]))
# output: false
Thank you.
I'd suggest the following:
import pandas as pd
# load csv
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# look for matching rows
filter = ['name', 'age']
filter = df1[filter].eq(df2[filter]).all(axis=1)
df1 = df1[filter].append(df2[filter])
# remove duplicates e.g. Jose
df1 = df1.drop_duplicates()
print(df1.head())
You are using equals to check if the row is in the other df on an element basis. all() will make sure, that both age and name match.
You can finally use the resulting series to select the relevant rows in both dfs and append them. This will result in the following output:
name age occupation
0 Alice 51 accountant
2 Peter 32 plumber
3 Jose 50 doctor
0 Alice 51 dentist
Please let me know, if you wanted to achieve a different effect.
I am new to Pandas and was wondering how to delete a specific row using the row id. Currently, I have a CSV file that contains data about different students. I do not have any headers in my CSV file.
data.csv:
John 21 34 87 ........ #more than 100 columns of data
Abigail 18 45 53 ........ #more than 100 columns of data
Norton 19 45 12 ........ #more than 100 columns of data
data.py:
I have a list that has a record of some names.
names = ['Jonathan', 'Abigail', 'Cassandra', 'Ezekiel']
I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'student_list' assigned.
Now, for all elements in the student_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete John and Norton since they do not appear in the names list. How can I achieve this using pandas? Or, is there a better alternative out there than compared to using pandas for this problem?
I have tried the following code below:
csv_filename = data.csv
with open(csv_filename, 'r') as readfile:
reader = csv.reader(readfile, delimiter=',')
student_list = [row[0] for row in reader] #returns John, Abigail and Norton.
for student in student_list:
if student not in names:
id = student_list.index(student) #grab the index of the student in student list who's not found in the names list.
#using pandas
df = pd.read_csv(csv_filename) #read data.csv file
df.drop(df.index[id], in_place = True) #delete the row id for the student who does not exist in names list.
df.to_csv(csv_filename, index = False, sep=',') #close the csv file with no index
else:
print("Student name found in names list")
I am not able to delete the data properly. Can anybody explain?
You can just use a filter to filter out the ids you don't want.
Example:
import pandas as pd
from io import StringIO
data = """
1,John
2,Beckey
3,Timothy
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, names=['id', 'name'])
unwanted_ids = [3]
new_df = df[~df.id.isin(unwanted_ids)]
You could also use a filter and get the indices to drop the columns in the original dataframe. Example:
df.drop(df[df.id.isin([3])].index, inplace=True)
Update for updated question:
df = pd.read_csv(csv_filename, sep='\t', header=None, names=['name', 'age'])
# keep only names wanted and reset index starting from 0
# drop=True makes sure to drop old index and not add it as column
df = df[df.name.isin(names)].reset_index(drop=True)
# if you really want index starting from 1 you can use this
df.index = df.index + 1
df.to_csv(csv_filename, index = False, sep=',')
I would like to list out each name of label/string that exists in a specific columns. such labels will appear multiple times in the column (eg. Fleet, Travel etc) : eg
Column1 Column2
Facility Machine
Fleet Other
Travel Leased Vehicles
...... .......
How do I write the code to extract the labels in numpy array ?
Thank you.
Desired output
eg. feature_labels = np.array(['Column1_Facility', 'Column1_Fleet', 'Column2_Machine', etc
numpy has the char module for quasi-vectorized string operations. You could for example use np.char.add:
import functools as ft
data
# array([['Column1', 'Column2'],
# ['Facility', 'Machine'],
# ['Fleet', 'Other'],
# ['Travel', 'Leased Vehicles'],
# ['......', '.......']], dtype='<U15')
ft.reduce(np.char.add, (data[:1], '_', data[1:]))
# array([['Column1_Facility', 'Column2_Machine'],
# ['Column1_Fleet', 'Column2_Other'],
# ['Column1_Travel', 'Column2_Leased Vehicles'],
# ['Column1_......', 'Column2_.......']], dtype='<U31')
I am not completely sure I fully understand the question, but here is my attempt:
df = pd.DataFrame({'Column1': ['Facility', 'Fleet', 'Travel'], 'Column2': ['Machine', 'Other', 'Leased Vehicles']})
df
#Outputs:
Column1 Column2
0 Facility Machine
1 Fleet Other
2 Travel Leased Vehicles
Then iterate over the columns to append the column name to the feature name as you want:
for col in df.columns:
df[col] = df[col].apply(lambda x: f'{col}_{x}')
The above would give you:
Column1 Column2
0 Column1_Facility Column2_Machine
1 Column1_Fleet Column2_Other
2 Column1_Travel Column2_Leased Vehicles
And now you can simply extract the values of each column:
df.Column1.values
Result:
array(['Column1_Facility', 'Column1_Fleet', 'Column1_Travel'],
dtype=object)
EDIT:
If you want to list only the unique values in a column:
Column1 Column2
0 Column1_Facility Column2_Machine
1 Column1_Fleet Column2_Other
2 Column1_Travel Column2_Leased Vehicles
3 Column1_Facility Column2_Machine
You'll need to use:
df.Column1.unique()
Result:
array(['Column1_Facility', 'Column1_Fleet', 'Column1_Travel'],
dtype=object)
I want to apply filters to spread sheet using Python, which module is more useful Pandas or any other?
Filtering within your pandas dataframe can be done with loc (in addition to some other methods). What I THINK you're looking for is a way to export dataframes to excel and apply a filter within excel.
XLSXWRITER (by John McNamara) satisfies pretty much all xlsx/pandas use cases and has great documentation here --> https://xlsxwriter.readthedocs.io/.
Auto-filtering is an option :) https://xlsxwriter.readthedocs.io/worksheet.html?highlight=auto%20filter#worksheet-autofilter
I am not sure if I understand your question right. Maybe the combination of pandas and
qgrid might help you.
Simple filtering in pandas can be accomplished using the .loc DataFrame method.
In [4]: data = ({'name': ['Joe', 'Bob', 'Alice', 'Susan'],
...: 'dept': ['Marketing', 'IT', 'Marketing', 'Sales']})
In [5]: employees = pd.DataFrame(data)
In [6]: employees
Out[6]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
3 Susan Sales
In [7]: marketing = employees.loc[employees['dept'] == 'Marketing']
In [8]: marketing
Out[8]:
name dept
0 Joe Marketing
2 Alice Marketing
You can also use .loc with .isin to select multiple values in the same column
In [9]: marketing_it = employees.loc[employees['dept'].isin(['Marketing', 'IT'])]
In [10]: marketing_it
Out[10]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
You can also pass multiple conditions to .loc using an and (&) or or (|) statement to select values from multiple columns
In [11]: joe = employees.loc[(employees['dept'] == 'Marketing') & (employees['name'] == 'Joe')]
In [12]: joe
Out[12]:
name dept
0 Joe Marketing
Here is an an example of adding an autofilter to a worksheet exported from Pandas using XlsxWriter:
import pandas as pd
# Create a Pandas dataframe by reading some data from a space-separated file.
df = pd.read_csv('autofilter_data.txt', sep=r'\s+')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_autofilter.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object. We also turn off the
# index column at the left of the output dataframe.
df.to_excel(writer, sheet_name='Sheet1', index=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Get the dimensions of the dataframe.
(max_row, max_col) = df.shape
# Make the columns wider for clarity.
worksheet.set_column(0, max_col - 1, 12)
# Set the autofilter.
worksheet.autofilter(0, 0, max_row, max_col - 1)
# Add an optional filter criteria. The placeholder "Region" in the filter
# is ignored and can be any string that adds clarity to the expression.
worksheet.filter_column(0, 'Region == East')
# It isn't enough to just apply the criteria. The rows that don't match
# must also be hidden. We use Pandas to figure our which rows to hide.
for row_num in (df.index[(df['Region'] != 'East')].tolist()):
worksheet.set_row(row_num + 1, options={'hidden': True})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:
The data used in this example is here.
How do I access the first column in this dataframe?
If I refer to it by the column name ('Group11...'), I get an error 'Not in index'.
First Column
iloc return data based on a numeric index, here all rows for the first (python 0-indexed) column.
df.iloc[:,0]
What you are refering to is the index of the dataframe. So, if your dataframe is called df, you can access the index using df.index.
Otherwise, if you want to refer to the as a column, you need to turn it into a column before using pandas.DataFrame.reset_index.
reproducible example:
Here's a reproducible example showing the two methods of accessing the index:
from StringIO import StringIO
import pandas as pd
data = """Group11.Primary.Phrase|count|num_cat
CP|4|4
DA|1|1
FW|7|7
"""
df = pd.read_csv(StringIO(data), sep="|", index_col=0)
print("here's how the dataframe looks like")
print(df.head())
print("here's how to access the index")
print(df.index)
print("if you want to turn the index values into a list")
print(list(df.index))
print("you can also reset_index as a column and access it")
df = df.reset_index()
print(df["Group11.Primary.Phrase"])
Running the above code, gives you the following output:
here's how the dataframe looks like
count num_cat
Group11.Primary.Phrase
CP 4 4
DA 1 1
FW 7 7
here's how to access the index
Index([u'CP', u'DA', u'FW'], dtype='object', name=u'Group11.Primary.Phrase')
if you want to turn the index values into a list
['CP', 'DA', 'FW']
you can also reset_index as a column and access it
0 CP
1 DA
2 FW
Name: Group11.Primary.Phrase, dtype: object
You can reset the index and then access column by the column name if you want to access it using column name. i.e
If you have a dataframe like
count num_cat
Group11.Primary.Phrase
CP 4 4
DA 1 1
FW 7 7
Then after resetting index when you access the column by its name then
df = df.reset_index()
df['Group11.Primary.Phrase']
Output:
0 CP
1 DA
2 FW
Here is a link to the docs: Indexing and Selecting Data In your case, you would index df['Group11']
In [9]: df
Out[9]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
In [12]: df[['A', 'B']]
Out[12]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849