Diff two CSVs by specific columns, output matching rows

Diff two CSVs by specific columns, output matching rows - python-3.x

I am comparing two CSV files. I need to know which rows match by comparing specific columns. The output needs to be the rows that match.
Data:
CSV 1:
name, age, occupation
Alice,51,accountant
John,23,driver
Peter,32,plumber
Jose,50,doctor
CSV 2:
name, age, occupation
Alice,51,dentist
Ted,43,carpenter
Peter,32,plumber
Jose,50,doctor
Desired Ouput:
Rather than returning a boolean, I would like to output the matching rows when only comparing the name and age columns:
Alice,51,accountant
Alice,51,dentist
Peter,32,plumber
Jose,50,doctor
Code:
I am comparing the two CSVs to see if the columns in the 'columns_to_compare_test' list match. I return a boolean, true or false.
# read two csv files into dataframes
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# compare only the name and age columns
columns_to_compare_test = ['name', 'age']
# print true or false
print(df1[columns_to_compare_test].equals(df2[columns_to_compare_test]))
# output: false
Thank you.

I'd suggest the following:
import pandas as pd
# load csv
df1 = pd.read_csv('test.csv', sep=',', encoding='UTF-8')
df2 = pd.read_csv('test2.csv', sep=',', encoding='UTF-8')
# look for matching rows
filter = ['name', 'age']
filter = df1[filter].eq(df2[filter]).all(axis=1)
df1 = df1[filter].append(df2[filter])
# remove duplicates e.g. Jose
df1 = df1.drop_duplicates()
print(df1.head())
You are using equals to check if the row is in the other df on an element basis. all() will make sure, that both age and name match.
You can finally use the resulting series to select the relevant rows in both dfs and append them. This will result in the following output:
name age occupation
0 Alice 51 accountant
2 Peter 32 plumber
3 Jose 50 doctor
0 Alice 51 dentist
Please let me know, if you wanted to achieve a different effect.

Related

How do I subset a pandas dataframe based on a list of column names

I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -

This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)

Split Pandas dataframe into multiple dataframes based on empty column delimiter

I'm reading the following excel sheet into a dataframe.
I want to split it into three dataframes by product. The tables will always be delimited by a single blank column in between, but each table can have different number of columns.

Based on the article introduced in the comment, you can process it as follows.
import pandas as pd
#### Read excel file to dataframe
df = pd.read_excel('test.xlsx', index_col=None, header=None)
#### Find empty column and listed
empcols = [col for col in df.columns if df[col].isnull().all()]
df.fillna('', inplace=True)
#### Split into consecutive columns of valid data
allcols = list(range(len(df.columns)))
start = 0
colslist = []
for sepcol in empcols:
colslist.append(allcols[start:sepcol])
start = sepcol+1
colslist.append(allcols[start:])
#### Extract consecutive columns of valid data and store them in a dictionary
dfdic = {}
for i in range(len(colslist)):
wkdf = df.iloc[:, colslist[i]]
title = ''.join(wkdf.iloc[0].tolist())
wkcols = wkdf.iloc[1].tolist()
wkdf.drop(wkdf.index[[0,1]], inplace=True)
wkdf.columns = wkcols
dfdic[title] = wkdf.reset_index(drop=True)
#### Display each DataFrame stored in the dictionary
dfkeys = dfdic.keys()
for k in dfkeys:
print(k)
print(dfdic[k])
print()

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.

Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

How to delete row data from a CSV file using pandas?

I am new to Pandas and was wondering how to delete a specific row using the row id. Currently, I have a CSV file that contains data about different students. I do not have any headers in my CSV file.
data.csv:
John 21 34 87 ........ #more than 100 columns of data
Abigail 18 45 53 ........ #more than 100 columns of data
Norton 19 45 12 ........ #more than 100 columns of data
data.py:
I have a list that has a record of some names.
names = ['Jonathan', 'Abigail', 'Cassandra', 'Ezekiel']
I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'student_list' assigned.
Now, for all elements in the student_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete John and Norton since they do not appear in the names list. How can I achieve this using pandas? Or, is there a better alternative out there than compared to using pandas for this problem?
I have tried the following code below:
csv_filename = data.csv
with open(csv_filename, 'r') as readfile:
reader = csv.reader(readfile, delimiter=',')
student_list = [row[0] for row in reader] #returns John, Abigail and Norton.
for student in student_list:
if student not in names:
id = student_list.index(student) #grab the index of the student in student list who's not found in the names list.
#using pandas
df = pd.read_csv(csv_filename) #read data.csv file
df.drop(df.index[id], in_place = True) #delete the row id for the student who does not exist in names list.
df.to_csv(csv_filename, index = False, sep=',') #close the csv file with no index
else:
print("Student name found in names list")
I am not able to delete the data properly. Can anybody explain?

You can just use a filter to filter out the ids you don't want.
Example:
import pandas as pd
from io import StringIO
data = """
1,John
2,Beckey
3,Timothy
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, names=['id', 'name'])
unwanted_ids = [3]
new_df = df[~df.id.isin(unwanted_ids)]
You could also use a filter and get the indices to drop the columns in the original dataframe. Example:
df.drop(df[df.id.isin([3])].index, inplace=True)
Update for updated question:
df = pd.read_csv(csv_filename, sep='\t', header=None, names=['name', 'age'])
# keep only names wanted and reset index starting from 0
# drop=True makes sure to drop old index and not add it as column
df = df[df.name.isin(names)].reset_index(drop=True)
# if you really want index starting from 1 you can use this
df.index = df.index + 1
df.to_csv(csv_filename, index = False, sep=',')

How do I list out all the different labels/strings in a certain column?

I would like to list out each name of label/string that exists in a specific columns. such labels will appear multiple times in the column (eg. Fleet, Travel etc) : eg
Column1 Column2
Facility Machine
Fleet Other
Travel Leased Vehicles
...... .......
How do I write the code to extract the labels in numpy array ?
Thank you.
Desired output
eg. feature_labels = np.array(['Column1_Facility', 'Column1_Fleet', 'Column2_Machine', etc

numpy has the char module for quasi-vectorized string operations. You could for example use np.char.add:
import functools as ft
data
# array([['Column1', 'Column2'],
# ['Facility', 'Machine'],
# ['Fleet', 'Other'],
# ['Travel', 'Leased Vehicles'],
# ['......', '.......']], dtype='<U15')
ft.reduce(np.char.add, (data[:1], '_', data[1:]))
# array([['Column1_Facility', 'Column2_Machine'],
# ['Column1_Fleet', 'Column2_Other'],
# ['Column1_Travel', 'Column2_Leased Vehicles'],
# ['Column1_......', 'Column2_.......']], dtype='<U31')

I am not completely sure I fully understand the question, but here is my attempt:
df = pd.DataFrame({'Column1': ['Facility', 'Fleet', 'Travel'], 'Column2': ['Machine', 'Other', 'Leased Vehicles']})
df
#Outputs:
Column1 Column2
0 Facility Machine
1 Fleet Other
2 Travel Leased Vehicles
Then iterate over the columns to append the column name to the feature name as you want:
for col in df.columns:
df[col] = df[col].apply(lambda x: f'{col}_{x}')
The above would give you:
Column1 Column2
0 Column1_Facility Column2_Machine
1 Column1_Fleet Column2_Other
2 Column1_Travel Column2_Leased Vehicles
And now you can simply extract the values of each column:
df.Column1.values
Result:
array(['Column1_Facility', 'Column1_Fleet', 'Column1_Travel'],
dtype=object)
EDIT:
If you want to list only the unique values in a column:
Column1 Column2
0 Column1_Facility Column2_Machine
1 Column1_Fleet Column2_Other
2 Column1_Travel Column2_Leased Vehicles
3 Column1_Facility Column2_Machine
You'll need to use:
df.Column1.unique()
Result:
array(['Column1_Facility', 'Column1_Fleet', 'Column1_Travel'],
dtype=object)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Diff two CSVs by specific columns, output matching rows - python-3.x

Related

How do I subset a pandas dataframe based on a list of column names

Split Pandas dataframe into multiple dataframes based on empty column delimiter

Convert lists present in each column to its respective datatypes

How to delete row data from a CSV file using pandas?

How do I list out all the different labels/strings in a certain column?

Categories

Resources