merging rows in csv where row[0] is a duplicate - python-3.x

I have a csv file that has student_id,guardian_email,guardian_first_name,guardian_last_name
In many cases, the student has a mom and dad with info, so the student has more than one row in the csv file.
For example the original csv would look like this:
student_id,guardian_email,guardian_first_name,guardian_last_name
12345,momemail#google.com,Jane,Doe
12345,dademail#google.com,John,Doe
98765,coollady#yahoo.com,Mary,Poppins
99999,soccermom#bing.net,Laura,Croft
99999,blackbelt#karate.com,Chuck,Norris
using python, I want it to output this:
student_id,guardian_email,guardian_first_name,guardian_last_name,guardian_email2,guardian_first_name2,guardian_last_name2
12345,momemail#google.com,Jane,Doe,dademail#google.com,John,Doe
98765,coollady#yahoo.com,Mary,Poppins,,,
99999,soccermom#bing.net,Laura,Croft,blackbelt#karate.com,Chuck,Norris
Any help is greatly appreciated!

use groupby()+cumcount() to track position and then pivot():
df['s']=df.groupby('student_id').cumcount()+1
df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
OR
use groupby()+cumcount() to track position and then unstack()
df=df.assign(s=df.groupby('student_id').cumcount()+1).set_index(['student_id','s']).unstack()
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
Now If you print df you will get your expected output
Update:
try:
def guardianemailfinal():
path=r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\guardian_email.csv'
df=pd.read_csv(path,sep=',')
df['s']=df.groupby('student_id').cumcount()+1
df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
df.to_csv(r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\output.csv',index=False,sep=',')
return df
#Finally call the function:
df=guardianemailfinal()
Note: Now If you print df you will get the modified dataframe and check your path then you will get 'output.csv' file

Related

Change one element of column heading in CSV using Pandas

I have created a CSV file which looks like this:
RigName,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
0,08 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Tested all totals. Edge cases for multiple clicks.
"
1,09 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Test entry#2.
"
I wish to change the 'RigName' to something the user inputs. I have tried various ways of changing the word 'RigName' to user input. One of them is this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns = df.columns.str.replace('RigName', user_input)
df.to_csv('new_csv.csv', header=True, index=False)
However no matter what I do, the result in the csv file always comes to this:
Unnamed:0,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
Why am I getting 'Unnamed: 0' instead of the user input value?
Also, is there a way to change 'RigName' to something else by calling its position? To make multiple changes to any word in its position in future?
Zubin, you would need to change the column name be looking at the columns as a list. The code below should do the trick. Also, the same code shows how to access the column by position...
import pandas as pd
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns.values[0] = user_input
df.to_csv('new_csv.csv', header=True, index=False)
After 3 hours of trial and error (and a lot of searching in vain), I solved it by doing this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'SD555'
df.rename(columns={ df.columns[1]: user_input}, inplace=True)
df.to_csv('new_csv.csv', index=False)
I hope this helps someone else struggling as I was.

Find maximum from a unindexed column of mixed integers and strings in python

I have a data without column headers and I want to find the max and min latitude and longitude which is index[2] and index[4] followed by index[3]=N and index[5]=E.
Example of the data is as follows:
1,5,6,9n,8,4,9,0
3,66,t,87,5,8
2,s,1.23,N,1.39,E
1,2,1.45,N,1.26,E,2N,9
7,-3,5,L,67,34,K,78,6,4
I have tried the following:
with open ("E:\\abc\xyz.txt", "r") as file1:
Lines = file1.readlines()
data = Lines
for line in Lines:
spline = line.split(",")
l1= []
l1.append(spline[2])
print(l1)
To get a combined list, so that I can get a max(). However not able to figure out.`
The result image is as follows:
['6']
['t']
['1.23']
['1.45']
['5']
Any help is appreciated.
Thank you all.
I ultimately found a solution for the question.
I used
pd.read_csv
and the max() of index[2] & [4]
Thank you once again

How do I batch rename columns in pyspark efficiently?

I am trying to batch rename my columns in PySpark from:
'collect_list(Target_Met_1)[1]' --> 'AB11'
'collect_list(Target_Met_1)[2]' --> 'AB12'
'collect_list(Target_Met_2)[1]' --> 'AB21'
'collect_list(Target_Met_1)[150]' --> 'AB150'
How do I go about it in a programmatically? Right now, I can manually change the names using:
df.withColumnRenamed('collect_list(Target_Met_1)[1]', 'AB11')
But if I have 500 columns, it's not efficient. I realize that an other way to renaming it would be using something like a udf, but I cannot figure the best possible approach.
I have split the columns and that's not the problem. The problem is around renaming the column.
Never mind. Figured out. Essentially I had to use a list comprehension to rename the columns. I was splitting columns mentioned in the link above. Here is what that did the trick:
df = df.select('1', '2', '3', *[df[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])
To rename all columns you can use the method toDf:
import re
df.toDF(*['AB' + ''.join(re.findall('\d+', i)) for i in df.columns])
Something like this can help too. It's a rename function similar to the Pandas rename functionality.
def rename_cols(map_dict):
"""
Rename a bunch of columns in a data frame
:param map_dict: Dictionary of old column names to new column names
:return: Function for use in transform
"""
def _rename_cols(df):
for old, new in map_dict.items():
df = df.withColumnRenamed(old, new)
return df
return _rename_cols
And you can use it like
spark_df.transform(rename_cols(dict(old1='new1', old2='new2', old3='new3')))

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Writing multiple columns into csv files using python

I am a python beginner. I am trying to write multiple lists into separate columns in a csv file.
In my csv file, I would like to have
2.9732676520000001 0.0015852047556142669 1854.1560636319559
4.0732676520000002 0.61902245706737125 2540.1258143280334
4.4032676520000003 1.0 2745.9167395368572
Following is the code that I wrote.
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
with open('output/'+file,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
This isn't running properly. I'm getting non-empty format string passed to object.format this error message. I'm having a hard time to catch what is going wrong. Could anyone spot what's going wrong with my code?
You are better off using pandas
import pandas as pd
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
# pandas can convert a list of lists to a dataframe.
# each list is a row thus after constructing the dataframe
# transpose is applied to get to the user's desired output.
df = pd.DataFrame([df, int_peak, CS])
df = df.transpose()
# write the data to the specified output path: "output"/+file_name
# without adding the index of the dataframe to the output
# and without adding a header to the output.
# => these parameters are added to be fit the desired output.
df.to_csv("output/"+file_name, index=False, header=None)
The output CSV looks like this:
2.973268 0.001585 1854.156064
4.073268 0.619022 2540.125814
4.403268 1.000000 2745.916740
However to fix your code, you need to use another file name variable other than file. I changed that in your code as follows:
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
with open('/tmp/'+file_name,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
and it works. The output is as follows:
2.973268,0.001585,1854.156064
4.073268,0.619022,2540.125814
4.403268,1.000000,2745.916740
If you need to write only a few selected columns to CSV then you should use following option.
csv_data = df.to_csv(columns=['Name', 'ID'])

Resources