Pandas text file to CSV - python-3.x

Trying to output only certain columns of a .txt file to .csv
PANDAS documentation and this answer got me this far:
import pandas as pd
read_file = pd.read_csv (r'death.txt')
header = ['County', 'Crude Rate']
read_file.to_csv (r'death.csv', columns=header, index=None)
But I receive an error:
KeyError: "None of [Index(['County', 'Crude Rate'], dtype='object')] are in the [columns]"
This is confusing as the .txt file I'm using is the following for hundreds of rows (from a government database):
"Notes" "County" "County Code" Deaths Population Crude Rate
"Autauga County, AL" "01001" 7893 918492 859.3
"Baldwin County, AL" "01003" 30292 3102984 976.2
"Barbour County, AL" "01005" 5197 499262 1040.9
I notice the first three columns have titles enclosed in quotes, and the last three do not. I have experimented with including quotes in my columns sequence (e.g. ""County"") but no luck. Based upon the error, I realize there is some discrepancy between column titles as I have typed them and how they are read in this script.
Any help in understanding this discrepancy is appreciated.

You are reading the file with default options
read_file = pd.read_csv (r'death.txt')
Change it to
read_file = pd.read_csv (r'death.txt', sep="\t")
Check this
df.columns
Index(['Notes', 'County', 'County Code', 'Deaths', 'Population', 'Crude Rate'], dtype='object')
and the....
You should filter your columns first, an then save.
Now, if your columns are ok:
read_file[['County', 'Crude Rate']].to_csv (r'death.csv', index=None)

Related

merging rows in csv where row[0] is a duplicate

I have a csv file that has student_id,guardian_email,guardian_first_name,guardian_last_name
In many cases, the student has a mom and dad with info, so the student has more than one row in the csv file.
For example the original csv would look like this:
student_id,guardian_email,guardian_first_name,guardian_last_name
12345,momemail#google.com,Jane,Doe
12345,dademail#google.com,John,Doe
98765,coollady#yahoo.com,Mary,Poppins
99999,soccermom#bing.net,Laura,Croft
99999,blackbelt#karate.com,Chuck,Norris
using python, I want it to output this:
student_id,guardian_email,guardian_first_name,guardian_last_name,guardian_email2,guardian_first_name2,guardian_last_name2
12345,momemail#google.com,Jane,Doe,dademail#google.com,John,Doe
98765,coollady#yahoo.com,Mary,Poppins,,,
99999,soccermom#bing.net,Laura,Croft,blackbelt#karate.com,Chuck,Norris
Any help is greatly appreciated!
use groupby()+cumcount() to track position and then pivot():
df['s']=df.groupby('student_id').cumcount()+1
df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
OR
use groupby()+cumcount() to track position and then unstack()
df=df.assign(s=df.groupby('student_id').cumcount()+1).set_index(['student_id','s']).unstack()
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
Now If you print df you will get your expected output
Update:
try:
def guardianemailfinal():
path=r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\guardian_email.csv'
df=pd.read_csv(path,sep=',')
df['s']=df.groupby('student_id').cumcount()+1
df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()
df.to_csv(r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\output.csv',index=False,sep=',')
return df
#Finally call the function:
df=guardianemailfinal()
Note: Now If you print df you will get the modified dataframe and check your path then you will get 'output.csv' file

Change one element of column heading in CSV using Pandas

I have created a CSV file which looks like this:
RigName,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
0,08 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Tested all totals. Edge cases for multiple clicks.
"
1,09 July 2021,19.21,63.05,43.16,45.41,8.52,0,"Test entry#2.
"
I wish to change the 'RigName' to something the user inputs. I have tried various ways of changing the word 'RigName' to user input. One of them is this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns = df.columns.str.replace('RigName', user_input)
df.to_csv('new_csv.csv', header=True, index=False)
However no matter what I do, the result in the csv file always comes to this:
Unnamed:0,Date,DrillingMiles,TrippingMiles,CasingMiles,LinerMiles,JarringMiles,TotalMiles,Comments
Why am I getting 'Unnamed: 0' instead of the user input value?
Also, is there a way to change 'RigName' to something else by calling its position? To make multiple changes to any word in its position in future?
Zubin, you would need to change the column name be looking at the columns as a list. The code below should do the trick. Also, the same code shows how to access the column by position...
import pandas as pd
df= pd.read_csv('ton_miles_record.csv')
user_input = 'Rig805'
df.columns.values[0] = user_input
df.to_csv('new_csv.csv', header=True, index=False)
After 3 hours of trial and error (and a lot of searching in vain), I solved it by doing this:
df= pd.read_csv('ton_miles_record.csv')
user_input = 'SD555'
df.rename(columns={ df.columns[1]: user_input}, inplace=True)
df.to_csv('new_csv.csv', index=False)
I hope this helps someone else struggling as I was.

Problems with graphing excel data off an internet source with dates

this is my first post on stackoveflow and I'm pretty new to programming especially python. I'm in engineering and am learning python to compliment that going forward, mostly at math and graphing applications.
Basically my question is how do I download csv excel data off a source (in my case stock data from google), and plot only certain rows against the date. For myself I want the date against the close value.
Right now the error message I'm getting is timedata '5-Jul-17' does not match '%d-%m-%Y'
previously I was also getting tuple data does not match
The description of the opened csv data in excel is
[7 columns (Date,Open,High,Low,Close,AdjClose,Volume, and the date is organized as 2017-05-30][1]
I'm sure there are other errors as well unfortunately
I would really be grateful for any help on this,
thank you in advance!
--edit--
Upon fiddling some more I don't think names and dtypes are necessary, when I check the matrix dimensions without those identifiers I get (250L, 6L) which seems right. Now my main problem is coverting the dates to something usable, My error now is strptime only accepts strings, so I'm not sure what to use. (see updated code below)
import matplotlib.pyplot as plt
importnumpy as np
from datetime import datetime
def graph_data(stock):
%getting the data off google finance
data = np.genfromtxt('urlgoeshere'+stock+'forthecsvdata', delimiter=',',
skip_header=1)
# checking format of matrix
print data.shape (returns 250L,6L)
time_format = '%d-%m-%Y'
# I only want the 1st column (dates) and 5 column (close), all rows
date = data[:,0][:,]
close = data[:,4][:,]
dates = [datetime.strptime(date, time_format)]
%plotting section
plt.plot_date(dates,close, '-')
plt.legend()
plt.show()
graph_data('stockhere')
Assuming the dates in the csv file are in the format '5-Jul-17', the proper format string to use is %d-%b-%y.
In [6]: datetime.strptime('5-Jul-17','%d-%m-%Y')
ValueError: time data '5-Jul-17' does not match format '%d-%m-%Y'
In [7]: datetime.strptime('5-Jul-17','%d-%b-%y')
Out[7]: datetime.datetime(2017, 7, 5, 0, 0)
See the Python documentation on strptime() behavior.

How to use Pandas to display specific columns from csv file?

I have a csv file with a number of columns in it. It is for students. I want to display only male students and their names. I used 1 for male students and 0 for female students. My code is:
import pandas as pd
data = pd.read_csv('normalizedDataset.csv')
results = pd.concat([data['name'], ['students']==1])
print results
I have got this error:
TypeError: cannot concatenate a non-NDFrame object
Can anyone help please. Thanks.
You can specify to read only certain column names of your data when you load your csv. Then use loc to locate all values where students equals 1.
data = pd.read_csv('normalizedDataset.csv', usecols=['name', 'students'])
data = data.loc[data.students == 1, :]
BTW, your original error is because you are trying to concatenate a dataframe with False.
>>> ['students']==1
False
No need to concat, you're stripping things away, not building.
Try:
data[data['friends']==1]['name']
To provide clarity on why you were getting the error:
The second thing you were trying to concat was:
['students']==1
Which is not an NDFrame object. You'd want to replace that with.
data[data['students']==1]['students']

Matplotlib: Import and plot multiple time series with legends direct from .csv

I have several spreadsheets containing data saved as comma delimited (.csv) files in the following format: The first row contains column labels as strings ('Time', 'Parameter_1'...). The first column of data is Time and each subsequent column contains the corresponding parameter data, as a float or integer.
I want to plot each parameter against Time on the same plot, with parameter legends which are derived directly from the first row of the .csv file.
My spreadsheets have different numbers of (columns of) parameters to be plotted against Time; so I'd like to find a generic solution which will also derive the number of columns directly from the .csv file.
The attached minimal working example shows what I'm trying to achieve using np.loadtxt (minus the legend); but I can't find a way to import the column labels from the .csv file to make the legends using this approach.
np.genfromtext offers more functionality, but I'm not familiar with this and am struggling to find a way of using it to do the above.
Plotting data in this style from .csv files must be a common problem, but I've been unable to find a solution on the web. I'd be very grateful for your help & suggestions.
Many thanks
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Data.csv', skiprows=1, delimiter=',') # skip the column labels
cols = data.shape[1] # get the number of columns in the array
for n in range (1,cols):
plt.plot(data[:,0],data[:,n]) # plot each parameter against time
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
plt.show()
Here's my minimal working example for the above using genfromtxt rather than loadtxt, in case it is helpful for anyone else.
I'm sure there are more concise and elegant ways of doing this (I'm always happy to get constructive criticism on how to improve my coding), but it makes sense and works OK:
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('Data.csv', delimiter=',', dtype=None) # dtype=None automatically defines appropriate format (e.g. string, int, etc.) based on cell contents
names = (arr[0]) # select the first row of data = column names
for n in range (1,len(names)): # plot each column in turn against column 0 (= time)
plt.plot (arr[1:,0],arr[1:,n],label=names[n]) # omitting the first row ( = column names)
plt.legend()
plt.show()
The function numpy.genfromtxt is more for broken tables with missing values rather than what you're trying to do. What you can do is simply open the file before handing it to numpy.loadtxt and read the first line. Then you don't even need to skip it. Here is an edited version of what you have here above that reads the labels and makes the legend:
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
#open the file
with open('Data.csv') as f:
#read the names of the colums first
names = f.readline().strip().split(',')
#np.loadtxt can also handle already open files
data = np.loadtxt(f, delimiter=',') # no skip needed anymore
cols = data.shape[1]
for n in range (1,cols):
#labels go in here
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
#And finally the legend is made
plt.legend()
plt.show()

Resources