CSV manipulation problem. A little complex and would like the solution to not be using pandas

CSV manipulation problem. A little complex and would like the solution to not be using pandas - python-3.x

CSV file:
Acct,phn_1,phn_2,phn_3,Name,Consent,zipcode
1234,45678,78906,,abc,NCN,10010
3456,48678,,78976,def,NNC,10010
Problem:
Based on consent value which is for each of the phones (in 1st row: 1st N is phn_1, C for phn_2 and so on) I need to retain only that phn column and move the remaining columns to the end of the file.
The below is what I have. My approach isn't that great is what I feel. I'm trying to get the id of the individual Ns and Cs, get the id and map it with the phone (but I'm unable to iterate through the phn headers and compare the id's of the Ns and Cs)
with open('file.csv', 'rU') as infile:
reader = csv.DictReader(infile) data = {} for row in reader:
for header, value in row.items():
data.setdefault(header, list()).append(value) # print(data)
Consent = data['Consent']
for i in range(len(Consent)):
# print(list(Consent[i]))
for idx, val in enumerate(list(Consent[i])):
# print(idx, val)
if val == 'C':
#print("C")
print(idx)
else:
print("N")
Could someone provide me with the solution for this?
Please Note: Do not want the solution to be by using pandas.

You’ll find my answer in the comments of the code below.
import csv
def parse_csv(file_name):
""" """
# Prepare the output. Note that all rows of a CSV file must have the same structure.
# So it is actually not possible to put the phone numbers with no consent at the end
# of the file, but what you can do is to put them at the end of the row.
# To ensure that the structure is the same on all rows, you need to put all phone numbers
# at the end of the row. That means the phone number with consent is duplicated, and that
# is not very efficient.
# I chose to put the result in a string, but you can use other types.
output = "Acct,phn,Name,Consent,zipcode,phn_1,phn_2,phn_3\n"
with open(file_name, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Search the letter “C” in “Consent” and get the position of the first match.
# Add one to the result because the “phn_×” keys are 1-based and not 0-based.
first_c_pos = row["Consent"].find("C") + 1
# If there is no “C”, then the “phn” key is empty.
if first_c_pos == 0:
row["phn"] = ""
# If there is at least one “C”, create a key string that will take the values
# phn_1, phn_2 or phn_3.
else:
key = f"phn_{first_c_pos}"
row["phn"] = row[key]
# Add the current row to the result string.
output += ",".join([
row["Acct"], row["phn"], row["Name"], row["Consent"],
row["zipcode"], row["phn_1"], row["phn_2"], row["phn_3"]
])
output += "\n"
# Return the string.
return(output)
if __name__ == "__main__":
output = parse_csv("file.csv")
print(output)

Related

How to get all but last element outside of loop in Python

I have two problems with this code.
I need to get the string list calibration_values out of the loop so I can use them elsewhere in the program.
I need all of the elements in the string list except the last one. I am not able to use pop because this is a string and not a list. I am unable to use variable[:-1] because that truncates the last number by one digit instead of giving me all but the last element in the list. I thought maybe I needed to get the values out of the function for [:-1] to work but that takes me back to issue #1.
What am I doing wrong here?
def Parse_CSV(file, string):
global cal_string
global calibration_values
cal_string = []
calibration_values = []
with open(file, "r") as csv_file:
reader = csv.reader(csv_file, delimiter=',')
reader2 = csv.DictReader(csv_file)
for row in reader:
for column in row:
if string in column:
# print("Found search string:", string, " in ", file," .Writing Calibration String\n")
cal_string = row
calibration_values = cal_string[1]
print(calibration_values)

Why are the values in my dictionary returning as a list within a list for each element?

I've got a file with an id and lineage info for species.
For example:
162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
2829358,,,,,
2806529,Eukaryota,Nematoda,,,
I'm writing a script where I need to get the counts for each lineage depending on user input (i.e. if genus, then I would be looking at the last word in each line such as Treponema, if class, then the fourth, etc).
I'll later need to convert the counts into a dataframe but first I am trying to turn this file of lineage info into a dictionary where depending on user input, that lineage info (i.e. let's say genus) is the key, and the id is the value. This is because there can be multiple ids that match to the same lineage info such as ids 195, 197, 199, 201 would all return a hit for Campylobacter.
Here is my code:
def create_dicts(filename):
'''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
# Creating a genus_dict
unique_ids_dict={} # the main dict to return
phylum_dict=2 # third item in line
family_dict=3 # fourth item in line
class_dict=4 # fifth item in line
genus_dict=5 # sixth item in line
type_of_dict=input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")
with open(filename, 'r') as f:
content = f.readlines()
for line in content:
key = line.split(",")[int(type_of_dict)].strip("\n") # lineage info
value = line.split(",")[0].split("\n") # the id, there can be multiple mapping to the same key
if key in unique_ids_dict: # if the lineage info is already a key, skip
unique_ids_dict[key].append(value)
else:
unique_ids_dict[key]=value
return unique_ids_dict
I had to add the .split("\n") at the end of value because I kept getting the error where str object doesn't have attribute append.
I am trying to get a dictionary like the following if the user input was 5 for genus:
unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], '': ['2829358', '2806529']}
But instead I am getting the following:
unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', ['197'], ['199'], ['201']], '': ['2829358', ['2806529']]} ##missing str "NONE" haven't figured out how to convert empty strings to say "NONE"
Also, if anyone knows how to convert all empty hits into "NONE" or something of the following that would be great. This is sort of a secondary question so if needed I can open this as a separate question.
Thank you!
SOLVED ~~~~
NEeded to use extend instead of append.
To change emtpy string into a variable I used dict.pop so after my if statement
unique_ids_dict["NONE"] = unique_ids_dict.pop("")
Thank you!

def create_dicts(filename):
'''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
# Creating a genus_dict
unique_ids_dict = {} # the main dict to return
phylum_dict = 2 # third item in line
family_dict = 3 # fourth item in line
class_dict = 4 # fifth item in line
genus_dict = 5 # sixth item in line
type_of_dict = input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")
with open(filename, 'r') as f:
content = f.readlines()
for line in content:
key = line.split(",")[int(type_of_dict)].strip("\n") # lineage info
value = line.split(",")[0].split("\n") # the id, there can be multiple mapping to the same key
if key in unique_ids_dict: # if the lineage info is already a key, skip
unique_ids_dict[key].**extend**(value)
else:
unique_ids_dict[key] = value
return unique_ids_dict
This worked for me. Using extend on list not append.

I suggest that you work with Pandas, it's much simpler, and also it's good to assure header names:
import pandas as pd
def create_dicts(filename):
"""
Transforms the unique_taxids_lineage_allsamples file into a
dictionary.
Note: There can be multiple ids mapping to the same lineage info.
Therefore ids should be values.
"""
# Reading File:
content = pd.read_csv(
filename,
names=("ID", "Kingdom", "Phylum", "Family", "Class", "Genus")
)
# Printing input and choosing clade to work with:
print("\nWhat type of dict?")
print("- Phylum")
print("- Family")
print("- Class")
print("- Genus")
clade = input("> ").capitalize()
# Replacing empty values with string 'None':
content = content.where(pd.notnull(content), "None")
# Selecting columns and aggregating accordingly to the chosen
# clade and ID:
series = content.groupby(clade).agg("ID").unique()
# Creating dict:
content_dict = series.to_dict()
# If you do not want to work with Numpy arrays, just create
# another dict of lists:
content_dict = {k:list(v) for k, v in content_dict.items()}
return content_dict
if __name__ == "__main__":
d = create_dicts("temp.csv")
print(d)
temp.csv:
162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
829358,,,,,
2806529,Eukaryota,Nematoda,,,
I hope this is what you wanted to do.

I read a line on a csv file and want to know the item number of a word

The header line in my csv file is:
Number,Name,Type,Manufacturer,Material,Process,Thickness (mil),Weight (oz),Dk,Orientation,Pullback distance (mil),Description
I can open it and read the line, with no problems:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
for row in csv_reader:
But I want to find out what the item number is for the "Dk".
The problem is that not only can the items be in any order as decided by the user in a different application. There can also be up to 25 items in the line.
How do I quickly determine what item is "Dk" so I can write Dk = (row[i]) for it and extract it for all the data after the header.
I have tried this below on each of the potential 25 items and it works, but it seems like a waste of time, energy and my ocd.
while True:
try:
if (row[0]) == "Dk":
DkColumn = 0
break
elif (row[1]) == "Dk":
DkColumn = 1
break
...
elif (row[24]) == "Dk":
DkColumn = 24
break
else:
f.write('Stackup needs a "Dk" column.')
break
except:
print ("Exception occurred")
break

Can't you get the index of the column (using list.index()) that has the value Dk in it? Something like:
infile = open('CS_Data/_AD_LayersTest.csv','r')
csv_reader = csv.reader(infile, delimiter=',')
# Store the header
headers = next(csv_reader, None)
# Get the index of the 'Dk' column
dkColumnIndex = header.index('Dk')
for row in csv_reader:
# Access values that belong to the 'Dk' column
rowDkValue = row[dkColumnIndex]
print(rowDkValue)
In the code above, we store the first line of the CSV in as a list in headers. We then search the list to find the index of the item that has the value of 'Dk'. That will be the column index.
Once we have that column index, we can then use it in each row to access the particular index, which will correspond to the column which Dk is the header of.

Use pandas library to save your order and have access to each column by typing:
row["column_name"]
import pandas as pd
dataframe = pd.read_csv(
"",
cols=["Number","Name","Type" ....])
for index, row in df.iterrows():
# do something

If I understand your question correctly, and you're not interested in using pandas (as suggested by Mikey - you sohuld really consider his suggestion, however), you should be able to do something like the following:
with open('CS_Data/_AD_LayersTest.csv','r') as infile:
csv_reader = csv.reader(infile, delimiter=',')
header = next(csv_reader)
col_map = {col_name: idx for idx, col_name in enumerate(header)}
for row in csv_reader:
row_dk = row[col_map['Dk']]

One solution would be to use pandas.
import pandas as pd
df=pd.read_csv('CS_Data/_AD_LayersTest.csv')
Now you can access 'Dk' easily as long as the file is read correctly.
dk=df['Dk']
and you can access individual values of dk like
for i in range(0,10):
temp_var=df.loc('Dk',i)
or however you want to access those indexes.

Python: How to sum a column in a CSV file while skipping the header row

Trying to sum a column in a csv file that has a header row at the top. I'm trying to use this for loop but it's just return zero. Any thoughts?
CSVFile = open('Data103.csv')
CSVReader = csv.reader(CSVFile) #you don't pass a file name directly to csv.reader
CSVDataList = list(CSVReader) #stores the csv file as a list of lists
print(CSVDataList[0][16])
total = 0
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
print (total)
Here is what the data sample looks like in txt:
Value,Value,Value, "15,500.05", 00.00, 00.00
So the items are deliminted by , except in the case where they need an escape then it's "". It's a pretty standard file with a header row and about 1k lines of data across 18 columns.

You might want to use Pandas.
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
column_sum = df['column_name'].sum()

It seems that you've over-indented the line that does the sum. It should be like this:
for row in CSVReader:
if CSVReader.line_num == 1:
continue
total += int(row[16])
Otherwise you'll only sum the values for the first row, which is exactly the one you want to skip.
EDIT:
Since you said the previous change doesn't work, I'd suggest working with the excellent Python lib called rows.
With the following CSV (fruits.csv):
id,name,amount
1,apple,3
2,banana,6
3,pineapple,2
4,lemon,5
You can access columns directly by their name instead of their index:
import rows
data = rows.import_from_csv('fruits.csv')
for fruit_data in data:
print(fruit_data.name, fruit_data.amount)
# output:
# apple 3
# banana 6
# pineapple 2
# lemon 5
NEW EDIT:
After you've provided the data, I believe in your case you could do something like:
import rows
data = rows.import_from_csv('Data103.csv')
print(data.field_names[16]) # prints the field name
total = 0
for row in data:
value = row.<column_name>
value = value.replace(',', '') # remove commas
total += float(value)
print (total)

How to loop through csv and be able to go back a step or proceed line by line?

I'm trying to look up a time for a user. Let's say they input 13(minutes), my code scrolls through the csv and finds each row that has 13 in the time column. It then prints out the row one at a time. I don't know how to allow a user to have the option of revisiting a previous step? My code currently just reverses the order of the csv, starts from the bottom, even if the rows are not the 13 minute- selected rows.
I'm a total newbie so please try to explain as simple as possible.. Thanks
Please see code:
def time():
while True:
find = input("Please enter a time in minutes(rounded)\n"
"> ")
if len(find) < 1:
continue
else:
break
print("Matches will appear below\n"
"If no matches were made\n"
"You will return back to the previous menu.\n"
"")
count = -1
with open("work_log1.csv", 'r') as fp:
reader = csv.DictReader(fp, delimiter=',')
for row in reader:
count+=1
if find == row["time"]:
for key, value in row.items(): # This part iterates through csv dict with no problems
print(key,':',value)
changee = input("make change? ")
if changee == "back": # ISSUE HERE******
for row in reversed(list(reader)): # I'm trying to use this to reverse the order of whats been printed
for key, value in row.items(): # Unfortunately it doesnt start from the last item, it starts from
print(key,':',value) # The bottom of the csv. Once I find out how to iterate through it
# Properly, then I can apply my changes
make_change = input("make change? or go back")
if make_change == 'y':
new_change = input("input new data: ")
fp = pd.read_csv("work_log1.csv")
fp.set_value(count, "time", new_change) # This part makes the changes to the row i'm currently on
fp.to_csv("work_log1.csv", index=False)
print("")

you can always have list that keep last n lines so you can go back using this list, after reading new line just history.pop(0) and 'history.append(last_line)'
or alternatively you can wrap this logic using stream.seek function

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

CSV manipulation problem. A little complex and would like the solution to not be using pandas - python-3.x

Related

How to get all but last element outside of loop in Python

Why are the values in my dictionary returning as a list within a list for each element?

I read a line on a csv file and want to know the item number of a word

Python: How to sum a column in a CSV file while skipping the header row

How to loop through csv and be able to go back a step or proceed line by line?

Categories

Resources