Specific Fields Python3 - python-3.x

I try to select specific fields from my Qdata.txt file and use field[2] to calculate average for every years separate. My code give only total average.
data file looks like: (1. day of year: 101 and last: 1231)
Date 3700300 6701500
20000101 21.00 223.00
20000102 20.00 218.00
. .
20001231 7.40 104.00
20010101 6.70 104.00
. .
20130101 8.37 111.63
. .
20131231 45.00 120.98
import sys
td=open("Qdata.txt","r") # open file Qdata
total=0
count=0
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
total=total+float(fields[2])
count=count+1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total/count)

You could use itertools.groupby using the first four characters as the key for grouping.
with open("data.txt") as f:
next(f) # skip first line
groups = itertools.groupby(f, key=lambda s: s[:4])
for k, g in groups:
print(k, [s.split() for s in g])
This gives you the entries grouped by year, for further processing.
Output for your example data:
2000 [['20000101', '21.00', '223.00'], ['20000102', '20.00', '218.00'], ['20001231', '7.40', '104.00']]
2001 [['20010101', '6.70', '104.00']]
2013 [['20130101', '8.37', '111.63'], ['20131231', '45.00', '120.98']]

You could create a dict (or even a defaultdict) for total and count instead:
import sys
from collections import defaultdict
td=open("Qdata.txt","r") # open file Qdata
total=defaultdict(float)
count=defaultdict(int)
row1=True
for row in td :
if (row1) :
row1=False # row1 is for topic
else:
fields=row.split()
try:
year = int(fields[0][:4])
total[year] += float(fields[2])
count[year] += 1
# Errors.
except IndexError:
continue
except ValueError:
print("File is incorrect.")
sys.exit()
print("Average in 2000 was: ",total[2000]/count[2000])

Every year separate? You have to divide your input into groups, something like this might be what you want:
from collections import defaultdict
row1 = True
year_sums = defaultdict(list)
for row in td:
if row1:
row1 = False
continue
fields = row.split()
year = fields[0][:4]
year_sums[year].append(float(fields[2]))
for year in year_sums:
avarage = sum(year_sums[year])/count(year_sums[year])
print("Avarage in {} was: {}".format(year, avarage)
That is just some example code, I don't know if it works for sure, but should give you an idea what you can do. year_sums is a defaultdict containing lists of values grouped by years. You can then use it for other statistics if you want.

Related

How to sum specific values from two different txt files in python

I have 2 txt files with names and scores. For example:
File 1 File 2 Desired Output
Name Score Name Score Name Score
Michael 20 Michael 30 Michael 50
Adrian 40 Adrian 50 Adrian 90
Jane 60 Jane 60
I want to sum scores with same names and print them. I tried to pair names and scores in two different dictionaries and after that merge the dictionaries. However, I can't keep same names with different scores. So, I'm stuck here. I've written something like following :
d1=dict()
d2=dict()
with open('data1.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d1[test[i]] = test[i + 1]
i += 2
del d1['Name']
with open('data2.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d2[test[i]] = test[i + 1]
i += 2
del d2['Name']
z = dict(d2.items() | d1.items())
Using a dictionary comprehension should get you what you are after. I have assumed the contents of the files are:
File1.txt:
Name Score
Michael 20
Adrian 40
Jane 60
File2.txt:
Name Score
Michael 30
Adrian 50
Then you can get a total as:
with open("file1.txt", "r") as file_in:
next(file_in) # skip header
file1_data = dict(row.split() for row in file_in if row)
with open("file2.txt", "r") as file_in:
next(file_in) # skip header
file2_data = dict(row.split() for row in file_in if row)
result = {
key: int(file1_data.get(key, 0)) + int(file2_data.get(key, 0))
for key
in set(file1_data).union(file2_data) # could also use file1_data.keys()
}
print(result)
This should give you a result like:
{'Michael': 50, 'Jane': 60, 'Adrian': 90}
Use defaultdict
from collections import defaultdict
name_scores = defaultdict(int)
files = ('data1.txt', 'data2.txt')
for file in files:
with open(file, 'r') as f:
for name, score in f.split():
name_scores[name] += int(score)
edit: You'll probably have to skip any header line and maybe clean up trailing white spaces, but the gist of it is above.

Python: How to obtain desired list?

I'm trying to learn Spark so I'm totally new to it.
I have a file with thousands of lines where each one is structured like:
LFPG;EDDW;00;E170;370;LFPG;EDDW;189930555;150907;1826;!!!!;AFR1724;AFR;0;AFR1724-LFPG-EDDW-20150907180000;N;0;;;245382;;;150907;1800;0;;X;;;;;;;;;;370;;0;20150907175700;AA45458743;;;;;NEXE;NEXE;;;;;20150907180000;;;;245382;;;;;;;;;;;;;;;;;;;;;;;;;;;;AFR;;;;;;;;;;;0
The above line represents flight information from an airplane, it took off from LFPG (1st element) and landed in EDDW (2nd element), the rest of the information is not relevant for the purpose.
I'd like to print or save in a file the top ten busiest airports based on the total number of aircraft movements, that is, airplanes that took off or landed in an airport.
So in a sense, the desired output would be:
AIRPORT_NAME #TOTAL_MOVEMENTS #TAKE-OFFs #LANDINGS
I have already implement this program in python and would like to transform it using the MAP/Reduce paradigm using Spark.
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
# START
# Defining default program argument
if len(sys.argv)==1:
fileName = "airports.exp2"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
A test file can be download from: https://srv-file7.gofile.io/download/YCnWxr/traffic1day.exp2
So far I've been able to get the very first and second elements from the file, but I don't know quite well how to implement the filter or reduce in order to obtain the frequency time that each airports appears on each list and then merge both list adding the airport name, the sum of takeOffs and landings and the number of takeoffs and landings.
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)
airports = sc.textFile("traffic1hour.exp2", minPartitions=4)
airports = airports.map(lambda line : line.split('\n'))
takeOff_airports = airports.map(lambda sub: (sub[0].split(';')[0]))
landing_airports = airports.map(lambda sub: (sub[0].split(';')[1]))
takeOff_airports.saveAsTextFile("takeOff_airports.txt")
landing_airports.saveAsTextFile("landing_airport.txt")
Any hint or guide it will be much appreciated.

pd.rename key KeyError: 'New_Name'

Edit 12/07/19: The problem was not in fact with pd.rename fuction but the fact that I did not return from the function the pandas dataframe and as a result the column change did not exist when printing. i.e.
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
return as_pandas <- This was missing*
Please see the user comment below to uptick them for finding this error for me.
Alternatively, you can continue reading.
The data can be downloaded from this link, yet I have added a sample dataset. The formatting of the file is not a typical CSV file and I believe this may have been an assessment piece and is related to Hidden Decision Tree article. I have given the portion of the code as it solves the issues surrounding the format of the text file as mentioned above and allows the user to rename the column.
The problem occured when I tried to assign create a re-naming function:
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
However, it seem to work when I set the variable names inside rename function.
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
Sample Dataset
Title URL Date Unique Pageviews
oupUrl=tutorials 18-Apr-15 5608
"An Exclusive Interview with Data Expert, John Bottega" http://www.datasciencecentral.com/forum/topics/an-exclusive-interview-with-data-expert-john-bottega?groupUrl=announcements 10-Jun-14 360
Announcing Composable Analytics http://www.datasciencecentral.com/forum/topics/announcing-composable-analytics 15-Jun-14 367
Announcing the release of Spark 1.5 http://www.datasciencecentral.com/forum/topics/announcing-the-release-of-spark-1-5 12-Sep-15 156
Are Extreme Weather Events More Frequent? The Data Science Answer http://www.datasciencecentral.com/forum/topics/are-extreme-weather-events-more-frequent-the-data-science-answer 5-Oct-15 204
Are you interested in joining the University of California for an empiricalstudy on 'Big Data'? http://www.datasciencecentral.com/forum/topics/are-you-interested-in-joining-the-university-of-california-for-an 7-Feb-13 204
Are you smart enough to work at Google? http://www.datasciencecentral.com/forum/topics/are-you-smart-enough-to-work-at-google 11-Oct-15 3625
"As a software engineer, what's the best skill set to have for the next 5-10years?" http://www.datasciencecentral.com/forum/topics/as-a-software-engineer-what-s-the-best-skill-set-to-have-for-the- 12-Feb-16 2815
A Statistician's View on Big Data and Data Science (Updated) http://www.datasciencecentral.com/forum/topics/a-statistician-s-view-on-big-data-and-data-science-updated-1 21-May-14 163
A synthetic variance designed for Hadoop and big data http://www.datasciencecentral.com/forum/topics/a-synthetic-variance-designed-for-hadoop-and-big-data?groupUrl=research 26-May-14 575
A Tough Calculus Question http://www.datasciencecentral.com/forum/topics/a-tough-calculus-question 10-Feb-16 937
Attribution Modeling: Key Analytical Strategy to Boost Marketing ROI http://www.datasciencecentral.com/forum/topics/attribution-modeling-key-concept 24-Oct-15 937
Audience expansion http://www.datasciencecentral.com/forum/topics/audience-expansion 6-May-13 223
Automatic use of insights http://www.datasciencecentral.com/forum/topics/automatic-use-of-insights 27-Aug-15 122
Average length of dissertations by higher education discipline. http://www.datasciencecentral.com/forum/topics/average-length-of-dissertations-by-higher-education-discipline 4-Jun-15 1303
This is the full code that produces the Key Error:
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=True)
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'},
inplace=True)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
# change_column_names(multi_sets)
change_column_names(multi_set, 'Old_Name', 'New_Name')
print(multi_sets)
main()
I cleaned up your code so it would run. You were changing the column names but not returning the result. Try the following:
import pandas as pd
import numpy as np
import math
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data',
'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def title_length(as_pandas):
# Insert new column header then count the number of letters in 'Title'
as_pandas.insert(len(as_pandas.columns), 'Title_Length', 0)
as_pandas['Title_Length'] = as_pandas['Title'].map(str).apply(len)
# Although it is log, percentage of change is inverse linear comparison of
#logX1 - logX2
# therefore you could think of it as the percentage change in Page Views
# map
# function allows for function to be performed on all rows in column
# 'Page_Views'.
def log_page_view(as_pandas):
# Insert new column header
as_pandas.insert(len(as_pandas.columns), 'Log_Page_Views', 0)
as_pandas['Log_Page_Views'] = as_pandas['Page_Views'].map(lambda x: math.log(1 + float(x)))
def change_to_numeric(as_pandas):
# Check for missing values then convert the column to numeric.
as_pandas = as_pandas.replace(r'^\s*$', np.nan, regex=True)
as_pandas['Page_Views'] = pd.to_numeric(as_pandas['Page_Views'],
errors='coerce')
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
# i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
import csv
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
multi_sets = change_column_names(multi_sets)
change_to_numeric(multi_sets)
log_page_view(multi_sets)
title_length(multi_sets)
set_new_columns(multi_sets)
print(multi_sets)
main()

Never resets list

I am trying to create a calorie counter the standard input goes like this:
python3 calories.txt < test.txt
Inside calories the food is the following format: apples 500
The problem I am having is that whenever I calculate the values for the person it seems to never return to an empty list..
import sys
food = {}
eaten = {}
finished = {}
total = 0
#mappings
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1]
food[key] = value
def calculate(x):
a = []
for keys,values in x.items():
for c in values:
try:
a.append(int(food[c]))
except:
a.append(100)
print("before",a)
a = []
total = sum(a) # Problem here
print("after",a)
print(total)
def main():
calories(sys.argv[1])
for line in sys.stdin:
lines = line.strip().split(',')
for c in lines:
values = lines[0]
keys = lines[1:]
eaten[values] = keys
calculate(eaten)
if __name__ == '__main__':
main()
Edit - forgot to include what test.txt would look like:
joe,almonds,almonds,blue cheese,cabbage,mayonnaise,cherry pie,cola
mary,apple pie,avocado,broccoli,butter,danish pastry,lettuce,apple
sandy,zuchini,yogurt,veal,tuna,taco,pumpkin pie,macadamia nuts,brazil nuts
trudy,waffles,waffles,waffles,chicken noodle soup,chocolate chip cookie
How to make it easier on yourself:
When reading the calories-data, convert the calories to int() asap, no need to do it every time you want to sum up somthing that way.
Dictionary has a .get(key, defaultvalue) accessor, so if food not found, use 100 as default is a 1-liner w/o try: ... except:
This works for me, not using sys.stdin but supplying the second file as file as well instead of piping it into the program using <.
I modified some parsings to remove whitespaces and return a [(name,cal),...] tuplelist from calc.
May it help you to fix it to your liking:
def calories(x):
with open(x,"r") as file:
for line in file:
lines = line.strip().split()
key = " ".join(lines[0:-1])
value = lines[-1].strip() # ensure no whitespaces in
food[key] = int(value)
def getCal(foodlist, defValueUnknown = 100):
"""Get sum / total calories of a list of ingredients, unknown cost 100."""
return sum( food.get(x,defValueUnknown ) for x in foodlist) # calculate it, if unknown assume 100
def calculate(x):
a = []
for name,foods in x.items():
a.append((name, getCal(foods))) # append as tuple to list for all names/foods eaten
return a
def main():
calories(sys.argv[1])
with open(sys.argv[2]) as f: # parse as file, not piped in via sys.stdin
for line in f:
lines = line.strip().split(',')
for c in lines:
values = lines[0].strip()
keys = [x.strip() for x in lines[1:]] # ensure no whitespaces in
eaten[values] = keys
calced = calculate(eaten) # calculate after all are read into the dict
print (calced)
Output:
[('joe', 1400), ('mary', 1400), ('sandy', 1600), ('trudy', 1000)]
Using sys.stdin and piping just lead to my console blinking and waiting for manual input - maybe VS related...

Creating a dictionary to count the number of occurrences of Sequence IDs

I'm trying to write a function to count the number of each sequence ID that occurs in this file (it's a sample blast file)
The picture above is the input file I'm dealing with.
def count_seq(input):
dic1={}
count=0
for line in input:
if line.startswith('#'):
continue
if line.find('hits found'):
line=line.split('\t')
if line[1] in dic1:
dic1[line]+=1
else:
dic1[line]=1
return dic1
Above is my code which when called just returns empty brackets {}
So I'm trying to count how many times each of the sequence IDs (second element of last 13 lines) occur eg: FO203510.1 occurs 4 times.
Any help would be appreciated immensely, thanks!
Maybe this is what you're after:
def count_seq(input_file):
dic1={}
with open(input_file, "r") as f:
for line in f:
line = line.strip()
if not line.startswith('#'):
line = line.split()
seq_id = line[1]
if not seq_id in dic1:
dic1[seq_id] = 1
else:
dic1[seq_id] += 1
return dic1
print(count_seq("blast_file"))
This is a fitting case for collections.defaultdict. Let f be the file object. Assuming the sequences are in the second column, it's only a few lines of code as shown.
from collections import defaultdict
d = defaultdict(int)
seqs = (line.split()[1] for line in f if not line.strip().startswith("#"))
for seq in seqs:
d[seq] += 1
See if it works!

Resources