I'm trying to learn Spark so I'm totally new to it.
I have a file with thousands of lines where each one is structured like:
The above line represents flight information from an airplane, it took off from LFPG (1st element) and landed in EDDW (2nd element), the rest of the information is not relevant for the purpose.
I'd like to print or save in a file the top ten busiest airports based on the total number of aircraft movements, that is, airplanes that took off or landed in an airport.
So in a sense, the desired output would be:
I have already implement this program in python and would like to transform it using the MAP/Reduce paradigm using Spark.
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
# Defining default program argument
if len(sys.argv)==1:
fileName = "airports.exp2"
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
A test file can be download from:
So far I've been able to get the very first and second elements from the file, but I don't know quite well how to implement the filter or reduce in order to obtain the frequency time that each airports appears on each list and then merge both list adding the airport name, the sum of takeOffs and landings and the number of takeoffs and landings.
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("airports").setMaster("local[*]")
sc = SparkContext(conf = conf)
airports = sc.textFile("traffic1hour.exp2", minPartitions=4)
airports = line : line.split('\n'))
takeOff_airports = sub: (sub[0].split(';')[0]))
landing_airports = sub: (sub[0].split(';')[1]))
Any hint or guide it will be much appreciated.


what is wrong with this Pandas and txt file code

I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
should work.

how to write a fastq file from other file

I was asked to read from two files (left and right reads) Aip02.R1.fastq and Aip02.R2.fastq, and get an interleaved fasta file using zip function. The left and right files were fastq files, but when I zip them together to make a new fastq file, the writer function doesn't work anymore. It gives me error "SeqRecord (id=) has an invalid sequence."
#!/usr/bin/env python3
# Import Seq, SeqRecord, and SeqIO
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")
A= zip(leftReads,rightReads)
SeqIO.write(SeqRecord(list(A)), "interleave.fastq", "fastq")
Your forward and reverse sequences probably have the same ID. So use the following code to add a suffix to the IDs. I used /1 and /2 here, but things like .f and .r are also used.
from Bio import SeqIO
import itertools
def interleave(iter1, iter2) :
for (forward, reverse) in itertools.izip(iter1, iter2):
assert == += "/1" += "/2"
yield forward
yield reverse
leftReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq", "fastq")
rightReads = SeqIO.parse("/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq","fastq")
handle = open("interleave.fastq", "w")
count = SeqIO.write(interleave(leftReads, rightReads), handle, "fastq")
print("{} records written to interleave.fastq".format(count))
This code can become slow for large fastq files. For example see here where they report that it takes 14 minutes to create a 2GB fastq file. So they give this improved way:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
import itertools
file_f = "/scratch/AiptasiaMiSeq/fastq/Aip02.R1.fastq"
file_r = "/scratch/AiptasiaMiSeq/fastq/Aip02.R2.fastq"
handle = open("interleave.fastq", "w")
count = 0
f_iter = FastqGeneralIterator(open(file_f,"rU"))
r_iter = FastqGeneralIterator(open(file_r,"rU"))
for (f_id, f_seq, f_q), (r_id, r_seq, r_q) in itertools.izip(f_iter,r_iter):
assert f_id == r_id
count += 2
#Write out both reads with "/1" and "/2" suffix on ID
% (f_id, f_seq, f_q, r_id, r_seq, r_q))
print("{} records written to interleave.fastq".format(count)

How can I tell Python to look for an element only if it exists?

I want to scrape information from supermarket products but taking into account that some of the info (the origin of the product) isn't always available.
I am trying to iterate over a dataframe of links of a supermarket. From each of them, I want to get some information. However, the origin of the products isn't always available. I don't know how to make Python look for 'origin' only when it is available. I've tried the following code:
import urllib.request
from bs4 import BeautifulSoup
import csv
import os
dir = ''
file = 'data.xlsx'
# create and write headers to a list
rows = []
rows.append(['Brand', 'Product', 'Product_Number', 'Gross_Weight', 'Origin'])
# Change working directory:
# Retrieve current working directory ('cwd'):
cwd = os.getcwd()
# Load spreadsheet:
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df = xl.parse(sheetname)
for index, row in df.iterrows():
# specify the url
urlpage = row['link']
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
results = soup.find_all('dl', attrs={'class': 'des_info clearfix'})
#print('Number of results', len(results))
for result in results:
# find all columns per result
data = result.find_all('dd')
# check that columns have data
if len(data) == 0:
# write columns to variables
brand = data[0].getText()
product = data[1].getText()
number = data[2].getText()
weight = data[3].getText()
if data[4].getText() == None:
origin = 0
origin = data[4].getText()
# write each result to rows
rows.append([brand, product, number, weight, origin])
I get the following error:
if data[4].getText() == None:
IndexError: list index out of range
I would like to get all the data ordered in a list and, if the origin isn't available for one item, a zero.
You can use a try statement:
# write columns to variables
brand = data[0].getText()
product = data[1].getText()
number = data[2].getText()
weight = data[3].getText()
origin = data[4].getText()
origin = 0
You could also use len of data
if len(data) >= 4:
#do something
#do something else

Having issues computing the average of compound sentiment values for each text file in a folder

# below is the sentiment analysis code written for sentence-level analysis
import glob
import os
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python script
sid = SentimentIntensityAnalyzer()
# I will also initialize the 'english.pickle' function and give it a short
tokenizer ='tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles', '*.txt'))
text = []
#iterate over the list getting each file
for file in files:
#open the file and then call .read() to get the text
with open(file) as f:
text_str = "\n".join(text)
# This breaks up the paragraph into a list of strings.
sentences = tokenizer.tokenize(text_str )
sent = 0.0
count = 0
# Iterating through the list of sentences and extracting the compound scores
for sentence in sentences:
count +=1
scores = sid.polarity_scores(sentence)
sent += scores['compound'] #Adding up the overall compound sentiment
# print(sent, file=open('cnn_compound.txt', 'a'))
if count != 0:
sent = float(sent / count)
print(sent, file=open('cnn_compound.txt', 'a'))
With these lines of code, I have been able to get the average of all the compound sentiment values for all the text files. What I really want is the
average compound sentiment value for each text file, such that if I have 10
text files in the folder, I will have 10 floating point values representing
each of the text file. So that I can plot these values against each other.
Kindly assist me as I am very new to Python.
# below is the sentiment analysis code written for sentence-level analysis
import os, string, glob, pandas as pd, numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize
# Next, VADER is initialized so I can use it within the Python
sid = SentimentIntensityAnalyzer()
exclude = set(string.punctuation)
# I will also initialize the 'english.pickle' function and give
it a short
tokenizer ='tokenizers/punkt/english.pickle')
#Each of the text file is listed from the folder speeches
files = glob.glob(os.path.join(os.getcwd(), 'cnn_articles',
text = []
sent = 0.0
count = 0
cnt = 0
#iterate over the list getting each file
for file in files:
f = open(file).read().split('.')
cnt +=1
count = (len(f))
for sentence in f:
if sentence not in exclude:
scores = sid.polarity_scores(sentence)
sent += scores['compound']
average = round((sent/count), 4)
t = [cnt, average]
df = pd.DataFrame(text, columns=['Article Number', 'Average
#df.to_csv(r'Result.txt', header=True, index=None, sep='"\t\"
+"\t\"', mode='w')
df.to_csv('cnn_result.csv', index=None)

Python Multiprocessing throwing out results based on previous values

I am trying to learn how to use multiprocessing and have managed to get the code below to work. The goal is to work through every combination of the variables within the CostlyFunction by setting n equal to some number (right now it is 100 so the first 100 combinations are tested). I was hoping I could manipulate w as each process returned its list (CostlyFunction returns a list of 7 values) and only keep the results in a given range. Right now, w holds all 100 lists and then lets me manipulate those lists but, when I use n=10MM, w becomes huge and costly to hold. Is there a way to evaluate CostlyFunction's output as the workers return values and then 'throw out' values I don't need?
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
#width = -36000000/1000
#fronteir = [None]*1000
currtime = time()
po = Pool()
res = po.map_async(CostlyFunction,((i,) for i in range(n)))
w = res.get()
spamwriter = csv.writer(csvFile, delimiter=',')
print(('2: parallel: time elapsed:', time() - currtime))
Unfortunately, Pool doesn't have a 'filter' method; otherwise, you might've been able to prune your results before they're returned. Pool.imap is probably the best solution you'll find for dealing with your memory issue: it returns an iterator over the results from CostlyFunction.
For sorting through the results, I made a simple list-based class called TopList that stores a fixed number of items. All of its items are the highest-ranked according to a key function.
from collections import Userlist
def keyfunc(a):
return a[5] # This would be the sixth item in a result from CostlyFunction
class TopList(UserList):
def __init__(self, key, *args, cap=10): # cap is the largest number of results
super().__init__(*args) # you want to store
self.cap = cap
self.key = key
def add(self, item):, reverse=True)
Here's how your code might look:
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
n = 100
currtime = time()
po = Pool()
best = TopList(keyfunc)
result_iter = po.imap(CostlyFunction, ((i,) for i in range(n)))
for result in result_iter:
spamwriter = csv.writer(csvFile, delimiter=',')
print(('2: parallel: time elapsed:', time() - currtime))
