remove empty strings from spark RDD - apache-spark

I have an RDD which I am tokenizing like this to give me list of tokens
data = sqlContext.read.load('file.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
data = data.rdd.map(lambda x: x.desc)
stopwords = set(sc.textFile('stopwords.txt').collect())
tokens = data.map( lambda document: document.strip().lower()).map( lambda document: re.split("[\s;,#]", document)).map( lambda word: [str(w) for w in word if not w in stopwords])
>>> print tokens.take(5)
[['35', 'year', 'wild', 'elephant', 'named', 'sidda', 'villagers', 'manchinabele', 'dam', 'outskirts', 'bengaluru', '', 'cared', 'wildlife', 'activists', 'suffered', 'fracture', 'developed', 'mu'], ['tamil', 'nadu', 'vivasayigal', 'sangam', 'reiterates', 'demand', 'declaring', 'tamil', 'nadu', 'drought', 'hit', 'sanction', 'compensation', 'affected', 'farmers'], ['triggers', 'rumours', 'income', 'tax', 'raids', 'quarries'], ['', 'president', 'barack', 'obama', 'ordered', 'intelligence', 'agencies', 'review', 'cyber', 'attacks', 'foreign', 'intervention', '2016', 'election', 'deliver', 'report', 'leaves', 'office', 'january', '20', '', '2017'], ['death', 'note', 'driver', '', 'bheema', 'nayak', '', 'special', 'land', 'acquisition', 'officer', '', 'alleging', 'laundered', 'mining', 'baron', 'janardhan', 'reddys', 'currency', 'commission', '']]
There are few '' items in the list which I am unable to remove. How can I remove them
This is not working
tokens = tokens.filter(lambda lst: filter(None, lst))

This should work
tokens = tokens.map(lambda lst: filter(None, lst))
The filter expects a method that returns boolean. In your case, you have a method that returns list.

Related

Tokenize in Python

I am trying to build a function that python that allows me to tokenize a character string. I have performed the following function:
def tokenize(string):
words = nltk.word_tokenize(string)
return words
This function prints the following:
tokenize("Hello. What’s your name?")
['Hello', '.', 'What', '’', 's', 'your', 'name', '?']
But I need you to print me as follows:
['Hello', '.', 'What’s', 'your', 'name', '?']
How could I implement it?.
Thank you

How to create a nested dict from list with blank keys across the board?

I know that the dict.fromkeys used as follows rt_dict = dict.fromkeys(['name', 'description', 'model'], '') gets me half way there, BUT, how do I adjust it to achieve my desired result of something like:
{'name': '', 'description': {'year': '', 'make': ''}, 'model': ''}
All keys without nested dictionaries should have blank values. All values of the nested dictionaries should be blank IF they do not have nested dictionaries.
Not clear what your input looks like, but this will work.
input = ['name', {'description': ['year', 'make']}, 'model']
result = {}
for key in input:
if isinstance(key, dict):
result[next(iter(key))] = dict.fromkeys(next(iter(key.values())), '')
else:
result[key] = ''
Output:
{'name': '', 'description': {'year': '', 'make': ''}, 'model': ''}

How do I pass a list as a parameter in a user-defined function?

How do I pass a list as a parameter in a function?
I am trying to form a user-defined function called 'get_all_latitude' where it will extract the latitude according to its listing id from a dataset. An excerpt of the dataset (it is a list of dictionaries) is as follows:
{
'listing_id': '1133718',
'survey_id': '1280',
'host_id': '6219420',
'room_type': 'Shared room',
'country': '',
'city': 'Singapore',
'borough': '',
'neighborhood': 'MK03',
'reviews': 9.0,
'overall_satisfaction': 4.5,
'accommodates': '12',
'bedrooms': '1.0',
'bathrooms': '',
'price': 74.0,
'minstay': '',
'last_modified': '2017-05-17 09:10:25.431659',
'latitude': 1.293354,
'longitude': 103.769226,
'location': '0101000020E6100000E84EB0FF3AF159409C69C2F693B1F43F'
}
This is my progress thus far:
def get_all_latitude(data, list_id):
new_list = []
for row in data:
if row['listing_id'] == list_id:
new_list.append(row['latitude'])
return new_list
This works if I only have 1 listing id as the 2nd argument (e.g. get_all_latitude(airbnb_data, '1133718') but I am wondering how I can get it to work with a list (e.g. get_all_latitude(airbnb_data, ['10350448','13507262','13642646']) ) as I do not know how to code it in a way where it will unpack the elements of a list.
Try this:
def get_all_latitude(data, list_id):
new_list = []
for row in data:
if row['listing_id'] in list_id:
new_list.append(row['latitude'])
return new_list
Or if you want to define list for all list id:
def get_all_latitude(data, list_ids):
new_lists = {list_id:list() for list_id in list_ids}
for row in data:
if row['listing_id'] == list_id:
new_list[row['listing_id']].append(row['latitude'])
return new_lists

finding non matching records in pandas

I would like to identify if a set of records is not represented by a distinct list of values; so in this example of:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice'],
'sport' : ['football','football','football','football']}
df_c = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['1', '3', '5'],
'first_name': ['Alex', 'Allen', 'Ayoung'],
'last_name': ['Anderson', 'Ali', 'Atiches'],
'sport' : ['football','football','football']}
df_d = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
frames = [df_a,df_b,df_c,df_d]
frame = pd.concat(frames)
frame = frame.sort_values(by='subject_id')
raw_data = {
'sport':['soccer','football','softball']
}
sportlist = pd.DataFrame(raw_data,columns=['sport'])
Desired output: I would like to get a list of first_name and last_name pairs that do not play football. And also I would like be able to return a list of all the records since softball is not represented in the original list.
I tried using merge with how= outer, indicator=True options but since there is a record that plays soccer there is a match. And the '_right_only' yields no records since it was not populated in the original data.
Thanks,
aem
If you only want to get the names of people who do not play football all you need to do is:
frame[frame.sport != 'football']
Which would select only those persons who are not playing football.
If it has to be a list you can further call to_records(index=False)
frame[frame.sport != 'football'][['first_name', 'last_name']].to_records(index=False)
which returns a list of tuples:
[('Alex', 'Anderson'), ('Amy', 'Ackerman'), ('Allen', 'Ali'),
('Alice', 'Aoni'), ('Brian', 'Black'), ('Ayoung', 'Atiches'),
('Bran', 'Balwner'), ('Bryce', 'Brice'), ('Betty', 'Btisan'),
('Billy', 'Bonder')]
You can also use .loc indexer in pandas
frame.loc[frame['sport'].ne('football'), ['first_name','last_name']].values.tolist()
[['Alex', 'Anderson'],
['Amy', 'Ackerman'],
['Allen', 'Ali'],
['Alice', 'Aoni'],
['Brian', 'Black'],
['Ayoung', 'Atiches'],
['Bran', 'Balwner'],
['Bryce', 'Brice'],
['Betty', 'Btisan'],
['Billy', 'Bonder']]

Why are these two methods of printing 2d lists different(python)

board = [[] for i in range(3)]
for i in board:
for j in range(3):
i.append(' ')
for i in board:print(i)
'''
['', '', '']
['', '', '']
['', '', '']'''
print(i for i in board) #<generator object <genexpr> at 0x0000026E45CB69E8>
Why do the last two lines print two different things?

Resources