finding non matching records in pandas - python-3.x

I would like to identify if a set of records is not represented by a distinct list of values; so in this example of:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan'],
'sport' : ['soccer','soccer','soccer','soccer','soccer']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['9', '5', '6', '7'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice'],
'sport' : ['football','football','football','football']}
df_c = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
raw_data = {
'subject_id': ['1', '3', '5'],
'first_name': ['Alex', 'Allen', 'Ayoung'],
'last_name': ['Anderson', 'Ali', 'Atiches'],
'sport' : ['football','football','football']}
df_d = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name','sport'])
frames = [df_a,df_b,df_c,df_d]
frame = pd.concat(frames)
frame = frame.sort_values(by='subject_id')
raw_data = {
'sport':['soccer','football','softball']
}
sportlist = pd.DataFrame(raw_data,columns=['sport'])
Desired output: I would like to get a list of first_name and last_name pairs that do not play football. And also I would like be able to return a list of all the records since softball is not represented in the original list.
I tried using merge with how= outer, indicator=True options but since there is a record that plays soccer there is a match. And the '_right_only' yields no records since it was not populated in the original data.
Thanks,
aem

If you only want to get the names of people who do not play football all you need to do is:
frame[frame.sport != 'football']
Which would select only those persons who are not playing football.
If it has to be a list you can further call to_records(index=False)
frame[frame.sport != 'football'][['first_name', 'last_name']].to_records(index=False)
which returns a list of tuples:
[('Alex', 'Anderson'), ('Amy', 'Ackerman'), ('Allen', 'Ali'),
('Alice', 'Aoni'), ('Brian', 'Black'), ('Ayoung', 'Atiches'),
('Bran', 'Balwner'), ('Bryce', 'Brice'), ('Betty', 'Btisan'),
('Billy', 'Bonder')]

You can also use .loc indexer in pandas
frame.loc[frame['sport'].ne('football'), ['first_name','last_name']].values.tolist()
[['Alex', 'Anderson'],
['Amy', 'Ackerman'],
['Allen', 'Ali'],
['Alice', 'Aoni'],
['Brian', 'Black'],
['Ayoung', 'Atiches'],
['Bran', 'Balwner'],
['Bryce', 'Brice'],
['Betty', 'Btisan'],
['Billy', 'Bonder']]

Related

Why list remains unchanged Python

After iterating through a list to change each value to an integer, the list remains unchanged with all the values still being strings.
As a result, sorting does not get applied either
a = ['14', '22', '4', '52', '54', '59']
for ea in a:
ea = int(ea)
a.sort()
print (a)
Output: '14', '22', '4', '52', '54', '59'
Should be : 4, 14, 22, 52, 54, 59
Your code is not changing the list itself. You are creating a new variable, converting it to an int, and throwing it away.
Use this instead
a = ['14', '22', '4', '52', '54', '59']
a = list(map(int, a)) #this converts the strings into integers and assigns the new list to a
a.sort() #this sorts it
print (a)
ea = int(ea) is not changing the element within the list. So as you do not change the list (which can be seen if you print the list before sorting it), the sort operation is doing it's job correctly because it is sorting strings here, not integer values.
You could change your loop to provide the index and modify the original entries in the list by using the enumerate function as follows:
a = ['14', '22', '4', '52', '54', '59']
for index, ea in enumerate(a):
a[index] = int(ea)
a.sort()
print(a)

Extracting Rows by specific keyword in Python (Without using Pandas)

My csv file looks like this:-
ID,Product,Price
1,Milk,20
2,Bottle,200
3,Mobile,258963
4,Milk,24
5,Mobile,10000
My code of extracting row is as follow :-
def search_data():
fin = open('Products/data.csv')
word = input() # "Milk"
found = {}
for line in fin:
if word in line:
found[word]=line
return found
search_data()
While I run this above code I got output as :-
{'Milk': '1,Milk ,20\n'}
I want If I search for "Milk" I will get all the rows which is having "Milk" as Product.
Note:- Do this in only Python don't use Pandas
Expected output should be like this:-
[{"ID": "1", "Product": "Milk ", "Price": "20"},{"ID": "4", "Product": "Milk ", "Price": "24"}]
Can anyone tell me where am I doing wrong ?
In your script every time you assign found[word]=line it will overwrite the value that was before it. Better approach is load all the data and then do filtering:
If file.csv contains:
ID Product Price
1 Milk 20
2 Bottle 200
3 Mobile 10,000
4 Milk 24
5 Mobile 15,000
Then this script:
#load data:
with open('file.csv', 'r') as f_in:
lines = [line.split() for line in map(str.strip, f_in) if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# print only items with 'Product': 'Milk'
print([i for i in data if i['Product'] == 'Milk'])
Prints only items with Product == Milk:
[{'ID': '1', 'Product': 'Milk', 'Price': '20'}, {'ID': '4', 'Product': 'Milk', 'Price': '24'}]
EDIT: If your data are separated by commas (,), you can use csv module to read it:
File.csv contains:
ID,Product,Price
1,Milk ,20
2,Bottle,200
3,Mobile,258963
4,Milk ,24
5,Mobile,10000
Then the script:
import csv
#load data:
with open('file.csv', 'r') as f_in:
csvreader = csv.reader(f_in, delimiter=',', quotechar='"')
lines = [line for line in csvreader if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# # print only items with 'Product': 'Milk'
print([i for i in data if i['Product'].strip() == 'Milk'])
Prints:
[{'ID': '1', 'Product': 'Milk ', 'Price': '20'}, {'ID': '4', 'Product': 'Milk ', 'Price': '24'}]

Concat multiple CSV rows into 1 in python

I am trying to contact the CSV rows. I tried to convert the CSV rows to list by pandas but it gets 'nan' values appended as some files are empty.
Also, I tried using zip but it concats column values.
with open(i) as f:
lines = f.readlines()
res = ""
for i, j in zip(lines[0].strip().split(','), lines[1].strip().split(',')):
res += "{} {},".format(i, j)
print(res.rstrip(','))
for line in lines[2:]:
print(line)
I have data as below,
Input data:-
Input CSV Data
Expected Output:-
Output CSV Data
The number of rows are more than 3,only sample is given here.
Suggest a way which will achieve the above task without creating a new file. Please point to any specific function or sample code.
This assumes your first line contains the correct amount of columns. It will read the whole file, ignore empty data ( ",,,,,," ) and accumulate enough data points to fill one row, then switch to the next row:
Write test file:
with open ("f.txt","w")as f:
f.write("""Circle,Year,1,2,3,4,5,6,7,8,9,10,11,12
abc,2018,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
def,2017,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
""")
Process test file:
data = [] # all data
temp = [] # data storage until enough found , then put into data
with open("f.txt","r") as r:
# get header and its lenght
title = r.readline().rstrip().split(",")
lenTitel = len(title)
data.append(title)
# process all remaining lines of the file
for l in r:
t = l.rstrip().split(",") # read one lines data
temp.extend( (x for x in t if x) ) # this eliminates all empty ,, pieces even in between
# if enough data accumulated, put as sublist into data, keep rest
if len (temp) > lenTitel:
data.append( temp[:lenTitel] )
temp = temp [lenTitel:]
if temp:
data.append(temp)
print(data)
Output:
[['Circle', 'Year', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
['abc', '2018', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7'],
['def', '2017', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7']]
Remarks:
your file cant have leading newlines, else the size of the title is incorrect.
newlines in between do not harm
you cannot have "empty" cells - they get eliminated
As long as nothing weird is going on in the files, something like this should work:
with open(i) as f:
result = []
for line in f:
result += line.strip().split(',')
print(result)

Python CSV - Writing different lists in the same CSV-file

SEE UPDATE BELOW!
For my Python program I need to write 3 different lists to a csv file, each in a different column. Each lists has a different size.
l1 = ['1', '2', '3', '4', '5']
l2 = ['11', '22', '33', '44']
l3 = ['111', '222', '333']
f = 'test.csv'
outputFile = open(f, 'w', newline='')
outputWriter = csv.writer(resultFile, delimiter=';')
outputWriter.writerow(headerNames)
for r in l3:
resultFile.write(';' + ';' + r + '\n')
for r in l2:
resultFile.write(';' + r + '\n')
for r in l1:
resultFile.write(r + '\n')
resultFile.close()
Unfortunately this doesn't work. The values of the lists are written below each other list in the column to the right. I would prefer to have the list values written beside one another just like this:
1;11;111
2;22;222
etc.
I am sure there is an easy way to get this done, but after hours of trying I still cannot figure it out.
UPDATE:
I tried the following. It is progress, but I am still not there yet.
f = input('filename: ')
l1 = ['1', '2', '3', '4', '5']
l2 = ['11', '22', '33', '44']
l3 = ['111', '222', '333']
headerNames = ['Name1', 'Name2', 'Name3']
rows = zip(l1, l2, l3)
with open(f, 'w', newline='') as resultFile:
resultWriter = csv.writer(resultFile, delimiter=';')
resultWriter.writerow(headerNames)
for row in rows:
resultWriter.writerow(row)
It write the data in the format I would like, however the values 4, 5 and 44 are not writen.
Your first attempt is not using the csv module properly, nor transposing the rows like your second attempt does.
Now zipping the rows will stop as soon as the shortest row ends. You want itertools.ziplongest instead (with a fill value of 0 for instance)
import itertools,csv
f = "out.csv"
l1 = ['1', '2', '3', '4', '5']
l2 = ['11', '22', '33', '44']
l3 = ['111', '222', '333']
headerNames = ['Name1', 'Name2', 'Name3']
rows = itertools.zip_longest(l1, l2, l3, fillvalue=0)
with open(f, 'w', newline='') as resultFile:
resultWriter = csv.writer(resultFile, delimiter=';')
resultWriter.writerow(headerNames)
resultWriter.writerows(rows) # write all rows in a row :)
output file contains:
Name1;Name2;Name3
1;11;111
2;22;222
3;33;333
4;44;0
5;0;0

remove empty strings from spark RDD

I have an RDD which I am tokenizing like this to give me list of tokens
data = sqlContext.read.load('file.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
data = data.rdd.map(lambda x: x.desc)
stopwords = set(sc.textFile('stopwords.txt').collect())
tokens = data.map( lambda document: document.strip().lower()).map( lambda document: re.split("[\s;,#]", document)).map( lambda word: [str(w) for w in word if not w in stopwords])
>>> print tokens.take(5)
[['35', 'year', 'wild', 'elephant', 'named', 'sidda', 'villagers', 'manchinabele', 'dam', 'outskirts', 'bengaluru', '', 'cared', 'wildlife', 'activists', 'suffered', 'fracture', 'developed', 'mu'], ['tamil', 'nadu', 'vivasayigal', 'sangam', 'reiterates', 'demand', 'declaring', 'tamil', 'nadu', 'drought', 'hit', 'sanction', 'compensation', 'affected', 'farmers'], ['triggers', 'rumours', 'income', 'tax', 'raids', 'quarries'], ['', 'president', 'barack', 'obama', 'ordered', 'intelligence', 'agencies', 'review', 'cyber', 'attacks', 'foreign', 'intervention', '2016', 'election', 'deliver', 'report', 'leaves', 'office', 'january', '20', '', '2017'], ['death', 'note', 'driver', '', 'bheema', 'nayak', '', 'special', 'land', 'acquisition', 'officer', '', 'alleging', 'laundered', 'mining', 'baron', 'janardhan', 'reddys', 'currency', 'commission', '']]
There are few '' items in the list which I am unable to remove. How can I remove them
This is not working
tokens = tokens.filter(lambda lst: filter(None, lst))
This should work
tokens = tokens.map(lambda lst: filter(None, lst))
The filter expects a method that returns boolean. In your case, you have a method that returns list.

Resources