Compare pandas dataframes and check overlaps? - python-3.x

I am trying myself out at spame filters. I tried several methods to label text files as spam. As a result, I have three dataframes. They basically look like this:
df_method_1 = pd.DataFrame({'file': ['A','B' ,'C'], 'spam': ['1', '0', '0']})
df_method_2 = pd.DataFrame({'file': ['A','B' ,'C'], 'spam': ['1', '1', '0']})
df_method_3 = pd.DataFrame({'file': ['A','B' ,'C'], 'spam': ['1', '1', '0']})
I am now trying to creat a dataframe showing, if a file was labled as spam and if so by which method.
In the best case, I can create a dataframe containing the following infortmation:
df_summary = pd.DataFrame({'file': ['A','B' ,'C'], 'spam': ['All methods', 'Method 2 & Method 3', 'No method']})
Obviously, I am looking for the information. No need for the actual strings.
I tried pandas.DataFrame.isin() to make it happen. But I failed. Any ideas how to do this?

How about merge()?
df1.merge(df2, on="file").merge(df3, on="file")
file spam_x spam_y spam
0 A 1 1 1
1 B 0 1 1
2 C 0 0 0

Related

Custom xticks in seaborn heatmap

I have the following heatmap (just a minimum working example, my data is huge!)
df = pd.DataFrame({'set1': ['0', '2', '2'],
'set2': ['1', '2', '0'],
'set3': ['0', '2', '1'],
'set4': ['1', '4', '1']
}).T.astype(float)
sns.heatmap(df, yticklabels = df.index, xticklabels = df.columns)
How can I put the xticks only of those where all the rows>=2? In this example it means putting only the '1' xtick only
So in this image '0' and '2' column names should not appear, only '1' because this is the column where all values are greater or equal than 2.
Problem is the the x axis gets too crowded. I want to show only the xticklabels of the columns where all values are >= 2. Still plotting everything but showing only those xticklabels.
Mask the DataFrame
Removes columns where the values are not >= to the specified value
# create a Boolean mask of df
mask = df.ge(2)
# apply the mask to df and dropna
dfm = df[mask].dropna(axis=1)
# plot the masked df
ax = sns.heatmap(dfm)
mask
0 1 2
set1 False True True
set2 False True False
set3 False True False
set4 False True False
Mask the xtick labels
Labels to columns where the values are not >= to the specified value are replaced with ''.
# create a Boolean mask of df
mask = df.ge(2).all()
# use the mask to update a list of labels
cols = [col if m else '' for (col, m) in zip(df.columns, mask)]
# plot with custom labels
ax = sns.heatmap(df, xticklabels=cols)
mask
0 False
1 True
2 False
dtype: bool
Are you looking to show the same heatmap, but only show xticklabels where ALL values are >=2? One way to do this might be to not use df.columns in heatmap, but mask and show only the ones you want. See if this is what you are looking for...
df = pd.DataFrame({'set1': ['0', '2', '2'],
'set2': ['1', '2', '0'],
'set3': ['0', '2', '1'],
'set4': ['1', '4', '1']
}).T.astype(float)
cols = [] ## Declare a new list to be used for xticklabels
for col in df.columns:
if col in set(df.columns).intersection(df[df >= 2].T.dropna().index):
cols.append(col) ##If all values are >=2, then add to label list
else:
cols.append('') ## Else, add blank
sns.heatmap(df, yticklabels = df.index, xticklabels = cols) ##Plot using new list

Python extract unknown string from dataframe column

New to python - using v3. I have a dataframe column that looks like
object
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Training New"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Time Influx"}},"objectType":"Activity"}
{"id":"http://Demo/1.7","definition":{"name":{"en-US":"Social"}},"objectType":"Activity"}
{"id":"http://Demo/2.18","definition":{"name":{"en-US":"Personal"}},"objectType":"Activity"}
I need to extract the activity, which starts in a variable place and is of variable length. I do not know what the activities are. All the questions I've found are to extract a specific string or pattern, not an unknown one. If I use the code below
dataExtract['activity'] = dataExtract['object'].str.find('en-US":"')
Will give me the start index and this
dataExtract['activity'] = dataExtract['object'].str.rfind('"}}')
Will give me the end index. So I have tried combining these
dataExtract['activity'] = dataExtract['object'].str[dataExtract['object'].str.find('en-US":"'):dataExtract['object'].str.rfind('"}}')]
But that just generates "NaN", which is clearly wrong. What syntax should I use, or is there a better way to do it? Thanks
I suggest convert values to nested dictionaries and then extract by nested keys:
#if necessary
#import ast
#dataExtract['object'] = dataExtract['object'].apply(ast.literal_eval)
dataExtract['activity'] = dataExtract['object'].apply(lambda x: x['definition']['name']['en-US'])
print (dataExtract)
object activity
0 {'id': 'http://Demo/1.7', 'definition': {'name... Time Training New
1 {'id': 'http://Demo/1.7', 'definition': {'name... Time Influx
2 {'id': 'http://Demo/1.7', 'definition': {'name... Social
3 {'id': 'http://Demo/2.18', 'definition': {'nam... Personal
Details:
print (dataExtract['object'].apply(lambda x: x['definition']))
0 {'name': {'en-US': 'Time Training New'}}
1 {'name': {'en-US': 'Time Influx'}}
2 {'name': {'en-US': 'Social'}}
3 {'name': {'en-US': 'Personal'}}
Name: object, dtype: object
print (dataExtract['object'].apply(lambda x: x['definition']['name']))
0 {'en-US': 'Time Training New'}
1 {'en-US': 'Time Influx'}
2 {'en-US': 'Social'}
3 {'en-US': 'Personal'}
Name: object, dtype: object

Export Python dict of nested lists of varying lengths to csv. If nested list has > 1 entry, expand to column before moving to next key

I have the following dictionary of lists
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
writing that to a csv using
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(d.values())
gives me a csv that looks like
A B C D
1 B1 ['C1','C2',C3']
2 B2 C15 D12
3 B3
4 B4 C4 ['D1','D2']
If there is a multiple item list in the value (nested list?), I would like that list to be expanded down the column like this
A B C D
1 B1 C1
1 C2
1 C3
2 B2 C15 D12
3 B3
4 B4 C4 D1
4 D2
I'm fairly new to python and can't seem to figure out a way to do what I need after a few days sifting through forums and banging my head on the wall. I think I may need to break apart the nested lists, but I need to keep them tied to their respective "A" value. Columns A and B will always have 1 entry, columns C and D can have 1 to X number of entries.
Any help is much appreciated
Seems like it might be easier to make a list of lists, with appropriately-located empty spaces, than what you're doing. Here's something that might do:
import csv
from itertools import zip_longest
def condense(dct):
# get the maximum number of columns of any list
num_cols = len(max(dct.values(), key=len)) - 1
# Ignore the key, it's not really relevant.
for _, v in dct.items():
# first, memorize the index of this list,
# since we need to repeat it no matter what
idx = v[0]
# next, use zip_longest to make a correspondence.
# We will deliberately make a 2d list,
# and we will later withdraw elements from it one by one.
matrix = [([] if elem is None else
[elem] if not isinstance(elem, list) else
elem[:] # soft copy to avoid altering original dict
) for elem, _ in zip_longest(v[1:], range(num_cols), fillvalue=None)
]
# Now, we output the top row of the matrix as long as it has contents
while any(matrix):
# If a column in the matrix is empty, we put an empty string.
# Otherwise, we remove the row as we pass through it,
# progressively emptying the matrix top-to-bottom
# as we output a row, we also remove that row from the matrix.
# *-notation is more convenient than concatenating these two lists.
yield [idx, *((col.pop(0) if col else '') for col in matrix)]
# e.g. for key 0 and a matrix that looks like this:
# [['a1', 'a2'],
# ['b1'],
# ['c1', 'c2', 'c3']]
# this would yield the following three lists before moving on:
# ['0', 'a1', 'b1', 'c1']
# ['0', 'a2', '', 'c2']
# ['0', '', '', 'c3']
# where '' should parse into an empty column in the resulting CSV.
The biggest thing to note here is that I use isinstance(elem, list) as a shorthand to check whether the thing is a list (which you need to be able to do, one way or another, to flatten or rounden lists as we do here). If you have more complicated or more varied data structures, you'll need to improvise with this check - maybe write a helper function isiterable() that tries to iterate through and returns a boolean based on whether doing so produced an error.
That done, we can call condense() on d and have the csv module deal with the output.
headers = ['A', 'B', 'C', 'D']
d = {1: ['1','B1',['C1','C2','C3']], 2: ['2','B2','C15','D12'], 3: ['3','B3'], 4: ['4', 'B4', 'C4', ['D1', 'D2']]}
# condense(d) produces
# [['1', 'B1', 'C1', '' ],
# ['1', '', 'C2', '' ],
# ['1', '', 'C3', '' ],
# ['2', 'B2', 'C15', 'D12'],
# ['3', 'B3', '', '' ],
# ['4', 'B4', 'C4', 'D1' ],
# ['4', '', '', 'D2' ]]
with open('test.csv', "w", newline = '') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(condense(d))
Which produces the following file:
A,B,C,D
1,B1,C1,
1,,C2,
1,,C3,
2,B2,C15,D12
3,B3,,
4,B4,C4,D1
4,,,D2
This is equivalent to your expected output. Hopefully the solution is sufficiently extensible for you to apply it to your non-MVCE problem.

Concat multiple CSV rows into 1 in python

I am trying to contact the CSV rows. I tried to convert the CSV rows to list by pandas but it gets 'nan' values appended as some files are empty.
Also, I tried using zip but it concats column values.
with open(i) as f:
lines = f.readlines()
res = ""
for i, j in zip(lines[0].strip().split(','), lines[1].strip().split(',')):
res += "{} {},".format(i, j)
print(res.rstrip(','))
for line in lines[2:]:
print(line)
I have data as below,
Input data:-
Input CSV Data
Expected Output:-
Output CSV Data
The number of rows are more than 3,only sample is given here.
Suggest a way which will achieve the above task without creating a new file. Please point to any specific function or sample code.
This assumes your first line contains the correct amount of columns. It will read the whole file, ignore empty data ( ",,,,,," ) and accumulate enough data points to fill one row, then switch to the next row:
Write test file:
with open ("f.txt","w")as f:
f.write("""Circle,Year,1,2,3,4,5,6,7,8,9,10,11,12
abc,2018,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
def,2017,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
""")
Process test file:
data = [] # all data
temp = [] # data storage until enough found , then put into data
with open("f.txt","r") as r:
# get header and its lenght
title = r.readline().rstrip().split(",")
lenTitel = len(title)
data.append(title)
# process all remaining lines of the file
for l in r:
t = l.rstrip().split(",") # read one lines data
temp.extend( (x for x in t if x) ) # this eliminates all empty ,, pieces even in between
# if enough data accumulated, put as sublist into data, keep rest
if len (temp) > lenTitel:
data.append( temp[:lenTitel] )
temp = temp [lenTitel:]
if temp:
data.append(temp)
print(data)
Output:
[['Circle', 'Year', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
['abc', '2018', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7'],
['def', '2017', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7']]
Remarks:
your file cant have leading newlines, else the size of the title is incorrect.
newlines in between do not harm
you cannot have "empty" cells - they get eliminated
As long as nothing weird is going on in the files, something like this should work:
with open(i) as f:
result = []
for line in f:
result += line.strip().split(',')
print(result)

Matrix input from a text file(python 3)

Im trying to find a way to be able to input a matrix from a text file;
for example, a text file would contain
1 2 3
4 5 6
7 8 9
And it would make a matrix with those numbers and put it in matrix = [[1,2,3],[4,5,6],[7,8,9]]
And then this has to be compatible with the way I print the matrix:
print('\n'.join([' '.join(map(str, row)) for row in matrix]))
So far,I tried this
chemin = input('entrez le chemin du fichier')
path = input('enter file location')
f = open ( path , 'r')
matrix = [ map(int,line.split(','))) for line in f if line.strip() != "" ]
All it does is return me a map object and return an error when I try to print the matrix.
What am I doing wrong? Matrix should contain the matrix read from the text file and not map object,and I dont want to use external library such as numpy
Thanks
You can use list comprehension as such:
myfile.txt:
1 2 3
4 5 6
7 8 9
>>> matrix = open('myfile.txt').read()
>>> matrix = [item.split() for item in matrix.split('\n')[:-1]]
>>> matrix
[['1', '2', '3'], ['4', '5', '6'], ['7', '8', '9']]
>>>
You can also create a function for this:
>>> def matrix(file):
... contents = open(file).read()
... return [item.split() for item in contents.split('\n')[:-1]]
...
>>> matrix('myfile.txt')
[['1', '2', '3'], ['4', '5', '6'], ['7', '8', '9']]
>>>
is working with both python2(e.g. Python 2.7.10) and python3(e.g. Python 3.6.4)
rows=3
cols=3
with open('in.txt') as f:
data = []
for i in range(0, rows):
data.append(list(map(int, f.readline().split()[:cols])))
print (data)

Resources