Splitting a list entry in Python - python-3.x

I am importing a CSV file into a list in Python. When I split it into list elements then print a index,the entry is printed like this.
2000-01-03,3.745536,4.017857,3.631696,3.997768,2.695920,133949200
How would I split this list so if I wanted to just print a solo element like this?
2000-01-03Here is my code so far.
def main():
list = []
filename = "AAPL.csv"
with open(filename) as x:
for line in x.readlines():
val = line.strip('\n').split(',')
list.append(val)
print(list[2])

Your current code build a list of lists, precisely a list (of rows) of lists (of fields).
To extract one single element, say first field of third row, you could do:
...
print(list[2][0])
But except for trivial tasks, you should use the csv module when processing csv file, because it is robust to corner cases like newlines or field separarors contained in fields. Your code could become:
def main():
list = []
filename = "AAPL.csv"
with open(filename) as x:
rd = csv.reader(x)
for val in rd: # the reader is an iterator of lists of fields
list.append(val)
print(list[2][0])

Related

Instead of printing to console create a dataframe for output

I am currently comparing the text of one file to that of another file.
The method: for each row in the source text file, check each row in the compare text file.
If the word is present in the compare file then write the word and write 'present' next to it.
If the word is not present then write the word and write not_present next to it.
so far I can do this fine by printing to the console output as shown below:
import sys
filein = 'source.txt'
compare = 'compare.txt'
source = 'source.txt'
# change to lower case
with open(filein,'r+') as fopen:
string = ""
for line in fopen.readlines():
string = string + line.lower()
with open(filein,'w') as fopen:
fopen.write(string)
# search and list
with open(compare) as f:
searcher = f.read()
if not searcher:
sys.exit("Could not read data :-(")
#search and output the results
with open(source) as f:
for item in (line.strip() for line in f):
if item in searcher:
print(item, ',present')
else:
print(item, ',not_present')
the output looks like this:
dog ,present
cat ,present
mouse ,present
horse ,not_present
elephant ,present
pig ,present
what I would like is to put this into a pandas dataframe, preferably 2 columns, one for the word and the second for its state . I cant seem to get my head around doing this.
I am making several assumptions here to include:
Compare.txt is a text file consisting of a list of single words 1 word per line.
Source.txt is a free flowing text file, which includes multiple words per line and each word is separated by a space.
When comparing to determine if a compare word is in source, is is found if and only if, no punctuation marks (i.e. " ' , . ?, etc) are appended to the word in source .
The output dataframe will only contain the words found in compare.txt.
The final output is a printed version of the pandas dataframe.
With these assumptions:
import pandas as pd
from collections import defaultdict
compare = 'compare.txt'
source = 'source.txt'
rslt = defaultdict(list)
def getCompareTxt(fid: str) -> list:
clist = []
with open(fid, 'r') as cmpFile:
for line in cmpFile.readlines():
clist.append(line.lower().strip('\n'))
return clist
cmpList = getCompareTxt(compare)
if cmpList:
with open(source, 'r') as fsrc:
items = []
for item in (line.strip().split(' ') for line in fsrc):
items.extend(item)
print(items)
for cmpItm in cmpList:
rslt['Name'].append(cmpItm)
if cmpItm in items:
rslt['State'].append('Present')
else:
rslt['State'].append('Not Present')
df = pd.DataFrame(rslt, index=range(len(cmpList)))
print(df)
else:
print('No compare data present')

Converting list to dictionary, and tokenizing the key values - possible?

So basically I have a folder of files I'm opening and reading into python.
I want to search these files and count the keywords in each file, to make a dataframe like the attached image.
I have managed to open and read these files into a list, but my problem is as follows:
Edit 1:
I decided to try and import the files as a dictionary instead. It works, but when I try to lower-case the values, I get a 'list' object attribute error - even though in my variable explorer, it's defined as a dictionary.
import os
filenames = os.listdir('.')
file_dict = {}
for file in filenames:
with open(file) as f:
items = [i.strip() for i in f.read().split(",")]
file_dict[file.replace(".txt", "")] = items
def lower_dict(d):
new_dict = dict((k, v.lower()) for k, v in d.items())
return new_dict
print(lower_dict(file_dict))
output =
AttributeError: 'list' object has no attribute 'lower'
Pre-edit post:
1. Each list value doesn't retain the filename key. So I don't have the rows I need.
2. I can't conduct a search of keywords in the list anyway, because it is not tokenized. So I can't count the keywords per file.
Here's my code for opening the files, converting them to lowercase and storing them in a list.
How can I transform this into a dictionary retaining the filename, and tokenized key values?. Additionally, is it better to somehow import the file and contents into a dictionary directly? Can I still tokenize and lower-case everything?
import os
import nltk
# create list of filenames to loop over
filenames = os.listdir('.')
#create an empty list for storage
Lcase_content = []
tokenized = []
num = 0
# read files from folder, convert to lower case
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read()
# convert to lower-case value
Lcase_content.append(content.lower())
## this two lines below don't work - index out of range error
tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num])
num = num + 1
You can compute the count of each token by using Collections. collections.Counter can take a list of strings and return a dictionary-like Counter with each token in its keys and the count of the tokens in values. Since NLTK's workd_tokenize takes a sequence of strings and returns a list, to get a dictionary with tokens and their counts, you can basically do this:
Counter(nltk.tokenize.word_tokenize())
Since you want your file names as index (first column), make it as a nested dictionary, with a file name as a key and another dictionary with tokens and counts as a value, which looks like this:
{'file1.txt': Counter({'cat': 4, 'dog': 0, 'squirrel': 12, 'sea horse': 3}),
'file2.txt': Counter({'cat': 11, 'dog': 4, 'squirrel': 17, 'sea horse': 0})}
If you are familiar with Pandas, you can convert your dictionary to a Pandas dataframe. It will make your life so much easier to work with any tsv/csv/excel file by exporting the Pandas dataframe result as a csv file. Make sure you apply .lower() to your file content and include orient='index' so that files names be your index.
import os
import nltk
from collections import Counter
import pandas as pd
result = dict()
filenames = os.listdir('.')
for filename in filenames:
if filename.endswith(".txt"):
with open(os.path.join('.', filename)) as file:
content = file.read().lower()
result[filename] = Counter(nltk.tokenize.word_tokenize(content))
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)
df['total words'] = df.sum(axis=1)
df.to_csv('words_count.csv', index=True)
Re: your first attempt, since your 'items' is a list (see [i.strip() for i in f.read().split(",")]), you can't apply .lower() to it.
Re: your second attempt, your 'tokenized' is empty as it was initialized as tokenized = []. That's why when you try to do tokenized[num] = nltk.tokenize.word_tokenize(tokenized[num]), tokenized[num] with num = 0 gives you the index out of range error.

How do I delete rows in one CSV based on another CSV

I am working with two CSV files, both contain only one column of data, but are over 50,000 rows. I need to compare the data from CSV1 against CSV2 and remove any data that displays in both of these files. I would like to print out the final list of data as a 3rd CSV file if possible.
The CSV files contain usernames. I have tried running deduplication scripts but realize that this does not remove entries found in both CSV files entirely since it only removes the duplication of a username. This is what I have been currently working with but I can already tell that this isn't going to give me the results I am looking for.
import csv
AD_AccountsCSV = open("AD_Accounts.csv", "r")
BA_AccountsCSV = open("BA_Accounts.csv", "r+")
def Remove(x,y):
final_list =[]
for item in x:
if item not in y:
final_list.append(item)
for i in y:
if i not in x:
final_list.append(i)
print (final_list)
The way that I wrote this code would print the results within the terminal after running the script but I realize that my output may be around 1,000 entries.
# define the paths
fpath1 = "/path/to/file1.csv"
fpath2 = "/path/to/file2.csv"
fpath3 = "/path/to/your/file3.csv"
with open(fpath1) as f1, open(fpath2) as f2, open(fpath3, "w") as f3:
l1 = f1.readlines()
l2 = f2.readlines()
not_in_both = [x for x in set(l1 + l2) if x in l1 and x in l2]
for x in not_in_both:
print(x, file=f3)
The with open() as ... clause takes care of closing the file.
You can combine several file openings under with.
Assuming, that the elements in the files are the only elements per line, I used simple readlines() (which automatically removes the newline character at the end). Otherwise it becomes more complicated in this step.
List-expressions make it nice to filter lists by conditions.
Default end='\n' in print() adds newline at end of each print.
In the way you did
For formatting code, please follow official style guides, e.g.
https://www.python.org/dev/peps/pep-0008/
def select_exclusive_accounts(path_to_f1,path_to_f2, path_to_f3):
# you have quite huge indentations - use 4 spaces!
with open(path_to_f1) as f1, open(path_to_f2) as f2, \
open(path_to_f3, "w") as f3:
for item in in_f1:
if item not in in_f2:
f3.write(item)
for i in in_f2:
if i not in in_f1:
f3.write(item)
select_exclusive_accounts("AD_Accounts.csv",
"BA_Accounts.csv",
"exclusive_accounts.csv")
Also here no imports not needed because these are standard Python commands.

How to make tokenize not treat contractions and their counter parts as the same when comparing two text files?

I am currently working on a data structure that is supposed to compare two text files and make a list of the strings they have in common. my program receives the content of the two files as two strings a & b (one file's content per variable). I then use the tokenize function in a for loop to break the string by each sentence. They are then stored into a set to avoid duplicate entries. I remove all duplicate lines within each variable before I compare them. I then compare each the two variables to each other and only keep the string they have in common. I have a bug that occurs in the last part when they are comparing against each other. The program will treat contractions and their proper counter parts as the same when it should not. For Example it will read Should not and Shouldn't as the same and will produce an incorrect answer. I want to make it not read contraction and their counter parts as the same.
import nltk
def sentences(a, b): #the variables store the contents of the files in the form of strings
a_placeholder = a
set_a = set()
a = []
for punctuation_a in nltk.sent_tokenize(a_placeholder):
if punctuation_a not in set_a:
set_a.add(punctuation_a)
a.append(punctuation_a)
b_placeholder = b
set_b = set()
b = []
for punctuation_b in nltk.sent_tokenize(b_placeholder):
if punctuation_b not in set_b:
set_b.add(punctuation_b)
b.append(punctuation_b)
a_new = a
for punctuation in a_new:
if punctuation not in set_b:
set_a.remove(punctuation)
a.remove(punctuation)
else:
pass
return []

Python read file contents into nested list

I have this file that contains something like this:
OOOOOOXOOOO
OOOOOXOOOOO
OOOOXOOOOOO
XXOOXOOOOOO
XXXXOOOOOOO
OOOOOOOOOOO
And I need to read it into a 2D list so it looks like this:
[[O,O,O,O,O,O,X,O,O,O,O],[O,O,O,O,O,X,O,O,O,O,O],[O,O,O,O,X,O,O,O,O,O,O],[X,X,O,O,X,O,O,O,O,O,O],[X,X,X,X,O,O,O,O,O,O,O,O],[O,O,O,O,O,O,O,O,O,O,O]
I have this code:
ins = open(filename, "r" )
data = []
for line in ins:
number_strings = line.split() # Split the line on runs of whitespace
numbers = [(n) for n in number_strings]
data.append(numbers) # Add the "row" to your list.
return data
But it doesn't seem to be working because the O's and X's do not have spaces between them. Any ideas?
Just use data.append(list(line.rstrip())) list accepts a string as argument and just splits them on every character.

Resources