Add selected columns of a file as values to a dictionary - python-3.x

I am analyzing a small corpus, and want to create a dictionary based on 500k text files.
These text files consist of numbered lines which are tab-separated columns with some strings (or numbers), i.e.:
1 string1 string2 string3 # ...and so on, but I only need columns 2-4
2 string1 string2 string3 # ...and so on...
3 string1 string2 string3 # ...and so on...
4 string1 string2 string3 # ...and so on...
# ...and so on...
I am only simplifying it, these words not necessarily are the same in every line, but they do repeat over the whole corpus.
I want to create a dictionary with second column (with "string1") as a key and 3rd and 4th columns as values for that key, but also with a sum of all repetitions of a specific key within that corpus.
Should be something like this:
my_dict = {
"string1": [99, "string2", "string3"],
"string1": [51, "string2", "string3"],
# ...and so on...
}
So, "string1" stands for tokens, number is a counter for these tokens, "string2" stands for lemma, "string3" stands for category (some of them need to be omitted, as in the code below).
I've managed (with a big help of stackoverflow) to write-copy some code:
import os
import re
import operator
test_paths = ["path1", "path2", "path3"]
cat_to_omit = ["CAT1", "CAT2"]
tokens = {}
for path in test_paths:
dirs = os.listdir(path)
for file in dirs:
file_path = path + file
with open(file_path) as f:
for line in f:
if re.match(r"^\d+.*", line): #selecting only lines starting with numbers, because some of them don't, and I don't need these
check = line.split()[3]
if check not in cat_to_omit: #omitting some categories that I don't need
token_lst = line.lower().split()[1]
for token in token_lst.split():
tokens[token] = tokens.get(token, 0) + 1
print(tokens)
Now I am only getting, which is obvious, "string1" (which is a token) as a key + counter for this token's occurences within my corpus as a value. How can I add a list of 3 values for each key (token):
1. counter, which I already have as the only value for each key,
2. lemma, which should be taken from column 3 ("string2"),
3. category, which should be taken from column 4 ("string3").
It seems I just don't understand, how to turn my "key: value" dictionary into a "key: 3 values" one.

Related

Count the number of times a word is repeated in a text file

I need to write a program that prompts for the name of a text file and prints the words with the maximum and minimum frequency, along with their frequency (separated by a space).
This is my text
I am Sam
Sam I am
That Sam-I-am
That Sam-I-am
I do not like
that Sam-I-am
Do you like
green eggs and ham
I do not like them
Sam-I-am
I do not like
green eggs and ham
Code:
file = open(fname,'r')
dict1 = []
for line in file:
line = line.lower()
x = line.split(' ')
if x in dict1:
dict1[x] += 1
else:
dict1[x] = 1
Then I wanted to iterate over the keys and values and find out which one was the max and min frequency however up to that point my console says
TypeError: list indices must be integers or slices, not list
I don't know what that means either.
For this problem the expected result is:
Max frequency: i 5
Min frequency: you 1
you are using a list instead of a dictionary to store the word frequencies. You can't use a list to store key-value pairs like this, you need to use a dictionary instead. Here is how you could modify your code to use a dictionary to store the word frequencies:
file = open(fname,'r')
word_frequencies = {} # use a dictionary to store the word frequencies
for line in file:
line = line.lower()
words = line.split(' ')
for word in words:
if word in word_frequencies:
word_frequencies[word] += 1
else:
word_frequencies[word] = 1
Then to iterate over the keys and find the min and max frequency
# iterate over the keys and values in the word_frequencies dictionary
# and find the word with the max and min frequency
max_word = None
min_word = None
max_frequency = 0
min_frequency = float('inf')
for word, frequency in word_frequencies.items():
if frequency > max_frequency:
max_word = word
max_frequency = frequency
if frequency < min_frequency:
min_word = word
min_frequency = frequency
Print the results
print("Max frequency:", max_word, max_frequency)
print("Min frequency:", min_word, min_frequency)

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

How to print multiple multiline strings from a list onto the same line

I have a list containing string patterns for digits 0-3. I am trying to print them onto the same line, so that print(digits1+col+digits[2]+col+digits[3]) prints '1 2 3' from the # pattern strings from the respective list index, but can only get the number patterns printed on their own.
# Create strings for each number 0-3 and store in digits list.
zero = '#'*3+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'+' '+'#'+'\n'+'#'*3
one = '#\n'.rjust(4)*6
two = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.ljust(3)+'\n'+'#'*3
three = '#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3+'\n'+'#'.rjust(3)+'\n'+'#'*3
digits = [zero, one, two, three]
col = '\n'.ljust(1)*6 # A divider column between each printed digit.
print(digits[1]+col+digits[2]+col+digits[3],end='')
The result of the above code.
One way to solve this is by reversing the digits matrix, right now each index in digits list has the complete digit values but if we keep horizontal values at each index it will print properly.
think it would be better represented in code...https://repl.it/#pavanskipo/DirectTriangularSlash
# Digits replaced horizntally
digits_rev = [digits[0].split("\n"),
digits[1].split("\n"),
digits[2].split("\n"),
digits[3].split("\n")]
for i in range(0, len(digits)+1):
print(digits_rev[0][i] + '\t' +
digits_rev[1][i] + '\t' +
digits_rev[2][i] + '\t' +
digits_rev[3][i])
click on the link and hit run, let me know if it works

Matlab Reading List of List from a Text File as String

I am a Java developer and new to Matlab. I have a file something like that:
Label_X sdfasf sadfl asdf a fasdlkjf asd
Label_Y lmdfgl ldfkgldkj dkljdkljdlkjdklj
Label_X sfdsa sdfsafasfsafasf 234|3#ert 44
Label_X sdfsfdsf____asdfsadf _ dsfsd
Label_Y !^dfskşfsşk o o o o 4545
What I want is:
A vector (array) includes labels:
Label Array:
Label_X
Label_Y
Label_X
Label_X
Label_Y
and a List (has five elements for our example) and every element of list has elements size of delimited strings. I mean
Element Number Value(List of strings) Element size of value list
-------------- ---------------------- --------------------------
1 sdfasf,sadfl,asdf,a,fasdlkjf,asd 6
2 lmdfgl,ldfkgldkj,dkljdkljdlkjdklj 3
3 sfdsa,sdfsafasfsafasf,234|3#ert,44 4
4 sdfsfdsf____asdfsadf,_,dsfsd 3
5 !^dfskşfsşk,o,o,o,o,4545 6
I know it is pretty simple with Java but I don't know how to implement it in Matlab.
PS: What I am doing is that. I have a text file includes tweets of people. First word is label at row, and other words are corresponding words related to that label. I will have a list of labels and another list of list that holds words about each label.
This probably isn't optimal, but it should do the trick
all = textread('test.txt', '%s', 'delimiter', '\n','whitespace', '');
List = cell(size(all));
for i = 1:size(all)
[List{i}.name remain] = strtok(all{i}, ' ');
[List{i}.content remain] = strtok(remain, ' ');
j = 0;
while(size(remain,2)>0)
j = j+1;
List{i}.content = [List{i}.content temp ','];
[temp remain] = strtok(remain, ' ');
end
List{i}.size = j;
end
The best construct for this in Matlab is the cell. Cells can contain one object, of any type, and are typically found in arrays themselves. Something like this should work, and be pretty optimal (Assuming you don't expect more than 10K lines);
output=cell(10000,1); %This should be set to the maximum number of lines you ever expect to have
output_names=cell(size(output));
output_used=false(size(output));
fid=fopen('filename.txt','r');
index=0;
while ~feof(fid)
index=index+1;
line=fgets(fid);
splited_names=regexp(line,'\w*','split');
output{index}=splited_names(2:end);
output_names{index}=splited_names(1);
output_used(index)=true;
end
output=output(output_used);
output_names=output_names(output_used);

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Resources