Assigning and reading multidimensional arrays in Python - python-3.x

I'm stumped.
for a in range(0,500): #500 is a highly variable number but using it for example purposes
b = findall(r'<(.*?)>', d) # d will return a highly number variable number of matches could be anywhere from 45-10000
c.append([b])
print(c[0][1])
This returns the error because everything from 'b' goes into c[0][0]. I can understand this. The question is how do I split 'b' apart so I can put it into c so I can
print(c[0][234])
and get it give me back the 235, err element 234 of the 1, err 0, line?
This is a situation like I said above where the number of times going through 'b' will be variable, at least for right now until I get the entire file prepped I can only that 'b' in the end will be way north of 10,000 and probably closer to 100,000 by the time I have all the data collection finished. The number of elements that are stored can and will be highly variable depending on the file that they come from. They are all coming from a csv file but I'm hoping to not to deal with adding in any 'complexity' by going out and having to deal with the csv module...since I've never used it before and that will probably just lead to more questions.
I have tried something similiar to...different variables naturally so things would be appropriately matched up
d = list(zip(*(e.split(',') for e in b)))
all this has did is split on each and every letter versus on the comma.

Your error is coming from the square brackets you have in c.append([b]). The brackets create an extra list that contains the list b. So rather than a two dimensional data structure, you're ending up with three dimensions. Your indexing fails because c[0][1] is trying to get a second value from the middle list (which only ever has one item in it).
You might get what you want with c[0][0][1] instead. But you probably don't actually want that extra level in your data structure. You can avoid creating it by using: c.append(b)

Related

counting most commonly used words across thousands of text files using Octave

I've accumulated a list of over 10,000 text files in Octave. I've got a function which cleans up the contents of each files, normalizing things in various ways (lowercase, reducing repeated whitespace, etc). I'd like to distill from all these files a list of words that appear in at least 100 files. I'm not at all familiar with Octave data structs or cell arrays or the Octave sorting functions, and was hoping someone could help me understand how to:
initialize an appropriate data structure (word_appearances) to count how many emails contain a particular word
loop thru the unique words that appear in an email string and increment for each of those words the count I'm tracking in word_appearances -- ideally we'd ignore words less than two chars in length and also exclude a short list of stop_words.
reduce word_appearances to only contain words that appear some number of times, e.g, min_appearances=100 times.
sort the words in word_appearances alphabetically and either export this as a .MAT file or as a CSV file like so:
1 aardvark
2 albatross
etc.
I currently have this code to loop through my files one by one:
for i = 1:total_files
filename = char(files{i}(1));
printf("processing %s\n", filename);
file_contents = jPreProcessFile(readFile(filename))
endfor
Note that the file_contents that comes back is pretty clean -- usually just a bunch of words, some repeated, separated by single spaces like so:
email market if done right is both highli effect and amazingli cost effect ok it can be downright profit if done right imagin send your sale letter to on million two million ten million or more prospect dure the night then wake up to sale order ring phone and an inbox full of freshli qualifi lead we ve been in the email busi for over seven year now our list ar larg current and deliver we have more server and bandwidth than we current need that s why we re send you thi note we d like to help you make your email market program more robust and profit pleas give us permiss to call you with a propos custom tailor to your busi just fill out thi form to get start name email address phone number url or web address i think you ll be delight with the bottom line result we ll deliv to your compani click here emailaddr thank you remov request click here emailaddr licatdcknjhyynfrwgwyyeuwbnrqcbh
Obviously, I need to create the word_appearances data structure such that each element in it specifies a word and how many files have contained that word so far. My primary point of confusion is what sort of data structure word_appearances should be, how I would search this data structure to see if some new word is already in it, and if found, increment its count, otherwise add a new element to word_appearances with count=1.
Octave has containers.Map to hold key-value pairs. This is the simple usage:
% initialize map
m = containers.Map('KeyType', 'char', 'ValueType', 'int32');
% check if it has a word
word = 'hello';
if !m.isKey(word)
m(word) = 1;
endif
% increment existing values
m(word) += 1;
This is one way to extract most frequent words from a map like the one above:
counts = m.values;
[sorted_counts, indices] = sort(cell2mat(counts));
top10_indices = indices(end:-1:end-9);
top10_words = m.keys(top10_indices);
I must warn you though, Octave may be pretty slow at this task, considering that you have thousands of files. Use it only if running time isn't that important for you.

How can I separate consecutive strings without any delimiters?

My input data is a VCF (Variant Call Format) file. Each line that I am interested in looks like this:
chrI 22232 DEL00BED N <DEL> . PASS SUPP=1159;SUPP_VEC=11111111111111111111111011111111111
I want to count the presence (1) of a specific deletion in a specific position (22232) supported by n samples. For this reason, I looked at SUPP_VEC= values, however, I don't know how to split each value as 1) it is a string, and 2) doesn't have delimiters. How could I add a space between every character? or How could I split/ count the values from SUPP_VEC= for Python3?
I was also curious to know what SUPP means. I found oneSUPP=2and I looked on Excel if the presence(1)\abscence(0) in the SUPP_VEC counted the value of SUPP, nevertheless, I could only count 1 instead of 2, probably does somebody know what SUPP means.
The reason for my procedure is to have a frequency table for a specific deletion type.
I hope I made myself clear.
Thank you in advance.

How to use the .insert method to add values to a list

I've been working on an algorithm that involves genetic code. I started by associating all 4 genetic bases, A, C, T, G with a list. A is 1,0,0,0. C is 0,1,0,0. T is 0,0,1,0 and G is 0,0,0,1. There are two different genetic codes, one being the original one and the other being one that was genetically mutated. The algorithm is going to come to conclusions of the data given based on the difference between the two genetic codes. But first, I need to sort of preprocess the data before I can work on the algorithm making conclusions.
What I'm trying to do is, when the code sees a letter in the original code, it should look at the letter in the same position in the copy version. If you look at the code below, an example would be seeing if the first letter in each(A & C) or the second letter in each(T & T) are the same. If they are then the list should not change. For example, in the 2nd position, T & T are the same. Which means the list would stay the same and be: 0,0,1,0. However, if it's not the same, so for example A & C, then the algorithm should overlap them and add both letter. So the code would be 1,0,1,0.
So far, this is what the code is looking like:
A = [1,0,0,0]
C = [0,1,0,0]
T = [0,0,1,0]
G= [0,0,0,1]
original = [A,T,T,G,C,T,A]
copy = [C,T,T,A,T,A,A]
final = original # In case you were wondering the purpose of this line is to make a new variable to hold the end result.
for i,v in enumerate(original):
if v == copy[i]:
print(v)
else:
print(final.insert(i,copy[i]))
When I run it I get "list index out of range" and I tried to play with it a little and delete the final = original and for some reason it works but instead of combining the two different letters when it should, it just says None.
I'm pretty new to programming so this could be a simple issue but I was wondering how I can actually go about making the two letters from two different lists, overlap if they are different.
Lists are "mutable" in python , in your code by final = original your final name is a new 'reference' to the the list named 'original', but not a new list and any changes made to the underlying list using either name will affect both (or rather will be visible using both list names, but change is only in one place). Use of mutable objects is usually the source of coders pains. You can use final = original.copy() to make a copy and operate on it safely. See other discussions on SO of Are Python Lists mutable?. Easy to trip over it when you are starting.

Looking for a way to distinguish identical string entries for index use

I am making a function in python 3.5.2 to read chemical structures (e.g. CaBr2) and then gives a list with the names of the elements and their coefficients.
The general rundown of how I am doing it is i have a for loop, it skips the first letter. Then it will append the previous element when it reaches one of: capital letter/number/the end. I did this with index of my iteration, and then get the entry with index(iteration)-1 or -2 depending on the specifics. For the given example it would skip C, read a but do nothing, reach B and append to my name list the translation of Ca, and append 1 to my coefficient list.
This works perfectly for structures with unique entries, but with something like CaCl2, the index of the iteration at the second C is not 2, but zero as index doesn't differentiate between the two. How would I be able to have variables in my function equal to the value at previous index(es) without running in to this problem? Keeping in mind inputs can be of any length, capitalization cannot change, and there could be any number of repeated values

How to convert a string containing non-numeric values into numeric values?

I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}

Resources