Splitting the output obtained by Counter in Python and pushing it to Excel - excel

I am using the counter function to count every word of the description of 20000 products and see how many times this word repeats like 'pipette' repeats 1282 times.To do this i have split a column A into many columns P,Q,R,S,T,U & V
df["P"] = df["A"].str.split(n=10).str[0]
df["Q"] = df["A"].str.split(n=10).str[1]
df["R"] = df["A"].str.split(n=10).str[2]
df["S"] = df["A"].str.split(n=10).str[3]
df["T"] = df["A"].str.split(n=10).str[4]
df["U"] = df["A"].str.split(n=10).str[5]
df["V"] = df["A"].str.split(n=10).str[6]
This shows the splitted products
And the i am individually counting all of the columns and then add them to get the total number of words.
d = Counter(df['P'])
e = Counter(df['Q'])
f = Counter(df['R'])
g = Counter(df['S'])
h = Counter(df['T'])
i = Counter(df['U'])
j = Counter(df['V'])
m = d+e+f+g+h+i+j
print(m)
This is the image of the output i obtained on using counter.
Now i want to transfer the output into a excel sheet with the Keys in one column and the Values in another.
Am i using the right method to do so? If yes how shall i push them into different columns.
Note: Length of each key is different
Also i wanna make all the items of column 'A' into lower case so that the counter does not repeat the items. How shall I go about it ?

I've been learning python for just a couple of months but I'll give it a shot. I'm sure there are some better ways to perform that same action. Maybe we both can learn something from this question. Let me know how this turns out. GoodLuck
import pandas as pd
num = len(m.keys())
df = pd.DataFrame(columns=['Key', 'Value']
for i,j,k in zip(range(num), m.keys(), m.values()):
df.loc[i] = [j, k]
df.to_csv('Your_Project.csv')

Related

Sliding window over a string using python

I am working on a dataset as a part of my course practice and am stuck in a particular step. I have tried that using R, but I wish to do the same in python. I am comparatively new to python and so require help.
The data set consists of a column with name 'Seq' with seq(5000+) records. I have another column of name 'MainSeq' that contains the substring seq values in it. I need to check the presence of seq on MainSeq based on the start position given and then print 7 letters before and after each letter of the seq. i.e.
I have a a value in col 'MainSeq' as 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.
Col 'Seq' contains value JKLMNO
Start Position of J= 10 and O= 15
I need to create a new column such that it takes 7 letters before and after the start letter from J till O i.e. having a total length of 15
CDEFGHI**J**KLMNOPQ
DEFGHIJ**K**LMNOPQR
EFGHIJK**L**MNOPQRS
FGHIJKL**M**NOPQRST
GHIJKLM**N**OPQRSTU
HIJKLMN**O**PQRSTUV
I know to apply the logic on a specific seq. But since I have around 5000+ seq records, I need to figure out a way to apply the same on all the seq records.
seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
i = seq.index('J')
j = seq.index('O')
value = 7
for mid in range(i, 1+j):
print(seq[mid-value:mid+value+1])
I'm not sure this will do exactly what you want, you've not really supplied a lot of data to test with, but it might work or at least give you a start.
import pandas as pd
df = pd.DataFrame({'MainSeq':['ABCDEFGHIJKLMNOPQRSTUVWZYZ','ABCDEFGHIJKLMNOPQRSTUVWZYZ'], 'Seq':'JKLMNO'})
def get_sequences(seq, letters, value):
sequences = [seq[seq.index(letter)-value:seq.index(letter)+value+1] for letter in letters]
return sequences
df['new_seq'] = df.apply(lambda row : get_sequences(row['MainSeq'], row['Seq'], 7), axis = 1)
df = df.explode('new_seq')
print(df)

How can i optimise my code and make it readable?

The task is:
User enters a number, you take 1 number from the left, one from the right and sum it. Then you take the rest of this number and sum every digit in it. then you get two answers. You have to sort them from biggest to lowest and make them into a one solid number. I solved it, but i don't like how it looks like. i mean the task is pretty simple but my code looks like trash. Maybe i should use some more built-in functions and libraries. If so, could you please advise me some? Thank you
a = int(input())
b = [int(i) for i in str(a)]
closesum = 0
d = []
e = ""
farsum = b[0] + b[-1]
print(farsum)
b.pop(0)
b.pop(-1)
print(b)
for i in b:
closesum += i
print(closesum)
d.append(int(closesum))
d.append(int(farsum))
print(d)
for i in sorted(d, reverse = True):
e += str(i)
print(int(e))
input()
You can use reduce
from functools import reduce
a = [0,1,2,3,4,5,6,7,8,9]
print(reduce(lambda x, y: x + y, a))
# 45
and you can just pass in a shortened list instead of poping elements: b[1:-1]
The first two lines:
str_input = input() # input will always read strings
num_list = [int(i) for i in str_input]
the for loop at the end is useless and there is no need to sort only 2 elements. You can just use a simple if..else condition to print what you want.
You don't need a loop to sum a slice of a list. You can also use join to concatenate a list of strings without looping. This implementation converts to string before sorting (the result would be the same). You could convert to string after sorting using map(str,...)
farsum = b[0] + b[-1]
closesum = sum(b[1:-2])
"".join(sorted((str(farsum),str(closesum)),reverse=True))

Find count of a column in pandas dataframe based on condition

I am using below method to find count of a pandas dataframe having 55k rows. This is included in a for loop of site list (4000 sites). It is taking many minutes to complete the loop of 4000 sites when below line is included.
for i in g_sitelist:
x = len(dfreglist[(dfreglist['site'] == i) & (dfreglist['isactive'] == 1)])
Is there any other better way to do so that the loop can be completed with in a second.
You can use value_counts():
site_counts = dfreglist[dfreglist['isactive'].eq(1)]['site'].value_counts()
This would give a series of the site values and the count that are active which you can then iterate.
Use numpy - convert each column to array and call np.sum:
m = (dfreglist['isactive'].values == 1)
for i in g_sitelist:
x = np.sum((dfreglist['site'].values == i) & m)
Faster solution:
df = dfreglist[dfreglist['site'].isin(g_sitelist) & (dfreglist['isactive'].values == 1)]
out = df['site'].value_counts()

MATLAB: Write Dynamic matrix to Excel

I'm using MATLAB R2009a and following this example:
http://uk.mathworks.com/help/matlab/matlab_external/using-a-matlab-application-as-an-automation-client.html
I'd like to edit it so that I can write a matrix of unknown size into a column in an excel sheet, therefore not explicitly stating the range. I've attempted it this way:
%Put MATLAB data into the worksheet
Hop = [47; 53; 93; 10]; %Pretend I don't know what size this matrix is.
p = length(Hop);
p = strcat('A',num2str(p));
eActivesheetRange = e.Activesheet.get('Range','A1:p');
eActivesheetRange.Value = Hop;
However, this errors out. I've tried several variations of this to no avail. For example, using 'A:B' puts this array in columns A and B in excel and a NAN into every cell beyond my array. As I only want column A filled, using simple ('Range','A') errors out also.
Thanks in advance for any advice you can offer.
You're having issues because you're trying to use your variable p in a string directly
range = 'A1:p';
'A1:p'
This isn't going to work, you want to include the value of p. There are a number of ways you can do this.
In the code you have provided, you have already set p = 'A10' so if you wanted to append that to your range, you'd perform string concatenation
p = 'A10';
range = strcat('A1:', p);
I personally prefer to use sprintf to place the number directly into my strings rather than concatenating a bunch of strings.
p = 10;
range = sprintf('A1:A%d', p)
'A1:A10`
So if we adapt your code to use this we should get
range = sprintf('A1:A%d', numel(Hop));
eActivesheetRange = e.Activesheet.get('Range', range);
eActivesheetRange.Value = Hop;
Also just to be a little explicit, I would use numel rather than length as length is ambiguous. Also, I would flatten Hop into a column vector just to make sure that it's the proper dimension to be written to the spreadsheet.
eActivesheetRange.Value = Hop(:);
Essentially, the idea is to replace xx in 'B1:Bxx' with the number of elements in your matrix.
I tried this:
e = actxserver('Excel.Application');
eWorkbook = e.Workbooks.Add;
e.Visible = 1;
eSheets = e.ActiveWorkbook.Sheets;
eSheet1 = eSheets.get('Item',1);
eSheet1.Activate;
A = [1 2 3 4];
eActivesheetRange = e.Activesheet.get('Range','A1:A4');
eActivesheetRange.Value = A;
The above is directly from the link you shared. The reason why what you are trying to do is failing is that the p you pass into e.Activesheet.get() is a variable and not a string. To avoid this, try the following:
B = randi([0 10],10,1)
eActivesheetRange = e.Activesheet.get('Range',['B1:B' num2str(numel(B))]);
eActivesheetRange.Value = B;
Here, num2str(numel(B)) will pass in a string, which is the number of elements in B. This is variable in the sense that it depends on the number of elements in B.

Combination of one of every string group in all possible combinations and orders in matlab

So I forgot a string and know there is three substrings in there and I know a few possibilities for each string. So all I need to do is go through all possible combinations and orders until I find the one I forgot. But since humans can only hold four items in their working memory (definately an upper limit for me), I cant keep tabs on which ones I examined.
So say I have n sets of m strings, how do I get all strings that have a length of n substrings consisting of one string from each set in any order?
I saw an example of how to do it in a nested loop but then I have to specify the order. The example is for n = 3 with different m`s. Not sure how to make this more general:
first = {'Hoi','Hi','Hallo'};
second = {'Jij','You','Du'};
third = {'Daar','There','Da','LengthIsDifferent'};
for iF = 1:length(first)
for iS = 1:length(second)
for iT = 1:length(third)
[first{iF}, second{iS}, third{iT}]
end
end
end
About this question: it does not solve this problem because it presumes that the order of the sets to choose from is known.
This generates the cartesian product of the indices using ndgrid.
Then uses some cellfun-magic to get all the strings. Afterwards it just cycles through all the permutations and appends those.
first = {'Hoi','Hi','Hallo'};
second = {'Jij','You','Du'};
third = {'Daar','There','Da','LengthIsDifferent'};
Vs = {first, second, third};
%% Create cartesian product
Indices = cellfun(#(X) 1:numel(X), Vs, 'uni', 0);
[cartesianProductInd{1:numel(Vs)}] = ndgrid(Indices{:});
AllStringCombinations = cellfun(#(A,I) A(I(:)), Vs, cartesianProductInd,'uni',0);
AllStringCombinations = cat(1, AllStringCombinations{:}).';%.'
%% Permute what we got
AllStringCombinationsPermuted = [];
permutations = perms(1:numel(Vs));
for i = 1:size(permutations,1)
AllStringCombinationsPermuted = [AllStringCombinationsPermuted; ...
AllStringCombinations(:,permutations(i,:));];
end

Resources