Multiple criteria unique list - excel

I have a list named ChargeType and a list named TaskCode. ChargeType contains "Professional Fees", "Disbursements", and "Subcontractors". TaskCode contains say 0527014.05.01, 0527014.05.02, 0527014.07.01, 0527014.04.01, 0527014.02.01 etc. I'm trying to create a unique list from another named range (JobClass). However, I can't seem to get the wildcard to work for me.
My formula right now is {=IFERROR(INDEX(JobClass,MATCH(0,IF(ChargeType="Professional Fees",IF(TaskCode="0527014.05"&"*",COUNTIF($D$6:D6,JobClass),""),""),0)),"")} (dragged down of course). So I want it to pick out the unique items from JobClass if the ChargeType is "Professional Fees" and the TaskCode is any of the 0527014.05 items. However, it is returning blanks.
If I don't use the wildcard and instead put the entire 0527014.05.01 it works, but only gives me Intermediate I, Principal I, and Intermediate II*. But I need it to work for all of the .05's. What am I doing wrong?
PTH = TaskCode, JobClass/Ref = JobClass, BudgetType = ChargeType
Desired output in this case would be a unique list of Intermediate I, Principal I, Intermediate II, Senior I

Related

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']

Indexing inner lists in "list of lists"

bigList = [list1,list2,list3]
I need a way to get the name of list1 using the zero index
bigList[0] just gives you all the items in the list
My original code prints all the items in the lists while the modified one says the index is out of range. I want it to give me the ilness names not all the symptoms.
Original code
https://i.stack.imgur.com/HDlA9.png
Modified
https://i.stack.imgur.com/5Qdh2.png

I'm trying to clean my data but it returns the wrong column

I am trying to take one of my imported data sets df19 and clean information out of it to create a second variable noneu19 where, you guessed it, EU countries are removed from the column Destination
Here is what I ran
noneu19=df19
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('UK')]
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('SWEDEN')]
noneu19["Destination"] = noneu19[~noneu19["Destination"].apply(str).str.contains('SPAIN')]
...
set(noneu19["Destination"])
(The ... replaces the 25 other lines)
what it returns is the list of data indexed in a completely separate column 'Location' for some reason.
If I do set(df19['Destination']) it returns the list that I am trying to clean, so it is not a problem in the original data set. Is there a way that I can do it easier/cleaner/better or a way to troubleshoot why it is returning the wrong column?
Thanks
You can create a list with all the countries in Eu such as
EU = ['SPAIN', 'ITALY'..., 'EU_COUNTRY']
then use isin function like this:
noneu19 = df19.loc[~df19["Destination"].isin(EU)].copy()
The function isin will check if an element of that very column is contained in the list you pass as the argument.
Approaching the problem this way, you will have a more readible and easy to mantain code.

count occurrences of a string in a structure

I have a structure mydata and I need to access one of its fields mydata.myfield, and within that field, access another field mydata.myfield.mysecondfield. In the last field, mydata.myfield.mysecondfield I need to check how many times a particular string ('apple') occurs.
I have tried with:
aaa=unique(mydata.myfield.mysecondfield,'apple')
bbb=cellfun(#(x) sum(ismember(mydata.myfield.mysecondfield,x)),aaa,'un',0)
but I get this error: Attempt to reference field of non-structure array.
The structure contains fields with both strings and numeric values.
The underlying problem may be due to the fact that the structure is a little bit different from how you describe it. Following your question, I created it as follows:
mydata = struct();
mydata.myfield.mysecondfield = {'apple' 'apple' 'orange' 'banana' 'apple' 'kiwi'};
and since I'm not getting the same error you get, I think the underlying types may be slightly mismatching. Anyway, given mydata defined as above, if you change your code as follows, it should work but it will return the count of every unique occurrence within the field:
aaa = unique(mydata.myfield.mysecondfield);
bbb = cellfun(#(x) sum(ismember(mydata.myfield.mysecondfield,x)),aaa,'un',0)
bbb =
4×1 cell array
[3]
[1]
[1]
[1]
If you only want to count the number of apple occurrences, you should use the following approach instead:
apple_count = sum(strcmp(mydata.myfield.mysecondfield,'apple')); % 3

Using load with data from cells

In my code I'm trying to use load with entries from a cell, but it is not working. The portion of my code below produces a 3 dimensional array of strings. The strings represent the paths to file names.
for i = 1:Something
for j = 1:Something Different
for k = 1: Yet Something Something Different
DataPath{j,k,i} = 'F:\blah\blah\blah\fileijk %file changes based on i,j,and k
end
end
end
In the next part of the code I want to use load to open the files using the path names defined in the code above. I do this using the code below.
Dummy = DataPath{l,(k-1)*TSRRange+m};
Data = load(Dummy);
The idea is for Dummy to take the string content out of DataPath so I can use it in load. By doing this I thought that Dummy would be defined as a string and not a cell, but this is not the case. How do I pull the string out of DataPath so I can use it with load? Thanks.
I have to load the data this way because the data is located in multiple folders. I can post more of the code if needed, but it is complex.
Dummy is a cell because you assigned a 3D cell array but are accessing a 2D cell with Dummy = Datapath{1,(k-1)*TSRRange+m}
I don't believe that you can expect to access all cell elements I this way. Instead, use three indices just as you did when creating it.

Resources