I have a python dictionary like below
car_dict=
{
'benz': {'usa':876456, 'uk':965471},
'audi' : {'usa':523487, 'uk':456879},
'bmw': {'usa':754235, 'uk':543298}
}
I need the output like below
benz,876456,965471
audi,523487,456879
bmw,754235,543298
and also in sorted form as well like below
audi,523487,456879
benz,876456,965471
bmw,754235,543298
Please help me in getting both outputs
You could do this:
car_dict= {
'benz': {'usa':876456, 'uk':965471},
'audi' : {'usa':523487, 'uk':456879},
'bmw': {'usa':754235, 'uk':543298}
}
cars = []
for car in car_dict:
cars.append('{},{},{}'.format(
car,
car_dict[car]['usa'],
car_dict[car]['uk']
))
cars = sorted(cars)
for car in cars:
print(cars)
Result
audi,523487,456879
benz,876456,965471
bmw,754235,543298
Explanation
Loop through each car and store the model, USA number and UK number in a list. Sort the list alphabetically. List it.
To print the data
# Use List comprehension to sorted list of values from car, USA, UK fields
data = [[car] + list(regions.values()) for car, regions in sorted(car_dict.items(), key=lambda x:x[0])]
for row in data:
print(*row, sep = ',')
Output
audi,523487,456879
benz,876456,965471
bmw,754235,543298
Explanation
Sort items by car
for car, regions in sorted(car_dict.items(), key=lambda x:x[0])
Each inner list in list comprehension to be row of car, USA, UK values
[car] + list(regions.values())
Print each row comma delimited
for row in data:
print(*row, sep = ',')
Are you looking to print the outputs or just organize the data in a better format for analysis.
If latter, I would use pandas and do the following
import pandas as pd
pd.DataFrame(car_dict).transpose().sort_index()
To view the output on terminal they way you requested,
for index, row in pd.DataFrame(car_dict).transpose().sort_index().iterrows():
print('{},{},{}'.format(index, row['usa'], row['uk']))
will print this out:
audi,523487,456879
benz,876456,965471
bmw,754235,543298
Find below the solution,
car_dict = {
'benz': {'usa':876456, 'uk':965471},
'audi' : {'usa':523487, 'uk':456879},
'bmw': {'usa':754235, 'uk':543298}
}
keys = list(car_dict.keys())
keys.sort()
for i in keys:
print ( i, car_dict[i] ['usa'], car_dict[i] ['uk'])
I like the list comprehension answer given by #DarrylG however you do not really need the lambda expression in this case.
sorted() will just do the sort by key by default, so you can just use :
data = [[car] + list(regions.values()) for car, regions in sorted(car_dict.items())]
I would also make another slight change. If you wanted more explicit control over the region ordering (or wanted a different ordering, you could replace the [car] + list(regions.values()) with [car, regions['usa'], regions['uk']] like this:
data = [[car, regions['usa'], regions['uk']] for car, regions in sorted(car_dict.items())]
Of course, that means that if you added more regions you would have to change this, but I prefer setting the order explicitly.
Related
What I want to do was actually group by all similar strings in one columns and sum their
corresponding counts if there are similarity, otherwise, leave them.
A little similar to this post. Unfortunately I have not been able to apply this to my case:
How to group Pandas data frame by column with regex match
Unfortunately, I ended up with the following steps:
I wrote a function to print out all the fuzz.Wratio for each row of string,
when each row does a linear search from the top to check if there are other similar
strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's
corresponding counts. Otherwise, leave them there.
I created a test data looking like this:
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
So what I want to do is make the result as a dataframe like:
result=pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
My function so far only gave me the fuzz ratio for each row,
and to my understanding is that,
each row compares to itself three times( here we have four rows).
So My function output would look like:
pd.Dataframe({
'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
'apple.inc','apple.inc'],
'Ncount':[4,4,4,3,3,3],
'FRatio': [100,100,100,100,100,100] })
This is just one portion of the whole output from the function I wrote with this test data.
And the last row "OMEGA" would give me a fuzz ratio about 18.
My function is like this:
def checkDupTitle2(data):
Nname=[]
Ncount=[]
f_ratio=[]
for i in range(0, len(data)):
current=0
count=0
space=0
for space in range(0, len(data)-1-current):
ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
str(data.loc[current+space]['name']).strip())
Nname.append(str(data.loc[i]['name']).strip())
Ncount.append(str(data.loc[i]['count']).strip())
f_ratio.append(ratio)
df=pd.DataFrame({
'Nname': Nname,
'Ncount': Ncount,
'FRatio': f_ratio
})
return df
So after running this function and get the output,
I tried to get what I eventually want.
here I tried group by on the df created above:
output.groupby(output.FRatio>90).sum()
But this way, I still need a "name" in my dataframe,
how can I decide on which names for this total counts, say, 9 here.
"Apple.Inc" or "apple.inc" or "APPLE.INC"?
Or, did I make it too complex?
Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly
appreciated! Thanks!
The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)
import pandas as pd
from rapidfuzz import process, utils
def checkDupTitle(data):
values = data.values.tolist()
companies = [company for company, _ in values]
pcompanies = [utils.default_process(company) for company in companies]
counts = [count for _, count in values]
results = []
while companies:
company = companies.pop(0)
pcompany = pcompanies.pop(0)
count = counts.pop(0)
duplicates = process.extractIndices(
pcompany, pcompanies,
processor=None, score_cutoff=90, limit=None)
for (i, _) in sorted(duplicates, reverse=True):
count += counts.pop(i)
del pcompanies[i]
del companies[i]
results.append([company, count])
return pd.DataFrame(results, columns=['Nname','Ncount'])
test_data=pd.DataFrame({
'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
'count':[4,3,2,6]
})
checkDupTitle(test_data)
The result is
pd.Dataframe({
'Nname':['Apple.Inc.','OMEGA'],
'Ncount':[9,6]
})
I want to determine the rows in a data frame that has the same value in some special columns (sex, work class, education).
new_row_data=df.head(20)
new_center_clusters =new_row_data.head(20)
for j in range(len(new_center_clusters)):
row=[]
for i in range(len(new_row_data)):
if (new_center_clusters.iloc[j][5] == new_row_data.iloc[i][5]):
if(new_center_clusters.iloc[j][2] == new_row_data.iloc[i][2]):
if(new_center_clusters.iloc[j][3] == new_row_data.iloc[i][3]):
if(new_center_clusters.iloc[j][0] != new_center_clusters.iloc[i][0]):
row.append(new_center_clusters.iloc[j][0])
row.append(new_center_clusters.iloc[i][0])
myset = list(set(row))
myset.sort()
print(myset)
I need a list that includes all the IDs of similar rows in one list. but I can not merge all the lists in one list.
I get this result:
I need to get like this:
[1,12,8,17,3,18,4,19,5,13,6,9]
Thank you in advance.
if you want combine all list
a=[1,3,4]
b=[2,4,1]
a.extend(b)
it will give output as:
[1,3,4,2,4,1]
similary if you want to remove the duplicates, convert it into set and again list:
c=list(set(a))
it will give output as:
[1,3,4,2]
I would like to sort and group a dictionary by keys. The keys are currently full names, but I would like to group all last names that are similar together and combine their value pairs. An excerpt of the input dictionary is below:
facdict = {'Yimei Li': [' Ph.D.', 'Assistant Professor of Biostatistics', 'liy3#email.chop.edu'],
'Mingyao Li': [' Ph.D.', 'Associate Professor of Biostatistics', 'mingyao#mail.med.upenn.edu'],
'Hongzhe Li': [' Ph.D', 'Professor of Biostatistics', 'hongzhe#upenn.edu'],
'A. Russell Localio': [' JD MA MPH MS PhD', 'Associate Professor of Biostatistics', 'rlocalio#upenn.edu']}
The desired output is:
last_name_dict = {'Li': [[' Ph.D.', 'Assistant Professor of Biostatistics', 'liy3#email.chop.edu'], [' Ph.D.', 'Associate Professor of Biostatistics', 'mingyao#mail.med.upenn.edu'], [' Ph.D', 'Professor of Biostatistics', 'hongzhe#upenn.edu']],
'Localio': [' JD MA MPH MS PhD', 'Associate Professor of Biostatistics', 'rlocalio#upenn.edu']}
I have tried to use the following dictionary comprehension:
search = re.compile(r"([A-Z]{1}[a-z]+)")
last_name_dict = {k.replace(k, search.findall(k)[-1:][0]): v for k, v in facdict.items()}
But this returns the last names of each entry with only the first value pair associated with it.
A dictionary comprehension can only produce single key-value pairs; any repeated pairs are not combined, and just replace the previous value for the same key.
Just use a regular loop, and initialise the outer list with dict.setdefault():
last_name_dict = {}
for k, v in facdict.items():
last_name = k.replace(k, search.findall(k)[-1:][0])
last_name_dict.setdefault(last_name, []).append(v)
dictionary.setdefault(key, []) looks for the key in the dictionary and returns it. However, if the key is not yet set, the second argument is used to first set the value, before returning that object. So in the above code, the return value of last_name_dict.setdefault(...) always returns a list, so we can call .append(...) and add another entry.
If you don't mind that you won't get key errors for wrong keys, you could use a collections.defaultdict() object:
from collections import defaultdict
last_name_dict = defaultdict(list)
for k, v in facdict.items():
last_name = k.replace(k, search.findall(k)[-1:][0])
last_name_dict[last_name].append(v)
Take into account that last_name_dict[unknown_key] will create another list object and return that.
You can achieve the same using a dictionary comprehension if you first sort your input on last names and then group the input by the same last name value with itertools.groupby(), but this is not as efficient. The above solutions group the input in O(N) linear time; for 10 items you take 10 steps, for 100 items 100 steps, etc. Sorting takes O(NlogN) quasilinear time, where 10 items takes about 33 steps, 100 items takes about 664 steps, etc. It quickly no longer matters if sorting steps are faster, as the number of inputs grows the number of steps grows faster when sorting is required compared to when you don't need sorting and so is going to end up being slower anyway.
I have a large list with elements as:
#1
#10
(on
)
0.0574
122-124
122A
Cat
Dog
elephant
elephant12
elephant-1
I want to search and be able to find only the following:
Cat
Dog
elephant
elephant12
elephant-1
i.e. elements which have an English alphabet at the beginning.
Use list comprehension:
import string
result = [item for item in my_list if item[0] in string.ascii_letters]
As #Jon commented, check if a character is a letter can simply be:
result = [item for item in my_list if item[0].isalpha()]
The above works when all items are string, and you expect items with leading English character. Change the if part as needed or even write a function if it is too complex.
If you are looking for memory-optimized version, consider generator.
You could work with a list of animals.
for example:
list=["cat","dog","elephant","fish","bird", "snake"]
then search for any of those strings in your input
I know there are better methods but I would do it with
def search(input):
for item in list:
if item in input:
result.append(item)
return result
Then you would add case ignorance precisions.
If you want the number associated with your animal name, you'll need to append the input. In this case, iterate your input.
Of course your list could take the dimension of a database if you need, for example to search for all possible existing word.
record=['MAT', '90', '62', 'ENG', '92','88']
course='MAT'
suppose i want to get the marks for MAT or ENG what do i do? I just know how to find the index of the course which is new[4:10].index(course). Idk how to get the marks.
Try this:
i = record.index('MAT')
grades = record[i+1:i+3]
In this case i is the index/position of the 'MAT' or whichever course, and grades are the items in a slice comprising the two slots after the course name.
You could also put it in a function:
def get_grades(course):
i = record.index(course)
return record[i+1:i+3]
Then you can just pass in the course name and get back the grades.
>>> get_grades('ENG')
['92', '88']
>>> get_grades('MAT')
['90', '62']
>>>
Edit
If you want to get a string of the two grades together instead of a list with the individual values you can modify the function as follows:
def get_grades(course):
i = record.index(course)
return ' '.join("'{}'".format(g) for g in record[i+1:i+3])
You can use index function ( see this https://stackoverflow.com/a/176921/) and later get next indexes, but I think you should use a dictionary.