Creating dictionary from excel data - python-3.x

I have data in excel and need to create a dictionary for those data.
expected output like below:-
d = [
{
"name":"dhdn",
"usn":1bm15mca13",
"sub":["c","java","python"],
"marks":[90,95,98]
},
{
"name":"subbu",
"usn":1bm15mca14",
"sub":["java","perl"],
"marks":[92,91]
},
{
"name":"paddu",
"usn":1bm15mca17",
"sub":["c#","java"],
"marks":[80,81]
}
]
Tried code but it is working for only two column
import pandas as pd
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k,'sub':g["sub"].tolist(),"marks":g["marks"].tolist()} for k,g in df_service.groupby(['name', 'usn'])]
print (result)
I am getting like below but I want as I expected like above.
[{'name': ('dhdn', '1bm15mca13'), 'sub': ['c', 'java', 'python'], 'marks': [90, 95, 98]}, {'name': ('paddu', '1bm15mca17'), 'sub': ['c#', 'java'], 'marks': [80, 81]}, {'name': ('subbu', '1bm15mca14'), 'sub': ['java', 'perl'], 'marks': [92, 91]}]

Finally, I solved.
import pandas as pd
from pprint import pprint
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k[0],'usn':k[1],'sub':v["sub"].tolist(),"marks":v["marks"].tolist()} for k,v in df_service.groupby(['name', 'usn'])]
pprint (result)
It is giving expected output as I expected.
[{'marks': [90, 95, 98],
'name': 'dhdn',
'sub': ['c', 'java', 'python'],
'usn': '1bm15mca13'},
{'marks': [80, 81],
'name': 'paddu',
'sub': ['c#', 'java'],
'usn': '1bm15mca17'},
{'marks': [92, 91],
'name': 'subbu',
'sub': ['java', 'perl'],
'usn': '1bm15mca14'}]

All right! I solved your question although it took me a while.
The first part is the same as your progress.
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.fillna(method='ffill')
Then we need to get the unique names and how many rows they cover. I'm assuming there are as many unique names as there are unique "usn's". I created a list that stores these 'counts'.
unique_names = df.name.unique()
unique_usn = df.usn.unique()
counts = []
for i in range(len(unique_names)):
counts.append(df.name.str.count(unique_names[i]).sum())
counts
[3,2,2] #this means that 'dhdn' covers 3 rows, 'subbu' covers 2 rows, etc.
Now we need a smart function that will let us obtain the necessary info from the other columns.
def get_items(column_number):
empty_list = []
lower_bound = 0
for i in range(len(counts)):
empty_list.append(df.iloc[lower_bound:sum(counts[:i+1]),column_number].values.tolist())
lower_bound = sum(counts[:i+1])
return empty_list
I leave it to you to understand what is going on. But basically we are recovering the necessary info. We now just need to apply that to get a list for subs and for marks, respectively.
list_sub = get_items(3)
list_marks = get_items(2)
Finally, we put it all into one list of dicts.
d = []
for i in range(len(unique_names)):
diction = {}
diction['name'] = unique_names[i]
diction['usn'] = unique_usn[i]
diction['sub'] = list_sub[i]
diction['marks'] = list_marks[i]
d.append(diction)
And voilĂ !
print(d)
[{'name': 'dhdn', 'usn': '1bm15mca13', 'sub': [90, 95, 98], 'marks': ['c', 'java', 'python']},
{'name': 'subbu', 'usn': '1bm15mca14', 'sub': [92, 91], 'marks': ['java', 'perl']},
{'name': 'paddu', 'usn': '1bm15mca17', 'sub': [80, 81], 'marks': ['c#', 'java']}]

Related

Python Pandas How to get rid of groupings with only 1 row?

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)

Trouble with a python loop

I'm having issues with a loop that I want to:
a. see if a value in a DF row is greater than a value from a list
b. if it is, concatenate the variable name and the value from the list as a string
c. if it's not, pass until the loop conditions are met.
This is what I've tried.
import pandas as pd
import numpy as np
df = {'level': ['21', '22', '23', '24', '25', '26', '27', '28', '29', '30']
, 'variable':'age'}
df = pd.DataFrame.from_dict(df)
knots = [0, 25]
df.assign(key = np.nan)
for knot in knots:
if df['key'].items == np.nan:
if df['level'].astype('int') > knot:
df['key'] = df['variable']+"_"+knot.astype('str')
else:
pass
else:
pass
However, this only yields the key column to have NaN values. I'm not sure why it's not placing the concatenation.
You can do something like this inside the for loop. No need of any if conditions:
df.loc[df['level'].astype('int') > 25, 'key'] = df.loc[df['level'].astype('int') > 25, 'variable'] + '_' + df.loc[df['level'].astype('int') > 25, 'level']

Extracting Rows by specific keyword in Python (Without using Pandas)

My csv file looks like this:-
ID,Product,Price
1,Milk,20
2,Bottle,200
3,Mobile,258963
4,Milk,24
5,Mobile,10000
My code of extracting row is as follow :-
def search_data():
fin = open('Products/data.csv')
word = input() # "Milk"
found = {}
for line in fin:
if word in line:
found[word]=line
return found
search_data()
While I run this above code I got output as :-
{'Milk': '1,Milk ,20\n'}
I want If I search for "Milk" I will get all the rows which is having "Milk" as Product.
Note:- Do this in only Python don't use Pandas
Expected output should be like this:-
[{"ID": "1", "Product": "Milk ", "Price": "20"},{"ID": "4", "Product": "Milk ", "Price": "24"}]
Can anyone tell me where am I doing wrong ?
In your script every time you assign found[word]=line it will overwrite the value that was before it. Better approach is load all the data and then do filtering:
If file.csv contains:
ID Product Price
1 Milk 20
2 Bottle 200
3 Mobile 10,000
4 Milk 24
5 Mobile 15,000
Then this script:
#load data:
with open('file.csv', 'r') as f_in:
lines = [line.split() for line in map(str.strip, f_in) if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# print only items with 'Product': 'Milk'
print([i for i in data if i['Product'] == 'Milk'])
Prints only items with Product == Milk:
[{'ID': '1', 'Product': 'Milk', 'Price': '20'}, {'ID': '4', 'Product': 'Milk', 'Price': '24'}]
EDIT: If your data are separated by commas (,), you can use csv module to read it:
File.csv contains:
ID,Product,Price
1,Milk ,20
2,Bottle,200
3,Mobile,258963
4,Milk ,24
5,Mobile,10000
Then the script:
import csv
#load data:
with open('file.csv', 'r') as f_in:
csvreader = csv.reader(f_in, delimiter=',', quotechar='"')
lines = [line for line in csvreader if line]
data = [dict(zip(lines[0], l)) for l in lines[1:]]
# # print only items with 'Product': 'Milk'
print([i for i in data if i['Product'].strip() == 'Milk'])
Prints:
[{'ID': '1', 'Product': 'Milk ', 'Price': '20'}, {'ID': '4', 'Product': 'Milk ', 'Price': '24'}]

Need help to convert some keys from a tuple dictionary in another dictionary

I have a csv file that I read with csv module in a csv.DictReader().
I have an output like this:
{'biweek': '1', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526822.1365'}
{'biweek': '2', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526995.246'}
{'biweek': '3', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527170.1981'}
{'biweek': '4', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527347.0136'}
And I need to get the 'loc' as key for a new dict and the count of the 'loc' as values for that new dict, as the 'loc' have a lot of repetitions in the file.
with open('Dalziel2015_data.csv') as fh:
new_dct = {}
cities = set()
cnt = 0
reader = csv.DictReader(fh)
for row in reader:
data = dict(row)
cities.add(data.get('loc'))
for (k, v) in data.items():
if data['loc'] in cities:
cnt += 1
new_dct[data['loc']] = cnt + 1
print(new_dct)
example_file:
biweek,year,loc,cases,pop
1,1906,BALTIMORE,NA,526822.1365
2,1906,BALTIMORE,NA,526995.246
3,1906,BALTIMORE,NA,527170.1981
4,1906,BALTIMORE,NA,527347.0136
5,1906,BALTIMORE,NA,527525.7134
6,1906,BALTIMORE,NA,527706.3183
4,1906,BOSTON,NA,630880.6579
5,1906,BOSTON,NA,631295.9457
6,1906,BOSTON,NA,631710.8403
7,1906,BOSTON,NA,632125.3403
8,1906,BOSTON,NA,632539.4442
9,1906,BOSTON,NA,632953.1503
10,1907,BRIDGEPORT,NA,91790.75578
11,1907,BRIDGEPORT,NA,91926.14732
12,1907,BRIDGEPORT,NA,92061.90153
13,1907,BRIDGEPORT,NA,92198.01976
14,1907,BRIDGEPORT,NA,92334.50335
15,1907,BRIDGEPORT,NA,92471.35364
17,1908,BUFFALO,NA,413661.413
18,1908,BUFFALO,NA,413934.7646
19,1908,BUFFALO,NA,414208.4097
20,1908,BUFFALO,NA,414482.3523
21,1908,BUFFALO,NA,414756.5963
22,1908,BUFFALO,NA,415031.1456
23,1908,BUFFALO,NA,415306.0041
24,1908,BUFFALO,NA,415581.1758
25,1908,BUFFALO,NA,415856.6646
6,1935,CLEVELAND,615,890247.9867
7,1935,CLEVELAND,954,890107.9192
8,1935,CLEVELAND,965,889967.7823
9,1935,CLEVELAND,872,889827.5956
10,1935,CLEVELAND,814,889687.3781
11,1935,CLEVELAND,717,889547.1492
12,1935,CLEVELAND,770,889406.9283
13,1935,CLEVELAND,558,889266.7346
I have done this. I got the keys alright, but I didn't get the count right.
My results:
{'BALTIMORE': 29, 'BOSTON': 59, 'BRIDGEPORT': 89, 'BUFFALO': 134, 'CLEVELAND': 174}
I know pandas is a very good tool but I need the code with csv module.
If any of you guys could help me to get the count done I appreciate.
Thank you!
Paulo
You can use collections.Counter to count occurrences of the cities in CSV file. Counter.keys() will also give you all cities found in CSV:
import csv
from collections import Counter
with open('csvtest.csv') as fh:
reader = csv.DictReader(fh)
c = Counter(row['loc'] for row in reader)
print(dict(c))
print('Cities={}'.format([*c.keys()]))
Prints:
{'BALTIMORE': 6, 'BOSTON': 6, 'BRIDGEPORT': 6, 'BUFFALO': 9, 'CLEVELAND': 8}
Cities=['BALTIMORE', 'BOSTON', 'BRIDGEPORT', 'BUFFALO', 'CLEVELAND']
You are updating a global counter and not the counter for the specific location. You are also iterating each column of each row and updating it for no reason.
Try this:
with open('Dalziel2015_data.csv') as fh:
new_dct = {}
cities = set()
reader = csv.DictReader(fh)
for row in reader:
data = dict(row)
new_dct[data['loc']] = new_dct.get(data['loc'], 0) + 1
print(new_dct)
This line: new_dct[data['loc']] = new_dct.get(data['loc'], 0) + 1 will get the last counter for that city and increment the number by one. If the counter does not exist yet, the function get will return 0.

Python3 TypeError: string indices must be integers

I am new to python programming. I'm trying to implement a code which reads data from a file and displays it in kind of a tabular format. However, when I try to run my code, it gives the error as:
TypeError: string indices must be integers
Here is my code:
from operator import itemgetter
emp_dict = dict()
emp_list = list()
with open('m04_lab_profiles','r') as people:
for p in people:
emp_list = p.strip().split(',')
emp_info = dict()
emp_info['Name'] = emp_list[0]
emp_info['Location'] = emp_list[1]
emp_info['Status'] = emp_list[2]
emp_info['Employer'] = emp_list[3]
emp_info['Job'] = emp_list[4]
emp_dict[emp_list[0]] = emp_list
emp_list.append(emp_info)
for info in emp_list:
print("{0:20} {1:25} {2:20} {3:20} {4:45}".format(int(info['Name'],info['Location'],info['Status'],info['Employer'],info['Job'])))
print("\n\n")
info_sorted = sorted(emp_list,key=itemgetter('Name'))
for x in info_sorted:
print("{0:20} {1:25} {2:20} {3:20}
{4:45}".format(emp_info['Name'],
emp_info['Address'],
emp_info['Status'],
emp_info['Employer'],
emp_info['Job']))
I've tried almost every other solution given for the same question title, but all went in vain. Please help
The issue is that you're using emp_list inside of your loop as well as outside. The result is that your list once you've loaded the file has some elements that are strings (which require an integer index) and some elements that are dicts (which have more flexible indexing rules). Specifically, with an example file that looks like
name,location,status,employer,job
name2,location2,status2,employer2,job2
After the loop, the emp_list looks like
In [3]: emp_list
Out[3]:
['name2',
'location2',
'status2',
'employer2',
'job2',
{'Name': 'name2',
'Location': 'location2',
'Status': 'status2',
'Employer': 'employer2',
'Job': 'job2'}]
The fix to this is to use a different temporary list as the output of your .split(',') call. I.e.
In [4]: from operator import itemgetter
...: emp_dict = dict()
...: emp_list = list()
...: with open('m04_lab_profiles','r') as people:
...: for p in people:
...: tmp = p.strip().split(',')
...: emp_info = dict()
...: emp_info['Name'] = tmp[0]
...: emp_info['Location'] = tmp[1]
...: emp_info['Status'] = tmp[2]
...: emp_info['Employer'] = tmp[3]
...: emp_info['Job'] = tmp[4]
...: emp_dict[tmp[0]] = emp_info
...: emp_list.append(emp_info)
...:
...:
In [5]: emp_list
Out[5]:
[{'Name': 'name',
'Location': 'location',
'Status': 'status',
'Employer': 'employer',
'Job': 'job'},
{'Name': 'name2',
'Location': 'location2',
'Status': 'status2',
'Employer': 'employer2',
'Job': 'job2'}]

Resources