Related
I have data in excel and need to create a dictionary for those data.
expected output like below:-
d = [
{
"name":"dhdn",
"usn":1bm15mca13",
"sub":["c","java","python"],
"marks":[90,95,98]
},
{
"name":"subbu",
"usn":1bm15mca14",
"sub":["java","perl"],
"marks":[92,91]
},
{
"name":"paddu",
"usn":1bm15mca17",
"sub":["c#","java"],
"marks":[80,81]
}
]
Tried code but it is working for only two column
import pandas as pd
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k,'sub':g["sub"].tolist(),"marks":g["marks"].tolist()} for k,g in df_service.groupby(['name', 'usn'])]
print (result)
I am getting like below but I want as I expected like above.
[{'name': ('dhdn', '1bm15mca13'), 'sub': ['c', 'java', 'python'], 'marks': [90, 95, 98]}, {'name': ('paddu', '1bm15mca17'), 'sub': ['c#', 'java'], 'marks': [80, 81]}, {'name': ('subbu', '1bm15mca14'), 'sub': ['java', 'perl'], 'marks': [92, 91]}]
Finally, I solved.
import pandas as pd
from pprint import pprint
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k[0],'usn':k[1],'sub':v["sub"].tolist(),"marks":v["marks"].tolist()} for k,v in df_service.groupby(['name', 'usn'])]
pprint (result)
It is giving expected output as I expected.
[{'marks': [90, 95, 98],
'name': 'dhdn',
'sub': ['c', 'java', 'python'],
'usn': '1bm15mca13'},
{'marks': [80, 81],
'name': 'paddu',
'sub': ['c#', 'java'],
'usn': '1bm15mca17'},
{'marks': [92, 91],
'name': 'subbu',
'sub': ['java', 'perl'],
'usn': '1bm15mca14'}]
All right! I solved your question although it took me a while.
The first part is the same as your progress.
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.fillna(method='ffill')
Then we need to get the unique names and how many rows they cover. I'm assuming there are as many unique names as there are unique "usn's". I created a list that stores these 'counts'.
unique_names = df.name.unique()
unique_usn = df.usn.unique()
counts = []
for i in range(len(unique_names)):
counts.append(df.name.str.count(unique_names[i]).sum())
counts
[3,2,2] #this means that 'dhdn' covers 3 rows, 'subbu' covers 2 rows, etc.
Now we need a smart function that will let us obtain the necessary info from the other columns.
def get_items(column_number):
empty_list = []
lower_bound = 0
for i in range(len(counts)):
empty_list.append(df.iloc[lower_bound:sum(counts[:i+1]),column_number].values.tolist())
lower_bound = sum(counts[:i+1])
return empty_list
I leave it to you to understand what is going on. But basically we are recovering the necessary info. We now just need to apply that to get a list for subs and for marks, respectively.
list_sub = get_items(3)
list_marks = get_items(2)
Finally, we put it all into one list of dicts.
d = []
for i in range(len(unique_names)):
diction = {}
diction['name'] = unique_names[i]
diction['usn'] = unique_usn[i]
diction['sub'] = list_sub[i]
diction['marks'] = list_marks[i]
d.append(diction)
And voilĂ !
print(d)
[{'name': 'dhdn', 'usn': '1bm15mca13', 'sub': [90, 95, 98], 'marks': ['c', 'java', 'python']},
{'name': 'subbu', 'usn': '1bm15mca14', 'sub': [92, 91], 'marks': ['java', 'perl']},
{'name': 'paddu', 'usn': '1bm15mca17', 'sub': [80, 81], 'marks': ['c#', 'java']}]
I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}
I have a csv file that I read with csv module in a csv.DictReader().
I have an output like this:
{'biweek': '1', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526822.1365'}
{'biweek': '2', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '526995.246'}
{'biweek': '3', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527170.1981'}
{'biweek': '4', 'year': '1906', 'loc': 'BALTIMORE', 'cases': 'NA', 'pop': '527347.0136'}
And I need to get the 'loc' as key for a new dict and the count of the 'loc' as values for that new dict, as the 'loc' have a lot of repetitions in the file.
with open('Dalziel2015_data.csv') as fh:
new_dct = {}
cities = set()
cnt = 0
reader = csv.DictReader(fh)
for row in reader:
data = dict(row)
cities.add(data.get('loc'))
for (k, v) in data.items():
if data['loc'] in cities:
cnt += 1
new_dct[data['loc']] = cnt + 1
print(new_dct)
example_file:
biweek,year,loc,cases,pop
1,1906,BALTIMORE,NA,526822.1365
2,1906,BALTIMORE,NA,526995.246
3,1906,BALTIMORE,NA,527170.1981
4,1906,BALTIMORE,NA,527347.0136
5,1906,BALTIMORE,NA,527525.7134
6,1906,BALTIMORE,NA,527706.3183
4,1906,BOSTON,NA,630880.6579
5,1906,BOSTON,NA,631295.9457
6,1906,BOSTON,NA,631710.8403
7,1906,BOSTON,NA,632125.3403
8,1906,BOSTON,NA,632539.4442
9,1906,BOSTON,NA,632953.1503
10,1907,BRIDGEPORT,NA,91790.75578
11,1907,BRIDGEPORT,NA,91926.14732
12,1907,BRIDGEPORT,NA,92061.90153
13,1907,BRIDGEPORT,NA,92198.01976
14,1907,BRIDGEPORT,NA,92334.50335
15,1907,BRIDGEPORT,NA,92471.35364
17,1908,BUFFALO,NA,413661.413
18,1908,BUFFALO,NA,413934.7646
19,1908,BUFFALO,NA,414208.4097
20,1908,BUFFALO,NA,414482.3523
21,1908,BUFFALO,NA,414756.5963
22,1908,BUFFALO,NA,415031.1456
23,1908,BUFFALO,NA,415306.0041
24,1908,BUFFALO,NA,415581.1758
25,1908,BUFFALO,NA,415856.6646
6,1935,CLEVELAND,615,890247.9867
7,1935,CLEVELAND,954,890107.9192
8,1935,CLEVELAND,965,889967.7823
9,1935,CLEVELAND,872,889827.5956
10,1935,CLEVELAND,814,889687.3781
11,1935,CLEVELAND,717,889547.1492
12,1935,CLEVELAND,770,889406.9283
13,1935,CLEVELAND,558,889266.7346
I have done this. I got the keys alright, but I didn't get the count right.
My results:
{'BALTIMORE': 29, 'BOSTON': 59, 'BRIDGEPORT': 89, 'BUFFALO': 134, 'CLEVELAND': 174}
I know pandas is a very good tool but I need the code with csv module.
If any of you guys could help me to get the count done I appreciate.
Thank you!
Paulo
You can use collections.Counter to count occurrences of the cities in CSV file. Counter.keys() will also give you all cities found in CSV:
import csv
from collections import Counter
with open('csvtest.csv') as fh:
reader = csv.DictReader(fh)
c = Counter(row['loc'] for row in reader)
print(dict(c))
print('Cities={}'.format([*c.keys()]))
Prints:
{'BALTIMORE': 6, 'BOSTON': 6, 'BRIDGEPORT': 6, 'BUFFALO': 9, 'CLEVELAND': 8}
Cities=['BALTIMORE', 'BOSTON', 'BRIDGEPORT', 'BUFFALO', 'CLEVELAND']
You are updating a global counter and not the counter for the specific location. You are also iterating each column of each row and updating it for no reason.
Try this:
with open('Dalziel2015_data.csv') as fh:
new_dct = {}
cities = set()
reader = csv.DictReader(fh)
for row in reader:
data = dict(row)
new_dct[data['loc']] = new_dct.get(data['loc'], 0) + 1
print(new_dct)
This line: new_dct[data['loc']] = new_dct.get(data['loc'], 0) + 1 will get the last counter for that city and increment the number by one. If the counter does not exist yet, the function get will return 0.
I am trying to contact the CSV rows. I tried to convert the CSV rows to list by pandas but it gets 'nan' values appended as some files are empty.
Also, I tried using zip but it concats column values.
with open(i) as f:
lines = f.readlines()
res = ""
for i, j in zip(lines[0].strip().split(','), lines[1].strip().split(',')):
res += "{} {},".format(i, j)
print(res.rstrip(','))
for line in lines[2:]:
print(line)
I have data as below,
Input data:-
Input CSV Data
Expected Output:-
Output CSV Data
The number of rows are more than 3,only sample is given here.
Suggest a way which will achieve the above task without creating a new file. Please point to any specific function or sample code.
This assumes your first line contains the correct amount of columns. It will read the whole file, ignore empty data ( ",,,,,," ) and accumulate enough data points to fill one row, then switch to the next row:
Write test file:
with open ("f.txt","w")as f:
f.write("""Circle,Year,1,2,3,4,5,6,7,8,9,10,11,12
abc,2018,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
def,2017,,,,,,,,,,,,
2.2,8.0,6.5,9,88,,,,,,,,,,
55,66,77,88,,,,,,,,,,
5,3.2,7
""")
Process test file:
data = [] # all data
temp = [] # data storage until enough found , then put into data
with open("f.txt","r") as r:
# get header and its lenght
title = r.readline().rstrip().split(",")
lenTitel = len(title)
data.append(title)
# process all remaining lines of the file
for l in r:
t = l.rstrip().split(",") # read one lines data
temp.extend( (x for x in t if x) ) # this eliminates all empty ,, pieces even in between
# if enough data accumulated, put as sublist into data, keep rest
if len (temp) > lenTitel:
data.append( temp[:lenTitel] )
temp = temp [lenTitel:]
if temp:
data.append(temp)
print(data)
Output:
[['Circle', 'Year', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
['abc', '2018', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7'],
['def', '2017', '2.2', '8.0', '6.5', '9', '88', '55', '66', '77', '88', '5', '3.2', '7']]
Remarks:
your file cant have leading newlines, else the size of the title is incorrect.
newlines in between do not harm
you cannot have "empty" cells - they get eliminated
As long as nothing weird is going on in the files, something like this should work:
with open(i) as f:
result = []
for line in f:
result += line.strip().split(',')
print(result)
I've got this data structure coming from Vimeo API
{'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
I want to transofrm in
[720, "sv", "http..", "incident.." "http..", "Warsaw", "New Europe.."]
to load it in a Google spreadsheet. I also need to maintain consistence values order.
PS. I see similar questions but answers are not in Python 3
Thanks
I'm going to use the csv module to create a CSV file like you've described out of your data.
First, we should use a header row for your file, so the order doesn't matter, only dict keys do:
import csv
# This defines the order they'll show up in final file
fieldnames = [
'name', 'link', 'duration', 'language',
'user_name', 'user_link', 'user_location',
]
# Open the file with Python
with open('my_file.csv', 'w', newline='') as my_file:
# Attach a CSV writer to the file with the desired fieldnames
writer = csv.DictWriter(my_file, fieldnames)
# Write the header row
writer.writeheader()
Notice the DictWriter, this will allow us to write dicts based on their keys instead of the order (dicts are unordered pre-3.6). The above code will end up with a file like this:
name;link;duration;language;user_name;user_link;user_location
Which we can then add rows to, but let's convert your data first, so the keys match the above field names:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
for key, value in data['user'].items():
data['user_{}'.format(key)] = value
del data['user']
This ends up with the data dictionary like this:
data = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user_link': 'https://vimeo.com/neweuropefilmsales',
'user_location': 'Warsaw, Poland',
'user_name': 'New Europe Film Sales',
}
We can now simply insert this as a whole row to the CSV writer, and everything else is done automatically:
# Using the same writer from above, insert the data from above
writer.writerow(data)
That's it, now just import this into your Google spreadsheets :)
This is a simple solution using recursion:
dictionary = {
'duration': 720,
'language': 'sv',
'link': 'https://vimeo.com/neweuropefilmsale/incidentbyabank',
'name': 'INCIDENT BY A BANK',
'user': {
'link': 'https://vimeo.com/neweuropefilmsales',
'location': 'Warsaw, Poland',
'name': 'New Europe Film Sales'
}
}
def flatten(current: dict, result: list=[]):
if isinstance(current, dict):
for key in current:
flatten(current[key], result)
else:
result.append(current)
return result
result = flatten(dictionary)
print(result)
Explanation: We call flatten() until we reach a value of the dictionary, that is not a dictionary itself (if isinstance(current, dict):). If we reach this value, we append it to our result list. It will work for any number of nested dictionaries.
See: How would I flatten a nested dictionary in Python 3?
I used the same solution, but I've changed the result collection to be a list.