Cosine Similarity of rows based on selected pandas columns - python-3.x

I am trying to make an item-item based movie recommender. In the movies dataset I have meta data about movies such as title, genres, directors, actors, producers, writers, year_of_release etc. Currently I am calculating the similarity based on the genre column only using tf-idf vectorizer by splitting the genre column into a list and it is working completely fine. Here is the code I am using:
def vector_cosine(df, index):
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
return cosine_sim[index]
# Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title, idx):
sim_scores = list(enumerate(vector_cosine(movies,idx)))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:6]
movie_indices = [i[0] for i in sim_scores]
recommendations = list(titles.iloc[movie_indices])
if title in recommendations: recommendations.remove(title)
return recommendations
and this is the sample data:
{"movie_id":"217d1207-effc-4605-bdef-899339615fe6","title":"Mamma Mia! Here We Go Again","release_date":"2018","actor_names":["Amanda Seyfried","Andy Garc\u00eda","Cher","Christine Baranski","Colin Firth","Dominic Cooper","Jessica Keenan Wynn","Julie Walters","Lily James","Meryl Streep","Pierce Brosnan","Stellan Skarsg\u00e5rd"],"director_names":["Ol Parker"],"producer_names":["Gary Goetzman","Judy Craymer"],"genres":"['Comedy', 'Music', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"59a0a1cd-5b83-401d-9b30-caa2c7e4f46c","title":"The Spy Who Dumped Me","release_date":"2018","actor_names":["Carolyn Pickles","Fred Melamed","Gillian Anderson","Hasan Minhaj","Ivanna Sakhno","James Fleet","Jane Curtin","Justin Theroux","Kate McKinnon","Mila Kunis","Paul Reiser","Sam Heughan"],"director_names":["Susanna Fogel"],"producer_names":null,"genres":"['Action', 'Comedy']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"35e19192-6fb0-4c2b-9591-0d70deb30db6","title":"Beirut","release_date":"2018","actor_names":["Alon Aboutboul","Dean Norris","Douglas Hodge","Jon Hamm","Jonny Coyne","Kate Fleetwood","Larry Pine","Le\u00efla Bekhti","Mark Pellegrino","Rosamund Pike","Shea Whigham","Sonia Okacha"],"director_names":["Brad Anderson"],"producer_names":["Mike Weber","Monica Levinson","Shivani Rawat","Ted Field","Tony Gilroy"],"genres":"['Action', 'Thriller']","rating_aus":"MA15+","rating_nzl":"M","rating_usa":null} {"movie_id":"6e3c57d8-51c9-4b20-afe7-e09d949b2d7f","title":"Mile 22","release_date":"2018","actor_names":["Alexandra Vino","Iko Uwais","John Malkovich","Lauren Cohan","Lauren Mary Kim","Mark Wahlberg","Nikolai Nikolaeff","Poorna Jagannathan","Ronda Jean Rousey","Sala Baker","Sam Medina","Terry Kinney"],"director_names":["Peter Berg"],"producer_names":["Mark Wahlberg","Peter Berg","Stephen Levinson"],"genres":"['Action']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"cde85555-c944-44ef-a3b1-a5bf842646ad","title":"The Meg","release_date":"2018","actor_names":["Cliff Curtis","James Gaylyn","Jason Statham","Jessica McNamee","Li Bingbing","Masi Oka","Page Kennedy","Rainn Wilson","Robert Taylor","Ruby Rose","Tawanda Manyimo","Winston Chao"],"director_names":["Jon Turteltaub"],"producer_names":["Belle Avery","Colin Wilson","Lorenzo di Bonaventura"],"genres":"['Action', 'Science Fiction', 'Thriller']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d1496722-2713-4b29-a323-18a0e2a0d6f3","title":"Monster's Ball","release_date":"2001","actor_names":["Amber Rules","Billy Bob Thornton","Charles Cowan Jr.","Coronji Calhoun","Gabrielle Witcher","Halle Berry","Heath Ledger","Peter Boyle","Sean Combs","Taylor LaGrange","Taylor Simpson","Yasiin Bey"],"director_names":["Marc Forster"],"producer_names":["Lee Daniels"],"genres":"['Drama', 'Romance']","rating_aus":null,"rating_nzl":"R16","rating_usa":null} {"movie_id":"844b309e-34ef-47bf-b981-eadc1d915886","title":"How to Be a Latin Lover","release_date":"2017","actor_names":["Anne McDaniels","Eugenio Derbez","Kristen Bell","Mckenna Grace","Michael Cera","Michaela Watkins","Raquel Welch","Rob Corddry","Rob Huebel","Rob Lowe","Rob Riggle","Salma Hayek"],"director_names":["Ken Marino"],"producer_names":null,"genres":"['Comedy']","rating_aus":"M","rating_nzl":"R13","rating_usa":null} {"movie_id":"991e3711-9918-41a0-b660-3c53ffa4901c","title":"Good Fortune","release_date":"2016","actor_names":null,"director_names":["Joshua Tickell","Rebecca Harrell Tickell"],"producer_names":null,"genres":"['Documentary']","rating_aus":"PG","rating_nzl":null,"rating_usa":null} {"movie_id":"7bf1935e-7d1a-437c-a71a-b4c70eb4f853","title":"Paper Heart","release_date":"2009","actor_names":["Charlyne Yi","Gill Summers","Jake Johnson","Martin Starr","Michael Cera","Seth Rogen"],"director_names":["Nicholas Jasenovec"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4cd0e423-acde-4bf7-bf93-f77777a4de6f","title":"Daybreakers","release_date":"2009","actor_names":["Claudia Karvan","Emma Randall","Ethan Hawke","Harriet Minto","Day","Isabel Lucas","Jay Laga'aia","Michael Dorman","Mungo McKay","Sam Neill","Tiffany Lamb","Vince Colosimo","Willem Dafoe"],"director_names":["Michael Spierig","Peter Spierig"],"producer_names":["Bryan Furst","Chris Brown","Sean Furst","Todd Fellman"],"genres":"['Action', 'Fantasy', 'Horror', 'Science Fiction']","rating_aus":"MA15+","rating_nzl":"R16","rating_usa":null} {"movie_id":"c0a84525-46a4-4977-86af-7fc8cb014683","title":"Requiem for a Dream","release_date":"2000","actor_names":["Charlotte Aronofsky","Christopher McDonald","Ellen Burstyn","Janet Sarno","Jared Leto","Jennifer Connelly","Joanne Gordon","Louise Lasser","Marcia Jean Kurtz","Mark Margolis","Marlon Wayans","Suzanne Shepherd"],"director_names":["Darren Aronofsky"],"producer_names":["Eric Watson","Palmer West"],"genres":"['Crime', 'Drama']","rating_aus":null,"rating_nzl":"R18","rating_usa":null} {"movie_id":"fe5367fe-b558-4fbe-872f-ab041ef58213","title":"Grizzly Man","release_date":"2005","actor_names":["David Letterman","Jewel Palovak","Kathleen Parker","Sam Egli","Timothy Treadwell","Warren Queeney","Werner Herzog","Willy Fulton"],"director_names":["Werner Herzog"],"producer_names":["Erik Nelson"],"genres":"['Documentary']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"6bd66f6b-2834-4013-884f-7eaf257e09fb","title":"The Great Buck Howard","release_date":"2008","actor_names":["Adam Scott","Colin Hanks","Debra Monk","Emily Blunt","Griffin Dunne","John Malkovich","Jonathan Ames","Patrick Fischler","Ricky Jay","Steve Zahn","Tom Hanks","Wallace Langham"],"director_names":["Sean McGinly"],"producer_names":["Gary Goetzman","Tom Hanks"],"genres":"['Comedy', 'Drama']","rating_aus":"G","rating_nzl":"G","rating_usa":null} {"movie_id":"78b22fc2-5069-40da-b4ac-790ec3902a32","title":"Boo! A Madea Halloween","release_date":"2016","actor_names":["Andre Hall","Bella Thorne","Brock O'Hurn","Cassi Davis","Diamond White","Jimmy Tatro","Kian Lawley","Lexy Panterra","Liza Koshy","Patrice Lovely","Tyler Perry","Yousef Erakat"],"director_names":["Tyler Perry"],"producer_names":null,"genres":"['Comedy', 'Drama', 'Horror']","rating_aus":"M","rating_nzl":null,"rating_usa":null} {"movie_id":"bc5c3635-bbfb-4dd7-b0ed-408787fd5f43","title":"Fantastic Beasts and Where to Find Them","release_date":"2016","actor_names":["Alison Sudol","Carmen Ejogo","Colin Farrell","Dan Fogler","Eddie Redmayne","Ezra Miller","Johnny Depp","Jon Voight","Katherine Waterston","Ron Perlman","Samantha Morton","Zo\u00eb Kravitz"],"director_names":["David Yates"],"producer_names":["David Heyman","J.K. Rowling","Lionel Wigram","Steve Kloves"],"genres":"['Adventure', 'Family', 'Fantasy']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"1ec3a043-a6c4-44a0-9fb1-c948eb07cf85","title":"Silver Linings Playbook","release_date":"2012","actor_names":["Anupam Kher","Bonnie Aarons","Bradley Cooper","Brea Bee","Chris Tucker","Dash Mihok","Jacki Weaver","Jennifer Lawrence","John Ortiz","Julia Stiles","Robert De Niro","Shea Whigham"],"director_names":["David O. Russell"],"producer_names":["Bruce Cohen","Donna Gigliotti","Jonathan Gordon","Mark Kamine"],"genres":"['Comedy', 'Drama', 'Romance']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"4c0bbde5-7a34-4556-a481-3357ef69b651","title":"The Equalizer","release_date":"2014","actor_names":["Alex Veadov","Bill Pullman","Chlo\u00eb Grace Moretz","David Harbour","David Meunier","Denzel Washington","E. Roger Mitchell","Haley Bennett","Johnny Skourtis","Marton Csokas","Melissa Leo","Vladimir Kulich"],"director_names":["Antoine Fuqua"],"producer_names":["Alex Siskin","Denzel Washington","Jason Blumenthal","Mace Neufeld","Michael Sloan","Richard Wenk","Steve Tisch","Todd Black","Tony Eldridge"],"genres":"['Action', 'Crime', 'Thriller']","rating_aus":"MA15+","rating_nzl":"R18","rating_usa":null} {"movie_id":"809f3131-7445-4ffb-8b77-6de6699c85c4","title":"The Notebook","release_date":"2004","actor_names":["David Thornton","Ed Grady","Gena Rowlands","James Garner","James Marsden","Jennifer Echols","Joan Allen","Kevin Connolly","Rachel McAdams","Ryan Gosling","Sam Shepard","Starletta DuPois"],"director_names":["Nick Cassavetes"],"producer_names":["Lynn Harris","Mark Johnson"],"genres":"['Drama', 'Romance']","rating_aus":"PG","rating_nzl":"PG","rating_usa":null} {"movie_id":"49652b6d-b818-4fc4-9c57-9ec7b5c346cc","title":"The Matrix","release_date":"1999","actor_names":["Anthony Ray Parker","Belinda McClory","Carrie","Anne Moss","Gloria Foster","Hugo Weaving","Joe Pantoliano","Julian Arahanga","Keanu Reeves","Laurence Fishburne","Marcus Chong","Paul Goddard","Robert Taylor"],"director_names":["Lana Wachowski","Lilly Wachowski"],"producer_names":["Joel Silver"],"genres":"['Action', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"d356a087-4e89-420b-867c-618544969302","title":"The Hunger Games","release_date":"2012","actor_names":["Alexander Ludwig","Donald Sutherland","Elizabeth Banks","Isabelle Fuhrman","Jennifer Lawrence","Josh Hutcherson","Lenny Kravitz","Liam Hemsworth","Stanley Tucci","Toby Jones","Wes Bentley","Woody Harrelson"],"director_names":["Gary Ross"],"producer_names":["Jon Kilik","Nina Jacobson"],"genres":"['Adventure', 'Fantasy', 'Science Fiction']","rating_aus":"M","rating_nzl":"M","rating_usa":null} {"movie_id":"0ef84c8c-ffc5-4896-9e3b-5303acba0ff3","title":"The Wolf of Wall Street","release_date":"2013","actor_names":["Brian Sacca","Henry Zebrowski","Jon Bernthal","Jon Favreau","Jonah Hill","Kenneth Choi","Kyle Chandler","Leonardo DiCaprio","Margot Robbie","Matthew McConaughey","P. J. Byrne","Rob Reiner"],"director_names":["Martin Scorsese"],"producer_names":["Emma Tillinger Koskoff","Joey McFarland","Leonardo DiCaprio","Martin Scorsese","Riza Aziz"],"genres":"['Comedy', 'Crime', 'Drama']","rating_aus":"R18+","rating_nzl":"R18","rating_usa":null}
What I want next is to calculate similarity based on multiple columns. Can you please guide me how can I achieve that.

You can use cosine_similarity from sklearn.metrics.pairwise like:
sim_df = cosine_similarity(df)
Then you can get the top 10 similar movies for a particular movie like:
sim_df[movie_index].argsort()[-10:][::-1]

Related

How do i analyze goodness of fit between two contingency tables?

I have two contingency tables that I performed a chi-square test on. I would like to know if these two tables have similar distributions/frequency of the data using a goodness of fit test. I'm not sure how to do this and how to format the data. Thanks in advance for your help!
discharge = data.frame (decreasing.sign =c(0,0,9,7,1,1 ),
decreasing.trend= c(2,3,35,27,8,6),
increase.trend = c(8,27,34,16,4,3),
increase.sign = c(0,2,7,0,0,0),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
groundwater = data.frame (decreasing.sign =c(0,1,6,45,6,16),
decreasing.trend= c(0, 1, 3,28, 5,5),
increase.trend = c(1,5,6,32,9,5),
increase.sign = c(1,0,0,4,2,20),
row.names= c("Ridge and Valley", "Blue Ridge", "Piedmont", "Southeastern Plains", "Middle Atlantic Coastal Plain","Southern Coastal Plain"))
chisq=chisq.test(discharge) #add ",simulate.p.value" if there are zeros within parentheses
chisq
chisq2=chisq.test(groundwater) #add ",simulate.p.value" if there are zeros within parentheses
chisq2

How to calculate mean of rouge1 and rougeL with precision, recall, and F-measure scores for text summarization?

I am trying the following code to calculate the mean precision, recall, and f-measure scores for rouge1 and rougeL in the text summarization task. But, I couldn't find the mean of these two scores.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge3', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog', 'The quick brown dog jumps on the log.')
scores1 = scorer.score('Something is good', 'Something went wrong')
print(scores)
print(scores1)
The result looks as follows
{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765), 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666), 'rouge3': Score(precision=0.16666666666666666, recall=0.14285714285714285, fmeasure=0.15384615384615383), 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}
{'rouge1': Score(precision=0.3333333333333333, recall=0.3333333333333333, fmeasure=0.3333333333333333), 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rouge3': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rougeL': Score(precision=0.3333333333333333, recall=0.3333333333333333, fmeasure=0.3333333333333333)}
Any help to find the mean among rouge1 and rougeL scores? Thanks in advance.

In python, headings not in the same row

I extracted three columns from a larger data frame (recent_grads) as follows...
df = recent_grads.groupby('Major_category')['Men', 'Women'].sum()
However, when I print df, it comes up as follows...
Men Women
Major_category
Agriculture & Natural Resources 40357.0 35263.0
Arts 134390.0 222740.0
Biology & Life Science 184919.0 268943.0
Business 667852.0 634524.0
Communications & Journalism 131921.0 260680.0
Computers & Mathematics 208725.0 90283.0
Education 103526.0 455603.0
Engineering 408307.0 129276.0
Health 75517.0 387713.0
Humanities & Liberal Arts 272846.0 440622.0
Industrial Arts & Consumer Services 103781.0 126011.0
Interdisciplinary 2817.0 9479.0
Law & Public Policy 91129.0 87978.0
Physical Sciences 95390.0 90089.0
Psychology & Social Work 98115.0 382892.0
Social Science 256834.0 273132.0
How do I get Major_category heading in the same row as Men and Women headings? I tried to put the three columns in a new data frame as follows...
df1 = df[['Major_category', 'Men', 'Women']].copy()
This gives me an error (Major_category not in index)
Hi man you should try reset_index https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html:
df = df.groupby('Major_category')['Men', 'Women'].sum()
# Print the output.
md = df.reset_index()
print(md)
Seems like you want to convert the groupby object back to a dataframe try:
df['Major_category'].apply(pd.DataFrame)

Add a space in lambda function

I have 2 columns in my data frame - ASIN and keywords . I am trying to groupby ASINs , the groupby is working fine
ASIN keywords
0 B07GFGXMZZ mangalagiri dress materials
1 B07GFGXMZZ pure cotton dress materials for women
2 B07GFGXMZZ suit material party wear for women
3 B076BL4CWB dhakai jamdani
4 B076BL4CWB jamdani
Groupby Code
df.groupby('ASIN').apply(lambda x: x.sum())
Output
but how to add a space for each lambda iteration as of now it is not doing , you can observe the same in output image i have linked
i Tried
df.groupby('ASIN').apply(lambda x:" ".join(x.sum()))
But it didn't work
ASIN
9801321261 98013212619801321261 cane mat with runnercane ...
B008YLNICE B008YLNICEB008YLNICEB008YLNICEB008YLNICEB008YL...
B00P81OJ26 B00P81OJ26B00P81OJ26B00P81OJ26B00P81OJ26B00P81...
B010SZBHEE B010SZBHEEB010SZBHEEB010SZBHEEB010SZBHEEB010SZ...
B01143KAY2 B01143KAY2B01143KAY2B01143KAY2B01143KAY2B01143...
B0157XMD4A B0157XMD4A elephant painted box
B0157XMRJ6 B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157XMRJ6B0157X...

Reformat csv file using python?

I have this csv file with only two entries. Here it is:
Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']
First one is title and second is a business headings.
Problem lies with entry two.
Here is my code:
import csv
with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
reader = csv.reader(textfile)
for row in reader:
row5 = row[5].replace("[", "").replace("]", "")
listt = [(''.join(row5))]
print (listt[0])
it prints:
'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'
What i need to do is that i want to create a list containing these words and then print them like this using for loop to print every item separately:
Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers
Actually I am trying to reformat my current csv file and clean it so it can be more precise and understandable.
Complete 1st line of csv is this:
Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"
First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification
There is no restriction of using csv.reader, I am open to any technique available to clean my file.
Think of it in terms of two separate tasks:
Collect some data items from a ‘dirty’ source (this CSV file)
Store that data somewhere so that it’s easy to access and manipulate programmatically (according to what you want to do with it)
Processing dirty CSV
One way to do this is to have a function deserialize_business() to distill structured business information from each incoming line in your CSV. This function can be complex because that’s the nature of the task, but still it’s advisable to split it into self-containing smaller functions (such as get_outlets(), get_headings(), and so on). This function can return a dictionary but depending on what you want it can be a [named] tuple, a custom object, etc.
This function would be an ‘adapter’ for this particular CSV data source.
Example of deserialization function:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
Example of calling it:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
Storing the data
You’ll have the store_business() function take your data structure and write it somewhere. Maybe it’ll be another CSV that’s better structured, maybe multiple CSVs, a JSON file, or you can make use of SQLite relational database facilities since Python has it built-in.
It all depends on what you want to do later.
Relational example
In this case your data would be split across multiple tables. (I’m using the word “table” but it can be a CSV file, although you can as well make use of an SQLite DB since Python has that built-in.)
Table identifying all possible business headings:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
Table identifying all possible categories:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
Table identifying businesses:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
Table describing their outlets:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
Table describing their headings:
business ID, business heading ID
1, 1
1, 2
1, 3
…
Handling all this would require a complex store_business() function. It may be worth looking into SQLite and some ORM framework, if going with relational way of keeping the data.
You can just replace the line :
print(listt[0])
with :
print(*listt[0], sep='\n')

Resources