How to parse complicated CSV file - python-3.x

I received a CSV file that includes a combination of string and tuple elements and cannot find a way to parse it properly. Am I missing something obvious?
csvfile
"presentation_id","presentation_name","sectionId","sectionNumber","courseId","courseIdentifier","courseName","activity_id","activity_prompt","activity_content","solution","event_timestamp","answer_id","answer","isCorrect","userid","firstname","lastname","email","role"
"26cc7957-5a6b-4bde-a996-dd823f54ece7","3-Axial Skeleton F18","937c47b0-cc66-4938-81de-1b1b58388499","001","3b5b5e49-1798-4eab-86d7-186cf59149b4","MOVESCI 230","Human Musculoskeletal Anatomy","62d059e8-9ab4-41d4-9eb8-00ba67d9fac9","A blow to which side of the knee might tear the medial collateral ligament?","{"choices":["medial","lateral"],"type":"MultipleChoice"}","{"solution":[1],"selectAll":false,"type":"MultipleChoice"}","2018-09-30 23:54:16.000","7b5048e5-7460-49f8-a64a-763b7f62d771","{"solution":[1],"type":"MultipleChoice"}","1","57ba970d-d02b-4a10-a64d-56f02336ee08","Student","One","student1#example.com","Student"
"26cc7957-5a6b-4bde-a996-dd823f54ece7","3-Axial Skeleton F18","937c47b0-cc66-4938-81de-1b1b58388499","001","3b5b5e49-1798-4eab-86d7-186cf59149b4","MOVESCI 230","Human Musculoskeletal Anatomy","f82cb32b-45ce-4d3a-aa74-b3fa1a1038a2","What is the name of this movement?","{"choices":["right rotation","left rotation","right lateral rotation","left lateral rotation"],"type":"MultipleChoice"}","{"solution":[1],"selectAll":false,"type":"MultipleChoice"}","2018-09-30 23:20:33.000","d6cce4d9-37ae-409e-afc5-54ad79f86226","{"solution":[3],"type":"MultipleChoice"}","0","921d1b9b-f550-4289-89f1-2a805b27eeb3","Student","Two","student2#example.com","Student"
where 1st row is titles, 2nd starts the data
with open(filepathcsv) as csvfile:
readCSV = csv.reader(csvfile)
for row in readCSV:
numcolumns = len(row)
print(numcolumns,": ",row)
yields:
20 : ['presentation_id', 'presentation_name', 'sectionId', 'sectionNumber', 'courseId', 'courseIdentifier', 'courseName', 'activity_id', 'activity_prompt', 'activity_content', 'solution', 'event_timestamp', 'answer_id', 'answer', 'isCorrect', 'userid', 'firstname', 'lastname', 'email', 'role']
25 : ['26cc7957-5a6b-4bde-a996-dd823f54ece7', '3-Axial Skeleton F18', '937c47b0-cc66-4938-81de-1b1b58388499', '001', '3b5b5e49-1798-4eab-86d7-186cf59149b4', 'MOVESCI 230', 'Human Musculoskeletal Anatomy', '62d059e8-9ab4-41d4-9eb8-00ba67d9fac9', 'A blow to which side of the knee might tear the medial collateral ligament?', '{choices":["medial"', 'lateral]', 'type:"MultipleChoice"}"', '{solution":[1]', 'selectAll:false', 'type:"MultipleChoice"}"', '2018-09-30 23:54:16.000', '7b5048e5-7460-49f8-a64a-763b7f62d771', '{solution":[1]', 'type:"MultipleChoice"}"', '1', '57ba970d-d02b-4a10-a64d-56f02336ee08', 'William', 'Muter', 'wmuter#umich.edu', 'Student']
27 : ['26cc7957-5a6b-4bde-a996-dd823f54ece7', '3-Axial Skeleton F18', '937c47b0-cc66-4938-81de-1b1b58388499', '001', '3b5b5e49-1798-4eab-86d7-186cf59149b4', 'MOVESCI 230', 'Human Musculoskeletal Anatomy', 'f82cb32b-45ce-4d3a-aa74-b3fa1a1038a2', 'What is the name of this movement?', '{choices":["right rotation"', 'left rotation', 'right lateral rotation', 'left lateral rotation]', 'type:"MultipleChoice"}"', '{solution":[1]', 'selectAll:false', 'type:"MultipleChoice"}"', '2018-09-30 23:20:33.000', 'd6cce4d9-37ae-409e-afc5-54ad79f86226', '{solution":[3]', 'type:"MultipleChoice"}"', '0', '921d1b9b-f550-4289-89f1-2a805b27eeb3', 'Noah', 'Willett', 'willettn#umich.edu', 'Student']
csv.reader is parsing each row differently because of complicated structure with embedded curly braced elements.
...but I expect 20 elements in each row.

The in the records, not the code. Your code works fine. To solve the problem you need to fix csv file because the fields with json content weren't serialised correctly.
Just change one quote sign " to two signs "" to escape them.
Here the example of fixed csv row.
"26cc7957-5a6b-4bde-a996-dd823f54ece7","3-Axial Skeleton F18","937c47b0-cc66-4938-81de-1b1b58388499","001","3b5b5e49-1798-4eab-86d7-186cf59149b4","MOVESCI 230","Human Musculoskeletal Anatomy","f82cb32b-45ce-4d3a-aa74-b3fa1a1038a2","What is the name of this movement?","{""choices"":[""right rotation"",""left rotation"",""right lateral rotation"",""left lateral rotation""],""type"":""MultipleChoice""}","{""solution"":[1],""selectAll"":false,""type"":""MultipleChoice""}","2018-09-30 23:20:33.000","d6cce4d9-37ae-409e-afc5-54ad79f86226","{""solution"":[3],""type"":""MultipleChoice""}","0","921d1b9b-f550-4289-89f1-2a805b27eeb3","Student","Two","student2#example.com","Student"
And the result of your code after fix:
20 : ['26cc7957-5a6b-4bde-a996-dd823f54ece7', '3-Axial Skeleton F18', '937c47b0-cc66-4938-81de-1b1b58388499', '001', '3b5b5e49-1798-4eab-86d7-186cf59149b4', 'MOVESCI 230', 'Human Musculoskeletal Anatomy', 'f82cb32b-45ce-4d3a-aa74-b3fa1a1038a2', 'What is the name of this movement?', '{"choices":["right rotation","left rotation","right lateral rotation","left lateral rotation"],"type":"MultipleChoice"}', '{"solution":[1],"selectAll":false,"type":"MultipleChoice"}', '2018-09-30 23:20:33.000', 'd6cce4d9-37ae-409e-afc5-54ad79f86226', '{"solution":[3],"type":"MultipleChoice"}', '0', '921d1b9b-f550-4289-89f1-2a805b27eeb3', 'Student', 'Two', 'student2#example.com', 'Student']

Thank you all for your suggestions!
Also, my apologies, as I did not include the raw CSV file I was trying to parse (example here:)
"b5ae18d3-b6dd-4d0a-84fe-7c43df472571"|"Climate_Rapid_Change_W18.pdf"|"18563b1e-a467-44b3-aed7-3607a1acd712"|"001"|"c86c8c8d-dca6-41cd-a010-a83e40d93e75"|"CLIMATE 102"|"Extreme Weather"|"278c4561-c834-4343-a770-3f544966f633"|"Which European city is at the same latitude as Ann Arbor?"|"{"choices":["Stockholm, Sweden","Berlin, Germany","London, England","Paris, France","Madrid, Spain"],"type":"MultipleChoice"}"|"{"solution":[4],"selectAll":false,"type":"MultipleChoice"}"|"2019-01-31 22:11:08.000"|"81392cd3-28e9-4e2e-8a33-018104b1f4d1"|"{"solution":[3,4],"type":"MultipleChoice"}"|"0"|"2db10c95-b507-4211-8244-394361148b22"|"Student"|"One"|"student1#umich.edu"|"Student"
"ee73fdaf-a926-4899-b0f7-9b942f1b44ad"|"6-Elbow, Wrist, Hand W19"|"48539109-529e-4359-83b9-2ae81be0532c"|"001"|"3b5b5e49-1798-4eab-86d7-186cf59149b4"|"MOVESCI 230"|"Human Musculoskeletal Anatomy"|"fcd7c673-d944-48c3-8a09-f458e03f8c44"|"What is the name of this movement?"|"{"choices":["first phalangeal joint","first proximal interphalangeal joint","first distal interphalangeal joint","first interphalangeal joint"],"type":"MultipleChoice"}"|"{"solution":[3],"selectAll":false,"type":"MultipleChoice"}"|"2019-01-31 22:07:32.000"|"9016f36c-41f5-4e14-84a9-78eea682c802"|"{"solution":[3],"type":"MultipleChoice"}"|"1"|"7184708d-4dc7-42e0-b1ea-4aca51f00fcd"|"Student"|"Two"|"student2#umich.edu"|"Student"
You are correct that the problem was the form of the CSV file.
I changed readCSV = csv.reader(csvfile) to readCSV = csv.reader(csvfile, delimiter="|", quotechar='|')
I then took the resulting list and removed the extraneous quotation marks from each element.
The rest of the program now works properly.

Related

How to use Camelot-py to split rows when text exist on a specific column

I am trying to extract table information from pdf using Camelot-py library. Initially using stream function like this:
import camelot
tables = camelot.read_pdf('sample.pdf', flavor='stream', pages='1', columns=['110,400'], split_text=True, row_tol=10)
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
However could not get the desired outcome, even after adjusting the columns value.
Then I switched to lattice flavor. It can now determine the column accurately, however due to the nature that the pdf source does not separate rows using lines, the whole table content are extracted on one row.
Below using lattice:
import camelot
tables = camelot.read_pdf('sample_camelot_extract.pdf', flavor='lattice', pages='1')
tables.export('ipc_export.csv', f='csv', compress=True)
tables[0]
tables[0].parsing_report
tables[0].to_csv('ipc_export.csv')
tables[0].df
source file snapshot
The logic that I want to implement is that for each new text that exists on the first column (FIG ITEM), it should be the start of the new row.
Have tried both flavors but not sure which is the best approach.
Link for original file here:
Logic intended
Thank you.
You could try using pdfplumber - it allows you to customize all of its table extraction settings.
For example - changing just the default horizontal strategy to text produces:
table = page.extract_table(table_settings={"horizontal_strategy": "text"})
[['FIG', '', '', 'EFFECT', 'UNITS'],
['ITEM', 'PART NUMBER', '1234567 NOMENCLATURE', 'FROM TO', 'PER\nASSY'],
['', '', '', '', ''],
['1', '', '', '', ''],
['', '', 'SYSTEM INSTL-AIR DISTR MIX', '', ''],
['', '', 'BAY (MAIN AIR', '', ''],
['', '', 'DISTRIBUTION ONLY)', '', ''],
You could play around with more settings to see if it's possible to extract the whole table the way you intend.
From here though - you could manually clean up and extract the column rows:
>>> df = pd.DataFrame(table)
>>> (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
0 FIG ITEM
1 PART NUMBER
2 1234567 NOMENCLATURE
3 EFFECT FROM TO
4 UNITS PER ASSY
dtype: object
df.columns = (df.iloc[0] + " " + df.iloc[1]).str.replace("\n", " ").str.strip()
df = df.tail(-3)
You could then forward fill the FIG ITEM column and group on that - allowing you to combine the items.
df.groupby(df["FIG ITEM"].replace("", float("nan")).ffill()).agg({
"PART NUMBER": "first",
"1234567 NOMENCLATURE": "\n".join,
"UNITS PER ASSY": "first",
})
PART NUMBER 1234567 NOMENCLATURE UNITS PER ASSY
FIG ITEM
- 1 M0DREF452754 SYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nDIS... RF
1 \nSYSTEM INSTL-AIR DISTR MIX\nBAY (MAIN AIR\nD...
10 BACS12GU3K8 .SCREW 12
15 BACS12GU3K9 .SCREW 18
20 BACB30NM3K15 .BOLT 15
27 BACB30NM3K17 .BOLT 2
28 BACB30NM3K20 .BOLT 1
30 NAS1149D0332J .WASHER 60
35 BACW10P44AL .WASHER 2
40 PLH53CD .NUT-\nSUPPLIER CODE:\nVF0224\nSPECIFICATION N... 2
45 SLT8LHC6 .STRAP-\nSUPPLIER CODE:\nV06383\nTRUE PART NUM... 7
5 BACS12GU3K7 .SCREW 12

add var with alphanumeric code in order of value

I have data from counties and for a peudonymized plot I want to add an alphanumeric code in the order of a sort variable. It is not so important what the code will look like, but I want to have a letter at the beginning so that it will not be confused with the numeric information in the chart.
In the original data, I have more than 26 observations. Therefore the code needs to have two digits.
# example data
county <- c("all", "Berkshire", "Blackpool", "Bournemouth", "Bristol",
"Cambridgeshire", "Cheshire", "Devon", "Dorset", "Essex",
"Gloucestershire", "Hampshire", "Kent", "Lincolnshire",
"Norfolk", "Oxfordshire", "Suffolk", "Wiltshire", "Worcestershire",
"Yorkshire")
sort <- c(-2, 16.5, 400, 331, 375.2, 13.1, 400, 376.4,
128.3, 400, 48.6, 6.7, 113.5, 43.7, 295.9,400,
261.5, 100, 183.3, 400)
df <- data.frame(county, sort)
This is how I would like the result to look like:

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.
wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

Dictionary text file Python

text
Donald Trump:
791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
Join me live in Springfield, Ohio!
Lit
<<<EOT
781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
<<<EOT
def read(text):
with open(text,'r') as f:
for line in f:
Is there a way that i can separate each information for the candidates So for example for Donald Trump it should be
[
[Donald Trump],
[791697302519947264[[791697302519947264,1477604720,'Ohio USA','Twitter for iPhone',5251,18951895], 'Join['Join me live in Springfield, Ohio! Lit']Lit']],
[781619038699094016[[781619038699094016,1475201875,'United States','Twitter for iPhone',31968,1724617246], 'While['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...']']]
]
The format of the file is the following:
ID,DATE,LOCATION,SOURCE,FAVORITE_COUNT,RETWEET_COUNT text(the tweet)
So basically after the 6 headings, everything after that is a tweet till '<<
Also is there a way i can do this for every candidate in the file
I'm not sure why you need a multi-dimensional list (I would pick tuples and dictionaries if possible) but this seems to produce the output you asked for:
>>> txt = """Donald Trump:
... 791697302519947264,1477604720,Ohio USA,Twitter for iPhone,5251,1895
... Join me live in Springfield, Ohio!
... Lit
... <<<EOT
... 781619038699094016,1475201875,United States,Twitter for iPhone,31968,17246
... While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney...
... <<<EOT
... Another Candidate Name:
... 12312321,123123213,New York USA, Twitter for iPhone,123,123
... This is the tweet text!
... <<<EOT"""
>>>
>>>
>>> buffer = []
>>> tweets = []
>>>
>>> for line in txt.split("\n"):
... if not line.startswith("<<<EOT"):
... buffer.append(line)
... else:
... if buffer[0].strip().endswith(":"):
... tweets.append([buffer.pop(0).rstrip().replace(":", "")])
... metadata = buffer.pop(0).split(",")
... tweet = [" ".join(line for line in buffer).replace("\n", " ")]
... tweets.append([metadata, tweet])
... buffer = []
...
>>>
>>> from pprint import pprint
>>>
>>> pprint(tweets)
[['Donald Trump'],
[['791697302519947264',
'1477604720',
'Ohio USA',
'Twitter for iPhone',
'5251',
'1895'],
['Join me live in Springfield, Ohio! Lit']],
[['781619038699094016',
'1475201875',
'United States',
'Twitter for iPhone',
'31968',
'17246'],
['While Hillary profits off the rigged system, I am fighting for you! Remember the simple phrase: #FollowTheMoney... ']],
['Another Candidate Name'],
[['12312321',
'123123213',
'New York USA',
' Twitter for iPhone',
'123',
'123'],
['This is the tweet text!']]]
>>>
I am not quite understanding... but here is my example to read a file line by line then add that line to a string of text to post to twitter.
candidates = open("FILEPATH WITH DOUBLE \") #example "C:\\users\\fox\\desktop\\candidates.txt"
for candidate in candidates():
candidate = candidate.rstrip('\n') #removes new line(this is mandatory)
#next line post means post to twitter
post("propaganda here " + candidate + "more propaganda)
note for every line in that file this code will post to twitter
ex.. 20 lines means twenty twitter posts

Reformat csv file using python?

I have this csv file with only two entries. Here it is:
Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']
First one is title and second is a business headings.
Problem lies with entry two.
Here is my code:
import csv
with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
reader = csv.reader(textfile)
for row in reader:
row5 = row[5].replace("[", "").replace("]", "")
listt = [(''.join(row5))]
print (listt[0])
it prints:
'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'
What i need to do is that i want to create a list containing these words and then print them like this using for loop to print every item separately:
Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers
Actually I am trying to reformat my current csv file and clean it so it can be more precise and understandable.
Complete 1st line of csv is this:
Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"
First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification
There is no restriction of using csv.reader, I am open to any technique available to clean my file.
Think of it in terms of two separate tasks:
Collect some data items from a ‘dirty’ source (this CSV file)
Store that data somewhere so that it’s easy to access and manipulate programmatically (according to what you want to do with it)
Processing dirty CSV
One way to do this is to have a function deserialize_business() to distill structured business information from each incoming line in your CSV. This function can be complex because that’s the nature of the task, but still it’s advisable to split it into self-containing smaller functions (such as get_outlets(), get_headings(), and so on). This function can return a dictionary but depending on what you want it can be a [named] tuple, a custom object, etc.
This function would be an ‘adapter’ for this particular CSV data source.
Example of deserialization function:
def deserialize_business(csv_line):
"""
Distills structured business information from given raw CSV line.
Returns a dictionary like {name, phone, owner,
btype, yoe, headings[], outlets[], category}.
"""
pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]
name = pieces[0]
phone = pieces[1]
owner = pieces[2]
btype = pieces[3]
yoe = pieces[4]
# after yoe headings begin, until substring Outlets Address
headings = pieces[4:pieces.index("Outlets Address")]
# outlets go from substring Outlets Address until category
outlet_pieces = pieces[pieces.index("Outlets Address"):-1]
# combine each individual outlet information into a string
# and let ``deserialize_outlet()`` deal with that
raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]
# category is the last piece
category = pieces[-1]
return {
'name': name,
'phone': phone,
'owner': owner,
'btype': btype,
'yoe': yoe,
'headings': headings,
'outlets': outlets,
'category': category,
}
Example of calling it:
with open("phonebookCOMPK-Directory.csv") as f:
lineno = 0
for line in f:
lineno += 1
try:
business = deserialize_business(line)
except:
# Bad line formatting?
log.exception(u"Failed to deserialize line #%s!", lineno)
else:
# All is well
store_business(business)
Storing the data
You’ll have the store_business() function take your data structure and write it somewhere. Maybe it’ll be another CSV that’s better structured, maybe multiple CSVs, a JSON file, or you can make use of SQLite relational database facilities since Python has it built-in.
It all depends on what you want to do later.
Relational example
In this case your data would be split across multiple tables. (I’m using the word “table” but it can be a CSV file, although you can as well make use of an SQLite DB since Python has that built-in.)
Table identifying all possible business headings:
business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers
Table identifying all possible categories:
category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"
Table identifying businesses:
business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3
Table describing their outlets:
business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281
Table describing their headings:
business ID, business heading ID
1, 1
1, 2
1, 3
…
Handling all this would require a complex store_business() function. It may be worth looking into SQLite and some ORM framework, if going with relational way of keeping the data.
You can just replace the line :
print(listt[0])
with :
print(*listt[0], sep='\n')

Resources