Data cleaning this strange array format - python-3.x

I’m new to python. I used python and Jupiter notebook and imported
Pandas & pypostal.
This is my code:
import numpy as np
import pandas as pd
from postal.parser import parse_address
df = pd.read_csv("./file.csv").head(20)
df['LongAddr'].apply(parse_address)
df['parse_addr'] = df['LongAddr'].apply(parse_address)
df.to_csv('./new_file.csv', index=False)
print ("JOB DONE")
This is my file.csv:
customer_key Company_Code Name Address_Type LongAddr
0 CHIT000001 ZY1 Terry CHI Nathan Road, Kowloon, Hong Kong
1 ENGT000002 BH6 Mary ENG Flat E, 19/F, Blk A, Hilton building
2 RCHIT000003 EG9 John.G CHI Marble Road Tai Koo Hong Kong
I had try output as cvs, json, xml.
However the file format wasn’t change any.
I have no clue with this format.
However, it turns out like this:
0 [(Hong Kong, state),(Kowloon, city),(Nathan Road, Road)]
1 [(flat E, unit), (19, level), (blk a hilton building, House)]
2 [(Hong Kong, state),(Tai Koo, city),(Marble Road, Road)]
All I want is .csv or .xlsx file
And output like this:
customer_key, state, city, road, house, level, unit
0 CHIT000001, Hong Kong, Kowloon, Nathan Road,,
1 ENGT000002, ,,,Blk A Hilton building, 19/F, Flat E
2 RCHIT000003 Hong Kong, Tai Koo, Marble Road

Create dictionary from resulting list of tuples by extracting State, city, road.
Create new dataframe from dictionary and you can use to_csv() for exporting file.
use respective file extension in to_csv()
Please provide Sample Output next time. Steps to reproduce are not clear.
Refer to below link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Related

JobLib | PicklingError: Could not pickle the task to send it to the workers

I am wanting to run many invocations of Neural Network GPT-2 by HappyFace.
It works without stand-alone:
%%time
for invoc in invoc_list:
val_list = query(invoc, entries)
val_list
>>> Wall time: 32.8 s
>>> ['"I think we all knew the first part of The First World War."\n\nThe question of how the people got to where they are in the United States isn\'t entirely clear, but it involves a series of complex and',
'from a lowly high school English teacher to the highest political executive in Israel when the Nazis captured that state in 1941, she wrote in her 1945 biography the "secret" book on the Nazi regime:\n\n"Among the Nazi',
'It is said that the woman of the North Sea was one of the most powerful of the Seven Sisters; and many of them were able to overcome other men. She is described as a "Great Great Woman from the',
'In a series of letters of his heredity to a classmate in a college near his home, she was asked by him to take her down to her home in Florida where she lives with her boyfriend, while she tried to']
My aim is to have the above output yielded, but running in parallel via. JobLib.
However, when "converting" my code to JobLib code for parallelisation; I get the following error:
%%time
my_list = Parallel(n_jobs=8)(delayed(query)(invoc, entries) for invoc in invoc_list)
my_list
>>> PicklingError: Could not pickle the task to send it to the workers.
Notebook:
import time
# GPT-2
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
#import dill as pickle
def query(payload, entries):
list_dict = generator(payload, min_length=25, max_length=50, num_return_sequences=entries)
output = [d['generated_text'].split(payload)[1].strip() for d in list_dict]
return output
invoc_list = ['his her criminal charges include but are not limited to:', 'the defendant was sentenced to:', 'his her education history:']
entries = 4
%%time
%%time
for invoc in invoc_list:
val_list = query(invoc, entries)
val_list
%%time
my_list = Parallel(n_jobs=8)(delayed(query)(invoc, entries) for invoc in invoc_list)
my_list
Note: you may need to install:
pip3 install joblib
pip3 install dill

How to save nested json data as individual column in CVS file using python 3?

I need to store the data in seperate column in csv file. particularly (Professional experiance)
Column shoud be like this
Date, Education, Email, Id, Job_position, Mobile_number, Name, Working_experiance1, Date1, Experiance1, Working_experiance2, Date2, Experiance2, Skills, Total_Experiance
Input
[{'Date': '12 12 2019',
'Education': ['BSC'],
'Email': None,
'Id': None,
'Job_position': [],
'Mobile_number': None,
'Name': 'Kenny Dosumu',
'Professional experiance': [{'Date1': ['May 2016', 'Dec 2019'],
'Experiance1': 3,
'Working_Experiance1':['Project1: Aetna Insurance May 2016 – Present Scrum Master Responsibilities Indian and Philippines]},
{'Date2': ['Jan 2013', 'Apr 2016'],
'Experiance2': 3,
'Working_Experiance2': ['Project2: Children’s Hospital of Philadelphia Jan 2013 – Apr 2016 Responsibilities Keeping the team together all the time to ensure successful sprints. Migrating projects from Waterfall to Scrum is major responsibility.]}],
'Skills': ['Vision',
'Matrix',
'Product owner',
'Scrum',
'Documents'],
'Total_experience': 6}]
To complete your task, you must load your json file to python. To do so, you can use something like this:
import json
with open('raw_data.json') as json_file:
raw_data = json.load(json_file)
I simply used a couple of lines of your input in my example by copy-pasting the raw data in the script, but you should import the file. So here we go!
import xlwt
raw_data = [{'Date': '12 12 2019',
'Education': ['BSC'],
'Email': None,
'Id': None,}]
You then need to create a list to store all the info contained in the json file.The first line of this list will contain the title of every column:
excel_data = [['Date','Education','Email','Id']]
Then, for every dictionary stored in your json file, you get the information and store it in the previously created list.
for element in raw_data:
date = element['Date']
education = element['Education']
email = element['Email']
identification = element['Id']
excel_data.append([date, education, email, identification])
Finally, you build the Excel file using xlwt and the previously created list.
excel_file = xlwt.Workbook()
sheet = excel_file.add_sheet('Data')
for i, l in enumerate(excel_data):
for j, col in enumerate(l):
sheet.write(i, j, col)
excel_file.save('ExcelFile.csv')
I hoped this solved your problem. As you can see, your problem contained a bunch of different questions namely:
1-How to import a json file to python (Import JSON data into Python)
2-How to recover elements from a dictionary (Accessing elements of Python dictionary by index)
3-How to write a Python list to Excel (Writing to an Excel spreadsheet)
Next time, try breaking up your problem into small steps and quickly Google how to accomplish every one of them if you are not sure. This will rapidly allow you to improve your programming skills and to solve the problem at hand.
Happy Holidays!

How python can get difference between all pairs of rows under multiple columns

I have two CSV files and both files having multiple columns & rows, i'm looking forward to get the difference across all the rows of both the files. Lets suppose if there is difference in Asset Tag Number between the files then highlight the differences in any form(may be Bold the values or something appropriate), Moreover, we have a Key here is Serial Number which is unique on both the files. So, Would be good to take the difference of the rows into a new.csv file and highlight the differences while removing the identical rows.
Just for the refe, i have Both the Files having more than 100 columns..
My actual Data columns are like below on both the csv files.
Columns: [Asset Tag Number_a, Serial Number_a, System Name_a, Domain_a, System manufacturer_a, Model Name_a, System Type_a, Critical Level_a, Purpose Level 1_a, Purpose2_a, ShareIndv_a, Site_a, Building_a, Room_a, Rack_a, serverCostCenter_a, User ID BU Grp Mgr_a, OS Name_a, OS Version_a, OS Type_a, Service Pack_a, Notification Group_a, Off The Network_a, First Name_a, Last Name_a, Manager Name_a, Status_a, BU Cost Center_a, BU CC Description_a, Organization Name_a, Higher Level BU_a, Business Contact_a, Description_a, Asset Type_a, System Type SW_a, Server _a, Host ID(Unix)_a, IP Address_a, MAC Address_a, Installed RAM_a, Disk Capacity_a, Installed Disk_a, Server Status _a, High Level Status_a, Lifecycle Status_a, EndOfLifeDate_a, Last Audit_a, AltVersion_a, BIOS Vendor_a, BIOS Version_a, BIOS Release Date_a, SMBIOS Enabled_a, SMBios Version_a, Region_a, Currency_a, Acquisition Cost USD_a, Net Book Value USD_a, CPU Type_a, CPU Speed_a, Acquisition Date_a, Age_a, DateModified_a, Altiris Exception_a, Inventory Owner_a, Last Logon User_a, Inventory Owner Last Logon User_a, Client Date_a, Reporting Status_a, Contact Status_a, Comments_a, Exception Reason_a, DNR_a, Asset Tag Number_b, Serial Number_b, System Name_b, Domain_b, System manufacturer_b, Model Name_b, System Type_b, Critical Level_b, Purpose Level 1_b, Purpose2_b, ShareIndv_b, Site_b, Building_b, Room_b, Rack_b, serverCostCenter_b, User ID BU Grp Mgr_b, OS Name_b, OS Version_b, OS Type_b, Service Pack_b, Notification Group_b, Off The Network_b, First Name_b, Last Name_b, Manager Name_b, Status_b, BU Cost Center_b, ...]
Index: []
As being a newbie pandas learner i applied fews methods of code but doesn't seems to be a nearby fit hence seeking a generous help & suggestions..
1) First code tried..
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
A = pd.read_csv('a.csv', index_col=0)
B = pd.read_csv('b.csv', index_col=0)
C = pd.merge(left=A,right=B, how='outer', left_index=True, right_index=True, suffixes=['_a', '_b'])
not_in_a = C.drop( A.index )
not_in_b = C.drop( B.index )
not_in_a.to_csv('not_in_a.csv')
not_in_b.to_csv('not_in_b.csv')
2) tried another code but output is so much width which tough to read, whereas this snippet should drop the duplicates and will print only the one who are in difference..
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import csv
import pandas as pd
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True)
c.reset_index(drop=True, inplace=True)
print(c)
I did some google searching and found some stack overflow discussion on the topic. However, there are some decent solutions in the thread but nothing that I felt would meet my requirements hence i posted here.
3) another code applied with python sets which works partially ..
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import os
orig = open('aa.csv','r')
new = open('bb.csv','r')
bigb = set(new) - set(orig)
print(bigb)
# Write to output file
with open('different.csv', 'w') as file_out:
for line in bigb:
file_out.write(line)
orig.close()
new.close()
file_out.close()
I have below two sample file for reference which looks similar to my data,where we can take the Serial Number as a key for out logic and code.
Below are my two csv files file1.csv & file2.csv
File1:
wrkStaId Asset Tag Number Serial Number System Name
mac-ymatsuok2
PC-ABNER-W10
PC-ADAMLIN-W10
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485 ser900 dgs1sj
Sample File2:
wrkStaId Asset Tag Number Serial Number System Name
mac-ymatsuok2
PC-Karn-W10
PC-ADAMLIN-W10
PC-ADRIANA-W10
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485 ser900 dgs1sj
Desired Result: How do you want the difference represented, as these
are non-numeric values. Do you want to print both rows in case they
differ into a new file, and drop them if they are the same?
ANS: Yes
Desired output..
Differce in File1 which is notin file2
wrkStaId Asset Tag Number Serial Number System Name
PC-ABNER-W10
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456 ser345 A313819
Differce in File2 which is not in file1
wrkStaId Asset Tag Number Serial Number System Name
PC-Karn-W10
PC-ADRIANA-W10
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709 ser345 A313819
Much thanks to the #w-m, however i'm still hopeful to spill the some more ideas from the experts from SO.
Your data seems to contain two parts: a list of System Names, and then a table of rows. As the structure is quite different, I suggest you split the data into a list of System Names and the full rows and process them separately.
First extract the System Name lists:
l1 = df1[df1.wrkStaId == ""].System_Name
l2 = df2[df2.wrkStaId == ""].System_Name
You can get the difference with Python set difference code:
>>> set(l1).difference(set(l2))
{'PC-ABNER-W10'}
>>> set(l2).difference(set(l1))
{'PC-ADRIANA-W10', 'PC-Karn-W10'}
Now drop the empty wrkStaId entries:
df1 = df1[df1.wrkStaId != ""].set_index("wrkStaId")
df2 = df2[df1.wrkStaId != ""].set_index("wrkStaId")
The rest of the data now contains full rows with wrkStaId as the index.
df1:
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234.0 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456.0 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456.0 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485.0 ser900 dgs1sj
df2:
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{ED0CCFFD-28D6-4170-9DE9-0DFB83F49193} 1234.0 ser123 sfreder
{8AEAF485-A4FF-460C-91FA-0DFCAD79DD24} 3456.0 ser124 10210277
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709.0 ser345 A313819
{445EC096-A70C-47D1-91FF-0DFE747F762A} 4485.0 ser900 dgs1sj
You can now do the set differences on the pandas df's like this:
>>> df1[~df1.isin(df2).all(1)]
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 456.0 ser345 A313819
>>> df2[~df2.isin(df1).all(1)]
Asset_Tag_Number Serial_Number System_Name
wrkStaId
{E6204B69-DABB-4A1E-906B-0DFD2BCEDA41} 1709.0 ser345 A313819
You may need to adapt the code a little to get exactly what you want, but I hope this gets you going.

Python. creating Pie chart using existing Object?

I'm working on a dataset called 'Crime Against Women in India.
I got the dataset from the website and tidy up the data using Excel.
For data Manipulation and Visualization i'm using python (3.0) on Jupyter Workbook (5.0.0 Version). Here's the the code I worked so far.
# importing Libraries
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
# Reading CSV File and naming the object called crime
crime=pd.read_csv("C:\\Users\\aneeq\\Documents\\python assignment\\crime.csv",index_col = None, skipinitialspace = True)
print(crime)
Now I can see my data. Now I want to do is to find out what type of crime has the most value against Woman in India in 2013. That's simple and I did that using the following code
Type = crime.loc[(crime.AreaName.isin(['All-India'])) & (crime.Year.isin([2013])) , ['Year', 'AreaName', 'Rape', 'Kidnapping', 'DowryDeaths', 'Assault', 'Insult', 'Cruelty']]
print(Type)
Results shows like this.
Year AreaName Rape Kidnapping DowryDeaths Assault Insult Cruelty
2013 All-India 33707 51881 8083 70739 12589 118866
Now , the next part is where I'm struggling with it at the moment. I want to make a piechart for the type of crimes that has the most values. You can see Cruelty('Cruelty by Husband or his relatives') has the most crime values than others.
I want to display 'Rape', 'Kidnapping', 'DowryDeaths', 'Assault', 'Insult' and 'Cruelty' on the Piechart (using matplotlib). Not 'Years' and 'AreaNames'.
This is my code so far
exp_val = Type.Rape, Type.Kidnapping, Type.DowryDeaths, Type.Assault, Type.Insult, Type.Cruelty
plt.pie(exp_val)
Not sure if my code is right. But anyways I got an error saying `'KeyError: 0'.
Can anyone help me with this and what is the right code for displaying Pie chart using existing object.

Relationship extraction between person and city/state

I'm trying to take a sentence and extract the relationship between Person(PER) and Place(GPE).
Sentence: "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
For the final person, she has both a city and a state that could get extracted as her place. So far, I've tried using nltk to do this, but have only been able to extract her city and not her state.
What I've tried:
import re
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sem.relextract import extract_rels, rtuple
sentence = "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
ISFROM = re.compile(r'.*\bfrom\b.*')
rels = extract_rels('PER', 'GPE', chunked, corpus = 'ace', pattern = ISFROM)
for rel in rels:
print(rtuple(rel))
My output is:
[PER: 'John/NNP'] 'is/VBZ from/IN' [GPE: 'Ohio/NNP']
[PER: 'Michael/NNP'] 'is/VBZ from/IN' [GPE: 'Florida/NNP']
[PER: 'Rebecca/NNP'] 'is/VBZ from/IN' [GPE: 'Nashville/NNP']
The problem is Rebecca. How can I extract that both Nashville and Tennesee are part of her location? Or even just Tennessee alone?
It seems to me that you have to first extract intra-location relationship (Nashville in Tennessee). Then ensure that you transitively assign all locations to Rebecca (if Rebecca is in Nashville and Nashville is in Tennessee then Rebecca is in Nashville and Rebecca is in Tennessee).
That would be one more relationship type and some logic for the above inference (things get complicated pretty quickly but it is hard to avoid it).

Resources