Transform JSON to excel table - python-3.x

I have data in csv - 2 columns, 1st column contains member id and second contains characteristics in Key-Value pairs (nested one under another).
I have seen online codes which convert a simple Key-value pairs but not able to transform data like what i have shown above
I want to transform this data into a excel table as below

I did it with this XlsxWriter package, so first you have to install it by running pip install XlsxWriter command.
import csv # to read csv file
import xlsxwriter # to write xlxs file
import ast
# you can change this names according to your local ones
csv_file = 'data.csv'
xlsx_file = 'data.xlsx'
# read the csv file and get all the JSON values into data list
data = []
with open(csv_file, 'r') as csvFile:
# read line by line in csv file
reader = csv.reader(csvFile)
# convert every line into list and select the JSON values
for row in list(reader)[1:]:
# csv are comma separated, so combine all the necessary
# part of the json with comma
json_to_str = ','.join(row[1:])
# convert it to python dictionary
str_to_dict = ast.literal_eval(json_to_str)
# append those completed JSON into the data list
data.append(str_to_dict)
# define the excel file
workbook = xlsxwriter.Workbook(xlsx_file)
# create a sheet for our work
worksheet = workbook.add_worksheet()
# cell format for merge fields with bold and align center
# letters and design border
merge_format = workbook.add_format({
'bold': 1,
'border': 1,
'align': 'center',
'valign': 'vcenter'})
# other cell format to design the border
cell_format = workbook.add_format({
'border': 1,
})
# create the header section dynamically
first_col = 0
last_col = 0
for index, value in enumerate(data[0].items()):
if isinstance(value[1], dict):
# this if mean the JSON key has something else
# other than the single value like dict or list
last_col += len(value[1].keys())
worksheet.merge_range(first_row=0,
first_col=first_col,
last_row=0,
last_col=last_col,
data=value[0],
cell_format=merge_format)
for k, v in value[1].items():
# this is for go in deep the value if exist
worksheet.write(1, first_col, k, merge_format)
first_col += 1
first_col = last_col + 1
else:
# 'age' has only one value, so this else section
# is for create normal headers like 'age'
worksheet.write(1, first_col, value[0], merge_format)
first_col += 1
# now we know how many columns exist in the
# excel, and set the width to 20
worksheet.set_column(first_col=0, last_col=last_col, width=20)
# filling values to excel file
for index, value in enumerate(data):
last_col = 0
for k, v in value.items():
if isinstance(v, dict):
# this is for handle values with dictionary
for k1, v1 in v.items():
if isinstance(v1, list):
# this will capture last 'type' list (['Grass', 'Hardball'])
# in the 'conditions'
worksheet.write(index + 2, last_col, ', '.join(v1), cell_format)
else:
# just filling other values other than list
worksheet.write(index + 2, last_col, v1, cell_format)
last_col += 1
else:
# this is handle single value other than dict or list
worksheet.write(index + 2, last_col, v, cell_format)
last_col += 1
# finally close to create the excel file
workbook.close()
I commented out most of the line to get better understand and reduce the complexity because you are very new to Python. If you didn't get any point let me know, I'll explain as much as I can. Additionally I used enumerate() python Built-in Function. Check this small example which I directly get it from original documentation. This enumerate() is useful when numbering items in the list.
Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration. The __next__() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over iterable.
>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']
>>> list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
>>> list(enumerate(seasons, start=1))
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
Here is my csv file,
and here is the final output of the excel file. I just merged the duplicate header values (matchruns and conditions).

Related

How to extract many groups of cells separated by a specified number of rows in excel using python and write it to an other file?

I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
else:
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
break
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
if ex == 1:
break
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
a.append(p)
print(numpy.shape(a))
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
print(reader.line_num)
# add each row to it group
row_group.append(row[91:95])
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
new_data.append(list(chain(*row_group)))
# clear the row_group for next set of rows
row_group.clear()
print('='*50)
else:
continue
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)
writer.writerows(new_data)

How to color in red values that are different in adjacent columns?

I have the following dataframe, and I want to color in read the values that are different for each adjacent feature. So for example for 'max', CRIM raw=88.98 and CRIM wisorized=41.53 should be in red whereas for AGE they should remain black.
How can I do this? Attached is the CSV file.
,25%,25%,50%,50%,75%,75%,count,count,max,max,mean,mean,min,min,std,std
,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized
CRIM,0.08,0.08,0.26,0.26,3.68,3.68,506.0,506.0,88.98,41.53,3.61,3.38,0.01,0.01,8.6,6.92
ZN,0.0,0.0,0.0,0.0,12.5,12.5,506.0,506.0,100.0,90.0,11.36,11.3,0.0,0.0,23.32,23.11
INDUS,5.19,5.19,9.69,9.69,18.1,18.1,506.0,506.0,27.74,25.65,11.14,11.12,0.46,1.25,6.86,6.81
CHAS,0.0,0.0,0.0,0.0,0.0,0.0,506.0,506.0,1.0,1.0,0.07,0.07,0.0,0.0,0.25,0.25
NOX,0.45,0.45,0.54,0.54,0.62,0.62,506.0,506.0,0.87,0.87,0.55,0.55,0.38,0.4,0.12,0.12
RM,5.89,5.89,6.21,6.21,6.62,6.62,506.0,506.0,8.78,8.34,6.28,6.29,3.56,4.52,0.7,0.68
AGE,45.02,45.02,77.5,77.5,94.07,94.07,506.0,506.0,100.0,100.0,68.57,68.58,2.9,6.6,28.15,28.13
DIS,2.1,2.1,3.21,3.21,5.19,5.19,506.0,506.0,12.13,9.22,3.8,3.78,1.13,1.2,2.11,2.05
RAD,4.0,4.0,5.0,5.0,24.0,24.0,506.0,506.0,24.0,24.0,9.55,9.55,1.0,1.0,8.71,8.71
TAX,279.0,279.0,330.0,330.0,666.0,666.0,506.0,506.0,711.0,666.0,408.24,407.79,187.0,188.0,168.54,167.79
PTRATIO,17.4,17.4,19.05,19.05,20.2,20.2,506.0,506.0,22.0,21.2,18.46,18.45,12.6,13.0,2.16,2.15
B,375.38,375.38,391.44,391.44,396.22,396.22,506.0,506.0,396.9,396.9,356.67,356.72,0.32,6.68,91.29,91.14
LSTAT,6.95,6.95,11.36,11.36,16.96,16.96,506.0,506.0,37.97,34.02,12.65,12.64,1.73,2.88,7.14,7.08
MEDV,17.02,17.02,21.2,21.2,25.0,25.0,506.0,506.0,50.0,50.0,22.53,22.54,5.0,7.0,9.2,9.18
Nothing more, Nothing less :)
def highlight_cols(s):
# input: s is a pd.Series with an attribute name
# s.name --> ('25%', 'raw')
# ('25%', 'winsorized')
# ...
#
# 1) Take the parent level of s.name (first value of the tuple) E.g. 25%
# 2) Select the subset from df, given step 1
# --> this will give you the df: 25% - raw | 25% - winsorized back
# 3) check if the amount of unique values (for each row) > 1
# If so: return a red text
# if not: return an empty string
#
# Output: a list with the desired style for serie x
return ['background-color: red' if x else '' for x in df[s.name[0]].nunique(axis=1) > 1]
df.style.apply(highlight_cols)
You can do this comparison between columns using a groupby. Here's an example:
import pandas as pd
import io
s = """,25%,25%,50%,50%,75%,75%,count,count,max,max,mean,mean,min,min,std,std
,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized
CRIM,0.08,0.08,0.26,0.26,3.68,3.68,506.0,506.0,88.98,41.53,3.61,3.38,0.01,0.01,8.6,6.92
ZN,0.0,0.0,0.0,0.0,12.5,12.5,506.0,506.0,100.0,90.0,11.36,11.3,0.0,0.0,23.32,23.11
INDUS,5.19,5.19,9.69,9.69,18.1,18.1,506.0,506.0,27.74,25.65,11.14,11.12,0.46,1.25,6.86,6.81
CHAS,0.0,0.0,0.0,0.0,0.0,0.0,506.0,506.0,1.0,1.0,0.07,0.07,0.0,0.0,0.25,0.25
NOX,0.45,0.45,0.54,0.54,0.62,0.62,506.0,506.0,0.87,0.87,0.55,0.55,0.38,0.4,0.12,0.12
RM,5.89,5.89,6.21,6.21,6.62,6.62,506.0,506.0,8.78,8.34,6.28,6.29,3.56,4.52,0.7,0.68
AGE,45.02,45.02,77.5,77.5,94.07,94.07,506.0,506.0,100.0,100.0,68.57,68.58,2.9,6.6,28.15,28.13
DIS,2.1,2.1,3.21,3.21,5.19,5.19,506.0,506.0,12.13,9.22,3.8,3.78,1.13,1.2,2.11,2.05
RAD,4.0,4.0,5.0,5.0,24.0,24.0,506.0,506.0,24.0,24.0,9.55,9.55,1.0,1.0,8.71,8.71
TAX,279.0,279.0,330.0,330.0,666.0,666.0,506.0,506.0,711.0,666.0,408.24,407.79,187.0,188.0,168.54,167.79
PTRATIO,17.4,17.4,19.05,19.05,20.2,20.2,506.0,506.0,22.0,21.2,18.46,18.45,12.6,13.0,2.16,2.15
B,375.38,375.38,391.44,391.44,396.22,396.22,506.0,506.0,396.9,396.9,356.67,356.72,0.32,6.68,91.29,91.14
LSTAT,6.95,6.95,11.36,11.36,16.96,16.96,506.0,506.0,37.97,34.02,12.65,12.64,1.73,2.88,7.14,7.08
MEDV,17.02,17.02,21.2,21.2,25.0,25.0,506.0,506.0,50.0,50.0,22.53,22.54,5.0,7.0,9.2,9.18"""
df = pd.read_csv(io.StringIO(s), header=[0,1])
df = df.set_index(df.columns[0])
df.index.name = ''
def get_styles_inner(col):
first_level_name = col.columns[0][0]
# compare raw and windsorized
match = col[(first_level_name, 'raw')] == col[(first_level_name, 'winsorized')]
# color both the raw and windsorized red if they don't match
col[(first_level_name, 'raw')] = match
col[(first_level_name, 'winsorized')] = match
return col
def get_styles(df):
# Grouping on the first level of the index of the columns, pass each
# group to get_styles_inner.
match_df = df.groupby(level=0, axis=1).apply(get_styles_inner)
# Replace True with no style, and False with red
style_df = match_df.applymap(lambda x: None if x else 'color:red;')
return style_df
df.style.apply(get_styles, axis=None)
(The first 24 lines are just loading in your dataset. You can ignore them if you already have the dataset.)
Here's the output:

Table from PrettyTable to pdf

I created a table using PrettyTable. I would like to save the output as a .pdf file but the only thing I can do is save it as .txt.
How to save it as .pdf file?
I installed the FPDF library but I am stucked with this.
# my table is saved as 'data' variable name
# I saved the table ('data') as .txt file
data = x.get_string()
with open('nameoffile.txt', 'w') as f:
f.write(data)
print(data)
PrettyTable is not used to export data to pdf file. It's used to display ASCII table.
The following code is a homemade method that answers your problem.
Lets assume you have this prettytable you want to export:
from prettytable import PrettyTable
x = PrettyTable()
x.field_names = ["City name", "Area", "Population", "Annual Rainfall"]
x.add_row(["Adelaide", 1295, 1158259, 600.5])
x.add_row(["Brisbane", 5905, 1857594, 1146.4])
x.add_row(["Darwin", 112, 120900, 1714.7])
x.add_row(["Hobart", 1357, 205556, 619.5])
x.add_row(["Sydney", 2058, 4336374, 1214.8])
x.add_row(["Melbourne", 1566, 3806092, 646.9])
x.add_row(["Perth", 5386, 1554769, 869.4])
First, you need to get the content of your table. The module isn't supposed to work in this way : it assumes that you have a table content that you want to display. Let's do the opposite :
def get_data_from_prettytable(data):
"""
Get a list of list from pretty_data table
Arguments:
:param data: data table to process
:type data: PrettyTable
"""
def remove_space(liste):
"""
Remove space for each word in a list
Arguments:
:param liste: list of strings
"""
list_without_space = []
for mot in liste: # For each word in list
word_without_space = mot.replace(' ', '') # word without space
list_without_space.append(word_without_space) # list of word without space
return list_without_space
# Get each row of the table
string_x = str(x).split('\n') # Get a list of row
header = string_x[1].split('|')[1: -1] # Columns names
rows = string_x[3:len(string_x) - 1] # List of rows
list_word_per_row = []
for row in rows: # For each word in a row
row_resize = row.split('|')[1:-1] # Remove first and last arguments
list_word_per_row.append(remove_space(row_resize)) # Remove spaces
return header, list_word_per_row
Then you can export it to a pdf file. Here is one solution :
from fpdf import FPDF
def export_to_pdf(header, data):
"""
Create a a table in PDF file from a list of row
:param header: columns name
:param data: List of row (a row = a list of cells)
:param spacing=1:
"""
pdf = FPDF() # New pdf object
pdf.set_font("Arial", size=12) # Font style
epw = pdf.w - 2*pdf.l_margin # Witdh of document
col_width = pdf.w / 4.5 # Column width in table
row_height = pdf.font_size * 1.5 # Row height in table
spacing = 1.3 # Space in each cell
pdf.add_page() # add new page
pdf.cell(epw, 0.0, 'My title', align='C') # create title cell
pdf.ln(row_height*spacing) # Define title line style
# Add header
for item in header: # for each column
pdf.cell(col_width, row_height*spacing, # Add a new cell
txt=item, border=1)
pdf.ln(row_height*spacing) # New line after header
for row in data: # For each row of the table
for item in row: # For each cell in row
pdf.cell(col_width, row_height*spacing, # Add cell
txt=item, border=1)
pdf.ln(row_height*spacing) # Add line at the end of row
pdf.output('simple_demo.pdf') # Create pdf file
pdf.close() # Close file
Finally, you just have to call the two methods:
header, data = get_data_from_prettytable(x)
export_to_pdf(header, data)

Parsing data with variable number of columns

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)

Python3 - using pandas to group rows, where two colums contain values in forward or reverse order: v1,v2 or v2,v1

I'm fairly new to python and pandas, but I've written code that reads an excel workbook, and groups rows based on the values contained in two columns.
So where Col_1=A and Col_2=B, or Col_1=B and Col_2=A, both would be assigned a GroupID=1.
sample spreadsheet data, with rows color coded for ease of visibility
I've manged to get this working, but I wanted to know if there's a more simpler/efficient/cleaner/less-clunky way to do this.
import pandas as pd
df = pd.read_excel('test.xlsx')
# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)
# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
if (b,a) not in new_list:
new_list.append((a,b))
# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
for a,b in new_list:
# if the values exist in forward direction
if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
# GroupID value given, where value is index in the new_list[]
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# else check if value exists in the reverse direction
if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')
I know of the pandas.groupby([columnA, columnB]) option, but I couldn't figure a way to create groups that contained both (v1, v2) and (v2,v1).
A boolean mask should do the trick:
import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
(df['Header_2'] == 'B') & (df['Header_3'] == 'A'))
# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)
# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]
EDIT: updated answer to address the groupby requirement.
I would do something like this.
import pandas as pd
df = pd.read_excel('test.xlsx')
#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)
#group them together
df = df.sort_values(by=["group1","group2"])
If you need to deal with more than two columns, I can write up a more general way to do this.

Resources