Generating test data in Excel for an EAV table - excel

This is a pretty complicated question so be prepared! I want to generate some test data in excel for my EAV table. The columns I have are:
user_id, attribute, value
Each user_id will repeat for a random number of times between 1-4, and for each entry I want to pick a random attribute from a list, and then a random value which this can take on. Lastly I want the attributes for each id entry to be unique i.e. I do not want more than one entry with the same id and attribute. Below is an example of what I mean:
user_id attribute value
100001 gender male
100001 religion jewish
100001 university imperial
100002 gender female
100002 course physics
Possible values:
attribute value
gender male
female
course maths
physics
chemistry
university imperial
cambridge
oxford
ucl
religion jewish
hindu
christian
muslim
Sorry that the table above messed up. I don't know how to paste into here while retaining the structure! Hopefully you can see what I'm talking about otherwise I can get a screenshot.
How can I do this? In the past I have generated random data using a random number generator and a VLOOKUP but this is a bit out of my league.

My approach is to create a table with all four attributes for each ID and then filter that table randomly to get between one and four filtered rows per ID. I assigned a random value to each attribute. The basic setup looks like this:
To the left is the randomized eav table and to the left is the lookup table used for the randomized values. Here's the formulas. Enter them and copy down:
Column A - Establishes a random number every four digits. This determines the attribute that must be selected:
=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)
Column B - Uses the formula in A to determine if row is included:
=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)
Column C - Creates the IDs, starting with 100,001:
=(INT((ROW()-2)/4)+100000)+1
Column D - Repeats the four attributes:
=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")
Column E - Finds the first occurence of the Column D attribute in the lookup table and selects a randomly offset value:
=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))
When you filter on the TRUEs in Column B you'll get your list of one to four Attributes per ID. Disconcertingly, the filtering forces a recalculation, so the filtered list will no longer say TRUE for every cell in column B.
If this was mine I'd automate it a little more, perhaps by putting the "magic number" 4 in it's own cell (the count of attributes).

There are a number of ways to do this. You could use either perl or python. Both have modules for working with spreadsheets. In this case, I used python and the openpyxl module.
# File: datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example: datagen.py myfile.xlsx
import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter
# verify that user specified an argument
if len(sys.argv) < 2:
print "Specify an excel filename to save the data, e.g myfile.xlsx"
exit(-1)
# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()
# Modify this line to specify the range of user ids
ids = range(100001, 100100)
# data structure for the attributes and values
data = { 'gender': ['male', 'female'],
'course': ['maths', 'physics', 'chemistry'],
'university': ['imperial','cambridge','oxford', 'ucla'],
'religion': ['jewish', 'hindu', 'christian','muslim']}
# Write column headers in the spreadsheet
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'
row = 1
# Loop through each user id
for user_id in ids:
# randomly select how many attributes to use
attr_cnt = random.randint(1,4)
attributes = data.keys()
for idx in range(attr_cnt):
# randomly select attribute
attr = random.choice(attributes)
# remove the selected attribute from further selection for this user id
attributes.remove(attr)
# randomly select a value for the attribute
value = random.choice(data[attr])
row = row + 1
# write the values for the current row in the spreadsheet
ws.cell('%s%s'%('A', row)).value = user_id
ws.cell('%s%s'%('B', row)).value = attr
ws.cell('%s%s'%('C', row)).value = value
# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1])
print "Done!"

Related

I need to pull a complete row from one excel sheet to another escel sheet, based on a cell value

I have 2 Excel worksheets. In the first I have a table that has a column named "Sales Order" and "SO Item" of each row (product) plus some other columns. In this table I concatenate "Sales Order" and "SO Item" so that I have Sales Order parent (xxxxxxx00) and also Sales Order childs (xxxxxxx01, xxxxxxx02,...,xxxxxxx09). However, in the second worksheet I also have the "concatenation" column but only contains Sales Order parents. How can I pull the whole rows containing the childs of each parent from worksheet 1 to worksheet 2?
I've tried to do it using VLOOKUP but this only returns a single child value (xxxxxxx001) and also its not returning the whole row where this code is located
Table 1 is:
Sales Order
SO Item
Concatenation
Material Description
Feas Plan Date
2503319449
100
2503319449100
SYS-7210 SAS-Mxp
Bundle Header
2503319449
101
2503319449101
PS-7210 SAS-T/Mxp
1/31/2023
2503319449
102
2503319449102
SYS-7210 SAS-Mxp2VDC
Global Allocation
2503319449
200
2503319449200
OS-7210 SAS-Mxp
1/31/2023
Table 2 is:
Sales Order
SO Item
Concatenation
Material Description
Feas Plan Date
2503319449
100
2503319449100
SYS-7210 SAS-Mxp
Bundle Header
2503319449
200
2503319449200
OS-7210 SAS-Mxp
1/31/2023
I want Table 2 to extract the missing "Concatenation" items from Table 1.
It is not clear from the question, how to present the output. I assume Table2 is your lookup table. Based on the input data, you need to return the entire Table1, I assume your Table1 has more data in your real case, and you want to extract just the information based on the lookup table. In the way you construct the concatenation, for the lookup it is only necessary the SO Item column values. Put on G2 the following formula:
=LET(tbA, A3:E4, tbB, A9:E12, soA, 1*INDEX(tbA,,2), soB, 1*INDEX(tbB,,2),
DROP(REDUCE("", soA, LAMBDA(ac,x, LET(f,
FILTER(tbB, (soB >= x) * (soB < x+100),""), IF(#f="", ac, VSTACK(ac,f))))),1))
Here is the output:
The condition:
IF(#f="", ac, VSTACK(ac,f))
It is just to prevent empty result from the FILTER output (f), it is not really necessary if you want to include the parent (condition: soB >= x as it is in the formula), but if you want to exclude it (soB > x) then you need it. Check my answer to the question: how to transform a table in Excel from vertical to horizontal but with different length on how to use DROP/REDUCE/VSTACK pattern. I convert to numeric values (multiplying INDEX by 1) the value of SO Item column, in case the input data is in text format, otherwise it is not necesary.

How to update one grid with information from another grid

I have one grid called Grid 1, I would like to pass the information to another grid called Grid M.
This Grid M may or may not contain previous data, so what I want is to overwrite these previous values and just leave the new data. Please note that both sheets have the same structure when it comes to column name and their formats.
This is my code:
# Grid IDs
grid1 = 6975487445624708
grid2 = 7306936514307972
grid3 = 1060505730213764
gridM = 4175140851345284
# Read Sheets
readSheet_Grid1 = smart.Sheets.get_sheet(grid1)
readColumn_Grid1 = readSheet_Grid1.get_columns().data
readSheet_GridM = smart.Sheets.get_sheet(gridM)
readColumn_GridM = readSheet_GridM.get_columns().data
# Get Column ID from Grid M
columntoRead = []
for column in readColumn_Grid1:
columntoRead.append(column.id)
print("Column IDs from Grid M: ", columntoRead)
# Get row id from Grid M
rowtoRead_GridM = []
for MyRow_GridM in readSheet_GridM.rows:
rowtoRead_GridM.append(MyRow_GridM.id)
print("Row IDs from Grid M: ",rowtoRead_GridM)
# Get values from Grid 1
celltoRead_Grid1 = []
celltoRead_GridM = []
for MyRow_Grid1 in readSheet_Grid1.rows:
for MyCell_Grid1 in MyRow_Grid1.cells:
celltoRead_Grid1.append(MyCell_Grid1.value)
print("Values from Grid 1: ",celltoRead_Grid1)
# Build new cell value
new_cell = smartsheet.models.Cell()
new_cell.column_id = columntoRead
new_cell.value = celltoRead_Grid1
new_cell.strict = False
# Build the row to update
new_row = smartsheet.models.Row()
new_row.cells.append(new_cell)
print(new_cell)
print(new_row)
This is the output:
Column IDs from Grid M: [7236841595791236, 1607342061578116, 6110941688948612, 8503502613309316, 3999902985938820, 3859141875263364, 8362741502633860, 1044392108156804]
Row IDs from Grid M: [7323028036380548, 1693528502167428, 6197128129537924, 3945328315852676, 8448927943223172]
Values from Grid 1: [3240099.0, 'James', 'Hamilton', 'Male', 197556.0, 18.0, 'Bachelor', 'Medic', 9615534.0, 'Miranda', 'Montgomery', 'Female', 158585.0, 20.0, 'Primary', 'Historian', 9119102.0, 'Vincent', 'Wells', 'Male', 182392.0, 29.0, 'Lower secondary', 'Agronomist', 4533161.0, 'Alen', 'Murray', 'Male', 140853.0, 30.0, 'Doctoral', 'Carpenter', 1010718.0, 'Frederick', 'Farrell', 'Male', 140403.0, 29.0, 'Primary', 'Jeweller']
This is where I start to get lost, find below error code.
ValueError Traceback (most recent call last)
Input In [5], in <cell line: 42>()
40 # Build new cell value
41 new_cell = smartsheet.models.Cell()
---> 42 new_cell.column_id = columntoRead
43 new_cell.value = celltoRead_Grid1
44 new_cell.strict = False
File ~\anaconda3\lib\site-packages\smartsheet\models\cell.py:70, in Cell.__setattr__(self, key, value)
68 self.format_ = value
69 else:
---> 70 super(Cell, self).__setattr__(key, value)
File ~\anaconda3\lib\site-packages\smartsheet\models\cell.py:78, in Cell.column_id(self, value)
76 #column_id.setter
77 def column_id(self, value):
---> 78 self._column_id.value = value
File ~\anaconda3\lib\site-packages\smartsheet\types.py:165, in Number.value(self, value)
163 self._value = value
164 else:
--> 165 raise ValueError("`{0}` invalid type for Number value".format(value))
ValueError: `[7236841595791236, 1607342061578116, 6110941688948612, 8503502613309316, 3999902985938820, 3859141875263364, 8362741502633860, 1044392108156804]` invalid type for Number value
Looks like I can't put in new_cell.column_id a list, only integers, but this makes me wonder the following, how do I let Smartsheet know that I wish to update multiple rows using .value from Grid 1 into Grid M?
If I replace the list with a specific Column ID, like in this code, new_cell.column_id = 7236841595791236 this is the output:
{"columnId": 7236841595791236, "strict": false}
{"cells": [{"columnId": 7236841595791236, "strict": false}]}
This is the desired output in Grid M:
ID Name Last Name Gender Salary Age Education Occupation
3240099 James Hamilton Male 197556 18 Bachelor Medic
9615534 Miranda Montgomery Female 158585 20 Primary Historian
9119102 Vincent Wells Male 182392 29 Lower secondary Agronomist
4533161 Alen Murray Male 140853 30 Doctoral Carpenter
1010718 Frederick Farrell Male 140403 29 Primary Jeweller
If I'm understanding your scenario correctly, the following things are true:
The structure of your source sheet and your destination sheet (number of columns, column types, column sequence) is identical.
Your objective is to delete ALL rows from the destination sheet and then copy all rows from the source sheet into the destination sheet.
You want to the copied data to remain in the source sheet (i.e., you're copying rows from the source sheet into the destination sheet, not moving rows from the source sheet to the destination sheet).
The following code achieves the objective described above.
# specify source info
source_sheet_id = 5169244485773188
# specify destination info
destination_sheet_id = 2486208480733060
'''
STEP 1:
Get all rows from the source sheet and build list of Row IDs.
'''
sheet = smartsheet_client.Sheets.get_sheet(source_sheet_id)
# iterate through the rows array and build a list of row IDs
source_sheet_row_ids = []
for row in sheet.rows:
source_sheet_row_ids.append(row.id)
'''
STEP 2:
Get all rows from the destination sheet and build list of Row IDs.
'''
sheet = smartsheet_client.Sheets.get_sheet(destination_sheet_id)
# iterate through the rows array and build a list of row IDs
destination_sheet_row_ids = []
for row in sheet.rows:
destination_sheet_row_ids.append(row.id)
'''
STEP 3:
Delete ALL rows from the destination sheet (using Row IDs from STEP 2).
'''
response = smartsheet_client.Sheets.delete_rows(destination_sheet_id, destination_sheet_row_ids)
'''
STEP 4:
Copy all rows from the source sheet (using Row IDs from STEP 1) to the destination sheet.
'''
# copy rows from source sheet to (bottom of) destination sheet
# (include everything -- i.e., attachments, children, and discussions)
response = smartsheet_client.Sheets.copy_rows(
source_sheet_id,
smartsheet.models.CopyOrMoveRowDirective({
'row_ids': source_sheet_row_ids,
'to': smartsheet.models.CopyOrMoveRowDestination({
'sheet_id': destination_sheet_id
})
}),
'all'
)
It's important to note that this code will delete ALL rows from the destination sheet each time it runs (immediately before it copies all rows from the source sheet into the destination sheet). If you intend for the destination sheet to be the home of data from multiple sheets at some point in the future, then you'll want to modify the code such that it only deletes rows that originated from the specified source sheet sheet. One way to do this would be to:
Add a column to the beginning of the source sheet AND the destination sheet called Source Sheet ID.
In the first row of source sheet, populate this column (cell) with the value of that sheet's ID. In each subsequent row of the source sheet, populate this column (cell) with a formula that pulls the value from that cell in the first row (i.e., =[Source Sheet ID]$1). This will make it so that this cell within any new rows that are added later will automatically be populated with that same value.
You might consider locking this column by using the Smartsheet UI, so it won't be editable (by non-admin users).
Then in the section of code that builds up the list of destination_sheet_row_IDs, add some conditional logic to only append the current row ID if the value of the Source Sheet ID column for that row matches your source sheet ID. That way only rows that originated from the specified source sheet will be deleted from the destination sheet -- any rows there that originated from another sheet will remain untouched.
If you choose to implement this approach -- adding the Source Sheet ID column (containing the ID of the source sheet) as the first column in both the source sheet and the destination sheet -- replace STEP 2 from the code sample above with the following code instead.
'''
STEP 2:
Get all rows from the destination sheet and build list of Row IDs.
'''
destination_sheet = smartsheet_client.Sheets.get_sheet(destination_sheet_id)
# iterate through the rows array and build a list of row IDs
destination_sheet_row_ids = []
for row in destination_sheet.rows:
# only include Row IDs for rows that originated from the specified Source sheet
if row.cells[0].value == source_sheet_id:
destination_sheet_row_ids.append(row.id)

Change number format using headers - openpyxl

I have an Excel file in which I want to convert the number formatting from 'General' to 'Date'. I know how to do so for one column when referring to the column letter:
workbook = openpyxl.load_workbook('path\filename.xlsx')
worksheet = workbook['Sheet1']
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Report_date'], row)].number_format='yyyy-mm-dd;#'
As you can see, I now use the column letter "D" to point out the column that I want to be formatted differently. Now, I would like to use the header in row 1 called "Start_Date" to refer to this column. I tried a method from the following post to achieve this: select a column by its name - openpyxl. However, that resulted in a KeyError: "Start_Date":
# Create a dictionary of column names
ColNames = {}
Current = 0
for COL in worksheet.iter_cols(1, worksheet.max_column):
ColNames[COL[0].value] = Current
Current += 1
for row in range(2, worksheet.max_row+1):
ws["{}{}".format(ColNames['Start_Date'], row)].number_format='yyyy-mm-dd;#'
EDIT
This method results in the following error:
AttributeError: 'tuple' object has no attribute 'number_format'
Additionally, I have more columns from which the number formatting needs to be changed. I have a list with the names of those columns:
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
Is there a way that I can use the list DateColumns so that I can save some lines of code?
Thanks in advance.
Please note that I posted a similar question earlier. The following post was referred to as an answer Python: Simulating CSV.DictReader with OpenPyXL. However, I don't see how the answers in that post can be adjusted to my needs.
You need to know which columns you want to change the number format on which you have conveniently put into a list, so why not just use that list.
Get the headers in your sheet, check if the Header is in the DateColumns list, if so then update all the entries in that column from row 2 to max with the date format you want...
...
DateColumns = ['Start_Date', 'End_Date', 'Birthday']
for COL in worksheet.iter_cols(min_row=1,max_row=1):
header = COL[0]
if header.value in DateColumns:
for row in range(2, worksheet.max_row+1):
worksheet.cell(row, COL[0].column).number_format='yyyy-mm-dd;#'

Filling randomly form fields from a list

Is there any way in python to fill in the fields of a form (first name, last name) by taking them randomly from 2 pre-filled lists (name list and surname list)? And instead automatically randomise, without taking data from a list, the selection of the date of birth by keeping it over a certain range (e.g. 19 to 32 years)?
Thanks in advance for the help.
Try using choice function from random module.
from random import choice
names = ['a','b','c']
random_name = choice(names) # use this for getting random names and surnames
for selecting a random date from a range, you can look at this post
Generate a random date between two other dates

Take odd one out from a list when comparing with a filtered column using python

I have an excel as shown below:
Input File
Now I want to filter fruits first from "Items" column and check which one in list of "list" column is not present in the list. For example: here "grapes" is not present in "Name" column. So I want grapes as output in next column as shown below.
Expected Output Shown
The same is to be done for many by filtering each items one by one as I have many items.
Please suggest or give some hints so that i can start this code.
I am naming the excel as Book1
import pandas as pd
frame = pd.read_excel("Book1.xlsx")
frame_list_as_String = frame.list.tolist()
frame_list = [x.split(',') for x in frame_list_as_String]
frame_Name = frame.Name.tolist()
frame_col3=[]
for item in frame_list :
frame_col3.append(list(set(items)-set(frame_Name)))
frame["col3"]=frame_col3
frame.to_excel("df.xlsx", index = False)

Resources