Is there a way to write to Kiba CSV destination line by line or in batches instead of all at once? - kiba-etl

Kiba is really cool!
I'm trying to set up a ETL process in my Rails app where I'll dump a large amount of data from my SQL DB to a CSV file. If I were to implement this myself I'd use something like find_each to load say 1000 records at a time and write them to the file in batches. Is there a way to do this using Kiba? From my understanding by default all of the rows from the Source get passed to the Destination, which wouldn't be feasible for my situation.

Glad you like Kiba!
I'm going to make you happy by stating that your understanding is incorrect.
The rows are yielded & processed one by one in Kiba.
To see how things work exactly, I suggest you try it this code:
class MySource
def initialize(enumerable)
#enumerable = enumerable
end
def each
#enumerable.each do |item|
puts "Source is reading #{item}"
yield item
end
end
end
class MyDestination
def write(row)
puts "Destination is writing #{row}"
end
end
source MySource, (1..10)
destination MyDestination
Run this and you'll see that each item is read then written.
Now to your actual concrete case - what's above means that you can implement your source this way:
class ActiveRecord
def initialize(model:)
#model = model
end
def each
#model.find_each do |record|
yield record
end
end
end
then you can use it like this:
source ActiveRecordSource, model: Person.where("age > 21")
(You could also leverage find_in_batches if you wanted each row to be an array of multiple records, but that's probably not what you need here).
Hope this properly answers your question!

Related

How to keep the share properties of an excel with python openpyxl?

I have trouble trying to keep the sharing properties of an excel. I tried this :Python and openpyxl is saving my shared workbook as unshared but the part of vout just cancels all the modification I made with the script
To explain the problem :
There's an excel file that is shared in which people can do some modification
Python reads and writes on it
When I save the workbook in the excel file, it automatically either drops the sharing property or when I try to keep it, it just doesn't do any modification
Can someone help me please ?
I'll get a little more precise, as requested.
The sharing mode is the one Microsoft provides. You can see the button below:
Share button Excel
The excel is stored on a server. Several users can write on it at the same time but when I launch my script, it stops automatically the sharing property, so everyone that is writing on it just can't do modification anymore and every modif they did is lost.
First I treated my Excel normally :
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
...my modifications on ws...
DLT.save()
DLT.close()
But then I tried this (Python and openpyxl is saving my shared workbook as unshared)
DLT=openpyxl.load_workbook(myPath)
ws=DLT['DLT']
zin = zipfile.ZipFile(myPath, 'r')
buffers = []
for item in zin.infolist():
buffers.append((item, zin.read(item.filename)))
zin.close()
...my modif on ws...
DLT.save()
zout = zipfile.ZipFile(myPath, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
DLT.close()
The second one just doesn't save my modification on ws.
The thing I would like to do, is not to get rid of the sharing property. I would need to keep it while I write on it. Not sure if it is possible. I have one alternative solution that is to use another file, and just copy/paste by hand the new data from this file to the DLT one.
well... after playing with it back and forth, for some weird reason zipfile.infolist() does contains the sheet data as well, so here's my way to fine tune it, using the shared_pyxl_save example the previous gentleman provided
basically instead of letting the old file overriding the sheet's data, use the old one
def shared_pyxl_save(file_path, workbook):
"""
`file_path`: path to the shared file you want to save
`workbook`: the object returned by openpyxl.load_workbook()
"""
zin = zipfile.ZipFile(file_path, 'r')
buffers = []
for item in zin.infolist():
if "sheet1.xml" not in item.filename:
buffers.append((item, zin.read(item.filename)))
zin.close()
workbook.save(file_path)
""" loop through again to find the sheet1.xmls and put it into buffer, else will show up error"""
zin2 = zipfile.ZipFile(file_path, 'r')
for item in zin2.infolist():
if "sheet1.xml" in item.filename:
buffers.append((item, zin2.read(item.filename)))
zin2.close()
#finally saves the file
zout = zipfile.ZipFile(file_path, 'w')
for item, buffer in buffers:
zout.writestr(item, buffer)
zout.close()
workbook.close()

Creating SequenceTaggingDataset from list, not file

I would like to create a SequenceTaggingDataset from two lists that I have created dynamically inside my code - train_sentences and train_tags. I would want to write something like this:
train_data = SequenceTaggingDataset(examples=(zip(train_sentences, train_tags)))
However, the constructor must receive a path. And not only that - it looks from the code as though, even if I were to provide the examples, it will override those, and initialize examples to be an empty list.
For various reasons, I do not want to save the lists I created in a file from which the SequenceTaggingDataset could read. Is there any way around this, save defining my own custom class?
You will need to modify source code for it (https://pytorch.org/text/_modules/torchtext/datasets/sequence_tagging.html#SequenceTaggingDataset). You can make a local copy and import as your module.
path is used in __init__. The important part is that it takes lines from file and splits it using given separator into list named columns. Then this columns list is being fed into another class method together with fields to construct examples list. Please read provided example here to understand fields (Note that UDPOS is called there to create SequenceTaggingDataset).
What you need is columns, which you don't need to read from file as you have all components already. You will feed it directly by simplifying class __init__:
def __init__(self, columns, fields, encoding="utf-8", separator="\t", **kwargs):
examples = []
examples.append(data.Example.fromlist(columns, fields))
super(SequenceTaggingDataset, self).__init__(examples, fields,
**kwargs)
columns is nested list of lists: [[word], [UD_TAG], [PTB_TAG]]. It means that you need to feed following into modified class:
train = SequenceTaggingDataset([train_sentences, train_tags], fields=...)

How to write this type of data to csv

im a noobie in python. I want to get some data from csv with pandas and after write a new csv file with extra data in format
"type";"currency";"amount";"comment"
"type1";"currency1";"amount1";"comment1"
etc
import pandas as pd
import csv
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
type="type"
comment = "2week"
i=0
while i<3:
Currency = req['Currency'].values[i]
ReqAmount = req['Request'].values[i]
r = round(ReqAmount,-1)
i+=1
data =[type,Currency,r,comment]
#print(data)
csv_file = open('data2.csv', 'w')
with csv_file:
writer = csv.writer(csv_file)
writer.writerow(data)
print("DONE")
writer.writerows(data)
_csv.Error: iterable expected, not numpy.float64
I have multiple things to criticize here. I hope it doesn't come across mean, but rather educational. While your code would work regardless of those points, it is good coding style to follow them.
Variables should not start with a capital letter. Currency and ReqAmount should be currency and reqAmount.
Don't use keywords for variable names. type is a python keyword.
Make sure your formatting doesn't get destroyed when posting here. Especially for python, which relies on tab formatting. Read here for more information: https://stackoverflow.com/editing-help#code
That said, let me try to go through your code and give you tips and tricks:
Don't run code in python directly. Always use the main() method. It's just better coding style.
When looping, don't use the i=0; while i<3; i+=1 construct, rather use for i in range(3). While it works, it is not very pythonic and a lot harder to read.
Never assume anything that cannot be guaranteed in your code. In this case, you assume that the csv-file has at least 3 lines, otherwise your program would crash. Instead, read the number of lines from the csv file, with len(req).
data =[type,Currency,r,comment] keeps overwriting your data variable. You could either append to data and then write everything to the output file at the end, or directly write to the output file in every iteration.
Don't use open to create a variable (except absolutely necessary). Instead, use open in a with statement. This ensures that the file will get closed properly. I have seen that you do use the with statement, nonetheless you usually use it like with open(...) as variable_name:.
CSV files usually start with the column names. Therefore you should write the column names before you write data.
I won't fix that, because it will change the appearance of the program completely, but normally, don't mix different libraries. If you use pandas for csv reading, also use it for writing. If you use the csv library for writing, also use it for reading. While it isn't wrong to mix them, it is bad style and creates more dependencies than would be necessary.
I don't really understand what your code is supposed to do, so I just guess and hope it goes in the right direction.
When fixing all those points, you might have something like that:
import pandas as pd
import csv
def main():
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
transferType = "type"
comment = "2week"
with open('data2.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["type","currency","amount","comment"])
for i in range(len(req)):
currency = req['Currency'].values[i]
reqAmount = req['Request'].values[i]
r = round(reqAmount,-1)
data = [transferType,currency,r,comment]
#print(data)
writer.writerow(data)
print("DONE")
# Whenever you run a program, __name__ will be set to '__main__' in the initial
# script. This makes it easier later when you work with multiple code files.
if __name__ == '__main__':
main()

Why is the same function in python-chess returning different results?

I'm new to working with python-chess and I was perusing the official documentation. I noticed this very weird thing I just can't make sense of. This is from the documentation:
import chess.pgn
pgn = open("data/pgn/kasparov-deep-blue-1997.pgn")
first_game = chess.pgn.read_game(pgn)
second_game = chess.pgn.read_game(pgn)
So as you can see the exact same function pgn.read_game() results in two different games to show up. I tried with my own pgn file and sure enough first_game == second_game resulted in False. I also tried third_game = chess.pgn.read_game() and sure enough that gave me the (presumably) third game from the pgn file. How is this possible? If I'm using the same function shouldn't it return the same result every time for the same file? Why should the variable name matter(I'm assuming it does) unless programming languages changed overnight or there's a random function built-in somewhere?
The only way that this can be possible is if some data is changing. This could be data that chess.pgn.read_game reads from elsewhere, or could be something to do with the object you're passing in.
In Python, file-like objects store where they are in the file. If they didn't, then this code:
with open("/home/wizzwizz4/Documents/TOPSECRET/diary.txt") as f:
line = f.readline()
while line:
print(line, end="")
line = f.readline()
would just print the first line over and over again. When data's read from a file, Python won't give you that data again unless you specifically ask for it.
There are multiple games in this file, stored one after each other. You're passing in the same file each time, but you're not resetting the read cursor to the beginning of the file (f.seek(0)) or closing and reopening the file, so it's going to read the next data available – i.e., the next game.

How to use a conditional statement while scraping?

I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.

Resources