Eland loading pandas dataframe to elasticsearch changes date - python-3.x

Greetings Stackoverflowers
I have been using (eland to insert a pandas dataframe as an elasticsearch document. The code used to make this happen is shown as follows and is strongly based on the one in the url
import eland as ed
def save_to_elastic(data_df, elastic_engine, index_name, type_overrides_dict, chunk_size):
"""
es_type_overrides={
"fechaRegistro": "date",
"fechaIncidente": "date"
}
"""
df = ed.pandas_to_eland(
pd_df=data_df,
es_client=elastic_engine,
# Where the data will live in Elasticsearch
es_dest_index=index_name,
# Type overrides for certain columns, the default is keyword
# name has been set to free text and year to a date field.
es_type_overrides=type_overrides_dict,
# If the index already exists replace it
es_if_exists="replace",
# Wait for data to be indexed before returning
es_refresh=True,
chunksize=chunk_size
)
I have used to insert the pandas dataframe inside elasticsearch as follows:
from snippets.elastic_utils import save_to_elastic, conect2elastic
es = conect2elastic(user='falconiel')
save_to_elastic(data_df=siaf_consumados_elk,
type_overrides_dict={'fechaRegistro':"date",
'fechaIncidente':"date"},
elastic_engine=es,
index_name='siaf_18032021_pc',
chunk_size=1000)
Everything works fine but once I have the document in elasticsearch 26 dates have been inserted wrongly inside elasticsearch. All my data starts in january 1 2015. But elasticsearch shows some documents with December 31 2014. I haven't been able to find an explanation for this. Why some of the rows in the pandas dataframe that have the date field correct (from 2015-01-01) were changed during loading to last day of december of previous year. I would appreciate any help or insight to correct this behavior.
My datetime columns in pandas dataframe are typed as datetime. However, I am trying to test the following conversions to address the problem. They have not been so productive by now:
I have tried using the following conversions before inserting calling the function I use to save to the dataframe in elastic:
siaf_consumados_elk.fechaRegistro = pd.to_datetime(siaf_consumados_elk.fechaRegistro).dt.tz_localize(None)
siaf_consumados_elk.fechaRegistro = pd.to_datetime(siaf_consumados_elk.fechaRegistro, utc=True)

In fact the problem is UTC. I checked some of the rows in the pandas dataframe and they were reduced almost one day. For instance, one record which was registered in 2021-01-02 GMT -5 appeared as 2021-01-01. The solution was to apply the corresponding time zone before calling the function to save the dataframe as an elastic document/index. So, considering the good observation given by Mark Walkom, this what I used before calling the function:
siaf_consumados_elk.fechaRegistro = siaf_consumados_elk.fechaRegistro.dt.tz_localize(tz='America/Guayaquil')
siaf_consumados_elk.fechaIncidente = siaf_consumados_elk.fechaIncidente.dt.tz_localize(tz='America/Guayaquil')
A list with the corresponding time zones can be found at: python time zones
This permitted to index the time corretly

Related

A DATETIME column in Synapse tables is loading date values that are a few hours into the past compared to the incoming value

I have a datetime column in Synapse called "load_day" which is being loaded through a pyspark dataframe (parquet). During runtime, the code adds a new column in the dataframe with an incoming date ('timestamp') of format yyyy-mm-dd hh:mm:ss into the dataframe.
df = df.select(lit(incoming_date).alias("load_day"), "*")
Later we are writing this dataframe into a synapse table using a df.write command.
But what's strange is that every date value that is going into this load_day column is being written as a value that is a few hours into the past. This is happening with all the synapse tables in my database for all the new loads that I'm doing. To my knowledge, nothing in the code has changed from before.
Eg: If my incoming date is "2022-02-19 00:00:00" it's being written as 2022-02-18 22:00:00.000 instead of 2022-02-19 00:00:00.000. The hours part in the date is also not stable; sometimes it writes as 22:00:00.000 and sometimes 23:00:00.000
I debugged the code but the output of the variable looks totally fine. It just shows the value as 2022-02-19 00:00:00 as expected but the moment the data is getting ingested into the Synapse table, it goes back a couple of hours.
I'm not understanding why this might be happening or what to look for during debugging.
Did any of you face something like this before? Any ideas on how to I can approach this to find out what's causing this erroneous date?

Python. Best way to filter array by date

I have a list of Rest objects. It's django model
class Rest(models.Model):
product = models.ForeignKey('Product', models.DO_NOTHING)
date = models.DateTimeField()
rest = models.FloatField()
I want to select objects from it for today's date. I do it like this. Maybe there is some more convenient and compact way?
for rest in rests_list:
if rest.date.day == datetime.now().day:
result.append(rest)
First - datetime.now().day will get you the day of the month (e.g. 18 if today is 18th March 2020), but you've said you want today's date. The below is on the assumption you want today's date, not the day of the month.
(NOTE: weAreStarDust pointed out that the field is a DateTimeField, not a DateField, which I missed and have updated the answer for)
The way you are doing it right now seems like it might be fetching all of the Rests from the database and then filter them in your application code (assuming rests_listcomes from something likeRest.objects.all()`. Generally, you want to do as much filtering as possible on the database query itself, and as little filtering as possible in the client code.
In this case, what you probably want to do is:
from datetime import date
Rest.objects.filter(date__date=date.today())
That will bring back only the records that have a date of today, which are the ones you want.
If you already have all the rests somehow, and you just want to filter to the ones from today then you can use:
filter(lambda x: x.date.date() == date.today(), rests_list)
That will return a filter object containing only the items in rests_list that have date == date.today(). If you want it as a list, rather than an iterable, you can do:
list(filter(lambda x: x.date.date() == date.today(), rests_list))
or for a tuple:
tuple(filter(lambda x: x.date.date() == date.today(), rests_list))
NOTE:
If you actually want to be storing only a date, I would suggest considering use of a DateField (although not if you want to store any timezone information).
If you want to store a DateTime, I would consider renaming the field from date to datetime, or started_at - calling the field date but having a datetime can be a bit confusing and lead to errors.
As docs says
For datetime fields, casts the value as date. Allows chaining additional field lookups. Takes a date value.
from datetime import datetime
Rest.objects.filter(date__date = datetime.now().day)
You can use the django filter for filtering and get only today's date data from your model. No need to fetch all data first and then apply loop for get today's date data. You have to write your query like ...
import datetime
Rest.objects.filter(date__date = datetime.date.today())
But be sure that timezone should be same for database server and web server

Is there a way to get the oldest date of a Pandas dataframe when not all columns are dates?

Just to give a little background, I'm trying to loop through the dataframe and query each Asset by it's oldest date. Then I can trim the data with Pandas locally for each of the items that make the asset.
The dataframe that I'm tryin to loop through looks something like this:
I've tried using query_date = df.min(axis=1) but it picks up the values rather than the dates.
query_date will be the start date for each query that will be inside a for loop.
Thanks in advance for the help

How to handle dates in cx_oracle using python?

I'm trying to access Oracle table using cx_oracle module and convert that as a dataframe, every thing is fine except couple of date columns has date format like "01-JAN-01" Python considering it as datetime.datetime(1,1,1,0,0) and after creating dataframe it's showing as 0001-01-01 00:00:00. I am expecting output as 2001-01-01 00:00:00. Please help me on this. Thanks in advance.
You have a couple of choices. You could
* Retrieve it from the Oracle database with [read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html specifying the date in a format (TO_CHAR) more appropriate for the default date format of pandas
* Retrieve it from the database as a string (as above) and then convert it into a date in the pandas framework.

Feature extraction from data stored in PostgreSQL database

I have some data stored in PostgreSQL database, which contains fields like cost, start date, end date, country, etc. Please take a look at the data here.
Now what I want to do is extract some of the important features/fields from this data and store them in a separate CSV file or pandas data frame so I can use the extracted data for analysis.
Is there any python script to do this task? Please let me know. Thanks.
Firstly you should import your postgresql table data into dataframe which can be done by ,
import psycopg2 as pg
import pandas.io.sql as psql
# get connected to the database
connection = pg.connect("dbname=mydatabase user=postgres")
dataframe = psql.frame_query("SELECT * FROM <tablename>", connection)
explained here https://gist.github.com/00krishna/9026574 .
After that we can select specific columns in pandas dataframe . these can be done by ,
df1 = dataframe[['projectfinancialtype','regionname']]
# here you can select n number of feature columns which is available in your dataframe i had only took 2 fields of your json
Now for putting these feature column into csv we can use code like these,
df1.to_csv("pathofoutput.csv", cols=['projectfinancialtype','regionname'])
#it will create csv with your feature columns
May these helps

Resources