Feature extraction from data stored in PostgreSQL database - python-3.x

I have some data stored in PostgreSQL database, which contains fields like cost, start date, end date, country, etc. Please take a look at the data here.
Now what I want to do is extract some of the important features/fields from this data and store them in a separate CSV file or pandas data frame so I can use the extracted data for analysis.
Is there any python script to do this task? Please let me know. Thanks.

Firstly you should import your postgresql table data into dataframe which can be done by ,
import psycopg2 as pg
import pandas.io.sql as psql
# get connected to the database
connection = pg.connect("dbname=mydatabase user=postgres")
dataframe = psql.frame_query("SELECT * FROM <tablename>", connection)
explained here https://gist.github.com/00krishna/9026574 .
After that we can select specific columns in pandas dataframe . these can be done by ,
df1 = dataframe[['projectfinancialtype','regionname']]
# here you can select n number of feature columns which is available in your dataframe i had only took 2 fields of your json
Now for putting these feature column into csv we can use code like these,
df1.to_csv("pathofoutput.csv", cols=['projectfinancialtype','regionname'])
#it will create csv with your feature columns
May these helps

Related

How to fix NaN values in concatenated multiple tables in Jupyter Notebook

I am using Jupyter Notedbook. I have concatenated multiple tables. When I run the head() command I am not able to see the values in age and gender columns in a table rather than it's showing me NaN values against each user_id.
The following image_1 shows us the output when I concatenated the different two tables
How can I sort it out this issue or suggest me another way to conatenate tables where I can see all of the table values?
or do I need to access tables separeately and apply operations on different tables?
I am expecting to get the values in age and gender table rather than NaN values.
When I use these tables separately. It shows correct results but I have a big data problem so I need to concatenate the tables to access each of the feature column. In the end, I can apply operations on concatenated table features.
I have been testing your problem with the two csv's of your Github.
First of all I loaded the two tables as 'df1' and 'df2', importing the pandas library.
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_2.csv')
Then, using the pandas library, you can merge both dataframes choosing the connection column, in this case, 'user_id'.
final_df= pd.merge(df1, df2,on='user_id')
Finally we have the 'final_df' with all the information of both tables and without NaN's.

Eland loading pandas dataframe to elasticsearch changes date

Greetings Stackoverflowers
I have been using (eland to insert a pandas dataframe as an elasticsearch document. The code used to make this happen is shown as follows and is strongly based on the one in the url
import eland as ed
def save_to_elastic(data_df, elastic_engine, index_name, type_overrides_dict, chunk_size):
"""
es_type_overrides={
"fechaRegistro": "date",
"fechaIncidente": "date"
}
"""
df = ed.pandas_to_eland(
pd_df=data_df,
es_client=elastic_engine,
# Where the data will live in Elasticsearch
es_dest_index=index_name,
# Type overrides for certain columns, the default is keyword
# name has been set to free text and year to a date field.
es_type_overrides=type_overrides_dict,
# If the index already exists replace it
es_if_exists="replace",
# Wait for data to be indexed before returning
es_refresh=True,
chunksize=chunk_size
)
I have used to insert the pandas dataframe inside elasticsearch as follows:
from snippets.elastic_utils import save_to_elastic, conect2elastic
es = conect2elastic(user='falconiel')
save_to_elastic(data_df=siaf_consumados_elk,
type_overrides_dict={'fechaRegistro':"date",
'fechaIncidente':"date"},
elastic_engine=es,
index_name='siaf_18032021_pc',
chunk_size=1000)
Everything works fine but once I have the document in elasticsearch 26 dates have been inserted wrongly inside elasticsearch. All my data starts in january 1 2015. But elasticsearch shows some documents with December 31 2014. I haven't been able to find an explanation for this. Why some of the rows in the pandas dataframe that have the date field correct (from 2015-01-01) were changed during loading to last day of december of previous year. I would appreciate any help or insight to correct this behavior.
My datetime columns in pandas dataframe are typed as datetime. However, I am trying to test the following conversions to address the problem. They have not been so productive by now:
I have tried using the following conversions before inserting calling the function I use to save to the dataframe in elastic:
siaf_consumados_elk.fechaRegistro = pd.to_datetime(siaf_consumados_elk.fechaRegistro).dt.tz_localize(None)
siaf_consumados_elk.fechaRegistro = pd.to_datetime(siaf_consumados_elk.fechaRegistro, utc=True)
In fact the problem is UTC. I checked some of the rows in the pandas dataframe and they were reduced almost one day. For instance, one record which was registered in 2021-01-02 GMT -5 appeared as 2021-01-01. The solution was to apply the corresponding time zone before calling the function to save the dataframe as an elastic document/index. So, considering the good observation given by Mark Walkom, this what I used before calling the function:
siaf_consumados_elk.fechaRegistro = siaf_consumados_elk.fechaRegistro.dt.tz_localize(tz='America/Guayaquil')
siaf_consumados_elk.fechaIncidente = siaf_consumados_elk.fechaIncidente.dt.tz_localize(tz='America/Guayaquil')
A list with the corresponding time zones can be found at: python time zones
This permitted to index the time corretly

Using a for loop in pandas

I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?
After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')

Create a Dataframe from an excel file

I want create a Dataframe from excel file. I am using pandas read_excel function. My requirement is to create a Dataframe for all elements if the column matches some value.
For eg:- Below is my excel file and I want to create the Dataframe with all elements that has Module equal to 'DC-Prod'
Exccel File Image
Welcome, Saagar Sheth!
to make a Dataframe, just import "pandas" it like so...
import pandas as pd
then create a variable for the file to access, like this;
file_var_pandas = 'customer_data.xlsx'
and then, create its dataframe using the read_excel;
customers = pd.read_excel(file_var_pandas,
sheetname=0,
header=0,
index_col=False,
keep_default_na=True
)
finally, use the head() command like so;
customers.head()
if you want to know more just go to this website!
Packet Pandas Dataframe
and have fun!

How to write summary of spark sql dataframe to excel file

I have a very large Dataframe with 8000 columns and 50000 rows.
I want to write its statistics information into excel file.
I think we can use describe() method. But how to write it to excel in good format. Thanks
The return type for describe is a pyspark dataframe. The easiest way to get the describe dataframe into an excel readable format is to convert it to a pandas dataframe and then write the pandas dataframe out as a csv file as below
import pandas
df.describe().toPandas().to_csv('fileOutput.csv')
If you want it in excel format, you can try below
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)

Resources