Convert a field content to Pandas DataFrame - python-3.x

I have the following pandas dataframe:
The field '_source' has a JSON structure in the content. I'd like to convert this field in another dataframe with the correspondent columns.
The type of this field is Series:
type(df['_source'])
pandas.core.series.Series
Which is the best way to convert this field ('_source') in a Pandas DataFrame?
Thanks in advance
Kind regards

You can use these lines of code to convert '_source' in correspondent columns:
subdf= df['_source'].apply(json.loads)
pd.DataFrame(subdf.tolist())

Related

How to create column names in pandas dataframe?

I have exported the gold price data from the brokers. the file has no column names like this
2014.02.13,00:00,1291.00,1302.90,1286.20,1302.30,41906
2014.02.14,00:00,1301.80,1321.20,1299.80,1318.70,46244
2014.02.17,00:00,1318.20,1329.80,1318.10,1328.60,26811
2014.02.18,00:00,1328.60,1332.10,1312.60,1321.40,46226
I read csv to pandas dataframe and it take the first row to be the column names. I am curious how can I set the column names and still have all the data
Thank you
if you don't have a header in the CSV, you can instruct Pandas to ignore it using the header parameter
df = pd.read_csv(file_path, header=None)
to manually assign column names to the DataFrame, you may use
df.columns = ["col1", "col2", ...]
Encourage you to go through the read_csv documentation to learn more about the options it provides.

Python Pandas spit to series groupBy

I have a dataframe which I can run split on a specific column and get a series - but how do I then add my other columns back into this dataframe? or do I somehow specify in the split that there's column a which is the groupBy then split on columnb ?
input:
ixd _id systemA systemB
0 abc123 1703.0|1144.0 2172.0|735.0
output:
pandas series data (not expanded) for systemA and B split on '|' groupedBy _id
It sounds like a regular .groupby will achieve what you are after:
for specific_value, subset_df in df.groupby(column_of_interest):
...
The subset_df will be a pandas dataframe containing only rows for which column_of_interest contains specific_value.

Coverting datetime.datetime into timestamp in pandas

I have a column in pandas which contains datetime.datetime array. For instance the rows has the following format:
datetime.datetime(2017,12,31,0,0)
I want to convert this to TimeStamp such that I get:
Timestamp('2017-12-31 00:00:00')
as output, I wonder how one does this?
Try: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_timestamp.html
So maybe: df['datetime'].to_timestamp

Convert float64 column to int64 column

I have a csv which has a column that contains a long "ID" string such as 9075841942209708806(int64). Now, when I read this csv file into a pandas data frame, this number turns into -9.191700e+18(float64).
How can the id of -9.191700e+18(float64) be converted in its original form, i.e. 9075841942209708806(int64)?
To change dtype of column you need to use:
df['ID'] = df['ID'].astype('int64')
Documentation here:
LINK

Get Spark dataset metadata

I am trying to convert the Dataset<row> into another object. Possibly be java.list. And I need to extract the metadata for this dataset. Like the number of column, column names and column types. Is there anyway to do it?
Thank you
You can get the schema from dataset as
ds.schema
This gives you StructType which contains all the information
ds.schema.fieldNames
This gives all the list of column names
ds.schema.fields
This gives you a list of StructField which contains column name, datatype and nullable as a boolean value.
ds.schema.size
This gives the total count of column names
Also, you can see the details with ds.printSchema()
Hope this helps!

Resources