How to convert a json object into a dataframe when arrays are of different lengths? - python-3.x

I am trying to convert a json format into a dataframe but getting an error saying "All arrays must be of the same length".
Below is the code. Any advice are highly appreciated.
I have the following codes
import pandas as pd
import requests
import json
from pandas import json_normalize
resp = requests.get("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?indicator=7.2.1")
resp.json()
pd.DataFrame(resp.json())
ValueError: All arrays must be of the same length

Here is how to convert the entire json using Pandas from_dict and transpose (T) methods:
import pandas as pd
import requests
resp = requests.get(
"https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?indicator=7.2.1"
)
df = pd.DataFrame.from_dict(resp.json(), orient="index").T
print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 size 1 non-null object
1 totalElements 1 non-null object
2 totalPages 1 non-null object
3 pageNumber 1 non-null object
4 attributes 1 non-null object
5 dimensions 1 non-null object
6 data 1 non-null object
dtypes: object(7)
memory usage: 184.0+ bytes
But if you are interested in specific values of your json, which, I presume, are the ones associated with the key data, then simply do:
df = pd.DataFrame(resp.json()["data"])
print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 goal 25 non-null object
1 target 25 non-null object
2 indicator 25 non-null object
3 series 25 non-null object
4 seriesDescription 25 non-null object
5 seriesCount 25 non-null object
6 geoAreaCode 25 non-null object
7 geoAreaName 25 non-null object
8 timePeriodStart 25 non-null float64
9 value 25 non-null object
10 valueType 25 non-null object
11 time_detail 0 non-null object
12 timeCoverage 0 non-null object
13 upperBound 0 non-null object
14 lowerBound 0 non-null object
15 basePeriod 0 non-null object
16 source 25 non-null object
17 geoInfoUrl 0 non-null object
18 footnotes 25 non-null object
19 attributes 25 non-null object
20 dimensions 25 non-null object
dtypes: float64(1), object(20)
memory usage: 4.2+ KB

Related

How to parse correctly to date in python from specific ISO format

I connected to a table on a db where there are two columns with dates.
I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00
My attempts
If I parse a single value through the strptime method
dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")
I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:
Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 1000 non-null object
I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:
Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 0 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None
Anyone can help?
pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this
import pandas as pd
df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00",
"2018-08-14 22:16:24.037 +03:00"]})
if you just parse to datetime,
df["datetime"] = pd.to_datetime(df["datetime"])
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: object
you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.
You can get that e.g. by parsing to UTC:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df["datetime"]
0 2017-12-14 20:16:24.037000+00:00
1 2018-08-14 19:16:24.037000+00:00
Name: datetime, dtype: datetime64[ns, UTC]
You might set an appropriate time zone to re-create the UTC offset:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: datetime64[ns, Europe/Athens]

Filtering multiple rows when fixing OutOfBoundsDatetime: Out of bounds nanosecond timestamp multiple values

I have tried this code to convert one of my columns,search_departure_date, from the dataframe df into the datetimeformat to get the following error.
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182005 entries, 0 to 182004
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 182005 non-null datetime64[ns]
1 device_type 182005 non-null object
2 search_origin 182005 non-null object
3 search_destination 182005 non-null object
4 search_route 182005 non-null object
5 search_adult_count 157378 non-null float64
6 search_child_count 157378 non-null float64
7 search_cabin_class 157378 non-null object
8 search_type 182005 non-null object
9 search_departure_date 182005 non-null object
10 search_arrival_date 97386 non-null datetime64[ns]
df["search_departure_date"] = pd.to_datetime(df.loc[:, 'search_departure_date'], format='%Y-%m-%d')
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1478-06-14 17:17:56
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1479-03-23 17:17:56
so I am trying to filter out the rows with this timestamp value
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
df.loc[df['search_departure_date'] != '1479-03-23 17:17:56']
df.loc[df['search_departure_date'] != '1478-06-14 17:17:56']
how can I do it for multiple timestamps? I noticed they all begin with 1478 or 1479, using just the pipe (|) operator to join lots of them together is cumbersome.
You can try errors='coerce'
df["search_departure_date"] = pd.to_datetime(df['search_departure_date'], errors='coerce')
If you want to filter out the rows, you can use str.match
m = df['search_departure_date'].str.match('147(8|9)')

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

How does pandas.merge scale?

I have two dataframes A and B. A consumes about 15 times more memory and has about 30 times more rows than B. However, if I merge A to itself it takes only about 15 seconds, whereas merging B to itself takes about 15 minutes (pandas version 1.1.5).
# this chain of operations takes about 15 seconds with A
(A # with B here and in the line below it takes about 15 minutes
.merge(A, how='left', indicator=True)
.query("_merge == 'left_only'")
.drop(columns='_merge')
.reset_index(drop=True))
Does anyone know how merges scale in pandas? Why is the merge of the large dataframe with itself so much faster than that of the small one with itself?
More details on the two dataframes:
Dataframe A:
RangeIndex: 7250117 entries, 0 to 7250116
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 Order_ID int64
1 Color object
2 Size object
3 Country object
4 pieces_demand float64
5 count int64
dtypes: float64(1), int64(2), object(3)
memory usage: 331.9+ MB
DataFrame B:
RangeIndex: 234805 entries, 0 to 234804
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order_ID 234805 non-null int64
1 Color 234805 non-null object
2 Size 234805 non-null object
3 pieces_demand 234805 non-null float64
4 pieces_ordered 234805 non-null float64
5 Size_Index 234805 non-null int64
6 dist_demand 234805 non-null float64
7 dist_ordered 234805 non-null float64
8 dist_discrete_demand 234805 non-null float64
9 dist_discrete_ordered 234805 non-null float64
10 count 234805 non-null int64
dtypes: float64(6), int64(3), object(2)
memory usage: 19.7+ MB

Offset a part of a dataframe by hours

I'm have a portion of measurements where the instrument time has been configured wrong, and this needs to be corrected.
So I'm trying to offset or shift a part of my dataframe by 2 hours, but I can't seem to get it to work with my following code:
dfs['2015-03-23 10:45:00':'2015-03-23 13:15:00'].shift(freq=datetime.timedelta(hours=2))
I don't know if it can be done this easily though.
Hope someone understands my issue :)
>>> dfs.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 11979 entries, 2015-03-23 10:45:05 to 2015-03-23 16:19:32
Data columns (total 11 columns):
CH-1[V] 11979 non-null float64
CH-2[V] 11979 non-null float64
CH-3[V] 11979 non-null float64
CH-4[V] 11979 non-null float64
CH-5[V] 11979 non-null float64
CH-6[V] 11979 non-null float64
CH-7[V] 11979 non-null float64
CH-9[C] 11979 non-null float64
CH-10[C] 11979 non-null float64
Event 11979 non-null int64
Unnamed: 11 0 non-null float64
dtypes: float64(10), int64(1)
memory usage: 1.1 MB
Pandas Indexes are not mutable, so we can't change the index in-place.
We could however make the index a DataFrame column, modify the column, and then reset the index:
import numpy as np
import pandas as pd
# Suppose this is your `dfs`:
index = pd.date_range('2015-03-23 10:45:05', '2015-03-23 16:19:32', freq='T')
N = len(index)
dfs = pd.DataFrame(np.arange(N), index=index)
# move the index into a column
dfs = dfs.reset_index()
mask = (index >= '2015-03-23 10:45:00') & (index <= '2015-03-23 13:15:00')
# shift the masked values in the column
dfs.loc[mask, 'index'] += pd.Timedelta(hours=2)
# use the index column as the index
dfs = dfs.set_index(['index'])
This shows the index has been shifted by 2 hours:
In [124]: dfs.iloc[np.where(mask)[0].max()-1:].head(5)
Out[124]:
0
index
2015-03-23 15:13:05 148
2015-03-23 15:14:05 149 <-- shifted by 2 hours
2015-03-23 13:15:05 150 <-- unchanged
2015-03-23 13:16:05 151
2015-03-23 13:17:05 152

Resources