Pandas converts some numbers into zeros or other fixed values - python-3.x

I'm using Python with Pandas through Google Colab for some data analysis. I was analyzing some data through plots and noticed some missing data. However, when I looked at the original Excel data before any Python work, there was no missing data in these places. Somehow it's turning the first four days of a month of hourly data into zeros OR, but only for some of the files and some of the time periods. Following the zeros is also a period of other constant values.
I have four similar data files and two of them seem to be working just fine, but the other two get these zeros at the start of SOME (consecutive) months, while nothing is wrong with the original data. Is there some feature in Pandas that could cause some numbers to turn into zeros or other constant values? The same code is used for all the different files, which are all in the same format.
I thought it could be just a problem with using 'resample' during plotting, but even when I just print the values without 'resample', the values are still missing. I included a figure here to show what the data problem looks like.
Function to read the data:
def read_elec_data(data_file_name):
df = pd.read_excel(data_file_name) # Read the original data
# Convert the time value (30.11.2018 0:00-1:00) into a Pandas-compatible timestamp format (2018-11-30 0:00)
new = df["Päivämäärä ja tunti"].str.split("-", n = 1, expand = True) # Split the time column by the delimiter and make it into two new columns [0, 1]. The ending hour [1] can be ignored.
time_data = new[0]
time_data_fixed = pd.to_datetime(time_data) # Convert the modified time data into datetime format
df['Aika'] = time_data_fixed # Add the new time column to the dataframe
# Remove all columns except the new timestamp and energy consumption columns. Rename the consumption according to the building name
building_name = df['Kohde'][0]
df.drop(columns =["Päivämäärä ja tunti", "Tunti", 'Kohde', 'Mittarin nimi'], inplace = True) # Remove everything except the new timestamp and energy consumption
df = df.rename(columns={'Kulutus[kWh]' : building_name})
df = df.set_index('Aika') # Set the timestamp as the index for the final DataFrame that will be utilized in the calculations
return df
Calling of the function:
all_electricity_data_list = []
for buildingname in list_of_electricity_data:
df = read_elec_data(buildingname) # Use the file reading and modification function
all_electricity_data_list.append(df)
all_electricity_data = pd.concat(all_electricity_data_list, axis=1)
Some numbers are converted to zeros or other constant values even though the original data is fine:

Related

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Pandas is messing with a high resolution integer on read_csv

EDIT: This was Excel's fault changing the data type, not Pandas.
When I read a CSV using pd.read_csv(file) a column of super long ints gets converted to a low res float. These ints are a date time in microseconds.
example:
CSV Columns of some values:
15555071095204000
15555071695202000
15555072295218000
15555072895216000
15555073495207000
15555074095206000
15555074695212000
15555075295202000
15555075895210000
15555076495216000
15555077095230000
15555077695206000
15555078295212000
15555078895218000
15555079495209000
15555080095208000
15555080530515000
15555086531880000
15555092531889000
15555098531886000
15555104531886000
15555110531890000
15555116531876000
15555122531873000
15555128531884000
15555134531884000
15555140531887000
15555146531874000
pd.read_csv produces: 1.55551e+16
how do I get it to report the exact int?
I've tried using: float_precision='high'
It's possible that this is caused by the way Pandas handles missing values, meaning that your column is importing as floats, to allow the missing values to be coded as NaN.
A simple solution would be to force the column to import as a str, then impute or remove missing values, and the convert to int:
import pandas as pd
df = pd.read_csv(file, dtypes={'col1': str}) # Edit to use appropriate column reference
# If you want to just remove rows with missing values, something like:
df = df[df.col1 != '']
# Then convert to integer
df.col1 = df.col1.astype('int64')
With a Minimal, Complete and Verifiable Example we can pinpoint the problem and update the code to accurately solve it.

Transfer cell values from different columns and sheets from multiple excel files with same structure into a single dataframe

I have a reporting sheet in excel that contains a set of datapoints that I want to compile from multiple files with the same format into a master dataset.
The initial step I undertook was to extract the data points I need from multiple sheet into one pandas dataframe. See the steps below
I initally imported the excel file and parsed it
import pandas as pd
xl = pd.ExcelFile(r"C:\Users\Nicola\Desktop\ISP 2016-20 Ops-Technical Form.xlsm")
df = xl.parse("FSL, WASH, DRM") #name of sheet #1
Then I located the data points needed for synthesis
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
Then I concatenated and equalised columns positioning to maintain the whole list of values within the same column:
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
And lastly created a dataframe with the string of values I need
master=pd.DataFrame(dfcont2)
finalmaster=master.transpose()
The next two steps I wish to pursue are:
1) Replicate the same code for 50 excel files
2) Compile all string of values from this set of excel files into one single pandas dataframe without running this code over again and compile manually by exporting it into excel.
Any support would be greatly appreciated. Thanks
I believe need loop by file names created by glob and last concat together (all files have same structure):
import glob
dfs = []
for f in glob.glob('*.xlsm'):
df = pd.read_excel(io=f, sheet_name=1)
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
out = pd.concat(dfs, ignore_index=True)
Found the solution that works for me, thank you for the input, jezrael.
To further explain:
1) Imported the files with same structure from my Desktop directory, parsed and selected the Excel sheet from which data can be extracted from different locations (iloc)
import glob
dfs = []
for f in glob.glob('C:/Users/Nicola/Desktop/OPS Form/*.xlsm'):
df = pd.ExcelFile(io=f, sheet_name=1)
df = df.parse("FSL, WASH, DRM")
a=df.iloc[5:20,3:5]
a1=df.iloc[7:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
c=df.iloc[50:56,3:5]
c1=df.iloc[38:39,10:12]
d=df.iloc[57:61,3:5]
e=df.iloc[63:71,3:5]
2) Concatenated and repositioned columns order to compose the first version of the dataframe (output)
dfcon=pd.concat(([a,b,c,d,e]))
dfcon2=pd.concat(([a1,b1,c1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
3) Output presented the same string of values but repeated twice [same label and form-specific entry] from recursive data pull-outs linked to iloc locations.
output = pd.concat(dfs, ignore_index=True)
4) This last snippet simply allowed me to extract the label only once and to select all entries ordered in odd numbers. With the last concatenation, I generated the dataframe I seeked, ready to be processed analytically.
a=output[2:3]
b=output[1::2]
pd.concat([a,b], axis=0, ignore_index=True)

Round pandas timestamp series to seconds - then save to csv without ms/ns resolution

I have a dataframe, df with index: pd.DatetimeIndex. The individual timestamps are changed from 2017-12-04 08:42:12.173645000 to 2017-12-04 08:42:12 using the excellent pandas rounding command:
df.index = df.index.round("S")
When stored to csv, this format is kept (which is exactly what I want). I also need a date-only column, and this is now easily created:
df = df.assign(DateTimeDay = df.index.round("D"))
When stored to csv-file using df.to_csv(), this does write out the entire timestamp (2017-12-04 00:00:00), except when it is the ONLY column to be saved. So, I add the following command before save:
df["DateTimeDay"] = df["DateTimeDay"].dt.date
...and the csv-file looks nice again (2017-12-04)
Problem description
Now over to the question, I have two other columns with timestamps on the same format as above (but different - AND - with some very few NaNs). I want to also round these to seconds (keeping NaNs as NaNs of course), then make sure that when written to csv, they are not padded with zeros "below the second resolution". Whatever I try, I am simply not able to do this.
Additional information:
print(df.dtypes)
print(df.index.dtype)
...all results in datetime64[ns]. If I convert them to an index:
df["TimeCol2"] = pd.DatetimeIndex(df["TimeCol2"]).round("s")
df["TimeCol3"] = pd.DatetimeIndex(df["TimeCol3"]).round("s")
...it works, but the csv-file still pads them with unwanted and unnecessary zeros.
Optimal solution: No conversion of the columns (like above) or use of element-wise apply unless they are quick (100+ million rows). My dream command would be like this:
df["TimeCol2"] = df["TimeCol2"].round("s") # Raises TypeError: an integer is required (got type str)
You can specify the date format for datetime dtypes when calling to_csv:
In[170]:
df = pd.DataFrame({'date':[pd.to_datetime('2017-12-04 07:05:06.767')]})
df
Out[170]:
date
0 2017-12-04 07:05:06.767
In[171]:
df.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[171]: ',date\n0,2017-12-04 07:05:06\n'
If you want to round the values, you need to round prior to writing to csv:
In[173]:
df1 = df['date'].dt.round('s')
df1.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[173]: '0,2017-12-04 07:05:07\n'

Resources