extract sub dataframe loc in for loop - python-3.x

I am having problem in slicing a total Pandas dataframe into subsets and then join them in a for loop.
The original dataframe looks like this:
original DataFrame content
*Forgive me that I blocked out some details to avoid issues on data security/policy. As long as the message is conveyed.
This is a query problem that I need to query for each "Surveyname" from another list which contains the survey names, and output to a table of all surveys (in sequences of the list) and their other information from selected columns.
The original Dataframe has columns of these:
Index(['Surveyname', 'Surveynumber', 'Datasetstatus', 'Datasetname',
'Datasetprocessname', 'Datasetdatatype', 'Datasetseismicformat',
'Datasettapedisplay', 'Inventoryid', 'Inventoryname',
'Inventorybarcode', 'Inventoryremarks', 'Datashipmentnumber',
'Facilityid', 'Facilityname', 'Inventoryfullpathfilename',
'Inventorytype', 'Mmsdatasetname', 'Inventorymediatype',
'Inventoryoriglocation', 'Inventoryreceiveddate',
'Inventorydataattribdesc', 'Firstalternatedesc', 'Secondalternatedesc',
'Thirdalternatedesc', 'field26', 'field27', 'field28', 'field29',
'field30', 'field31', 'field32'],
dtype='object')
And I am selecting only these columns as output:
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
I set up an empty dataframe at the start and try to append the "queried" subset dataframe to this one. Hope it will grow along with the for loop.
The code looks like this :
f=open('EmptySurveyList2.txt','r')
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
setdf=pd.DataFrame(columns=cols)# create an empty DataFrame
for line in f:
print(line)
# check by string content
df0=df_MIG.loc[df_MIG['Surveyname']==line,cols]
print(df_MIG.loc[df_MIG['Surveyname']==line,cols])
# check by string length for exact match
df0=df0.loc[df0['Surveyname'].str.len()==len(line),cols]
print(df0.loc[df0['Surveyname'].str.len()==len(line),cols])
print('df0:',len(df0))
setdf=setdf.append(df0)
print('setdf:',len(setdf))
However, this code would still give me only a few rows from the very last survey on the 'setdf' dataframe.
I went on debugging. I found that in the for loop, the df0 dataframe is not finding the survey information from the main df_MIG for the first N surveys in the list except the last one. By printing out the lenth of the df0 and the setdf:
>...Centauro
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Blueberry
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Baha (G)
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
>df0: 1
>
>setdf: 1
While if I do the query outside of the loop,
a = "Blueberry"
df0=df_MIG.loc[df_MIG['Surveyname']==a,cols]
df0=df0.loc[df0['Surveyname'].str.len()==len(a),cols]
setdf=setdf.append(df0)
It works normal and no issues, found the rows which has the name of the survey and add it to the setdf
Bebugging outside loop
This is quite a mystery to me. Anyone can help clarify why , or suggest a better alternative ?

Related

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Wrote a function to quantify nulls in each column of data frame, but doesn't drop

I'm writing a function to automatically check the quantity of null values per column in a data frame, then if the amount of nulls is less than or equal to 2000, drop the rows containing null values
I've written some code that successfully outputs the text strings to mark which column it has analyzed
def drop_na(df,cols):
for i in cols:
missing_vals = df[i].isnull().sum()
if missing_vals <= 2000:
df = df.dropna(subset=[i])
print(f'finished checking column "{i}"')
print('FINISHED ALL!')
I am checking to see if the null containing rows have been dropped with data.isnull().sum() after running the code successfully (where data is the name of my data frame) but the same null counts exist in the columns
I call the function with drop_na(data, data.columns)
It looks like you are only deleting the rows only inside the function. Doing it inplace solves the problem as in the following code works:
def drop_na(data):
cols = data.cols
subset = []
# Determine bad columns, and store them in `subset` list.
for i in cols:
missing_vals = df[i].isnull().sum()
if missing_vals <= 2000:
subset.append(i)
# Now remove all bad columns at once, but inplace.
df.dropna(subset=subset, inplace=True)
print('FINISHED ALL!')
If you don't want to do it inplace, then returning the df, and assigning the returned value to a new variable df2 = drop_na(data) works. Do not forget to re-index the new dataframe if you need to.

How do I give a text key to a dataframe stored as a value in a dictionary?

So I have 3 dataframes - df1,df2.df3. I'm trying to loop through each dataframe so that I can run some preprocessing - set date time, extract hour to a separate column etc. However, I'm running into some issues:
If I store the df in a dict as in df_dict = {'df1' : df1, 'df2' : df2, 'df3' : df3} and then loop through it as in
for k, v in df_dict.items():
if k == 'df1':
v['Col1']....
else:
v['Coln']....
I get a NameError: name 'df1' is not defined
What am I doing wrong? I initially thought I was not reading in the df1..3 data in but that seems to operate ok (as in it doesn't fail and its clearly reading it in given the time lag (they are big files)). The code preceding it (for load) is:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
for k,v in DF_DATA.items():
print(k, v) #this works to print out both key and value
k = pd.read_csv(v) #this does not
I am thinking this maybe the cause but not sure.I'm expecting the load loop to create the 3 dataframes and put them into memory. Then for the loop on the top of the page, I want to reference the string key in my if block condition so that each df can get a slightly different preprocessing treatment.
Thanks very much in advance for your assist.
You didn't create df_dict correctly. Try this:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
df_dict= {k:pd.read_csv(v) for k,v in DF_DATA.items()}

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources