Possible corner case: pandas.read_csv - python-3.x

Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?

Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131

Related

Removing unwanted characters in python pandas

I have a pandas dataframe column like below :
| ColumnA |
+-------------+
| ABCD(!) |
| <DEFG>(23) |
| (MNPQ. ) |
| 32.JHGF |
| "QWERT" |
Aim is to remove the special characters and produce the output as below :
| ColumnA |
+------------+
| ABCD |
| DEFG |
| MNPQ |
| JHGF |
| QWERT |
Tried using the replace method like below, but without success :
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z\d\_]+", "", regex=True)
print(df)
So, how can I replace the special characters using replace method in pandas?
Your solution is also for get numbers \d and _, so it remove only:
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "")
print (df)
ColumnA
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
regrex should be r'[^a-zA-Z]+', it means keep only the characters that are from A to Z, a-z
import pandas as pd
# | ColumnA |
# +-------------+
# | ABCD(!) |
# | <DEFG>(23) |
# | (MNPQ. ) |
# | 32.JHGF |
# | "QWERT" |
# create a dataframe from a list
df = pd.DataFrame(['ABCD(!)', 'DEFG(23)', '(MNPQ. )', '32.JHGF', 'QWERT'], columns=['ColumnA'])
# | ColumnA |
# +------------+
# | ABCD |
# | DEFG |
# | MNPQ |
# | JHGF |
# | QWERT |
# keep only the characters that are from A to Z, a-z
df['ColumnB'] =df['ColumnA'].str.replace(r'[^a-zA-Z]+', '')
print(df['ColumnB'])
Result:
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT
Your suggested code works fine on my installation with only extra digits so that you need to update your regex statement: r"[^a-zA-Z]+" If this doesn't work, then maybe try to update your pandas;
import pandas as pd
d = {'Column A': [' ABCD(!)', '<DEFG>(23)', '(MNPQ. )', ' 32.JHGF', '"QWERT"']}
df = pd.DataFrame(d)
df['ColumnA'] = df['ColumnA'].str.replace(r"[^a-zA-Z]+", "", regex=True)
print(df)
Output
Column A
0 ABCD
1 DEFG
2 MNPQ
3 JHGF
4 QWERT

Nodes taken from pandas Dataframe is not fully connected by NetworkX

I have a dataframe of that I use to calculate the weight of distribution between OS and limit.
| customer_id | limit_reference | OS | limit |
| ----------- | ------------------ | ------ | ------- |
| 1111 | 1111.A.1::1111.B.1 | 0.1 | 5 |
| 1111 | 1111.A.1 | .08 | 5 |
| 9012 | 1111.B.1::9012.B.1 | .15 | 5 |
The value in limit_reference is of the form: ID.contract_reference.
I need to match all value in customer_id and limit_reference, and group them into groups. So if I created another group_id column, then I will have:
| customer_id | limit_reference | OS | limit | group_id |
| ----------- | ------------------ | ------ | ------- | ---------|
| 1111 | 1111.A.1::1111.B.1 | 0.1 | 5 | 1 |
| 1111 | 1111.A.1 | .08 | 5 | 1 |
| 9012 | 1111.B.1::9012.B.1 | .15 | 5 | 1 |
The problem I have is networkx recognize 1111.A.1::1111.B.1 and 1111.B.1::9012.B.1 as 2 different nodes even though they have the same element 1111.B.1.
I have tried to split limit_reference, but it is unhashable. Here's the code I tried:
import pandas as pd
import networkx as nx
df_ = pd.read_excel('sample2.xlsx')
G = nx.Graph()
G = nx.from_pandas_edgelist(df_, 'customer_id', 'limit_reference')
cnc = nx.connected_components(G)
pos = nx.spring_layout(G, scale=20, k=2/np.sqrt(G.order()))
df_['group_id'] = [label for node in df_.customer_id for label, component in lookup.items() if node in component]
nx.draw(G, pos, node_color='lightgreen', node_size=1000, with_labels=True)
lookup = {i: component for i, component in enumerate(cnc, 1)}
You can split limit_reference with pandas' Series.str.split() method, e.g.
import pandas as pd
import networkx as nx
df_ = pd.DataFrame({'customer_id': [1111, 1111, 9012],
'limit_reference': ['1111.A.1::1111.B.1',
'1111.A.1',
'1111.B.1::9012.B.1']})
G = nx.Graph()
limit_reference_split = [x for sublist in df_['limit_reference'].str.split('::')
for x in sublist]
G.add_nodes_from(limit_reference_split)
list(G.nodes)
['1111.A.1', '1111.B.1', '9012.B.1']

Problem in using contains and udf in Pyspark: AttributeError: 'NoneType' object has no attribute 'lower'

I have 2 Dataframe, df1, and df2:
df1:
+-------------------+----------+------------+
| df1.name |df1.state | df1.pincode|
+-------------------+----------+------------+
| CYBEX INTERNATION| HOUSTON | 00530 |
| FLUID POWER| MEDWAY | 02053 |
| REFINERY SYSTEMS| FRANCE | 072234 |
| K N ENTERPRISES| MUMBAI | 100010 |
+-------------------+----------+------------+
df2:
+--------------------+------------+------------+
| df2.name |df2.state | df2.pincode|
+--------------------+------------+------------+
|FLUID POWER PVT LTD | MEDWAY | 02053 |
| CYBEX INTERNATION | HOUSTON | 02356 |
|REFINERY SYSTEMS LTD| MUMBAI | 072234 |
+--------------------+------------+------------+
My work is to validate whether the data in df1 is present on df2, if it does validate = 1 else validate = 0.
Now I am running some join operation on the condition, state, and Pincode and for string compare I am first converting a string to lower case, sorting and using Python Sequence matching.
Expected Output is:
+-------------------+-------------------+----------+------------+------------+
| df1.name|df2.name |df1.state | df1.pincode| Validated |
+-------------------+-------------------+----------+------------+------------+
| CYBEX INTERNATION| NULL |HOUSTON | 00530 | 0 |
| FLUID POWER|FLUID POWER PVT LTD|MEDWAY | 02053 | 1 |
| REFINERY SYSTEMS| NULL |FRANCE | 072234 | 0 |
| K N ENTERPRISES| NULL |MUMBAI | 100010 | 0 |
+-------------------+-------------------+----------+------------+------------+
I have my code:
from pyspark.sql.types import *
from difflib import SequenceMatcher
from pyspark.sql.functions import col,when,lit,udf
contains = udf(lambda s, q: SequenceMatcher(None,"".join(sorted(s.lower())), "".join(sorted(q.lower()))).ratio()>=0.9, BooleanType())
join_condition = ((col("df1.pincode") == col("df2.pincode")) & (col("df1.state") == col("df2.state")))
result_df = df1.alias("df1").join(df2.alias("df2"), join_condition , "left").where(contains(col("df1.name"), col("df2.name")))
result = result_df.select("df1.*",when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))
result.show()
But the output is giving me
AttributeError: 'NoneType' object has no attribute 'lower'
I know the unmatched column is Null so that's why s.lower() and p.lower() not working, but how to tackle this problem. I want only this condition in contains, to do filter process.
Also, I need to have df2.name column in result for that I am giving col names in list:
cols = ["df1.name","df2.name","df1.state","df1.pincode"]
result = result_df.select(*cols,when(col("df2.name").isNotNull(), lit(1)).otherwise(lit(0)).alias("validated"))
But again I am getting an error:
SyntaxError: only named arguments may follow *expression
Any help will be appreciated. Thanks.
in your UDF, you are using the .lower method. This method is a method of str objects. Apparently, in your Dataframe, you have somewhere in the df1.name or df2.name some None values.
Replace your current UDF with something like this to handle None :
contains = udf(
lambda s, q: SequenceMatcher(
None,
"".join(sorted((s or "").lower())),
"".join(sorted((q or "").lower()))
).ratio()>=0.9, BooleanType()
)

Fill dataframe cells entry using dataframe column names and index

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+
IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Cannot convert column when using FeatureTools to normalize_entity with timestamps

I'm attempting to use FeatureTools to normalize a table for feature synthesis. My table is similar to Max-Kanter's response from How to apply Deep Feature Synthesis to a single table. I'm hitting an exception I would appreciate some help working around.
The exception originates in featuretools.entityset.entity.entityset_convert_variable_type, which doesn't seem to handle time types.
What is the nature of the exception, and can I work around it?
The Table, df:
PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show
12345 | 5642903 | F | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA | 0 | 1 | 0 | 0 | 0 | 0 | No
67890 | 3902943 | M | 2016-03-18 | 2016-04-29 | 44 | Other Nbh | 1 | 1 | 0 | 0 | 0 | 0 | Yes
...
My Code:
appointment_entity_set = ft.EntitySet('appointments')
appointment_entity_set.entity_from_dataframe(
dataframe=df, entity_id='appointments',
index='AppointmentID', time_index='AppointmentDay')
# error generated here
appointment_entity_set.normalize_entity(base_entity_id='appointments',
new_entity_id='patients',
index='PatientId')
ScheduledDay and AppointmentDay are type pandas._libs.tslib.Timestamp as is the case in Max-Kanter's response.
The Exception:
~/.virtualenvs/trane/lib/python3.6/site-packages/featuretools/entityset/entity.py in entityset_convert_variable_type(self, column_id, new_type, **kwargs)
474 df = self.df
--> 475 if df[column_id].empty:
476 return
477 if new_type == vtypes.Numeric:
Exception: Cannot convert column first_appointments_time to <class 'featuretools.variable_types.variable.DatetimeTimeIndex'>
featuretools==0.1.21
This dataset is from the Kaggle Show or No Show competition
The error that’s showing up seems to be a problem with the way the AppointmentDay variable is being read by pandas. We actually have an example Kaggle kernel with that dataset. There, we needed to use pandas.read_csv with parse_dates:
data = pd.read_csv("data/KaggleV2-May-2016.csv", parse_dates=['AppointmentDay', 'ScheduledDay'])
That returns a pandas Series whose values are of type numpy.datetime64. This should load in fine to Featuretools.
Also, can you make sure you have the latest version of Featuretools from pip? There is a set trace command in that stack trace that isn’t in the latest release.

Resources