I have a dataframe of that I use to calculate the weight of distribution between OS and limit.
| customer_id | limit_reference | OS | limit |
| ----------- | ------------------ | ------ | ------- |
| 1111 | 1111.A.1::1111.B.1 | 0.1 | 5 |
| 1111 | 1111.A.1 | .08 | 5 |
| 9012 | 1111.B.1::9012.B.1 | .15 | 5 |
The value in limit_reference is of the form: ID.contract_reference.
I need to match all value in customer_id and limit_reference, and group them into groups. So if I created another group_id column, then I will have:
| customer_id | limit_reference | OS | limit | group_id |
| ----------- | ------------------ | ------ | ------- | ---------|
| 1111 | 1111.A.1::1111.B.1 | 0.1 | 5 | 1 |
| 1111 | 1111.A.1 | .08 | 5 | 1 |
| 9012 | 1111.B.1::9012.B.1 | .15 | 5 | 1 |
The problem I have is networkx recognize 1111.A.1::1111.B.1 and 1111.B.1::9012.B.1 as 2 different nodes even though they have the same element 1111.B.1.
I have tried to split limit_reference, but it is unhashable. Here's the code I tried:
import pandas as pd
import networkx as nx
df_ = pd.read_excel('sample2.xlsx')
G = nx.Graph()
G = nx.from_pandas_edgelist(df_, 'customer_id', 'limit_reference')
cnc = nx.connected_components(G)
pos = nx.spring_layout(G, scale=20, k=2/np.sqrt(G.order()))
df_['group_id'] = [label for node in df_.customer_id for label, component in lookup.items() if node in component]
nx.draw(G, pos, node_color='lightgreen', node_size=1000, with_labels=True)
lookup = {i: component for i, component in enumerate(cnc, 1)}
You can split limit_reference with pandas' Series.str.split() method, e.g.
import pandas as pd
import networkx as nx
df_ = pd.DataFrame({'customer_id': [1111, 1111, 9012],
'limit_reference': ['1111.A.1::1111.B.1',
'1111.A.1',
'1111.B.1::9012.B.1']})
G = nx.Graph()
limit_reference_split = [x for sublist in df_['limit_reference'].str.split('::')
for x in sublist]
G.add_nodes_from(limit_reference_split)
list(G.nodes)
['1111.A.1', '1111.B.1', '9012.B.1']
Related
Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?
Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131
I used Django-orm,postgresql, Is it possible to query by group_by and order_by?
this table
| id | b_id | others |
| 1 | 2 | hh |
| 2 | 2 | hhh |
| 3 | 6 | h |
| 4 | 7 | hi |
| 5 | 7 | i |
I want the query result to be like this
| id | b_id | others |
| 1 | 2 | hh |
| 3 | 6 | h |
| 4 | 7 | hi |
or
| id | b_id | others |
| 4 | 7 | hi |
| 3 | 6 | h |
| 1 | 2 | hh |
I tried
Table.objects.annotate(count=Count('b_id')).values('b_id', 'id', 'others')
Table.objects.values('b_id', 'id', 'others').annotate(count=Count('b_id'))
Table.objects.extra(order_by=['id']).values('b_id','id', 'others')
Try window function and subquery in the following way:
from django.db.models import Window, F, Subquery, Count
from django.db.models.functions import FirstValue
queryset = A.objects.annotate(count=Count('b_id')).filter(pk__in=Subquery(
A.objects.annotate(
first_id=Window(expression=FirstValue('id'), partition_by=[F('b_id')]), order_by=F('id'))
.values('first_id')))
You can try this:
from django.db.models import Count
result = Table.objects
.values('b_id')
.annotate(count=Count('b_id'))
The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!
I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)
I have a csv file with following data, I want to know how to multiply values in Qty column with Avg cost column and then sum the values together.
| Instrument | Qty | Avg cost |
|------------|------|-----------|
| APLAPOLLO | 1 | 878.2 |
| AVANTIFEED | 2 | 488.95 |
| BALAMINES | 3 | 308.95 |
| BANCOINDIA | 5 | 195.2 |
| DCMSHRIRAM | 4 | 212.95 |
| GHCL | 4 | 241.75 |
| GIPCL | 9 | 102 |
| JAMNAAUTO | 5 | 178.8 |
| JBCHEPHARM | 3 | 348.65 |
| KEI | 8 | 121 |
| KPRMILL | 2 | 592.65 |
| KRBL | 3 | 274.45 |
| MPHASIS | 2 | 519.75 |
| SHEMAROO | 2 | 400 |
| VOLTAMP | 1 | 924 |
Try this:
f=open('yourfile.csv','r')
temp_sum=0
for line in f:
word=line.split(',')
temp_sum=temp_sum+float(word[1])*float(word[2])
print(temp_sum)
import pandas
colnames = ['Qty', 'Avg_cost']
data = pandas.read_csv('test.csv', names=colnames)
qty = data.Qty.tolist()
avg = data.Avg_cost.tolist()
mult = []
for i in range(0,len(qty)):
temp = qty[i]*avg[i]
mult.append(temp)
sum_all = sum(mul)
print sum_all
print mult
I saved the file as test.csv and did the following
import csv
with open('/tmp/test.csv', 'r') as f:
next(f) #skip first row
total = sum(int(row[1]) * float(row[2]) for row in csv.reader(f))
print('The total is {}'.format(total))