I have two csv files. Contacts and Users.
How I load data into dataframes and merge them
First, I load a dataframe with the name of the users:
import pandas as pd
import numpy as np
df_users= pd.read_csv('./Users_001.csv',sep=',',usecols=[0,2,3])
Then I load the information from contacts of each user
df_contacts = pd.read_csv('./Contacts_001.csv',sep=',',usecols=[0,1,5,48,55,56,57,83,58])
df_users columns name are: user_id, Name, Surname
df_contacts columns name are: Contact ID, id user owner, fullname, qualification, ...
I want to merge both dataframes using user_id and 'id user owner' since they represent the same information. To to this I first change the name of the columns on df_contacts and then I merge
dfcontactos.columns = ['ID de Contacto','user_id','fullname','qualification','accesibility' ... ]
df_us_cont = pd.merge(dfcontactos,df_usuarios,on='user_id')
Now df_us_cont has the information from users and contacts.
What I want to do
There are only 18 user_id but there are 500 contacts. For each user I want to know:
Number of contacts with qualification < 100
For the contacts that have qualification <100
How many contacts have accesibility >= 4
Accesibility is a discrete number (0-5))
Number of contacts with qualification > 100 and < 300
Number of contacts with qualification > 300
-
What I have tried and fail
df_qua_lower100 = df_us_cont[df_us_cont['qualification']<100]
df_qua_lower100['user_id'].value_counts()
So far with this I am able to get the information on how many contacts with qualification<100 has each user_id. But I am unable to look how many have 'accesibility>=4'
I have tried to explain the best I could
First thing you can merge without changing column names
df_us_cont = dfcontactos.merge(dfcontactos,left_on='id user owner',right_on='user_id')
You can add as many conditions as you want if you use loc
df_us_cont.loc[(df_us_cont['qualification']<100) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
Number of contacts with qualification > 100 and < 300
df_us_cont.loc[(df_us_cont['qualification']>100) &(df_us_cont['qualification']<300) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
Number of contacts with qualification > 300
df_us_cont.loc[(df_us_cont['qualification']>300) & (df_us_cont['accesibility']>=4),'user_id'].value_counts()
Related
I want to read the data in spark sql by joining two very large tables. But i just need a fix number (let's say 500) from the resultant dataframe.
For example -
SELECT id, name, employee.deptno, deptname
FROM employee INNER JOIN department ON employee.deptno = department.deptno
Here i can use head(500) or limit(500) function on the resultant dataframe to limit the rows from resultant dataframe, But still it is going to read full data from both of the tables first and then on the resultant dataframe it will apply the limit.
Is there a way in which i can avoid reading full data before applying the limit ?
Something like this:
employee = spark.sql('select id, name, deptno from employee limit 500')
department = spark.sql('select deptno, deptname from department limit 500')
employee = employee.join(department, on = 'deptno', how = 'inner')
I am trying to do a python pandas function to calculate the number of instances vs "Customer Name" and "Customer ID" and "Number of tickets".
I have noticed that in some rows the user ID was not present despite having a ticket and customer name.
I want to have a table with the sum of tickets (rows) that were opened by a customer name and had no customer ID.
I expect getting a table similar to the below:
I used the lambada function:
for CustomertID in CustomerIDList:
CustomerID_Series=ExlReport.apply(lambda x:True if x['Account Name'] == CustomerID else False, axis=1)
numOfRows = len(CustomerID_Series[CustomerID_Series == True].index)
countList.append(numOfRows)
I get the output below:
So my question:
How to use lambda to have multiple conditions based on Customer Name and Customer ID and to sum the rows counts / ticket numbers of:
1) Customer Name is found but no customer ID.
2) Customer name and customer ID found.
I thought of using for loop on the customer name instead of customer id list, but I am confused on how to do so.
Thanks
I did not get any answer on this question, however I solved it myself without using lambada and without using for loops. Adding the solution for reference:
I replaced the empty cells in 'customer ID' by a string value of 'N/A' and used 'group by' and 'size()' to get ticket count.
Solution as below:
1- Sorted the df to have the empty cell / cells with NAN value first:
df = (df[['Customer Name','Customer ID']]).sort_values(by=["Customer Name","Customer ID"],axis = 0,ascending = [True,True],na_position='first')
2- Replace the empty cell in column 'Customer ID' with a string of 'N/A':
df['Customer ID'] = df['Customer ID'].replace(np.nan, 'N/A')
3- Use group by to group the df by the two columns, priority is given to "Customer Name"
4- size will be used to count of the Rows
5- Reset index is needed to add column name "Ticket Count" to df:
df = ((df.***groupby***(["Customer Name","Customer ID"],as_index = False)).***size()***.***reset_index(name="Ticket Count"***))
6- Replaced the 'N/A' with '':
df["Customer ID"] = df["Customer ID"].replace('N/A','')
7- Output similar to the below:
I have not worked with Pandas before and I am seeking guidance on the best course of action.
Currently, I have an excel(.xlsx)spreadsheet that I am reading into a data Pandas DataFrame. Within that excel spread sheet, it contains account data, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
From that data, all of the account numbers need to be copied back to every row of data from document key co, document control number, contract id, manufacturer contract id, series number, include exclude, start date, end date and vendors customer id.
Here is a sample of the data:
I've read in the DataFrame and iterated over the DataFrame with the following code:
#reads in template data. Keeps leading zeros in column B and prevents "NaN" from appearing in blank cells
df = pd.read_excel('Contracts.xlsx', converters = {'document_key_co' : lambda x: str(x)}, na_filter = False)
#iterates over rows
for row in df.itertuples():
print(row)
After doing those things, that is where I am stuck. The desired outcome is this:
As you can see there are three accounts copied to the each of the contract id's.
Reading through the Pandas documentation, I considered separating each account into a separate DataFrame and using concat/merging it into another DataFrame that included document key co - vendors customer id, but felt like that was a lot of extra code when there's a likely a better solution.
I was able to accomplish the task utilizing this snippet of code:
concats = []
for x in df.account.values:
concats.append(df.copy())
concats[-1].account = x
pd.concat(concats)
I have the following two Access tables
Employees
id Name
1 bob smith
2 james bird
3 jane big
Events
id emp_id Notes
1 1 fell down the stairs
2 3 paper cut in the break room
I also have the following Excel file that I would like to 'suck' (import) into the Events table. The problem is the data needs to be correlated on the name/emp_id field and I'm not sure the best way to do this.
Excel_sheet
Employee Name Notes
bob smith feel asleep while driving
The access table uses references to the Employees table, whereas the Excel sheet is using names. What are some options for me to bring this Excel sheet into Events table and convert the names (bob smith) into their associated id's from the Employees table?
Assuming names are consistently spelled in both datasets and only one person exists for each name, try:
INSERT INTO Events(emp_ID, Notes) SELECT ID, Notes FROM excel_sheet INNER JOIN Employees ON Employees.Name=excel_sheet.[Employee Name];
Build that SQL in a query object or run in VBA:
CurrentDb.Execute "INSERT INTO Events(emp_ID, Notes) " & _
"SELECT ID, Notes FROM excel_sheet " & _
"INNER JOIN Employees ON Employees.Name=excel_sheet.[Employee Name];"
Suggest you test with a copy of database.
Name is a reserved word and really should not use reserved words as names for anything.
Lets assume that we have a big csv/excel file where there are big number of records against following fields.
1.Email
2.First Name
3.Last Name
4.Phone Number etc.
Among these records, we need to identify the duplicate records in terms of matching criteria of Email,First Name and Last Name.
For duplicate calculation,some custom rules are defined which gives a score against an individual record.
For example ,
1.If email is exact match then score is 100,else 0.
2.For First Name,Last Name etc. the edit distance is the score.
For example, lets assume that search parameter is like the following
Email:xyz#gmail.com,First Name: ABCD,Last Name:EFGH
The rows/records are like
1.Email:xyz#gmail.com,First Name: ABC,Last Name:EFGH
2.Email:123xyz#gmail.com,First Name: ABC,Last Name:EFGH
For record1, score = 100(for email) + 75 (for first name) + 100 (for Last name)=275, i.e.91.6%
For record2, score = 0(for email) + 75 (for first name) + 100 (for Last name)=175, i.e.58%
Duplicate detection threshold is 75%,so record 1 is duplicate and record 2 is not.This is fairly simple to implement when we have input parameters and using them we want to determine the duplicates from a file.
But how to apply this logic when we have all the records in a file and for all of them we need to find out which are the duplicate ones ?
Here no input parameter is defined and we need to compare one record with all other records in order to find scoring relevance .
How to achieve this in Apache Spark ?
Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark
val df = sc.textFile("hdfs path of data");
df.mapToPair("email", <whole_record>)
.groupBy(//will be done based on key)
.map(//will run on each bag )