pyspark concatenate multiple csv files in one

pyspark concatenate multiple csv files in one - apache-spark

I need to use function concat(Path trg, Path[] psrcs) from org.apache.hadoop.fs with pyspark
My code is:
orig1_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename1}')
orig2_fs = spark._jvm.org.apache.hadoop.fs.Path(f'{tmp_path}{filename2}')
dest_fs = spark._jvm.org.apache.hadoop.fs.Path(dest_path)
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
fs.concat(dest_fs, list((orig1_fs , orig2_fs)))
But I get error:
error
How can I use the function?

That's because the 2nd argument of concat method is an Array not an ArrayList
# transform from `ArrayList<Path>` to `Path[]`
py_paths = [orig1_fs , orig2_fs]
java_paths = sc._gateway.new_array(spark._jvm.org.apache.hadoop.fs.Path, len(py_paths))
for i in range(len(py_paths)):
java_paths[i] = py_paths[i]
# you can use the new array now
fs.concat(dest_fs, java_paths)

Related

Why is only half my data being passed into my dictionary?

When I run this script I can verify that it loops through all of the values, but not all of them get passed into my dictionary
file = open('path', 'rb')
readFile = PyPDF2.PdfFileReader(file)
lineData = {}
totalPages = readFile.numPages
for i in range(totalPages):
pageObj = readFile.getPage(i)
pageText = pageObj.extractText
newTrans = re.compile(r'Jan \d{2,}')
for line in pageText(pageObj).split('\n'):
if newTrans.match(line):
newValue = re.split(r'Jan \d{2,}', line)
newValueStr = ' '.join(newValue)
newKey = newTrans.findall(line)
newKeyStr = ' '.join(newKey)
print(newKeyStr + newValueStr)
lineData[newKeyStr] = newValueStr
print(len(lineData))
There are 80+ data pairs but when I run this the dict only gets 37

Well, duplicate keys, maybe? Try to make lineData = [] and append there: lineData.append({newKeyStr:newValueStr} and then check how many records you get.

Remove .0 from a string in python

I'm reading a file called MissingItems.txt, the contents of which is a lsit of bar codes and looks like this
[3000000.0, 5000000.0, 6000000.0, 7000000.0, 8000000.0, 1234567.0, 1234568.0, 9876543.0, 3000001.0, 5000001.0, 6000001.0, 7000001.0, 8000001.0, 1234561.0, 1234561.0, 9876541.0, 6000002.0, 7000002.0, 8000002.0, 1234562.0, 1234562.0, 9876542.0,9876543.0,9876544.0]
I have replaced the square brackets and then split the line as below
OpenFile = open(r"G:MissingItems.txt","r")
for line in OpenFile:
remove = line.replace('[','')
remove1 = remove.replace(']','')
plates = remove1.split(",")
Plate1 = plates[0]
Plate2 = plates[1]
Plate3 = plates[2]
Plate4 = plates[3]
Plate5 = plates[4]
Plate6 = plates[5]
Plate7 = plates[6]
Plate8 = plates[7]
Plate9 = plates[8]
Plate10 = plates[9]
Plate11 = plates[10]
Plate12 = plates[11]
Plate13 = plates[12]
Plate14 = plates[13]
Plate15 = plates[14]
Plate16 = plates[15]
Plate17 = plates[16]
Plate18 = plates[17]
Plate19 = plates[18]
Plate20 = plates[19]
Plate21 = plates[20]
Plate22 = plates[21]
Plate23 = plates[22]
Plate24 = plates[23]
Is there a way to remove the .0 from the bar codes, preferable before splitting? So I would get '3000000', rather than '3000000.0'. I've tried to use replace, but I'm not sure how to get it to recognize they are at the end of the bar codes.

This is one approach using ast.literal_eval and int.
Ex:
import ast
with open(r"G:MissingItems.txt","r") as infile:
for line in infile:
plates = [int(i) for i in ast.literal_eval(line.strip())]
print(plates)
# --> [3000000, 5000000, 6000000, 7000000, 8000000, 1234567, 1234568, 9876543, 3000001, 5000001, 6000001, 7000001, 8000001, 1234561, 1234561, 9876541, 6000002, 7000002, 8000002, 1234562, 1234562, 9876542, 9876543, 9876544]

Your file seems to have JSON formatted lines, so you could use a JSON parser:
import json
with open(r"G:MissingItems.txt","r") as openfile:
for line in openfile:
plate = json.loads(line)
print(plate)
This makes plate a list of numbers (not strings), so the difference between 3000.0 and 3000 disappears (as they are representations of the same number). It is only when you would need to output them in a decimal representation that you would worry about the number of decimals to output.
Secondly, it is bad practice to create separate variables for plate1 plate2, ... In such a scenario you should work with a list, and access the values with plate[0], plate[1], ...

Featuretools - LookupError: Time index not found in dataframe

I have a input dataframe which I have split up into 3 entities based on the attributes. When I try to generate features using featuretools I get the above mentioned error
input dataframe in_df = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id', 'Current_pincode_ID', 'Date.of.Birth', 'Employment.Type', 'DisbursalDate', 'State_ID', 'Employee_code_ID', 'MobileNo_Avl_Flag', 'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS', 'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT', 'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT', 'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS', 'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'AVERAGE.ACCT.AGE', 'CREDIT.HISTORY.LENGTH', 'NO.OF_INQUIRIES', 'loan_default']
I have split this up into 3 entities based on the information available on the dataset:
cust_cols = ['UniqueID','Current_pincode_ID', 'Employment.Type', 'State_ID', 'MobileNo_Avl_Flag', 'branch_id',
'Aadhar_flag', 'PAN_flag', 'VoterID_flag', 'Driving_flag', 'Passport_flag', 'asset_cost', 'Date.of.Birth']
customers_df = df_raw_train[cust_cols]
loan_info_cols = ['UniqueID', 'disbursed_amount', 'asset_cost', 'ltv', 'branch_id', 'supplier_id', 'manufacturer_id',
'Employee_code_ID', 'loan_default', 'DisbursalDate']
loan_info_df = df_raw_train[loan_info_cols]
bureau_cols = ['UniqueID','PERFORM_CNS.SCORE', 'PERFORM_CNS.SCORE.DESCRIPTION', 'PRI.NO.OF.ACCTS', 'PRI.ACTIVE.ACCTS',
'PRI.OVERDUE.ACCTS', 'PRI.CURRENT.BALANCE', 'PRI.SANCTIONED.AMOUNT', 'PRI.DISBURSED.AMOUNT',
'SEC.NO.OF.ACCTS', 'SEC.ACTIVE.ACCTS', 'SEC.OVERDUE.ACCTS', 'SEC.CURRENT.BALANCE', 'SEC.SANCTIONED.AMOUNT',
'SEC.DISBURSED.AMOUNT', 'PRIMARY.INSTAL.AMT', 'SEC.INSTAL.AMT', 'NEW.ACCTS.IN.LAST.SIX.MONTHS',
'DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS', 'NO.OF_INQUIRIES']
bureau_df = df_raw_train[bureau_cols]
customers_df.set_index(['UniqueID', 'branch_id'],inplace = True, append = True)
loan_info_df.set_index(['UniqueID', 'branch_id'], inplace = True, append = True)
entities = {"customers" : (customers_df, "UniqueID", "branch_id"), "loans" : (loan_info_df, "UniqueID", "branch_id"),
"bureau" : (bureau_df, "UniqueID")
}
relationships = [("loans", "UniqueID", "bureau", "UniqueID"),
("customers", "branch_id", "loans", "branch_id")]
feature_matrix_customers, features_defs = ft.dfs(entities=entities, relationships=relationships, target_entity="customers")
I am getting the error" LookupError: Time index not found in dataframe
Can someone help on why is there an error as the featuretools docs does not mention any need to specify the timeindex?

Got this resolved by creating entity sets from dataframes.

Python add ', ' to string and return:

fd = open(nom_fichier, 'r')
liste_chaine = fd.readlines()
liste_chaine2 = []
for item in liste_chaine:
if item not in "'noir\n','blanc\n','Humain\n', 'Ordinateur\n', 'False\n', 'True\n":
liste_chaine2.append(item)
liste_chaine2 = [i.replace('\n', '') for i in liste_chaine2]
return liste_chaine2
['3,3,blanc', '3,4,noir', '4,3,noir', '4,4,blanc']
i am reading a file and trying to return a string output exactly like :
3,3,blanc
4,3,noir
3,4,white
i cleaned the file with the code above but need to clean up this list to the required output

You can split your string and put it together again to meet your requirements:
string = '33blanc 34noir 43noir 44blanche'
result = '\n'.join(['{},{},{}'.format(v[0], v[1], v[2:]) for v in string.split()])
print(result)
3,3,blanc
3,4,noir
4,3,noir
4,4,blanche

How to DON'T return string in quotes

I have list:
itemid = ['113222408782', '113223652945', '113222268092', '113223761722', '113222277037', '113223676589', '113214024190', '113227956444', '113222400375', '113222383960', '113223749386', '113213898511', '113223653433', '113214060057', '113212059543', '113223647852', '113212403974', '113222230789', '113212110156', '113213917508', '113223748917', '113212088893', '113213936773', '113212282559', '113222369037', '113223645004', '113214034011', '113223647208', '113222397481', '113212052765', '113212136602', '113212037895', '113222210185', '113223752305', '113212049744', '113212400978', '113212274566', '113218830085', '113203034623', '113222199167', '113223648988', '113223646543', '113223651519', '113222200831', '113213996789', '113214000484', '113213890605', '113222232853', '113222298617', '113223753658', '113222238111', '113194336951', '113223631876', '113222242464', '113212123303', '113222215450', '113214000567', '113223642160', '113223639750', '113214060070', '113223644511', '113194332243', '113212139900', '113222207007', '113222374260', '113223719876', '113194339799', '113223677943', '113212417158', '113212433693', '113227977319', '113223607151', '113212409228', '113215809743', '113214051350']
This list contains 75 values. I'm cutting this list for 20 items list using following method:
while len(itemid) > 0:
slice = itertools.islice(itemid, 20)
ha = []
for x in slice:
ha.append(format(x))
var1 = ','.join(ha)
var1 returns string with 20 values like this:
113222408782,113223652945,113222268092,113223761722,113222277037,113223676589,113214024190,113227956444,113222400375,113222383960,113223749386,113213898511,113223653433,113214060057,113212059543,113223647852,113212403974,113222230789,113212110156,113213917508
And then I got stuck:
I'm using it for eBay api and I want to do the following command:
api_request = {'ItemID': var1}
and it returns:
{'ItemID': '113223748917,113212088893,113213936773,113212282559,113222369037,113223645004,113214034011,113223647208,113222397481,113212052765,113212136602,113212037895,113222210185,113223752305,113212049744,113212400978,113212274566,113218830085,113203034623,113222199167'}
But I need to return the string var1 without quotes like this:
{'ItemID': 113223748917,113212088893,113213936773,113212282559,113222369037,113223645004,113214034011,113223647208,113222397481,113212052765,113212136602,113212037895,113222210185,113223752305,113212049744,113212400978,113212274566,113218830085,113203034623,113222199167}
How can I do this?

The quotes are not actually part of the string, they just signify that it is a string. Therefore you cannot remove them

You does not need to convert array to string, you have to pass the entire object like this:
var1 = itertools.islice(itemid, 20)
api_request = {'ItemID': var1}
Why are you trying to cast it to a format if you need to pass it as a array?
Regards,
Igor Quirino

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pyspark concatenate multiple csv files in one - apache-spark

Related

Why is only half my data being passed into my dictionary?

Remove .0 from a string in python

Featuretools - LookupError: Time index not found in dataframe

Python add ', ' to string and return:

How to DON'T return string in quotes

Categories

Resources