Schema for the csv file - apache-spark

I have one text file having one column,
I have another csv files which is having data.
I need to print the schema from text file and merge this with csv file.
Is this automatically possible without using 'StructType' or using 'Case Class',like it just reads the text file,copy the whole column and transpose and paste it as 1st row for that CSV file.
Text file
Column Header
Name
Age
Roll Number
Section
CSV File
Fred 25 123 A
Eyaz 26 456 B
O/P
Name Age Roll_Number Section
Fred 25 123 A
Eyaz 26 456 B
Any help would be highly appreciated.
Thanks for the time!

val dd= y.collect()
val schema=StructType( dd.map(fieldName => StructField(fieldName,StringType,nullable=true)))
println(schema)

Related

Find and replace function in Alteryx -How it can be done in Azure Data Flow

I have a "Find and replace " tool in Alteryx which finds the Col value of csv file1 and replaces it with the look up csv file2 which has 2 columns like
Word and ReplacementWord.
Example :
Address is a col in Csv file1 which has value like St.Xyz,NY,100067
And Csv file 2 has
Word ReplacementWord
NY NewYork
ZBW Zimbawe etc....
Now the final Output should be
Address
St.Xyz,NewYork,100067
Please help guys .
Hey here's the problem .I have a "Find and replace " tool in Alteryx which finds the Col value of csv file1 and replaces it with the look up csv file2 which has 2 columns like
Word and ReplacementWord.
Example :
Address is a col in Csv file1 which has value like St.Xyz,NY,100067
And Csv file 2 has
Word ReplacementWord
NY NewYork
ZBW Zimbawe etc....
Now the final Output should be
Address
St.Xyz,NewYork,100067
Please help guys .
I tried to reproduce your scenario in my environment to achieve the desired output I Followed below steps:
In dataflow activity I took 2 Sources
Source 1 is the file which contain the actual address.
Source 2 is the file which contain the country codes with names.
After that I took lookup to merge files based on the country code. In lookup condition I provided split(Address,',')[2] to split the address string with comma and get the 2nd value from it Which will be the country code based on this : Xyz,NY,100067 and column_1 of 2nd source.
Lookup data preview:
Now took Derived Column and gave column name as Address with the expression replace(Address, split(Address,',')[2], Column_2) It will replace the What we split in lookup from Address string to value of Column_2
Derived column preview:
then took select and deleted the unwanted columns
Select Preview:
now providing this to sink dataset
Output

How to select a column from a text file which has no header using python

I have a text file which is tabulated. When I open the file in python using pandas, it shows me that the file contains only one column but there are many columns in it. I've tried using pd.DataFrames, sep= '\s*', sep= '\t', but I can't select the column since there is only one column. I've even tried specifying the header but the header moves to the exterior right side and specifies the whole file as one column only. I've also tried .loc method and mentioned specific column number but it always returns rows. I want to select the first column (A, A), third column (HIS, PRO) and fourth column (0, 0).
I want to get the above mentioned specific columns and print it in a CSV file.
Here is the code I have used along with some file components.
1) After opening the file using pd:
[599 rows x 1 columns]
2) The file format:
pdb_id: 1IHV
0 radii_filename: MD_threshold: 4
1 A 20 HIS 0 MaximumDistance
2 A 21 PRO 0 MaximumDistance
3 A 22 THR 0 MaximumDistance
Any help will be highly appreciated.
3) code:
import pandas as pd
df= pd.read_table("file_path.txt", sep= '\t')
U= df.loc[:][2:4]
Any help will be highly appreciated.
If anybody gets any file like this, it can be opened and the column can be selected using the following codes:
f=open('file.txt',"r")
lines=f.readlines()
result=[]
for x in lines:
result.append(x.split(' ')[range])
for w in result:
s='\t'.join(w)
print(s)
Where range is the column you want to select.

How to delete entire row if any of the cells are blank(or empty) in any row in python

I have the following CSV data:
AAB BT 2 5 5
YUT HYT 89 52 3
JUI 10 2 3
HYIU 2 5 6
YHT JU 25 63 2
In 3rd row(and 4th row), first column(and 2nd column) elements are empty, so delete entire 3rd(and 4th) row. Similarly, the Entire 5th row is empty, also remove the entire 5th row.
My desired output should be:
AAB BT 2 5 5
YUT HYT 89 52 3
YHT JU 25 63 2
I use the following code, but it is not deleating.
import csv
input(r'\\input.csv','r')
output=open(r'Outpur.csv','w',newline="")
writer= csv.writer(output)
for row in csv.reader(input):
if any(field.strip() for field in row):
writer.writerrow(row)
input.close()
output.close()
Firstly, please see the comment from #RomanPerekhrest on your original question. This answer assumes that you are not intending to delete your 6th row of data in your example.
It looks like your line of code:
if any(field.strip() for field in row):
is returning true if any field in the row of data contains an entry.
I suspect that what you want (if I understand your requirement correctly), is for the row of data to be kept if all of the fields in the row contain an entry. Therefore, try changing your code to use:
if all(field.strip() for field in row):
Also, check your debugger to ensure that your code is correctly tokenizing the row by splitting it at each comma in the CSV file.
If field.strip() throws an exception (especially for empty fields), you might need to try employing a string length test instead. Something like:
if all(len(str(field).strip()) > 0 for field in row):
#the following line reads the entire data from your csv file
#list contains list of every fields that are seperated by comma in your
csv file
#of every line
#
data_list=[[s.strip() for s in lines.split(',')] for lines in
open('c:/python34/yourcsv.csv')]
#now you can filter out any list inside your data list
#according to your choice
#such as lists whose length are less than it should be
# lets try it
filtered=[item for item in data_list if not(len(item)<5)]
#now you can overwrite the existing file or simply save it to new file
#and do whatever you want
for item in filtered:
print(item)

Specific column import form text file

I have many test files. I want to import only one column from each test file, and write it to an an Excel file. How do I do this?
For example, text file 1 (1).text and (2).text file.
file 1(1).txt column A,B,C,D
file 1(2).txt column A1,B1,C1,D1
I would like to get
A.exls == B B1

Excel formula by combining 2 sheets

I need help in generating excel report.Can anyone of you please help me.
I have 2 excel files. I have tried to paste the files in the question.
file1:
Column A Column B Column C
----------------------------------------------------
$www.example1.com/ab 200 abc
file 2:
URL Hits
-----------------------------------------
$www.something.com/dir/abc 1000
$www.example1.com/ab 100
$www.example2.com/cd 50
$www.example1.com/ab 100
Contains 3 columns -- colA (URLs), colB(Hits in Numerals), colC(some data)
Contains 2 columns -- ColA (URLs), ColB(Hits in Numericals)
Steps:
Take ColA(URLs) from file1 and search in ColA(URL) of files2.
Suppose we get 10 searches, I need to get the Sum of all the ColB(Hits) of file2 and
place it in file1 ColB of the first result.
Any kind of hints would be helpful. I tried many options, but none of them worked.
Should be possible under the following conditions:
Both Files are open
the URLs are the same
Then use code similar to this example:
=SUMIF([Name of file 2]NameOfSheet!$A$2:$C$6;A2;[Name of file 2]NameOfSheet!$B$2:$B$6)
Where $A$2:$C$6 is the range of data in file 2 and A2 is the cell with the value in file 1 and $B$2:$B$6 is the range of data to be summed up within file 2.
Hope this helps.

Resources