how to read a specific text file in pandas - text

I want to read a specific line in a csv file in pandas on python.
Here is the structure of the file :
file :
example
how would be the best way to fill the values into a dataframe, with the correct name of the parameters?
thanks for help

Possible methods:
pandas.read_table method seems to be a good way to read (also in chunks) a tabular data file
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html
pandas has a good fast (compiled) csv reader pandas.read_csv (may be more than one).
doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Ref Link: https://codereview.stackexchange.com/questions/152194/reading-from-a-txt-file-to-a-pandas-dataframe

Related

Is it possible in Pyspark to get the csv representation of a dataframe as a string?

I'm trying to get the same result as a pandas to_csv called without a path argument. Currently I'm saving the dataframe as a csv to then read it and I'd like to avoid this step.
path_or_buf: str or file handle, default None
File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.
Having a big dataset, the toPandas function doesn't work.
Does someone know if it's possible in pyspark or know a work around ?
You can use to_csv:
csv_string = df.agg(F.concat_ws('\n', F.collect_list(F.to_csv(F.struct(df.columns))))).head()[0]
You can just use to_csv to convert list of columns to csv as below
from pyspark.sql import functions as f
df.select(f.to_csv(f.struct(df.columns))).show(truncate=False)

Python 3: Read text file that is in list format

I have one large text file that contains data in the form of a list and its just in one line. See examples
Text file contents: [{"input": "data1"}, {"input": "data2"}, {"input": "data2"}]
I am reading this file using python 3 and when I use the read() method, I get one large string however I want to convert this string to a list while maintaining the same format that is in the text file. Is there anyway that this can be achieved? Most of the posts talk about using the split method to achieve this which does not work for this case.
In JavaScript I generally use the stringify and parse methods to do these kinds of conversions but I am not able to find this in python. Any help will be appreciated. Thank you.
You can load json from a a file using Python's built-in json package.
>>> import json
>>> with open('foo.json') as f:
... data = json.load(f)
...
>>> print(data)
[{'input': 'data1'}, {'input': 'data2'}, {'input': 'data2'}]

Python script that reads csv files

script that reads CSV files and gets headers and filter by specific column, I have tried researching on it but nothing of quality I have managed to get.
Please any help will be deeply appreciated
There's a standard csv library included with Python.
https://docs.python.org/3/library/csv.html
It will automatically create a dictionary of arrays where the first row in the CSV determines the keys in the dict.
You can also follow pandas.read_csv for the same.

How to import data in csv format in J?

I want to know how can I import data in CSV and then how I can deal with it?
I had loaded the file but do not know how to read it.
'',' fixdsv dat ] load '/Users/apple/Downloads/data'
Assuming that the file /Users/apple/Downloads/data is a csv file then you should be able to load it into a J session as a boxed table like this:
load 'csv'
data=: readcsv '/Users/apple/Downloads/data'
If the file uses delimiters other than commas (e.g. Tabs) then you could use the tables/dsv addon.
data=: TAB readdsv '/Users/apple/Downloads/data'
See the J wiki for more information on the tables/csv and tables/dsv addons.
After loading the file, I think that I would start by reading the file into a variable then working with that.
data=: 1:!1 <'filepath/filename' NB. filename and path need to be boxed string
http://www.jsoftware.com/help/dictionary/dx001.htm
Also you could look at jd which is specifically a relational database system if you are more focussed on file management than data processing.
http://code.jsoftware.com/wiki/Jd/Index

Get HDFS file path in PySpark for files in sequence file format

My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things:
Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format.
How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format.
Appreciate any pointers.
If stored types are compatible with SQL types and you use Spark 2.0 it is quite simple. Import input_file_name:
from pyspark.sql.functions import input_file_name
Read file and convert to a DataFrame:
df = sc.sequenceFile("/tmp/foo/").toDF()
Add file name:
df.withColumn("input", input_file_name())
If this solution is not applicable in your case then universal one is to list files directly (for HDFS you can use hdfs3 library):
files = ...
read one by one adding file name:
def read(f):
"""Just to avoid problems with late binding"""
return sc.sequenceFile(f).map(lambda x: (f, x))
rdds = [read(f) for f in files]
and union:
sc.union(rdds)

Resources