i just start learning Data Analytics, and trying to understand step by step cleaning the data, I have a example data from freeCodeCamp
https://github.com/ine-rmotr-curriculum/FreeCodeCamp-Pandas-Real-Life-Example
what I confused is in the "data/sales_data.csv" I try to read data in jupyter, but why jupyter cannot read my data properly(head of column renamed accidentally) , this is what jupyter shown on lab
and this is the data should look like before I upload it in jupyter
anyone can help me with this problem, I try other csv file and doesn't see this problem
Related
I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.
I'm having trouble trying to plot METAR data from the THREDDS data server, which comes in .nc format, onto a map. I'm using Siphon to grab the data and Xarray to open the dataset with remote_access, but the problem comes with attempting to convert the data to .csv or .txt. When using NetCDF Dataset and attempting to read the NetCDF4 file, I get a file does not exist error. I'm not really sure where to start as there doesn't seem to be much on the internet about this. Any help would be appreciated as I'm also really new to this.
I am using Google Colabs for my research in machine learning.
I do many variations on a network and run them saving the results.
I have a part of my notebook that used to be separate file (network.py) At the start of a training session I used to save this file in a directory that has the results and logs etc. Now that this part of the code is in the notebook it is easier to edit etc, BUT I do not have a file to copy to the output directory that describes the model. how to i take a section of a google colab notebook and save the raw code as a python file?
Things I have tried:
%%writefile "my_file.py" - is able to write the file however the classes are not available to the runtime.
I have set up a google cloud account
I want to perform my deep learning much more faster on a jupyter notebook, but
I cannot find a way to read my csv file
I downloaded it with wget from my github account and afterwards I tried
dataset = pd.read_csv('/home/user/.jupyter/SIEMENSTRAIN.csv')
but I get the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Why? When I read it on my laptop using my jupyter notebooks, everything runs well
Any suggestions?
I tried the recommended solutions for this error and I got the next warning
/home/user/anaconda3/lib/python3.5/site-packages/ipykernel/main.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if name == 'main':
When I ran dataset.head() this is what appeared
Any help please?
There are a number of possibilities that could be causing the problem... I would first always make sure that Pandas (pd)'s version is updated and compatible.
The more likely cause is that the CSV itself is not right, so pd.read_csv() is not able to work correctly (thus a Parse Error). This may have something to do with the headers, though I'm not sure what your original CSV file looks like. It's worth playing around with read_csv, for example:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
This tampers with 2 things - the delimiter, and if pd is reading a header from CSV or not.
I go through some pd.read_csv() stuff in my book about Stock Prediction (another cool Machine Learning problem) and Deep Learning, feel free to check it out.
Good Luck!
I tried what you proposed and this is what I got
So, any suggestions?
I suppose that the path is ok, but it just won't be read properly, or am I wrong?
I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.