Saving Spark Dataframe to csv has empty rows - apache-spark

LATER EDIT 2: I found the problem, I should normally delete this question as the mistake I made is not related to what I'm asking, the source of the problem was somewhere else.
There are some nuggets of knowledge in it though so I will leave it unless community decides to take it down.
LATER EDIT: So, not sure why this did not came earlier to me, the solution is to use dataframe.na.drop("all") to get rid of all the empty rows. I would still like to know why they appear though. Other filters do not create these empty lines.
I can't find any answers or hints why this happens. I suspect filter is the culprit but not sure if so, why and how to fix it.
I define a dataframe as a another dataframe filtered based on several conditions.Then I save it as csv:
var dataframe = dataframe_raw.filter($"column1" !== $"column2" || $"column3"!==$"column4").drop($"column2").drop($"column4")
dataframe.write.mode("overwrite").option("header","true").csv("hdfs:///path/to/file/")
The problem is that the output "part" file(s) contains empty rows. Any ideea why and how to remove them?
Thank you.
Note: also tried coalesce(1) that helps with saving only one file but that also contains empty rows.

I think problem is related to operators precedence in Scala. To solve this please try changing !== to ===.
dataframe_raw.filter($"column1" =!= $"column2" || $"column3"=!=$"column4")
Second option is to add parenthesis.
dataframe_raw.filter(($"column1" !== $"column2") || ($"column3"!==$"column4"))

Related

Turning a Dataframe into a Series with .squeeze("columns")

I'm studying how to work with data right now and so I'm following along with a tutorial for working with Time Series data. Among the first things he does is read_csv on the file path and then use squeeze=True to read it as a Series. Unfortunately (and as you probably know), squeeze has been depricated from read_csv.
I've been reading documentation to figure out how to read a csv as a series, and everything I try fails. The documentation itself says to use pd.read_csv('filename').squeeze('columns') , but, when I check the type afterward, it is always still a Dataframe.
I've looked up various other methods online, but none of them seem to work. I'm doing this on a Jupyter Notebook using Python3 (which the tutorial uses as well).
If anyone has any insights into why I cannot change the type in this way, I would appreciate it. I'm not sure if I've misunderstood the tutorial altogether or if I'm not understanding the documentation.
I do literally type .squeeze("columns") when I write this out because when I write a column name or index, it fails completely. Am I doing that correctly? Is this the correct method or am I missing a better method?
Thanks for the help!
shampoo = pd.read_csv('shampoo_with_exog.csv',index_col= [0], parse_dates=True).squeeze("columns")
I would start with this...
#Change the the stuff between the '' to the entire file path of where your csv is located.
df = pd.read_csv(r'c:\user\documents\shampoo_with_exog.csv')
To start this will name your dataframe as df which is kind of the unspoken industry standard the same as pd for pandas.
Additionally, this will allow you to use a "raw" (the r) string which makes it easier to insert directories into your python code.
Once you are are able to successfully run this you can simply put df in a separate cell in jupyter. This will show you what your data looks like from your CSV. Once you have done all that you can start manipulating your data. While you can use the fancy stuff in pd.read_csv() I mostly just try to get the data and manipulate it from the code itself. Obviously there are reasons not to only do a pd.read_csv but as you progress you can start adding things here and there. I almost never use squeeze although I'm sure there will be those here to comment stating how "essential" it is for whatever the specific case might be.

merging two string variables results in empty value

I'd like to merge two string variables (STRING_VAR1 and STRING_VAR2) into one string variable STRING_ALL, such that the content of STRING_VAR1 or STRING_VAR2 is copied into STRING_ALL depending on which of those two variables contain any data (see example_dataset). If both variables STRING_VAR1 and STRING_VAR2 contain missing cases, STRING_ALL should be missing as well.
I've tried CONCAT (see code below) but that doesn't work for some reason and leaves me with only empty cases for STRING_ALL.
STRING STRING_ALL(A4)
COMPUTE STRING_ALL = CONCAT(STRING_VAR1, STRING_VAR2)
Thanks in advance!
Eli's suggestion gave you the necessary information to solve this specific issue. If you want to know why, check the Command Order topic in the SPSS Statistics Command Syntax Reference. It discusses the different types of commands and the fact that some of them, such as COMPUTE, do not take effect immediately, but are stored pending execution of a command that causes a data pass.

Python groupby returning single value between carets

Long time listener first time caller. I am new to Python, about 3 days into this and I cannot figure out for the life of me, what is happening in this particular instance.
I brought in an XLSX file as a dataframe called dfInvoice. I want to use groupby on two columns (indexes?) but something funky is happening I think. I can't see my new grouped dataframe with the code below.
uniqueLocation = dfInvoice.groupby(['Location ID','Location'])
When I call uniqueLocation, all that is returned is this:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001B9A1C61198>
I have two questions from here.
1) what the heck is going on? I followed these steps almost identically to this (https://www.geeksforgeeks.org/python-pandas-dataframe-groupby).
2) this string of text between the carets, what should I refer to this as? I didn't know how to search for this happening because I don't exactly understand what this return is.

How can I add multiple GridBagLayout attributes in the 'Constraints' section of an element in Groovy (2.5.5)?

This is driving me mad at the moment, if anyone can help it would be much appreciated!! This is simple enough in Java, but when called from groovy I cannot get multiple gbc properties defined in a single constraint.
I have read on a couple of old posts on the net that GridBagConstraints properties such as gridx etc can be added as follows from here.
code snippet of interest:
label(text:'Username', constraints:gbc(gridx:0,gridy:0,gridwidth:2))
However this won't work for me and I didn't expect it to as the syntax appears to be from years ago so I assume an old API. (error message below when I try the above)
Caught: java.lang.IllegalArgumentException: cannot add to layout: constraint must be a string (or null)
I can't see how this could work as as surely the format needs to be:
GridBagConstraints gbc = new GridBagConstraints()
label("Username: ", constraints:gbc.gridx=0)
The two lines of code above run, but then I have the problem that I can't add more than one entry in the 'constraints:' section, and obviously I need to add 'gridy=0' etc.
Has anybody got any solution on how this should work?
Thanks
Taylor.

Cassandra comments data model

I am trying to store very simple comments in a wide row, but the problem is that i want to have top comments.
So at first I have tried to use UTF8 comparator type and each column name would begin by likes amount and would be followed by timestamp, for example:
Comments_CF = {
parent:{
8_timestamp: comment,
5_timestamp: comment,
1_timestamp: comment,
...
}
...
}
The problem with this approach is that for example 2_timestamp > 19_timestamp because lexicographically 2 is bigger than 19
I could probably store top comments in a separate CF but then i would need to do two queries instead of one so i would really like to avoid that, any suggestions?
2 queries instead of one is usually not a big deal. You could also just do a composite value(number of likes+the comment) and sort the comments yourself....From stuff I have seen there is never alot of comments except a few posts anyways so that would be very quick.
There are other patterns that might spark ideas here as well...
https://github.com/deanhiller/playorm/wiki/Patterns-Page
Use a composite, where the first component is a long and the second is whatever type is appropriate for your timestamp format. This way the sorting will be correct.

Resources