Spark's .tgz File cannot be extracted on Google Colab? - apache-spark

Running these seem to work:
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
Then the following line does not work and produces this error:
!tar xf spark-3.3.0-bin-hadoop3.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
When I do !pwd, I am in the same folder as where my spark-3.3.0-bin-hadoop3.tgz is located.
Here is a screen shot:

Extracted from OP's question:
For everyone else having this same error, forget the whole thing.
There is a much easier way with 5 lines of code.
Run these instead and Pyspark should automatically be set up in Google
Colab:
!pip install pyspark
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark
# Import a Spark function
from library from pyspark.sql.functions import col

I think you should remove the -q flag from the wget command and see what's happening.
The thing is, I could only reproduce your problem with the following actions:
Suppose I accidentally tried to download Spark from the following link (redirector link):
!wget -q https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
The command above downloaded the file, but it's basically html, only with the filename spark-3.3.0-bin-hadoop3.tgz.
Suppose also that later I discovered my mistake, and decided to download from the proper link. I removed the -q flag to show what's happening:
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
2022-09-12 12:03:09 (224 MB/s) - ‘spark-3.3.0-bin-hadoop3.tgz.1’ saved [299321244/299321244]
Since I already have the file spark-3.3.0-bin-hadoop3.tgz, wget downloads with another filename.
So, when I try to unpack the file, basically I'm trying to unpack the first downloaded file, i.e. the wrong one:
!tar xf spark-3.3.0-bin-hadoop3.tgz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Related

Using pseudo to retain xattrs when extracting tar

I'm trying the following:
use pseudo to pack an archive (bz2) which has files with security xattr set.
use pseudo again to unpack the archive and keep the security xattr of the files.
Unfortunately, the extraction fails with the following message coming from pseudo:
$ pseudo tar -cjpf myarchive.tar.bz2 --xattrs --xattrs-include='*' myFile
# checked the contents in the meantime and looked good
$ pseudo tar -C unpack-folder/ -xpjf myarchive.tar.bz2 --xattrs --xattrs-include='*'
Warning: PSEUDO_PREFIX unset, defaulting to /home/user/tmp/.
got *at() syscall for unknown directory, fd 4
unknown base path for fd 4, path myFile
couldn't allocate absolute path for 'myFile'.
tar: myFile: Cannot open: Bad address
tar: Exiting with failure status due to previous errors
pseudo: 1.9.0
tar: 1.34
Do you have any idea what could be the problem or have another idea on how to be able to preserve the xattr of the files when extracting the contents of the archive?

Pyspark: Load a tar.gz file into a dataframe and filter by filename

I have a tar.gz file that has multiple files. The hierarchy looks as below. My intention is to read the tar.gz file, filter out the contents of b.tsv as it is static metadata where all the other files are actual records.
gzfile.tar.gz
|- a.tsv
|- b.tsv
|- thousand more files.
By pyspark load, I'm able to load the file into a dataframe. I used the command:
spark = SparkSession.\
builder.\
appName("Loading Gzip Files").\
getOrCreate()
input = spark.read.load('/Users/jeevs/git/data/gzfile.tar.gz',\
format='com.databricks.spark.csv',\
sep = '\t'
With the intention to filter, I added the filename
from pyspark.sql.functions import input_file_name
input.withColumn("filename", input_file_name())
Which now generates the data like so:
|_c0 |_c1 |filename |
|b.tsv0000666000076500001440035235677713575350214013124 0ustar netsaintusers1|Lynx 2.7.1|file:///Users/jeevs/git/data/gzfile.tar.gz|
|2|Lynx 2.7|file:///Users/jeevs/git/data/gzfile.tar.gz|
Of course, the file field is populating with the tar.gz file, making that approach useless.
A more irritating problem is, the _c0 is getting populated with filename+garbage+first row values
At this point, I'm wondering if the file read itself is getting weird as it is a tar.gz file. When we did the v1 of this processing, (spark 0.9), we had another step that loaded the data from s3 into an ec2 box, extract and write back into s3. I'm trying to get rid of those steps.
Thanks in advance!
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
Above code will unzip all files with extension *.tar.gz in source to destination location.
If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable.
This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

Executing unzip command programmatically

I have created a shell script and inside of it is a simple statement unzip -o $1 and on running through terminal and passing a .zip file as parameter it works fine and takes 5 second to create unzipped folder.Now I am trying to do the same thing in scala and my code is as below :
object ZipExt extends App {
val process = Runtime.getRuntime.exec(Array[String]("/home/administrator/test.sh", "/home/administrator/MyZipFile_0.8.6.3.zip"))
process.waitFor
println("done")
}
Now whenever I am trying to execute ZipExt it gets stuck in process.waitFor forever and print statement is not reached.I have tried using this code both locally and on server also. I tried other possibilites also like creating local variable inside shellscript, including exit statements inside shell script, trying to unzip other .zip other than mines, even sometimes print statement is executing but no unzipped file is created there. So I am pretty sure there is something wrong about executing unzip command programmatically to unzip a file or there is some other way around to unzip a zipped file programtically.I have been stuck with this problem for like 2 days, so somebody plz help..
The information you have given us appears to be insufficient to reproduce the problem:
% mkdir 34088099
% cd 34088099
% mkdir junk
% touch junk/a junk/b junk/c
% zip -r junk.zip junk
updating: junk/ (stored 0%)
adding: junk/a (stored 0%)
adding: junk/b (stored 0%)
adding: junk/c (stored 0%)
% rm -r junk
% echo 'unzip -o $1' > test.sh
% chmod +x test.sh
% scala
Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val process = Runtime.getRuntime.exec(Array[String]("./test.sh", "junk.zip"))
process: Process = java.lang.UNIXProcess#35432107
scala> process.waitFor
res0: Int = 0
scala> :quit
% ls junk
a b c
I would suggest trying this same reproduction on your own machine. If it succeeds for you too, then start systematically reducing the differences between the succeeding case and the failing case, a step at a time. This will help narrow down what the possible causes are.

Unzip the archive with more than one entry

I'm trying to decompress ~8GB .zip file piped from curl command. Everything I have tried is being interrupted at <1GB and returns a message:
... has more than one entry--rest ignored
I've tried: funzip, gunzip, gzip -d, zcat, ... also with different arguments - all end up in the above message.
The datafile is public, so it's easy to repro the issue:
curl -L https://archive.org/download/nycTaxiTripData2013/faredata2013.zip | funzip > datafile
Are you sure the mentioned file deflates to a single file? If it extracts to multiple files you unfortunately cannot unzip on the fly.
Zip is a container as well as compression format and it doesn't know where the new file begins. You'll have to download the whole file and unzip it.

python3 subprocess in Oracle Linux (wget -o)

I see there are several posts on python subprocess invoking bash shell commands. But I can't find an answer to my problem unless someone has a link that I'm missing.
So here is a start of my code.
import os;
import subprocess;
subprocess.call("wget ‐O /home/oracle/Downloads/puppet-repo.rpm https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm");
When I do
wget ‐O /home/oracle/Downloads/puppet-repo.rpm https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm
straight up in terminal, it works.
But my IDE gives me FileNotFoundError: [Errno 2] No such file or directory: 'wget'
Again, I'm new to invoking os/subprocess module within python and I would appreciate any insight on how to use these modules effectively.
{UPDATE: with miindlek's answer, I get these errors. 1st - subprocess.call(["wget", "‐O", "/home/oracle/Downloads/puppet-repo.rpm", "https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm"])}
--2015-06-07 17:14:37-- http://%E2%80%90o/
Resolving ‐o... failed: Temporary failure in name resolution.
wget: unable to resolve host address “‐o”
/home/oracle/Downloads/puppet-repo.rpm: Scheme missing.
--2015-06-07 17:14:52-- https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm
{with 2nd bash method subprocess.call("wget ‐O /home/oracle/Downloads/puppet-repo.rpm https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm", shell=True)}
Resolving yum.puppetlabs.com... 198.58.114.168, 2600:3c00::f03c:91ff:fe69:6bf0
Connecting to yum.puppetlabs.com|198.58.114.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10184 (9.9K) [application/x-redhat-package-manager]
Saving to: “puppetlabs-release-el-6.noarch.rpm.1”
0K ......... 100% 1.86M=0.005s
2015-06-07 17:14:53 (1.86 MB/s) - “puppetlabs-release-el-6.noarch.rpm.1” saved [10184/10184]
FINISHED --2015-06-07 17:14:53--
Downloaded: 1 files, 9.9K in 0.005s (1.86 MB/s)
Process finished with exit code 0
You should split your command string into a list of arguments:
import subprocess
subprocess.call(["wget", "-O", "/home/oracle/Downloads/puppet-repo.rpm", "https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm"])
You could also use the shell option as an alternative:
import subprocess
subprocess.call("wget -O /home/oracle/Downloads/puppet-repo.rpm https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm", shell=True)
By the way, in python you don't need to add semicolons at the end of a line.
Update
The dash in option -O is a utf8 hyphen Charakter, not a dash. See for example:
>>> a = "‐" # utf8 hyphen
>>> b = "-" # dash
>>> str(a)
'\xe2\x80\x9'
>>> str(b)
'-'
You should delete your old dash and relace it by a normal one. I updated the former source code. You can also copy it from there.
This sounds primarily because your IDE is launching that python subprocess from not 'straight up in a terminal'.
This will be a reading suggestion, rather than a direct answer to only this problem.
Check your IDE; read docs about how it launches stuff.
1 - in terminal
type $ env where you tested $ wget
2 - in IDE
import os ; print(os.environ)
3 - read here about shell and Popen
https://docs.python.org/3/library/subprocess.html
Begin the learning process from there.
I would even suggest replacing
subprocess.call("wget -O /home/oracle/Downloads/puppet-repo.rpm https://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm", shell=True)
With a clear declaration of what 'shell' you want to use
subprocess.Popen(['/bin/sh', '-c', 'wget' '<stuff>'])
to mitigate future IDE/shell/env assumption problems.

Resources