Special Characters are becoming ambiguous when using mencoder or ffmpeg - linux

i have subtitles encoded in iso-8859-1 at least thats what file -bi says..
They have turkish special characters such as ğ,ü,ş,ç,ö.. When i try this command
mencoder source.avi -sub source.srt -o output.avi -oac copy -ovc lavc \
-lavcopts vbitrate=1200
Turkish characters either is not showing or becoming ambiguous. Also i have tried to iconv to convert character encoding to utf-8 also didn't work.
I have tried ffmpeg with -sub_charenc iso-8859-1, -sub_charenc cp1254, -sub_charenc iso-8859-9
they all didn't work when i tried to make .ass file like this
ffmpeg -sub_charenc utf8 -i test.srt test1.srt
it showed subtitle lines ok on the screen. So i know that it can read lines but i couldn't render video with iso-8859-9(inc. turkish characters) subtitle.
Is someone has any idea how can i do it? People who have done it for their own language may help too. I mean i know that german and spanish has their own special characters.

I have solved this problem with using mencoder like this
-sub test.srt -subcp enca:tr:iso-8859-9
I hope this helps.

Related

iconv command is not changing the encoding of a plain text file to another encoding

In Linux I created a plain text file. using "file -i" I am seeing file encoding is "us-ascii" . After trying below commands it is still showing output file encoding as "us-ascii". Could you please tell me how to change encoding? or Is there any way to download some encoded file which I can't read.
iconv -f US-ASCII -t ISO88592//TRANSLIT -o o.txt ip.txt
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT -o op.txt ip.txt
I am expecting either iconv change the encoding or I can download some encoded file.
If your file contains only ASCII character, then there's no difference between the ASCII, UTF-8 and different ISO8859-x encoding. So after conversion, you will end up with the exactly same file.
A text file does not store any information about what encoding was used. Therefore, the file applies a few rules but at the end of the day, it's just a guess. And as the files are identical, the result will alwazys be the same.
To see a difference, you will must use characters that are encoded differently with the different encoding or are not avaialbe at all in one of the encodings, e.g. ă, € or 😊.

ffmpeg being inprecise when trimming mp3 files

I want to use ffmpeg to trim some mp3s without re-encoding. The command I used was
ffmpeg -i "inputfile.mp3" -t 00:00:12.414 -c copy out.mp3
However, out.mp3 has a length of 12.460s, and when I load the file in Audacity I can see that it was cut at the wrong spot, and not at 12.414s.
Why is this? I googled a bit and tried some other commands like ffmpeg -i "inputfile.mp3" -ss 0 -to 00:00:12.414 -c copy out.mp3 (which interestingly results in a different length of 12.434s) but could never get the milliseconds to be cut right.
PS. I wasn't sure whether SO was the right place to ask since it isn't technically programming related, however most of the stuff I found on ffmpeg for trimming audio files were stackoverflow questions, e. g. ffmpeg trimming videos with millisecond precision
You can't trim MP3 (nor most lossy codec output) with that level of precision. An MP3 frame or so of padding is added during encoding. (See also: https://wiki.hydrogenaud.io/index.php?title=Gapless, and all the hacks required to make this work.)
If you need precision timing, use something uncompressed like PCM in WAV, or a lossless compression like FLAC.
On Linux you can use mp3splt:
mp3splt -f mp3file.mp3 from to -o output file format
Example:
mp3splt -f "/home/audio folder/test.mp3" 0.11.89 3.25.48 -o #f_trimmed
this will create a "/home/audio folder/test_trimmed.mp3"
For more info to the parameters, check the mp3splt man page here
On Windows you can use mp3DirectCut
mp3DirectCut has a GUI, but it also have command line support

Convert files between UTF-8 and ISO-8859 on Linux

Every time that I get confronted with Unicode, nothing works. I'm on Linux, and I got these files from Windows:
$file *
file1: UTF-8 Unicode text
file2: ISO-8859 text
file3: ISO-8859 text
Nothing was working until I found out that the files have different encodings. I want to make my life easy and have them all in the same format:
iconv -f UTF-8 -t ISO-8859 file1 > test
iconv: conversion to `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
I tried to convert to ISO because that's only 1 conversion + when I open those ISO files in gedit, the German letter "ü" is displayed just fine. Okay, next try:
iconv -f ISO-8859 -t UTF-8 file2 > test
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
but obviously that didn't work.
ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.
And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859
It seems command file just give a very limited info of the file encoding
You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by #hobbs
And you could get a supported encoding of iconv by iconv -l
If life treats you not easy with guessing the real file encoding, this silly script might help you out :D
As in other answers, you can list out the supported formats
iconv -l | grep 8859
A grep will save your time to find which version of your encoding is/are supported. You can provide the <number> as in my example or ISO or any expected string in your encoding.

Encoding error in TopoJSON module

I'm having this error, and I just don't know if it's an issue or something I'm doing wrong. When converting a GeoJSON file, generated by
ogr2ogr -f "GeoJSON" INPUT.json INPUT.shp
with the topojson module, and preserving the properties, some Spanish characters are not preserved:
topojson -p -o OUTPUT.json INPUT.json
For example: Castellón from the INPUT.json file (checked, there are no erros in that file) ends like Castell�n in the OUTPUT.json file. The properties are well preserved except for characters like á, í, ó, etc. (common in Spanish).
I've tried adding --shapefile-encoding utf8 without success.
Well it's almost 1 year late but I'll lleave here my solution because probably more people will have this problem.
Solved it in 2 steps:
In the conversion from shp to geojson I encoded it as UTF-8 instead of the standard ISO 8859-1 (I used QGIS for this instead of command line ogr2ogr)
In topojson I added the option --shapefile-encoding utf8 as Nacho did
And all my beautiful accents and tildes are back

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them.
Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence
Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.
iconv
can do it
iconv -f cp1252 foo.txt
None of the methods here or on any other similar questions worked for me.
In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.
May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

Resources