TextEdit "Upgrade" - encoding, hidden characters etc.

Do you know any simple macOS App like Apple's TextEdit that would extends its capabilities by detection of inconsistent characters (bad encoding etc.), line terminations and hidden characters / spaces?

It could be either free or "reasonable" priced App.

Typical usage:

Encoding corrections including "bad" chars
Text corrections when copying .docx, .pdf, .pages text to LaTeX code.
"sorting" larger .csv data with errors.
Replacing certain characters when changing a syntax from Python to Matlab and vice versa (eventually C++) - in this case, I do not require sytax highlighting.

👍︎ 2

💬︎

👤︎ u/iBo0m

📅︎ Jan 16 2022

🚨︎ report

help with rare character encoding in fontforge

hey guys. first off, no idea if fontforge is considered a good tool, or what it is you guys use here, but if there's anyone who is familiar with it it would be greatly appreciated.

so my issue is, i'm trying to encode rare combinations of letters with diacritics, like t̄, r̀, v́, ǰ, and so on.

how on earth am i supposed to code these? lol

thank you in advance for all and any help.

👍︎ 7

💬︎

👤︎ u/Beleg__Strongbow

📅︎ Jan 08 2022

🚨︎ report

Most unicode symbols anywhere in a document imported into Scrivener cause an import failure of the whole document. (XmlException: Invalid character in the given encoding) Is there a way around this or is it a hard limitation?

Thanks!

👍︎ 2

💬︎

👤︎ u/graatch_ii

📅︎ Nov 25 2021

🚨︎ report

Parsing character encoding-dependent protocols with scodec hmemcpy.com/2021/11/parsi…

👍︎ 38

💬︎

👤︎ u/hmemcpy

📅︎ Nov 19 2021

🚨︎ report

do we need a character encoding similar to UTF-8 but for conlang writing systems?

tell me if it is a good idea and if there is any other suitable option for that.

👍︎ 18

💬︎

👤︎ u/erfanekm

📅︎ Oct 09 2021

🚨︎ report

Character Encoding in C++

I'm a little confused and don't quite get the point behind character encoding.

So, for example, I don't exactly understand the point behind char8_t or a UTF-8 encoding.

Is it there to be able to output Russian characters in the console, for example? I just don't understand how and what this is used for, and how it works internally. If I code something in normal char, does the corresponding variable only have a value in its memory cell that is within the scope of normal char coding? And if I have a char8_t variable, does this have a number as its value that can be larger than that of the normal char or something?

I'm happy about every answer!

👍︎ 12

💬︎

👤︎ u/Lana8888

📅︎ Oct 08 2021

🚨︎ report

Character encoding sets: Who remembers people typing '...' to drop other players?

This thought just crossed my mind and made me realize how far we've come. This game is likely encoded with UTF-8 while the original D2 likely had ASCII.

Korean players never had the "dot" character in their charset, and so their machines would not process it. Thus, the player would drop from game.

That said, the whole period of D2 history with that "whole thing" seems really weird by today's standards. Just blew my mind and thought I'd reminisce/share.

Happy Diablo 2's Eve everyone!

👍︎ 2

💬︎

👤︎ u/Megaman112

📅︎ Sep 23 2021

🚨︎ report

Yore - library for decoding/encoding character sets according to OEM code pages

Yore

Rust library for decoding/encoding character sets according to OEM code pages, CP850 etc.

crates.io github

This is my first rust library, please give me feedback. Most important is feedback about safety. I use a lot of unsafe for performance reasons.
Yes - I did benchmark it.

I found two other libraries for single-byte encoding: encoding and oem_cp.

Neither of them were a good fit for my use case(and I had to invent reasons to do something myself). I understand that their design considerations were different from mine.

I wanted great performance. By using our underappreciated friend Cow we can avoid doing any work when our input bytes are a subset of ascii.

This approach makes a huge difference for english text and source code.

I have done a lot of different things to improve performance, sometimes with surprising results. Perhaps I will do a blog-post about it someday.

Just general points:

Strings are not a sequence of chars.
Avoiding allocations with Cow is fantastic.
Batching work is important.
Precompute when possible.
Don't convert to char and then convert to utf8.
Unsafe is not unsafe(I await the inevitable bug report...).
Sometimes it is faster to do more(copy 4 bytes can be faster than copying 1-4 bytes).
Iterators aren't zero-cost-abstractions.
match is slow compared to lookup table.

👍︎ 13

💬︎

👤︎ u/bonega

📅︎ Jul 03 2021

🚨︎ report

Robocopy job file with unicode characters / encoding issue

Hey /r/sysadmin

Just wanted to reach out to see if anyone has suggestions here -

TL;DR - Need to exclude thousands of unicode files from robocopy

I'm running a robocopy job that copies from a source directory to a destination directory. It contains a JOB file with a list of files to exclude, some of which contain unicode characters. Unfortunately, it seems that even with the Job file saved as UTF-8, robocopy reads the job file in as ANSI. This causes it to incorrectly read the file paths in the job file for exclusion.

Command

robocopy "source" "destination" /e /MT:128 /r:1 /w:1 /copy:DAT /v /XX /np /ndl /bytes /tee /JOB:"jobpath" /unilog:"logfilepath"

the JOB file contains

/XF 
filetoExcludeWithUnicodeCharacter.zip

the /UNICODE flag seems to only change console output, and only in the options portion.
the /UNILOG flag only changes the log file output
saving the job as UCS-2 BE BOM caused it to not copy anything
changing the code page [chcp 65001] doesn't seem to make any impact on how robocopy reads the JOB file
writing /XF unicodefilepath.zip DOES work without issue...but that limits me to excluding only to the max command length

The root problem -appears- to be with how robocopy is parsing the job file internally

👍︎ 3

💬︎

👤︎ u/nuclearmage257

📅︎ Sep 03 2021

🚨︎ report

Character 202 in different encodings...?

Is there a place where I can easily see what character 202 corresponds to in different character encodings?

That is, I give it a byte and it spits out how it is interpreted by many different encodings.

👍︎ 3

💬︎

👤︎ u/dvorahtheexplorer

📅︎ Jul 16 2021

🚨︎ report

KDE KONSOLE Character encoding error or bug

[yigit@archlinux ~]$ sudo pacman -Qm | grep e
[sudo] password for yigit:
pacman: invalid option -- '�'
[yigit@archlinux ~]$ sudo pacman -Q�denemem | grep e

I don't have any information about this problem I use Latin alphabet and sometimes crashes frequently I use UTF-8 at my console

👍︎ 2

💬︎

👤︎ u/NO_NAME2002

📅︎ Aug 24 2021

🚨︎ report

Does anyone have an idea how to solve this character encoding bug in SceneBuilder? (more details in comments)

👍︎ 3

💬︎

👤︎ u/speedmotel

📅︎ Apr 17 2021

🚨︎ report

Every second day I am able to write this, and every second day it turns into an empty string (yes, I am able to write an empty message). Very mysterious! Seems they change character encoding every server save. I wonder why.

👍︎ 37

💬︎

👤︎ u/OtherwiseAd9733

📅︎ Dec 10 2020

🚨︎ report

Did Juniper get utf-8 wrong in "Understanding Character Encoding on Devices Running Junos OS"?

https://www.juniper.net/documentation/en_US/junos/topics/reference/general/junos-character-encoding.html

In the linked doc they tell us the ñ character maps to its equivalent UTF-8 decimal character reference "&#195;&#177;".

But as per, https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec&unicodeinhtml=dec, those values map to the characters 'Ã' - "&#195;" and ' ± ' - "&#177;" in the column "numerical HTML encoding of the Unicode character." That's what PyEZ thought they were prior to a bit of manipulation.

It's true however that '195 177' (without the escape characters) maps to 'ñ' in the UTF-8(dec.) column. Has Juniper erred by including the escape characters? At the same time, without some escape value how would you ever distinguish a utf-8 decimal value from a number? What's the practical usefulness of utf-8 decimal and should Juniper switch to hex which seems to be much more common?

👍︎ 2

💬︎

👤︎ u/0110101001100010

📅︎ Apr 01 2021

🚨︎ report

Character encoding in JavaScript made simple (really simple) youtube.com/watch?v=CN9fA…

👍︎ 118

💬︎

👤︎ u/3dCodeWorld

📅︎ Mar 07 2021

🚨︎ report

Desperately need help! Encoding problem while parsing HTML with beautiful soup, how to prevent missing characters (commas, apostrophes and dashes)?

Desperately need help, parsing HTML with code below causes characters such as apostrophes, commas and dashes to appear as "â " in the resulting data frames and .txt file. A very rushed data science module in college means I'm really not familiar with encoding issues, beautiful soup or how to use escape characters, could somebody please help me tweak the code to prevent these characters from appearing as "missing"? I would be eternally grateful.

code:

import requests 
import calendar
from bs4 import BeautifulSoup as bs
import pandas as pd

def get_data(soup, link): 
    rows = []

    for article in soup.select('.article:has(.metadata:nth-of-type(2):contains("Books","Music","Film"))'):
        title = article.select_one('a').text
        category = article.select_one('.metadata:nth-of-type(2)').text.replace('Category: ','')
        desc = article.select_one('.snippet').text
        rows.append([title, desc, category])
    return pd.DataFrame(rows)

if __name__ == '__main__':
    
    with requests.Session() as s:
        r = s.get('http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html')
        soup = bs(r.content, "html5lib")
        links = ['http://mlg.ucd.ie/modules/COMP41680/assignment2/' + i['href'] for i in soup.select('.list-group a')]
        urls = []
        results = []
 
        
        for number, link in enumerate(links, 1):
            soup = bs(s.get(link).text, "html5lib")
            pages = int(soup.select_one('.results').text.split('of ')[-1])
            #print(pages)
            results.append(get_data(soup, links))
        
            for day in range(2, pages + 1):
                urls.append(f'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-{calendar.month_name[number][:3].lower()}-{str(day).zfill(3)}.html')

        for url in urls:
            df = get_data(soup, link)
            results.append(df) 
        
    final = pd.concat(results)
    final.columns = ['Title', 'Description', 'Category']
    print(final.head())
    final.to_csv(r"C:\Users\User\Documents\College - Data Science Python Folder\articles.txt", header=True, index=False, sep='\t', mode='a', encoding = "utf-8-sig")

👍︎ 5

💬︎

👤︎ u/MindOverMatte

📅︎ Apr 11 2021

🚨︎ report

Why does Google declare conflicting character encodings on their homepage?

I've been doing some research on declaring character encodings.

Specifically, do you really need the <meta charset="UTF-8"> tag?

You must declare a character encoding, but by default most servers include this in the http headers and that's actually better than using a <meta> tag — the earlier it's declared the sooner the page can render.

A micro-optimisation really.

On top of that, for HTML5 utf-8 is the only valid character encoding. So <!doctype html> is implicitly declaring the character encoding too.

<meta charset="UTF-8"> is considered sacred. So before I started telling people it's a useless 22 bytes. I thought I'd see what google do.

In the google homepage <head> tags they have:

&lt;meta content="text/html; charset=UTF-8" http-equiv="Content-Type"&gt;

But then in the http headers it's:

Content-Type: text/html; charset=ISO-8859-1

What's going on here?

Here's my guesses:

Maybe it's a backwards compatibility thing. Perhaps browsers that don't understand the <meta> tag also don't understand utf-8?
Maybe it's a performance optimization. Perhaps it's faster to parse the very first part of the document in ISO-8859-1 then switch to utf-8 for the rest.

What do you think?

What does google know that we don't (besides literally everything)?

Originally posted to dev.to but maybe here is a better fit.

👍︎ 2

💬︎

👤︎ u/HauntingTomatillo202

📅︎ Mar 17 2021

🚨︎ report

String Literals, Character Encodings, and Multiplatform C++ pspdfkit.com/blog/2021/st…

👍︎ 14

💬︎

👤︎ u/vormestrand

📅︎ Mar 03 2021

🚨︎ report

Possible to use UTF-8 character encoding in online chat in the English version?

In most lobbies I'm matched in there are always non-ASCII compatible characters typed in chat, rendered as boxes when I launch theNon-ASCII encoded characters, such as Chinese characters, are not drawn correctly and are shown as boxes. This is an issue because I realize over half of the players matched with me in the online servers communicate in Chinese. This becomes especially troubling during gameplay missions requiring communication and teamwork; for example during heists. I can read and type in Chinese because it is my secondary language, however I do not want for the entire game's UI to use Chinese, I only wish to be able to see it if it exists (when other players are trying to communicate with me).

Is there a R* endorsed method to use UTF-8 encoding with GTA V, or any other encoding that can render both English characters and Chinese texts while keeping the UI American English? Thanks.

👍︎ 4

💬︎

👤︎ u/TheSpicyGuy

📅︎ Feb 23 2021

🚨︎ report

Characters encoding demystified - Techblog - Hostmoz techblog.hostmoz.net/en/c…

👍︎ 5

💬︎

👤︎ u/backstageel

📅︎ Mar 18 2021

🚨︎ report

My settings text is garbled; fine when I paste it into a browser window. I've tried resetting the app settings; it's still showing. something to do with character encoding? No extensions installed.

👍︎ 2

💬︎

👤︎ u/townfox

📅︎ Dec 28 2020

🚨︎ report

Characters encoding demystified - Techblog - Hostmoz techblog.hostmoz.net/en/c…

👍︎ 7

💬︎

👤︎ u/backstageel

📅︎ Mar 18 2021

🚨︎ report

Help with encoding and decoding Chinese characters

I'm making a small program that can take the text from an .epub file and save it as a .txt. It will print the text without an issue, but when I try to write the same text to a file I get this error:

> return codecs.charmap_encode(input,self.errors,encoding_table)[0] > UnicodeEncodeError: 'charmap' codec can't encode character '\u5e74' in position 4: character maps to <undefined>

I found online that I need to encode the text, so I tried using utf-8. This works, but now my text file has a bunch of

> \xb8\xe6\xb2\xa1\xe8\xaf\xb4\xe4\xbb\x80\xe4\xb9\x88\xef\xbc\x8c\xe5\x8f\xaa\xe6\x98\xaf\xe7\xac\x91\xe4\xba\x86\xe7\xac\x91\xe3\x80

instead of the Chinese characters. How can I convert this back to Chinese characters in the .txt file?

👍︎ 5

💬︎

👤︎ u/xain1112

📅︎ Sep 22 2020

🚨︎ report

Is there a semantic encoding language, similar to Unicode but for meaning instead of characters?

A discussion about vague icons on mobile devices lead to the question of auto-translating "meaning codes" to local languages. Quote:

> Maybe we need something like Unicode for meaning instead of just pictograms. Pictograms can convey meaning, but are limited in that regard, especially for non-nouns.

> For example, the code 1230 could mean "show" or "display". 1234 could mean "all", 1235 could mean "next", and 1236 for "previous" etc.

> The UI designer would specify "[1230][1234]" and the device's local language settings would lookup the words to get "Show All" for English speakers, for example. I suppose the word order may be different in other languages, but it's still likely better than confusing icons...

I suppose we could use Esperanto as the standard, but that may require too much parsing. Plus, usage of parentheses and other grouping characters could reduce reference ambiguities often found in written languages, such as "I spotted a telescope on the hill" where "on" can refer to either noun. The equivalent of the sentence diagram could be unambiguous, at least in terms of a tree structure: "(I (location is on hill))(spotted)(one telescope)".

I couldn't find such a proposed standard on the "GoogleBings", but merely intermediate encodings used inside translating engines.

👍︎ 5

💬︎

👤︎ u/Zardotab

📅︎ Sep 29 2020

🚨︎ report

In this video walkthrough, we demonstrated how to bypass XSS filters with the use of character encoding. Also, we demonstrated bypassing command injection filters to gain access to the remote host youtube.com/watch?v=keJu7…

👍︎ 11

💬︎

👤︎ u/MotasemHa

📅︎ Jan 05 2021

🚨︎ report

In this video walkthrough, we demonstrated how to bypass XSS filters with the use of character encoding. Also, we demonstrated bypassing command injection filters to gain access to the remote host youtube.com/watch?v=keJu7…

👍︎ 15

💬︎

👤︎ u/MotasemHa

📅︎ Jan 05 2021

🚨︎ report

Display character encoding in OpenHab logviewer(ü, ä, ö = "?")

Hi,

does anyone have the same problem as I have and knows the solution?

All german vowel mutations are displayed as questionmarks in my Logviewer. I'd like to know how I can change this.

Any hints would be appreaciated.

thanks in advance

https://preview.redd.it/njurl7nvj6y51.png?width=946&format=png&auto=webp&s=09744bfb75998a4a6d5c608733ff88f7dc4c65f6

👍︎ 4

💬︎

👤︎ u/radza_

📅︎ Nov 09 2020

🚨︎ report

HackTheBox Holiday - we demonstrated how to bypass XSS filters with the use of character encoding. Also, we demonstrated bypassing command injection filters to gain access to the remote host youtube.com/watch?v=keJu7…

👍︎ 3

💬︎

👤︎ u/MotasemHa

📅︎ Jan 05 2021

🚨︎ report

Best practice: Character-types, encodings, multi-platform, multi-language?

Recently - when I was implementing a few encodings - I started thinking about and researching the topic described in the title of this post. With all the information I've found so far, I'm now unsure whether using char16_t and UTF-16 (chosen due to .NET, Java etc. taking that route) was the wisest choice and which option might be better suited for my needs - let me give a general overview of what I'm working on and what I do want to achieve as it might help in anwering my question:

I'm currently writing a framework to use as a basis for a backend (and possibly frontend). The framework itself contains functionality to connect to databases (MSSQL, Oracle etc.), read and write files (be it simple textfiles, pdfs, zipped files, office documents, xml, json etc.), communicate over a socket using different protocols (FTP, LDAP, HTTP etc.). While currently mostly writing code for windows, I'd like to make the framework available for android, linux and maybe macOS as well at some point.

Before doing too much work that I later down the road might need to redo, I'd like to ask for some insight by someone who maybe has some experience with the whole thing. I know there's no single solution that is best in every way but maybe there's one in terms of "best of both worlds", so let me simply ask:

Would it be wiser to use char and UTF-8 internally? Are there any arguments/current developements that might reduce the pool of types and encodings to choose from? Am I maybe not asking the right questions at all?

Thanks in advance and best regards,

waYne

👍︎ 7

💬︎

👤︎ u/waYneIsOn

📅︎ Aug 03 2020

🚨︎ report

Limitations of the <meta> tag for specifying character encoding seeblog.seenet.ca/2020/09…

👍︎ 12

💬︎

👤︎ u/MrDOS

📅︎ Sep 02 2020

🚨︎ report

digiKam 7.1.0 is released. Includes better Canon CR3 metadata support, new Batch Queue Manager plugins to fix Hot Pixels automatically and to apply texture over images, and better metadata management (improved IPTC compatibility with UTF-8 characters encoding). digikam.org/news/2020-09-…

👍︎ 59

💬︎

👤︎ u/anaxarchos

📅︎ Sep 08 2020

🚨︎ report

How much digital memory space would it take to have all books written in the last century, in the most basic character encoding? less then a Zebibyte?

👍︎ 2

💬︎

👤︎ u/MAOU_42

📅︎ Oct 31 2020

🚨︎ report

Character encoding issues? w/ random packages and files

Okay so for reference: I'm on Linux Mint 20 (on cinnamon) and I need to read a perl script. It runs into an issue with a certain package (line 3 says something along the lines of #Use CAre.pm and that's where it's not happy); I check the package file itself and the file is telling me about a character encoding issue.
I wasn't sure what the issue was but still the same issue when I'm on a cluster; and not only that, but I ran into this same "character encoding" issue when I tried to download a file from my university's IT department (that they asked me to download so that they could do a screenshare).
Does anyone have any ideas what could possibly be causing this? I've used linux before but not mint, and I'm honestly baffled as to how often this particular issue is coming up?

👍︎ 4

💬︎

👤︎ u/sailor_rini

📅︎ Oct 24 2020

🚨︎ report