A list of puns related to "Character Encoding"
Do you know any simple macOS App like Apple's TextEdit that would extends its capabilities by detection of inconsistent characters (bad encoding etc.), line terminations and hidden characters / spaces?
It could be either free or "reasonable" priced App.
Typical usage:
hey guys. first off, no idea if fontforge is considered a good tool, or what it is you guys use here, but if there's anyone who is familiar with it it would be greatly appreciated.
so my issue is, i'm trying to encode rare combinations of letters with diacritics, like tΜ, rΜ, vΜ, jΜ, and so on.
how on earth am i supposed to code these? lol
thank you in advance for all and any help.
Thanks!
tell me if it is a good idea and if there is any other suitable option for that.
I'm a little confused and don't quite get the point behind character encoding.
So, for example, I don't exactly understand the point behind char8_t or a UTF-8 encoding.
Is it there to be able to output Russian characters in the console, for example? I just don't understand how and what this is used for, and how it works internally. If I code something in normal char, does the corresponding variable only have a value in its memory cell that is within the scope of normal char coding? And if I have a char8_t variable, does this have a number as its value that can be larger than that of the normal char or something?
I'm happy about every answer!
This thought just crossed my mind and made me realize how far we've come. This game is likely encoded with UTF-8 while the original D2 likely had ASCII.
Korean players never had the "dot" character in their charset, and so their machines would not process it. Thus, the player would drop from game.
That said, the whole period of D2 history with that "whole thing" seems really weird by today's standards. Just blew my mind and thought I'd reminisce/share.
Happy Diablo 2's Eve everyone!
Rust library for decoding/encoding character sets according to OEM code pages, CP850 etc.
This is my first rust library, please give me feedback.
Most important is feedback about safety. I use a lot of unsafe
for performance reasons.
Yes - I did benchmark it.
I found two other libraries for single-byte encoding: encoding and oem_cp.
Neither of them were a good fit for my use case(and I had to invent reasons to do something myself). I understand that their design considerations were different from mine.
I wanted great performance.
By using our underappreciated friend Cow
we can avoid doing any work when our input bytes are a subset of ascii.
This approach makes a huge difference for english text and source code.
I have done a lot of different things to improve performance, sometimes with surprising results. Perhaps I will do a blog-post about it someday.
Just general points:
Strings are not a sequence of chars.
Avoiding allocations with Cow
is fantastic.
Batching work is important.
Precompute when possible.
Don't convert to char
and then convert to utf8.
Unsafe is not unsafe(I await the inevitable bug report...).
Sometimes it is faster to do more(copy 4 bytes can be faster than copying 1-4 bytes).
Iterators aren't zero-cost-abstractions.
match
is slow compared to lookup table.
Hey /r/sysadmin
Just wanted to reach out to see if anyone has suggestions here -
TL;DR - Need to exclude thousands of unicode files from robocopy
I'm running a robocopy job that copies from a source directory to a destination directory. It contains a JOB file with a list of files to exclude, some of which contain unicode characters. Unfortunately, it seems that even with the Job file saved as UTF-8, robocopy reads the job file in as ANSI. This causes it to incorrectly read the file paths in the job file for exclusion.
Command
robocopy "source" "destination" /e /MT:128 /r:1 /w:1 /copy:DAT /v /XX /np /ndl /bytes /tee /JOB:"jobpath" /unilog:"logfilepath"
the JOB file contains
/XF
filetoExcludeWithUnicodeCharacter.zip
The root problem -appears- to be with how robocopy is parsing the job file internally
Is there a place where I can easily see what character 202 corresponds to in different character encodings?
That is, I give it a byte and it spits out how it is interpreted by many different encodings.
[yigit@archlinux ~]$ sudo pacman -Qm | grep e
[sudo] password for yigit: Β
pacman: invalid option -- 'οΏ½'
[yigit@archlinux ~]$ sudo pacman -QοΏ½denemem | grep e
I don't have any information about this problem I use Latin alphabet and sometimes crashes frequently I use UTF-8 at my console
In the linked doc they tell us the Γ± character maps to its equivalent UTF-8 decimal character reference "ñ".
But as per, https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec&unicodeinhtml=dec, those values map to the characters 'Γ' - "Ã" and ' Β± ' - "±" in the column "numerical HTML encoding of the Unicode character." That's what PyEZ thought they were prior to a bit of manipulation.
It's true however that '195 177' (without the escape characters) maps to 'Γ±' in the UTF-8(dec.) column. Has Juniper erred by including the escape characters? At the same time, without some escape value how would you ever distinguish a utf-8 decimal value from a number? What's the practical usefulness of utf-8 decimal and should Juniper switch to hex which seems to be much more common?
Desperately need help, parsing HTML with code below causes characters such as apostrophes, commas and dashes to appear as "Γ’ΒΒ " in the resulting data frames and .txt file. A very rushed data science module in college means I'm really not familiar with encoding issues, beautiful soup or how to use escape characters, could somebody please help me tweak the code to prevent these characters from appearing as "missing"? I would be eternally grateful.
code:
import requests
import calendar
from bs4 import BeautifulSoup as bs
import pandas as pd
def get_data(soup, link):
rows = []
for article in soup.select('.article:has(.metadata:nth-of-type(2):contains("Books","Music","Film"))'):
title = article.select_one('a').text
category = article.select_one('.metadata:nth-of-type(2)').text.replace('Category: ','')
desc = article.select_one('.snippet').text
rows.append([title, desc, category])
return pd.DataFrame(rows)
if __name__ == '__main__':
with requests.Session() as s:
r = s.get('http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html')
soup = bs(r.content, "html5lib")
links = ['http://mlg.ucd.ie/modules/COMP41680/assignment2/' + i['href'] for i in soup.select('.list-group a')]
urls = []
results = []
for number, link in enumerate(links, 1):
soup = bs(s.get(link).text, "html5lib")
pages = int(soup.select_one('.results').text.split('of ')[-1])
#print(pages)
results.append(get_data(soup, links))
for day in range(2, pages + 1):
urls.append(f'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-{calendar.month_name[number][:3].lower()}-{str(day).zfill(3)}.html')
for url in urls:
df = get_data(soup, link)
results.append(df)
final = pd.concat(results)
final.columns = ['Title', 'Description', 'Category']
print(final.head())
final.to_csv(r"C:\Users\User\Documents\College - Data Science Python Folder\articles.txt", header=True, index=False, sep='\t', mode='a', encoding = "utf-8-sig")
I've been doing some research on declaring character encodings.
Specifically, do you really need the <meta charset="UTF-8">
tag?
You must declare a character encoding, but by default most servers include this in the http headers
and that's actually better than using a <meta>
tag β the earlier it's declared the sooner the page can render.
A micro-optimisation really.
On top of that, for HTML5
utf-8
is the only valid character encoding. So <!doctype html>
is implicitly declaring the character encoding too.
<meta charset="UTF-8">
is considered sacred. So before I started telling people it's a useless 22 bytes
. I thought I'd see what google
do.
In the google homepage <head>
tags they have:
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
But then in the http headers
it's:
Content-Type: text/html; charset=ISO-8859-1
What's going on here?
Here's my guesses:
<meta>
tag also don't understand utf-8
?ISO-8859-1
then switch to utf-8
for the rest.What do you think?
What does google know that we don't (besides literally everything)?
Originally posted to dev.to but maybe here is a better fit.
In most lobbies I'm matched in there are always non-ASCII compatible characters typed in chat, rendered as boxes when I launch theNon-ASCII encoded characters, such as Chinese characters, are not drawn correctly and are shown as boxes. This is an issue because I realize over half of the players matched with me in the online servers communicate in Chinese. This becomes especially troubling during gameplay missions requiring communication and teamwork; for example during heists. I can read and type in Chinese because it is my secondary language, however I do not want for the entire game's UI to use Chinese, I only wish to be able to see it if it exists (when other players are trying to communicate with me).
Is there a R* endorsed method to use UTF-8 encoding with GTA V, or any other encoding that can render both English characters and Chinese texts while keeping the UI American English? Thanks.
I'm making a small program that can take the text from an .epub file and save it as a .txt. It will print the text without an issue, but when I try to write the same text to a file I get this error:
> return codecs.charmap_encode(input,self.errors,encoding_table)[0] > UnicodeEncodeError: 'charmap' codec can't encode character '\u5e74' in position 4: character maps to <undefined>
I found online that I need to encode the text, so I tried using utf-8. This works, but now my text file has a bunch of
> \xb8\xe6\xb2\xa1\xe8\xaf\xb4\xe4\xbb\x80\xe4\xb9\x88\xef\xbc\x8c\xe5\x8f\xaa\xe6\x98\xaf\xe7\xac\x91\xe4\xba\x86\xe7\xac\x91\xe3\x80
instead of the Chinese characters. How can I convert this back to Chinese characters in the .txt file?
A discussion about vague icons on mobile devices lead to the question of auto-translating "meaning codes" to local languages. Quote:
> Maybe we need something like Unicode for meaning instead of just pictograms. Pictograms can convey meaning, but are limited in that regard, especially for non-nouns.
> For example, the code 1230 could mean "show" or "display". 1234 could mean "all", 1235 could mean "next", and 1236 for "previous" etc.
> The UI designer would specify "[1230][1234]" and the device's local language settings would lookup the words to get "Show All" for English speakers, for example. I suppose the word order may be different in other languages, but it's still likely better than confusing icons...
I suppose we could use Esperanto as the standard, but that may require too much parsing. Plus, usage of parentheses and other grouping characters could reduce reference ambiguities often found in written languages, such as "I spotted a telescope on the hill" where "on" can refer to either noun. The equivalent of the sentence diagram could be unambiguous, at least in terms of a tree structure: "(I (location is on hill))(spotted)(one telescope)".
I couldn't find such a proposed standard on the "GoogleBings", but merely intermediate encodings used inside translating engines.
Hi,
does anyone have the same problem as I have and knows the solution?
All german vowel mutations are displayed as questionmarks in my Logviewer. I'd like to know how I can change this.
Any hints would be appreaciated.
thanks in advance
https://preview.redd.it/njurl7nvj6y51.png?width=946&format=png&auto=webp&s=09744bfb75998a4a6d5c608733ff88f7dc4c65f6
Recently - when I was implementing a few encodings - I started thinking about and researching the topic described in the title of this post. With all the information I've found so far, I'm now unsure whether using char16_t and UTF-16 (chosen due to .NET, Java etc. taking that route) was the wisest choice and which option might be better suited for my needs - let me give a general overview of what I'm working on and what I do want to achieve as it might help in anwering my question:
I'm currently writing a framework to use as a basis for a backend (and possibly frontend). The framework itself contains functionality to connect to databases (MSSQL, Oracle etc.), read and write files (be it simple textfiles, pdfs, zipped files, office documents, xml, json etc.), communicate over a socket using different protocols (FTP, LDAP, HTTP etc.). While currently mostly writing code for windows, I'd like to make the framework available for android, linux and maybe macOS as well at some point.
Before doing too much work that I later down the road might need to redo, I'd like to ask for some insight by someone who maybe has some experience with the whole thing. I know there's no single solution that is best in every way but maybe there's one in terms of "best of both worlds", so let me simply ask:
Would it be wiser to use char and UTF-8 internally? Are there any arguments/current developements that might reduce the pool of types and encodings to choose from? Am I maybe not asking the right questions at all?
Thanks in advance and best regards,
waYne
Okay so for reference: I'm on Linux Mint 20 (on cinnamon) and I need to read a perl script. It runs into an issue with a certain package (line 3 says something along the lines of #Use CAre.pm and that's where it's not happy); I check the package file itself and the file is telling me about a character encoding issue.
I wasn't sure what the issue was but still the same issue when I'm on a cluster; and not only that, but I ran into this same "character encoding" issue when I tried to download a file from my university's IT department (that they asked me to download so that they could do a screenshare).
Does anyone have any ideas what could possibly be causing this? I've used linux before but not mint, and I'm honestly baffled as to how often this particular issue is coming up?
Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.