Pza. Candelaria, 1, Edf.Olympo
Santa Cruz de Tenerife
Tenerife · Islas Canarias
38003 · ESPAÑA

922.276.532 Teléfonos 679.404.986

Civicom

 

Consultores de Gestión ERP, Open ERP, BI, CRM, RR.HH., e-Business, TPV, TIC

LAMP + PDF + I18N: UTF-8 and CID

Welcome to the TCPDF testing

2009·AGO·21, ed. 2009·OCT·04, Javier de Lorenzo-Cáceres

Thanks to Phil Daintree, Tim Schofield and Harald Ringehahn from webERP, Zhiguo Yuan from ulaszipper.com and Markus Kuhn from University of Cambridge Computer Laboratory http://www.cl.cam.ac.uk/~mgk25/ and of course, Nicola Asuni and Olivier Plathey, the authors of TCPDF and FPDF respectively.

MySQL, PhP, Apache (http://www.w3.org/International), Firefox and OOo support UTF-8 and are free. Pdf is not free and handles Unicode in a special way, introducing a character-access type called Character Identifier, abbreviated as CID; a 16 bits code. I don't know FPDF, specially CJK, but I guess it only supports ANSI and CJK, i.e., it doesn't support UTF-8 or other ISO than latin-1, hence, it doesn't support Turkish and many other languages. TCPDF supports UTF-8, and, as required by pdf, it does some uni2cid conversions (UTF-8 to CID that are lossy because the 16 bits logic of CID) and uses some CID-Keyed fonts. Also, some PhP functions like htmlspecialchars() and htmlentities(), don't support any other ISO Charset than ISO-8859-1 (or 15) what pushes to UTF-8. I'm only testing TCPDF UTF-8, languages, characters and glyphs, CID-Keyed fonts (CMaps and CIDFonts), comparing to ISO-8859-X. All the examples are based on TCPDF Example #8, many chars have been added.

This web page is UTF-8 coded and was intended to not use BOM, but as more languages were written, BOM was added; font is from user agent. For Hebrew (and Arabic in the future), the CSS2 properties 'direction' and 'unicode-bidi' are used.

I knew about ASCII, ANSI, ISO and Unicode but CID is something new to me, although it exists since 15 years ago. Fortunately, documentation about CID is provided:

TCPDF Unicode to CID conversion table is from ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/ The files in this directory relate to Adobe CJK character collections for CID-keyed fonts, and also provide *real* CIDFonts for testing purposes.

Other CID-related Tech Notes:

People being interested in CID-keyed font technology or wishing to develop products based on CID-keyed font technology, should request a copy of the CID SDK (CID Software Developers Kit).

ISO-88959-X and webERP available languages

ISO was designed for information not typography, Dutch 'ij' or French 'oe' is considered to be right. webERP default language is English and it has been translated to other 17 languages. Despite the fact that the number of locales is even greater or that there are 2 charsets for Chinese, from the 18 available languages only 11 have pdf support with FPDF. The next list shows all webERP languages sorted by an index number corresponding the following charsets table where webERP languages are written uppercase. All but index 1 and CJK lack pdf reports. Languages 2 letter codes are ISO-639. From the 18 webERP languages 9 belong to ISO-1, hence, webERP is multilanguage for these 9. As FDPF supports CJK, webERP is i18n compliant for Chinese and Japanese.

  • 1. English, Dutch (missed IJ ij), French, German, Indonesian, Italian, Spanish, Swedish, Portuguese
  • 2. Czech, Hungarian, Polish
  • 3. Turkish
  • 5. Russian
  • 7. Greek
  • 11. Chinese Big5
  • 12. Chinese simplified GB2312
  • 13. Japanese
  • 15. Persian (Farsi)

Charsets on web pages are still too often missing or incorrect or not the best suited. People learn to live with imperfections and new charsets are ignored, e.g., French and Finnish use ISO-1 instead of its ISO-15. I marked languages with an asterisk when a better ISO (bold type) exists. As a language may use several charsets, a language declaration is not enough for disambiguation, charset must be declared too.

IndexFormerLanguagesImprovedLanguagesLatterLanguages
1ISO-8859-1 West It's superseeded by ISO-15 unless you need one of the following 8 characters {¤, ¦, ¨, ´, ¸, ¼, ½, ¾}. It supports ENGLISH (en), DUTCH (nl) (missed IJ ij), FRENCH*(fr), GERMAN (de), INDONESIAN (id), ITALIAN (it), SPANISH (es) [inc. Basque(eu), Catalan (ca) and Galician (gl)], SWEDISH (sv), PORTUGUESE (pt), Afrikaans (af) and Swahili, Faroese (fo), Icelandic (is), Danish (da), Norwegian (no), Irish (ga), Scottish (gd), Finnish*(fi) (partial), Albanian*(sq). ISO-8859-15 ENGLISH (en), DUTCH (nl) (missed IJ ij), FRENCH (fr), GERMAN (de), INDONESIAN (id), ITALIAN (it), SPANISH (es) [inc. Basque(eu), Catalan (ca) and Galician (gl)], SWEDISH (sv), PORTUGUESE (pt), Afrikaans (af) and Swahili, Faroese (fo), Icelandic (is), Danish (da), Norwegian (no), Irish (ga), Scottish (gd), Finnish (fi), Estonian (et)
2ISO-8859-2 EastCZECH (cs), Slovak (sk), Bosnian (bs), Serbian (sr) latin, Croatian*(hr), Hungarian*(hu), Polish*(pl), Romanian*(ro), Slovenian*(sl). ISO-8859-16 South-EastAlbanian (sq), Croatian (hr), HUNGARIAN (hu), POLISH (pl), Romanian (ro), Slovenian (sl) also French, German, Italian and Irish Gaelic.
3ISO-8859-3 SouthMaltese (mt), Esperanto (eo). (scarce support in browsers)ISO-8859-9 TurkishTURKISH (tr) (replaces Icelandic from ISO-1).
4ISO-8859-4 NorthEstonian*(et), Latvian*(lv), Lithuanian*(lt), Groenlandic, Sami.ISO-8859-10 NorthInuit (Eskimo), Lapp (doesn't have a 2 letter code) ISO-8859-13 BalticEstonian (et), Latvian (lv), Lithuanian (lt).
5ISO-8859-5 CyrillicRUSSIAN (ru)(koi8-r), Bulgarian (bg), Byelorussian (be), Serbian (sr), Macedonian (mk), Ukranian (uk).
6ISO-8859-6 ArabicArabic (ar)
7ISO-8859-7 GreekModern GREEK (el)
8ISO-8859-8 HebrewHebrew (iw) (letters, not signs)
9ISO-8859-11 ThaiThai
-ISO-8859-12 Devaganari
10ISO-8859-14 CelticGaelic, Welsh, Breton.
11Big5CHINESE traditional.
12GB2312CHINESE simplified.
13Shift JIS, ISO-2022-JP, EUC-JPJAPANESE (ja).
14EUC-KRKorean (ko).
15UTF-8PERSIAN (Farsi).

How CID works

CID-Keyed fonts have three components: Character Collection, CMap and CIDFont.

Character Collection
It's a document with a 2 bytes ordered character set definition. Each character in the set is given a CID number or index (0-65535). Actual Collections are not as large as 65536. The set of values covered by the Collection is called Codespace. There is a rule for the Character Collection name: 'Registry-Ordering-Supplement', e.g., 'Adobe-GB1-5' means Registry=Adobe (The Author), Ordering=GB1 (Base Char Set Order) and Supplement=5 (Number of Added Subsets) since the first release (supplement 0). Adobe Collections are based on Adobe Character Sets which are based on standard Charsets. Large Adobe charsets are: Western 2 (Replaces ISO-Adobe), CE (Central European), Latin Extended (= Western2 + CE + Welsh, archaic Danish and Esperanto), Greek, Polytonic Greek, Cyrillic, Symbol/Pi, Adobe-Japan1-6 (or Pr6N), Adobe-GB1-5, Adobe-CNS1-4, Adobe-Korea1-2 and Vietnamese. Less large are ISO-Adobe (ISO-1), Custom (subsets), and other subsets like Kana. The largest Collection is Japan1-6 (23058) which includes Western2, CE and others.
CIDFont
This is a set of Glyphs that may draw the whole characters of a specific Collection but frecuently in large Collections it has only a subset, e.g., the Pro or Std indicator in the font name means that the font has the CE charset or not.
CMap
As CID is not a character code, a relationship is needed to access glyphs from a CIDFont. This is the correspondence between character codes like UTF-8 or UTF-16 and CID numbers from the specific Collection. The Identity CMap is a y=x map for the less significant byte when the most significant is zero and assigns zero to the less significant when not; ASCII and ANSI Collections don't need more to be accesed by most character codes.
CMaps may do more than map to CIDFonts, they may also map to character codes for fonts other than CIDFonts and map to charater names for Postscript Fonts. Cmaps also may act like a font called Rearranged Font, this is a font which consist of only a CMap file that references other fonts (Adobe Type Composer). The referenced fonts must be in the user's system.

Tech Note 5092 page 13 says CMap files are "shared among all CID-keyed fonts"; it forgot to say "for a specific Character Collection"; this also happens with Rearranged Fonts since in this kind of font there is a main font called Template that defines the Codespace; the other fonts are called Components and they can substitute glyps from the Template with consistent ones for a style change and also they can add glyphs to cover empty pre-defined subsets but they can't improve the Collection by enlarging the Codespace. Doing so would be the same as dealing with a new Collection, i.e, new Collection/CMap/CIDFont must be defined, it's not so difficult but it's copyrighted, I mean, may I enlarge the Codespace of an Adobe Font?

Anyway, it is not possible to merge a CIDFont from Adobe-Japan1-X Collection with another CIDFont from Adobe-GB1-X Collection without creating a new Collection because, since they belong to different Collections, they both use same CID ranges (The CID Codespace is not totally different from each other); and they would conflict, i.e., it's not possible to add one font to the other as a supplement without changing CID index. Let's see it.

While we are in the Unicode domain, we can write both:
计 = hex U+8ba1 = decimal 35745 = a character from the GB1 Collection.
塞 = hex U+585e = decimal 22622 = a character from the Japan1 Collection.

But in the CID domain both characters have the same CID index 2105, each in its respective Collection. Collections are abstract notions, they materialize, become concrete notions, in Cmaps/CIDFonts. CMaps are specially suited for Collections.

If you look at Adobe CMaps:
UniGB-UTF16-H, you will see: <8ba1> 2105
UniJIS-UTF16-H, you will see: <585e> 2105

And the same is in TCPDF CMaps:
uni2cid_ag15.php, you will see: 35745=>2105
uni2cid_aj16.php, you will see: 22622=>2105

Despite the fact that TCPDF CMaps are decimal and sorted by CID index, and Adobe CMaps are sorted by character code expressed in hexadecimal, they are identical.

It's not expected that Adobe will create a new Collection to merge GB and Japan1 because both Collection may be used at time.
PDF using two CMaps/CIDFonts may show Chinese and Turkish at time
Maybe, the Unified Ideograph subset we miss in the Japan1 Collection will be added, as recently Japan2 was added to Japan1 what made Japan2 obsolete.

I expect the above to prove that making a CMap is not a solution.
The solution is one of three:

  1. To create a new Collection and its required resources CMap/CIDFont.
  2. To deal with using two Collections at time.
  3. Another that I don't guess.

Both actual solutions are not ideal. The second implies to look for characters to change the Collection accordingly, while the first implies to make a new Collection, CMap and CIDFont.

Bitstream Cyberbit

This is a rich Unicode font, almost pan-Unicode and it's free. Fontforge allows to generate its CID versions from CMaps. This font might be used in the case of creating a new Collection. Also, several fonts may be used to create the new CIDFont for the new Collection, e.g., Deja Vu or Freefont for the Western part and the free Arphic for the Chinese and Unified subsets.

How TCPDF example #8 works, the very short story.

Example #8 looks like this:
require_once('../../tcpdf/config/lang/eng.php'); (the language array that declares utf-8, ltr, en, page)
require_once('../../tcpdf/tcpdf.php'); (the class)
$pdf = new TCPDF(PDF_PAGE_ORIENTATION, PDF_UNIT, PDF_PAGE_FORMAT, true, 'UTF-8', false); (the constructor defined in the class)
13 calls to set document information, header and footer, monospaced font, margins, auto page breaks, image scale factor and the previous language array.
SetFont
AddPage
file_get_contents (read the text file that will be the pdf contents)
SetFillColor (background color)
Write (write the text)
Output (Close and output PDF document)

Despite the language array, the quid is the SetFont call. The TCPDF font file invoked is a php file, there are 5 TCPDF font kinds: Core, TrueType, Type 1, TrueTypeUnicode and CID (tcpdf.php line 3068), but only Core and CIDFonts may be not embedded.

TCPDF CID fonts php files may use one of the following four CMaps:
(I added some Adobe CIDFonts to the list to complete the needed resources.)

  • Adobe-Japan1-5 , UniJIS-UTF16-H , KozminPr6N-Regular , KozgoProVI-Medium
  • Adobe-GB1-2 , UniGB-UTF16-H , AdobeHeitiStd-Regular, AdobeSongStd-Light
  • Adobe-CNS1-0 , UniCNS-UTF16-H , AdobeMingStd-Light
  • Adobe-Korea1-0 , UniKS-UTF16-H , AdobeMyungjoStd-Medium

(Don't ask me why UTF-16 CMaps are used instead the UTF-8 ones, yet).

TCPDF has also the following CID fonts definition php files:

  • kozminproregular.php , Adobe-Japan1-4 , UniJIS-UCS2-H
  • kozgopromedium.php , Adobe-Japan1-4 , UniJIS-UCS2-H
  • stsongstdlight.php , Adobe-GB1-2 , UniGB-UCS2-H
  • msungstdlight.php, Adobe-CNS1-3 , UniCNS-UCS2-H
  • hysmyeongjostdmedium.php , Adobe-Korea1-1 , UniKS-UCS2-H

TCPDF Adobe-Japan1-4 CMap may be updated to Adobe-Japan1-6 with ease, if necessary.

UCS-2 (Universal Character Set ISO 10646) CMap files are now considered obsolete by Adobe, and are no longer being maintained. They have been replaced, for all character collections, with a suite of UTF-8, UTF-16 and UTF-32 CMap files.
(Again, don't ask me why UCS2 CMaps are used instead the UTF-16 or UTF-8 ones).

Example 1 · ANSI

Characters

ANSI is alike ISO-8859-1 (or 15) Western Europe. ISO-8859-1 was improved to ISO-8859-15 to completely support French {œ, Œ, Ÿ} (to avoid the use of oe and OE instead) and the euro sign {€} was added. This is well supported by php, FPDF, TCPDF and acrobat core fonts.

ISO-8859-1 is the webERP's default charset, it supports:
(again, webERP languages are written uppercase)

ENGLISH (en)

This is the 26 letters from the Basic modern latin alphabet.

DUTCH (nl)

The only distinct letter from the Dutch alphabet is IJ ij (lange ij). ISO doesn't include it, it can be expressed as IJ, or Y as time ago.

FRENCH (fr)

{œ, Œ, Ÿ} to avoid the use of oe and OE instead.

Portez ce vieux whisky au juge blond qui fume sur son île intérieure, à côté de l'alcôve ovoïde, où les bûches se consument dans l'âtre, ce qui lui permet de penser à la cænogenèse de l'être dont il est question dans la cause ambiguë entendue à Moÿ, dans un capharnaüm qui, pense-t-il, diminue çà et là la qualité de son œuvre.

l'île exiguë Où l'obèse jury mûr Fête l'haï volapük, Âne ex aéquo au whist, Ôtez ce vœu déçu.

Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au delà des îles, près du mälström où brûlent les novæ.

GERMAN (de)

Falsches Üben von Xylophonmusik quält jeden größeren Zwerg
(= Wrongful practicing of xylophone music tortures every larger dwarf)

Zwölf Boxkämpfer jagten Eva quer über den Sylter Deich
(= Twelve boxing fighters hunted Eva across the dike of Sylt)

Heizölrückstoßabdämpfung
(= fuel oil recoil absorber)
(jqvwxy missing, but all non-ASCII letters in one word)

INDONESIAN (ID)

Indonesian alphabet is just the same as the English one.

ITALIAN (it)

Italian alphabet is only a set of 21 letters from the English alphabet.

SPANISH (es) (inc. catalán, gallego y euskera)

{ñ, Ñ, ü, Ü, á, é, í, ó, ú, Á, É, Í, Ó, Ú}, as Dutch ij we have ch and ll to form unique letters, hence there are 28 letters but 29 are used since w was adopted.

SWEDISH (sv)

The characters {Å, å, Ä, ä, Ö, ö} are considered distinct letters in Swedish and are sorted after Z (unlike the German umlauts in the German alphabet).

PORTUGUESE (pt)

Alphabet is the Basic modern. As French, they have special punctuation like ã or ê. Brazilian also has ç included in most latin keyboards.

Danish (da)

Danish alphabet includes {Æ, æ, Ø, ø, Å, å} sorted after Z.

Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Wolther spillede på xylofon.
(= Quiz contestants were eating strawbery with cream while Wolther the circus clown played on xylophone.)

Icelandic (is)

Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa

Sævör grét áðan því úlpan var ónýt (some ASCII letters missing)

Example 1: txt, odt and pdf.

  • PDF Font: Helvetica.
  • From: Acrobat Core Fonts.
  • Embedding: Core Fonts are never embedded.
  • Font styles (bold, italic and bold-italic) are allowed.

Core fonts are 14 = 3 faces (Helvetica or Myriad, Times-Roman or Minion and Courier) x 4 styles (regular, bold, italic and bold-italic) + 1 Symbol + 1 Adobe Pi (Zapf-Dingbats).
I tell this (already known) to note that I believe CID fonts lack styles; they have one weight from { light, regular, medium, bold }.

ISO-8859-1 and 15 don't support other languages like those from Europe East, South and North, like Greek, Turkish, etc. (see ISO-8859-2 to 16, except 15)

Example 2 · CID

Characters

When ISO-8859-1 (or 15) is changed to another ISO, languages except English are lost because byte values 128 to 255 now accomodates other languages. Sometime, changing chars are only a few but enough to consider the ISO change.

Using UTF-8 solves it, but Acrobat doesn't support UTF-8, it maps UTF-8 (variable lenght from one to four bytes) to a CID (two bytes) Cmap. This is a lossy conversion because 2 bytes can't accomodate the more than 100000 glyphs from the Unicode repertoire. There are many Cmaps but TCPDF has only four: CCJK. Both Chinese Cmaps lack Turkish so let's give a try to Japanese Cmap.

1. ISO-1 and 15, Western Europe.

The characters in Example 1 are also included in the generated Example 2 pdf file.

2. ISO-2 and 16, Eastern Europe.

Eastern Europe languages like Hungarian, Polish, etc. are also supported.

HUNGARIAN (hu)

Árvíztűrő tükörfúrógép
(= flood-proof mirror-drilling machine, only all non-ASCII letters)

POLISH (pl)

Pchnąć w tę łódź jeża lub ośm skrzyń fig
(= To push a hedgehog or eight bins of figs in this boat)

3. ISO-3 and 9, Turkish.

TURKISH (tr)

ISO-9's 6-Turkish-chars set {ı, İ, ş, Ş, ğ, Ğ} replaces 6-Icelandic-chars set {ý, Ý, þ, Þ, ð, Ð} from ISO-1. We see here both 6-chars sets at the same time.

4. ISO-4 10 and 13, Northern Europe and Baltic.

I have no text from them but there aren't webERP translations either.

5. ISO-5, Cyrillic.

Languages like Russian, Bulgarian, etc. are also supported.

RUSSIAN (ru)

В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!
(= Would a citrus live in the bushes of south? Yes, but only a fake one!)

6. ISO-6, Arabic.

No text available and neither webERP translations.

7. ISO-7, Modern Greek.

Greek most used math symbols are included in 13th core font, here is the writing.

GREEK (el)

Γαζέες καὶ μυρτιὲς δὲν θὰ βρῶ πιὰ στὸ χρυσαφὶ ξέφωτο
(= No more shall I see acacias or myrtles in the golden clearing)

Ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
(= I uncover the soul-destroying abhorrence)

8. ISO-8, Hebrew.

ISO-8 covers hebrew letters but not signs; I don't know if the following has signs. This is a RTL writing.

Hebrew (iw)

? דג סקרן שט בים מאוכזב ולפתע מצא לו חברה איך הקליטה

9. ISO-11, Thai.

Thai has a stunning alignment. A mono-spaced font is used but it doesn't fit Thai.

Thai (th)


[--------------------------|------------------------]
 ๏ เป็นมนุษย์สุดประเสริฐเลิศคุณค่า  กว่าบรรดาฝูงสัตว์เดรัจฉาน
 จงฝ่าฟันพัฒนาวิชาการ           อย่าล้างผลาญฤๅเข่นฆ่าบีฑาใคร
 ไม่ถือโทษโกรธแช่งซัดฮึดฮัดด่า     หัดอภัยเหมือนกีฬาอัชฌาสัย
 ปฏิบัติประพฤติกฎกำหนดใจ        พูดจาให้จ๊ะๆ จ๋าๆ น่าฟังเอย ฯ

[The copyright for the Thai example is owned by The Computer Association of Thailand under the Royal Patronage of His Majesty the King.]

-. ISO-12, Devaganari.

ISO-12 Devaganari is deprecated/obsolete.

10. ISO-14, Gaelic.

Irish Gaelic (ga)

D'fhuascail Íosa, Úrmhac na hÓighe Beannaithe, pór Éava agus Ádhaimh

11-14. CJKV, Asian.

To here, I have covered all ISO charsets. Now let's view CJK. Japanese has several scripts as Hiragana and Katakana.

Japanese (jp)

Hiragana: (Iroha)
いろはにほへとちりぬるを
わかよたれそつねならむ
うゐのおくやまけふこえて
あさきゆめみしゑひもせす

Katakana:
イロハニホヘト チリヌルヲ ワカヨタレソ ツネナラム
ウヰノオクヤマ ケフコエテ アサキユメミシ ヱヒモセスン

Japanese CMap miss Unified Ideograph, a Chinese CMap must be used.

CHINESE (cn)

自所不欲 勿施与人

CHINESE · Unified Ideograph as in gb2312
(first from fax, second from freight and both from total)

传真(fax) 运费(freight) 总计(total)

Thanks to Zhiguo Yuan for the following chars from Unicode Unified Ideograph:
U+4f20 summon; propagate, transmit
U+8d39 expenses, expenditures, fee
U+603b collect; overall, altogether
U+8ba1 plan, plot; stratagem; scheme

The above is based on Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/ -- 2001-09-02 Many improvements have been made. Please let me know if you find others! Special thanks to the people from all over the world who contributed these sentences.

Example 2 txt and odt

odt Issues

  • OOo 3.1 for Windows plugin doesn't work well with IE8, it did with IE7.

Example 2 Japanese pdf

Japanese Font

  • Font: Kozuka Mincho Pro 6 New Regular (Mincho= a serif, Pro= it has Central European glyphs, New= Adobe 9, Regular= normal weight).
  • From: Adobe Reader 9 (Reader 8 could make a double subst.) Japanese CID Font.
  • Embedding: not embedded.
  • Styles: I believe this font has no styles, like bold or italic.

Japanese Issues

  • Adobe Reader 8 has Kozuka Mincho Pro called KozMinProVI-Regular but Reader 9 has KozMinPr6N-Regular what is Kozuka Mincho Pro 6 *New*. (see note below)
  • Japanese Cmap lacks Unicode Unified Ideograph.
  • Japanese Cmap miss some Greek glyphs.
  • Japanese Cmap miss Hebrew all glyphs and RTL support is needed to be set in TCPDF.
  • Japanese Cmap miss all Thai glyphs and might need a special tabulation.

As only one font may be declared and I declared the one from Reader 9, this will make a double font substitution in non matching versions like Reader 8. To avoid it, copy Kozuka New file from Reader 9 to Reader 8 or update Reader 8 to 9.

Example 2 Chinese GB pdf

Chinese GB Font

  • Font: Song Standard Light (Song= a serif, Std= it has no Central European glyphs, Light= weight is less than regular).
  • From: Adobe Reader 8 and 9 Chinese GB CID Font.
  • Embedding: not embedded.
  • Styles: I believe this font has no styles, like bold or italic.

Chinese GB Issues

To know how Adobe handles fonts visit docs sites, briefing, if fonts names match, Reader uses the font installed; if not, Reader uses the font description from sender. This is why four cases sub-examples look different.

  • Unified Ideograph range now looks correct but almost all languages are lost; not only the Central European as expected from the Std label, also most Western 2.
  • #1 and #3 lack the apostrophe. In #2 and #4 apostrophe is right.
  • Letters are more spaced in #2, specially vowels 'a' and 'e' have more space at its right and kerning is horrible; also special chars as in spanish are not in place. (read pdf from example 1)

Appendix 0. CIDFonts

Heiti

Adobe Reader comes with a very little font collection. Looking at their name we may see the Std or Pro label that will show its support for Central European languages, and looking at their file size we can guess their capabilities; this is how AdobeHeitiStd-Regular catched my attention. But it wasn't simple to know the Collection the font is suited for, at least for me (maybe asians see this easy). Files are binary and as large as 10 MB, so, before to open it with hexeditor or hexplorer I did a search and it was not fast. Then I made this to make it evident. I thought to delete this part but it gives an idea of the inner CMap-CIDFont relationship.

TCPDF using the Adobe Heiti Std Font with right and wrong CMaps.

Appendix 1. Unicode Ranges

BitStream Cyberbit (a free 13 MB serif font) has these unicode ranges:
Basic Latin,
Latin-1 Supplement,
Latin Extended-A, *** Here is Turkish: S with cedilla, G with breve acute, Capital I with dot above and dotless small i ***
Latin Extended-B,
Spacing Modifier Letters,
Greek,
Cyrillic,
Hebrew Extended (A and B blocks combined),
Thai, Latin Extended Additional,
General Punctuation,
Currency Symbols,
Letterlike Symbols,
Number Forms,
Arrows,
Mathematical Operators,
Miscellaneous Technical,
Box Drawing, Block Elements,
Geometric Shapes,
Miscellaneous Dingbats,
Alphabetic Presentation Forms,
Combining Diacritical Marks,
Enclosed Alphanumerics,
Arabic,
Arabic Presentation Forms-A and -B,
CJK (Chinese, Japanese, Korean) Symbols and Punctuation,
Hiragana, Katakana,
Bopomofo,
Hangul Compatibility Jamo,
Enclosed CJK Letters and Months,
CJK Compatibility,
Hangul,
CJK Unified Ideographs, *** I think the missing Chinese picts belong to this group ***
CJK Compatibility Ideographs,
CJK Compatibility Forms,
Small Form Variants,
and Halfwidth and Fullwidth Forms.

Appendix 2. ISO

  • iso-8859-1 (latin-1) Europa Occidental: Inglés, Alemán, Español (inc. catalán, gallego y euskera), Italiano, Portugués, Sueco, Danés, Albanés (ver 16), etc.
  • iso-8859-2 (latin-2) Europa del Este: Húngaro, Polaco, Checo, Slovaco, Bosnio, Croata, Serbio latino, Rumano, etc.
  • iso-8859-3 (latin-3) Europa del Sur: Maltés.
  • iso-8859-4 (latin-4) Europa del Norte: Estón, Letón, Lituano, Groenlandés y Sami.
  • iso-8859-5 (latin/Cyrillic): Ruso, Búlgaro, Bieloruso, Serbio y Macedonio.
  • iso-8859-6 (latin/Arabic)
  • iso-8859-7 (latin/Greek): Griego moderno
  • iso-8859-8 (latin/Hebrew): cubre las letras hebreas pero no los signos.
  • iso-8859-9 (latin-5), mejora de latin-3 para el Turco. Basado en latin-1, cambia el islandés por Turco {ı (I min), İ (i may), ş, Ş, ğ y Ğ}.
  • iso-8859-10 (latin-6), mejora de latin-4 para lenguas nórdicas.
  • iso-8859-11 (latin/Thai)
  • iso-8859-12 (latin/Devanagari), abandonado.
  • iso-8859-13 (latin-7): mejora de latin-4 y latin-6 para lenguas bálticas.
  • iso-8859-14 (latin-8), lenguas celtas: gaélico, welsh y bretón.
  • iso-8859-15 (latin-9) Europa Occidental, mejora de latin-1, completa el Francés {œ, Œ y Ÿ} y añade el signo del euro {€}. (Basado y similar a latin-1)
  • iso-8859-16 (latin-10) Europa Sud-Este: Albanés, Croacia, Hungría, Polonia, Rumanía, Eslovenia y también Francés, Alemán, Italiano e Irlandés Gaélico.
  • gb2312: Chino simplificado.

Appendix 3. UTF-8

Un archivo pdf de 15K, crece hasta 450K al pasar de iso a utf-8 si se incluyen subconjuntos de las fuentes utf-8 usadas en el documento, pero puede contener caracteres de cualquier idioma. Si se incluye sólo el subconcunto de caracteres usados en el documento, no será editable con ningún nuevo carácter y si seincluye la fuente completa puede tener un sobre peso de 13 a 23 MB, según se use Bitstream Cyberbit o Arial Unicode MS respectivamente. De momento, webERP sólo produce informes pdf en ISO-8859 + CJK (Chino, Japonés y Koreano).


Publicidad

aese, s.l. aese, s.l.