Pza. Candelaria, 1, Edf.Olympo
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Civicom · AESESistemas ERP
Software LibreRespuestas frecuentesPartnership |
LAMP + PDF + I18N: UTF-8 and CIDWelcome to the TCPDF testing2009·AGO·21, ed. 2009·OCT·04, Javier de Lorenzo-Cáceres Thanks to Phil Daintree, Tim Schofield and Harald Ringehahn from webERP, Zhiguo Yuan from ulaszipper.com and Markus Kuhn from University of Cambridge Computer Laboratory http://www.cl.cam.ac.uk/~mgk25/ and of course, Nicola Asuni and Olivier Plathey, the authors of TCPDF and FPDF respectively. MySQL, PhP, Apache (http://www.w3.org/International), Firefox and OOo support UTF-8 and are free. Pdf is not free and handles Unicode in a special way, introducing a character-access type called Character Identifier, abbreviated as CID; a 16 bits code. I don't know FPDF, specially CJK, but I guess it only supports ANSI and CJK, i.e., it doesn't support UTF-8 or other ISO than latin-1, hence, it doesn't support Turkish and many other languages. TCPDF supports UTF-8, and, as required by pdf, it does some uni2cid conversions (UTF-8 to CID that are lossy because the 16 bits logic of CID) and uses some CID-Keyed fonts. Also, some PhP functions like htmlspecialchars() and htmlentities(), don't support any other ISO Charset than ISO-8859-1 (or 15) what pushes to UTF-8. I'm only testing TCPDF UTF-8, languages, characters and glyphs, CID-Keyed fonts (CMaps and CIDFonts), comparing to ISO-8859-X. All the examples are based on TCPDF Example #8, many chars have been added. This web page is UTF-8 coded and was intended to not use BOM, but as more languages were written, BOM was added; font is from user agent. For Hebrew (and Arabic in the future), the CSS2 properties 'direction' and 'unicode-bidi' are used. I knew about ASCII, ANSI, ISO and Unicode but CID is something new to me, although it exists since 15 years ago. Fortunately, documentation about CID is provided:
TCPDF Unicode to CID conversion table is from ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/ The files in this directory relate to Adobe CJK character collections for CID-keyed fonts, and also provide *real* CIDFonts for testing purposes. Other CID-related Tech Notes:
People being interested in CID-keyed font technology or wishing to develop products based on CID-keyed font technology, should request a copy of the CID SDK (CID Software Developers Kit). ISO-88959-X and webERP available languagesISO was designed for information not typography, Dutch 'ij' or French 'oe' is considered to be right. webERP default language is English and it has been translated to other 17 languages. Despite the fact that the number of locales is even greater or that there are 2 charsets for Chinese, from the 18 available languages only 11 have pdf support with FPDF. The next list shows all webERP languages sorted by an index number corresponding the following charsets table where webERP languages are written uppercase. All but index 1 and CJK lack pdf reports. Languages 2 letter codes are ISO-639. From the 18 webERP languages 9 belong to ISO-1, hence, webERP is multilanguage for these 9. As FDPF supports CJK, webERP is i18n compliant for Chinese and Japanese.
Charsets on web pages are still too often missing or incorrect or not the best suited. People learn to live with imperfections and new charsets are ignored, e.g., French and Finnish use ISO-1 instead of its ISO-15. I marked languages with an asterisk when a better ISO (bold type) exists. As a language may use several charsets, a language declaration is not enough for disambiguation, charset must be declared too.
How CID worksCID-Keyed fonts have three components: Character Collection, CMap and CIDFont.
Tech Note 5092 page 13 says CMap files are "shared among all CID-keyed fonts"; it forgot to say "for a specific Character Collection"; this also happens with Rearranged Fonts since in this kind of font there is a main font called Template that defines the Codespace; the other fonts are called Components and they can substitute glyps from the Template with consistent ones for a style change and also they can add glyphs to cover empty pre-defined subsets but they can't improve the Collection by enlarging the Codespace. Doing so would be the same as dealing with a new Collection, i.e, new Collection/CMap/CIDFont must be defined, it's not so difficult but it's copyrighted, I mean, may I enlarge the Codespace of an Adobe Font? Anyway, it is not possible to merge a CIDFont from Adobe-Japan1-X Collection with another CIDFont from Adobe-GB1-X Collection without creating a new Collection because, since they belong to different Collections, they both use same CID ranges (The CID Codespace is not totally different from each other); and they would conflict, i.e., it's not possible to add one font to the other as a supplement without changing CID index. Let's see it.
While we are in the Unicode domain, we can write both: But in the CID domain both characters have the same CID index 2105, each in its respective Collection. Collections are abstract notions, they materialize, become concrete notions, in Cmaps/CIDFonts. CMaps are specially suited for Collections.
If you look at Adobe CMaps:
And the same is in TCPDF CMaps: Despite the fact that TCPDF CMaps are decimal and sorted by CID index, and Adobe CMaps are sorted by character code expressed in hexadecimal, they are identical.
It's not expected that Adobe will create a new Collection to merge GB and Japan1 because
both Collection may be used at time.
I expect the above to prove that making a CMap is not a solution.
Both actual solutions are not ideal. The second implies to look for characters to change the Collection accordingly, while the first implies to make a new Collection, CMap and CIDFont. Bitstream CyberbitThis is a rich Unicode font, almost pan-Unicode and it's free. Fontforge allows to generate its CID versions from CMaps. This font might be used in the case of creating a new Collection. Also, several fonts may be used to create the new CIDFont for the new Collection, e.g., Deja Vu or Freefont for the Western part and the free Arphic for the Chinese and Unified subsets. How TCPDF example #8 works, the very short story.
Example #8 looks like this: Despite the language array, the quid is the SetFont call. The TCPDF font file invoked is a php file, there are 5 TCPDF font kinds: Core, TrueType, Type 1, TrueTypeUnicode and CID (tcpdf.php line 3068), but only Core and CIDFonts may be not embedded.
TCPDF CID fonts php files may use one of the following four CMaps:
(Don't ask me why UTF-16 CMaps are used instead the UTF-8 ones, yet). TCPDF has also the following CID fonts definition php files:
TCPDF Adobe-Japan1-4 CMap may be updated to Adobe-Japan1-6 with ease, if necessary.
UCS-2 (Universal Character Set ISO 10646) CMap files are now considered obsolete by Adobe,
and are no longer being maintained.
They have been replaced, for all character collections, with a suite of UTF-8,
UTF-16 and UTF-32 CMap files. Example 1 · ANSICharactersANSI is alike ISO-8859-1 (or 15) Western Europe. ISO-8859-1 was improved to ISO-8859-15 to completely support French {œ, Œ, Ÿ} (to avoid the use of oe and OE instead) and the euro sign {€} was added. This is well supported by php, FPDF, TCPDF and acrobat core fonts.
ISO-8859-1 is the webERP's default charset, it supports: ENGLISH (en)This is the 26 letters from the Basic modern latin alphabet. DUTCH (nl)The only distinct letter from the Dutch alphabet is IJ ij (lange ij). ISO doesn't include it, it can be expressed as IJ, or Y as time ago. FRENCH (fr){œ, Œ, Ÿ} to avoid the use of oe and OE instead. Portez ce vieux whisky au juge blond qui fume sur son île intérieure, à côté de l'alcôve ovoïde, où les bûches se consument dans l'âtre, ce qui lui permet de penser à la cænogenèse de l'être dont il est question dans la cause ambiguë entendue à Moÿ, dans un capharnaüm qui, pense-t-il, diminue çà et là la qualité de son œuvre. l'île exiguë Où l'obèse jury mûr Fête l'haï volapük, Âne ex aéquo au whist, Ôtez ce vœu déçu. Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au delà des îles, près du mälström où brûlent les novæ. GERMAN (de)
Falsches Üben von Xylophonmusik quält jeden größeren Zwerg
Zwölf Boxkämpfer jagten Eva quer über den Sylter Deich
Heizölrückstoßabdämpfung INDONESIAN (ID)Indonesian alphabet is just the same as the English one. ITALIAN (it)Italian alphabet is only a set of 21 letters from the English alphabet. SPANISH (es) (inc. catalán, gallego y euskera){ñ, Ñ, ü, Ü, á, é, í, ó, ú, Á, É, Í, Ó, Ú}, as Dutch ij we have ch and ll to form unique letters, hence there are 28 letters but 29 are used since w was adopted. SWEDISH (sv)The characters {Å, å, Ä, ä, Ö, ö} are considered distinct letters in Swedish and are sorted after Z (unlike the German umlauts in the German alphabet). PORTUGUESE (pt)Alphabet is the Basic modern. As French, they have special punctuation like ã or ê. Brazilian also has ç included in most latin keyboards. Danish (da)Danish alphabet includes {Æ, æ, Ø, ø, Å, å} sorted after Z.
Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen
Wolther spillede på xylofon. Icelandic (is)Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa Sævör grét áðan því úlpan var ónýt (some ASCII letters missing) Example 1: txt, odt and pdf.
Core fonts are 14 = 3 faces (Helvetica or Myriad, Times-Roman or Minion and Courier) x 4 styles (regular, bold, italic and bold-italic) + 1 Symbol + 1 Adobe Pi (Zapf-Dingbats). ISO-8859-1 and 15 don't support other languages like those from Europe East, South and North, like Greek, Turkish, etc. (see ISO-8859-2 to 16, except 15) Example 2 · CIDCharactersWhen ISO-8859-1 (or 15) is changed to another ISO, languages except English are lost because byte values 128 to 255 now accomodates other languages. Sometime, changing chars are only a few but enough to consider the ISO change. Using UTF-8 solves it, but Acrobat doesn't support UTF-8, it maps UTF-8 (variable lenght from one to four bytes) to a CID (two bytes) Cmap. This is a lossy conversion because 2 bytes can't accomodate the more than 100000 glyphs from the Unicode repertoire. There are many Cmaps but TCPDF has only four: CCJK. Both Chinese Cmaps lack Turkish so let's give a try to Japanese Cmap. 1. ISO-1 and 15, Western Europe.The characters in Example 1 are also included in the generated Example 2 pdf file. 2. ISO-2 and 16, Eastern Europe.Eastern Europe languages like Hungarian, Polish, etc. are also supported. HUNGARIAN (hu)
Árvíztűrő tükörfúrógép POLISH (pl)
Pchnąć w tę łódź jeża lub ośm skrzyń fig 3. ISO-3 and 9, Turkish.TURKISH (tr)ISO-9's 6-Turkish-chars set {ı, İ, ş, Ş, ğ, Ğ} replaces 6-Icelandic-chars set {ý, Ý, þ, Þ, ð, Ð} from ISO-1. We see here both 6-chars sets at the same time. 4. ISO-4 10 and 13, Northern Europe and Baltic.I have no text from them but there aren't webERP translations either. 5. ISO-5, Cyrillic.Languages like Russian, Bulgarian, etc. are also supported. RUSSIAN (ru)
В чащах юга жил бы цитрус? Да, но фальшивый экземпляр! 6. ISO-6, Arabic.No text available and neither webERP translations. 7. ISO-7, Modern Greek.Greek most used math symbols are included in 13th core font, here is the writing. GREEK (el)
Γαζέες καὶ μυρτιὲς δὲν θὰ βρῶ πιὰ στὸ χρυσαφὶ ξέφωτο
Ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία 8. ISO-8, Hebrew.ISO-8 covers hebrew letters but not signs; I don't know if the following has signs. This is a RTL writing. Hebrew (iw)? דג סקרן שט בים מאוכזב ולפתע מצא לו חברה איך הקליטה 9. ISO-11, Thai.Thai has a stunning alignment. A mono-spaced font is used but it doesn't fit Thai. Thai (th)
[The copyright for the Thai example is owned by The Computer Association of Thailand under the Royal Patronage of His Majesty the King.] -. ISO-12, Devaganari.ISO-12 Devaganari is deprecated/obsolete. 10. ISO-14, Gaelic.Irish Gaelic (ga)D'fhuascail Íosa, Úrmhac na hÓighe Beannaithe, pór Éava agus Ádhaimh 11-14. CJKV, Asian.To here, I have covered all ISO charsets. Now let's view CJK. Japanese has several scripts as Hiragana and Katakana. Japanese (jp)
Hiragana: (Iroha)
Katakana:
Japanese CMap miss Unified Ideograph, a Chinese CMap must be used. CHINESE (cn)自所不欲 勿施与人
CHINESE · Unified Ideograph as in gb2312
传真(fax) 运费(freight) 总计(total)
Thanks to Zhiguo Yuan for the following chars from Unicode Unified Ideograph: The above is based on Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/ -- 2001-09-02 Many improvements have been made. Please let me know if you find others! Special thanks to the people from all over the world who contributed these sentences. Example 2 txt and odt
odt Issues
Example 2 Japanese pdf
Japanese Font
Japanese Issues
As only one font may be declared and I declared the one from Reader 9, this will make a double font substitution in non matching versions like Reader 8. To avoid it, copy Kozuka New file from Reader 9 to Reader 8 or update Reader 8 to 9. Example 2 Chinese GB pdf
Chinese GB Font
Chinese GB IssuesTo know how Adobe handles fonts visit docs sites, briefing, if fonts names match, Reader uses the font installed; if not, Reader uses the font description from sender. This is why four cases sub-examples look different.
Appendix 0. CIDFontsHeitiAdobe Reader comes with a very little font collection. Looking at their name we may see the Std or Pro label that will show its support for Central European languages, and looking at their file size we can guess their capabilities; this is how AdobeHeitiStd-Regular catched my attention. But it wasn't simple to know the Collection the font is suited for, at least for me (maybe asians see this easy). Files are binary and as large as 10 MB, so, before to open it with hexeditor or hexplorer I did a search and it was not fast. Then I made this to make it evident. I thought to delete this part but it gives an idea of the inner CMap-CIDFont relationship. TCPDF using the Adobe Heiti Std Font with right and wrong CMaps.
Appendix 1. Unicode Ranges
BitStream Cyberbit (a free 13 MB serif font) has these unicode ranges: Appendix 2. ISO
Appendix 3. UTF-8Un archivo pdf de 15K, crece hasta 450K al pasar de iso a utf-8 si se incluyen subconjuntos de las fuentes utf-8 usadas en el documento, pero puede contener caracteres de cualquier idioma. Si se incluye sólo el subconcunto de caracteres usados en el documento, no será editable con ningún nuevo carácter y si seincluye la fuente completa puede tener un sobre peso de 13 a 23 MB, según se use Bitstream Cyberbit o Arial Unicode MS respectivamente. De momento, webERP sólo produce informes pdf en ISO-8859 + CJK (Chino, Japonés y Koreano). |
Publicidad |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||