developer.jelix.org is not used any more and exists only for history. Post new tickets on the Github account.
developer.jelix.org n'est plus utilisée, et existe uniquement pour son historique. Postez les nouveaux tickets sur le compte github.

Opened 10 years ago

Last modified 8 years ago

#1020 delayed enhancement

jUrl should transliterate instead of applying the replacement character

Reported by: tudorilisoi Owned by: laurentj
Priority: lowest Milestone:
Component: jelix:core Version: 1.1.4
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Documentation needed: no
Hosting Provider: Php version:

Description

Hi again! I think jUrl could generate more useful URIs by transliterating accented characters. You can add this code in lib/jelix/core/jUrl.class.php before line209:

if (function_exists('iconv')) {

$str = iconv('UTF-8','ASCIITRANSLIT',$str);

}

This will replace accented characters with their ascii counterparts, for example 'ö' becomes 'o'.

Change History (14)

comment:1 Changed 10 years ago by tudorilisoi

  • Summary changed from jUrl shoul transliterate instead of applying the replacement character to jUrl should transliterate instead of applying the replacement character

comment:2 Changed 10 years ago by tudorilisoi

oh, and line 216 should be replaced by

$str=str_replace($str, $url_escape_to, $str); supprime les caractères accentués, et les quotes, doubles quotes

comment:3 Changed 10 years ago by tudorilisoi

I was stupidly hurrying! this code should be inserted before line 204, in the original file without prior modifications described above if (function_exists('iconv')) {

$str = iconv('UTF-8','ASCIITRANSLIT',$str);

}

Additionally, I found that the conversion is locale dependent, which makes things worse http://www.php.net/manual/en/function.iconv.php#94481

I think it worths a try, although my sybmitted code is not covering all the issues

comment:4 Changed 10 years ago by tudorilisoi

commonbricks also supplies a deaccent function, but I suspect it only covers a part of the possible accented characters

comment:5 Changed 10 years ago by tudorilisoi

Ok, I'm flooding you with comments, but! I finally found a high performance implementation which takes care of everything just right http://sourceforge.net/projects/phputf8/files/utf8_to_ascii/

It does not depend on PHP locale settings, it just does it best the author claims # the average request time for the example from ~0.46ms to 0.41ms , and the example has several hundred words Best of all, i see no extension dependencies, just pure PHP

comment:6 Changed 10 years ago by foxmask

Just a question : why translating those char?

Most of this, ICANN Will permit to use domain name with Unicode char.

comment:7 Changed 10 years ago by tudorilisoi

First of all , I apologize for the messy post and comments the main issue here is for urls to be human friendly and provide seo value here's an example fro this string wigh is an article title:

"Oaxen Skärgårdskrog - 09/05/2009" without transliteration: 1) /news/6751-oaxen-sk-rg-rdskrog-09-05-2009 with transliteration: 2) /news/6751-oaxen-skargardskrog-09-05-2009

Yes, urls can be encoded like wikipedia does, but they are a mess when doing something like 'copy link location'(in the broswer) or similar, and are prone to processing errors. This solution is clean and simple to implement, and the overhead is less than a millisecond. Just notice the difference between the mutilated url 1), which has pretty much lost its meaning for humans, and 2), which is fine.

Google knows how to (reverse)transliterate, it will take the urls into account for searches for the accented AND unaccented words.

utf8 would be nice, but there are a lot of systems which will handle such URIs incorrectly, and changing those platforms will take quite some time.

comment:8 Changed 10 years ago by tudorilisoi

First of all , I apologize for the messy post and comments .

The main issue here is for urls to be human friendly and provide seo value here's an example for this string which is an article title:

"Oaxen Skärgårdskrog - 09/05/2009"

without transliteration:

1) /news/6751-oaxen-sk-rg-rdskrog-09-05-2009

with transliteration:

2) /news/6751-oaxen-skargardskrog-09-05-2009

Yes, urls can be encoded like wikipedia does, but they are a mess when doing something like 'copy link location'(in the broswer) or similar, and are prone to processing errors.

This solution is clean and simple to implement, and the overhead is less than a millisecond. Just notice the difference between the mutilated url 1), which has pretty much lost its meaning for humans, and 2), which is fine.

Google knows how to (reverse)transliterate, it will take the urls into account for searches for the accented AND unaccented words.

utf8 would be nice, but there are a lot of systems which will handle such URIs incorrectly, and changing those platforms will take quite some time.

Offtopic: And, of course, a light visual editor may be better than learning wiki formatting syntax ;)

comment:9 Changed 10 years ago by foxmask

  • Type changed from bug to enhancement

you seem to miss a point... there is a Preview button to avoid to make some noise here:/ Offtopic could be avoid to 'cause i could say If you dont like wiki syntax, say it to the trac's author.

And finally we are going far from the main subject of this ticket.

I Will let the boss explainq himself what he think of this behavior in jurl

regards

comment:10 Changed 10 years ago by tudorilisoi

mea culpa! Thank you very much for your time, and sorry for the mess!

comment:11 Changed 10 years ago by laurentj

  • Component changed from jelix to jelix:core
  • Owner set to laurentj
  • Priority changed from normal to lowest

First, provide a true patch, with a diff utility or generated with the sources retrieved from the mercurial repository (hg diff). It will be better for us to understand where and how you want to add/remove lines of code.

  • iconv is not a solution, since it is system dependant. We don't want it for jelix. And it seems that the result is not the same on each machine. On my laptop, the result for your example returns 'Oaxen Sk"argardskrog'.
  • utf8_to_ascii : strange licence, use trigger_error instead of exception, old way to access to characters into the string (using such syntax $str{i} is deprecated by PHP) and is apparently a dead project. And it is 38 times slower than iconv (even if the current solution is jelix is 7 times slower than iconv :-)). This is not acceptable, since we can have several urls in a same page, and jelix is used on huge loaded web sites.

We cannot use mb_convert_encoding too, because it doesn't support transliteration.

This is why we have our own transliteration. But we don't support all characters. However, you can add new characters, by overloading the locale file format.UTF-8.properties which stored in the "jelix" module (lib/jelix/core-modules/jelix/). see http://jelix.org/articles/en/manual-1.1/overloads . and in these properties file, you add your characters into url_escape_from and url_escape_to.

comment:12 Changed 10 years ago by laurentj

  • Blocking 1027 added

(In #1027) test

comment:13 Changed 10 years ago by laurentj

  • Blocking 1027 removed

comment:14 Changed 8 years ago by laurentj

  • Status changed from new to delayed
Note: See TracTickets for help on using tickets.