This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: length in gawk returns wrong value
On Jul 19 11:27, Ralf wrote:
> Corinna Vinschen <corinna-cygwin <at> cygwin.com> writes:
>
> >
> > Uh oh. 1.7.9 is old. Please update.
> >
> > > 0000000 R 374 c k e n \r \n
> > > 0000010
> > > Length: 1
> > >
> > > What can I do to get the correct length in gawk without changing
> > > ttt.txt?
> >
> > Dunno. This is not what I see. What did you have $LANG and $LC_CTYPE
> > set to? Here's what I see:
> >
> > $ uname -a
> > CYGWIN_NT-6.1 vmbert7 1.7.16(0.261/5/3) 2012-07-09 14:51 i686 Cygwin
> >
> > $ echo $LANG
> > C.UTF-8
> >
> > $ echo "RÃcken" > ttt.txt
> > $ od -c ttt.txt
> > 0000000 R 303 274 c k e n \n
> > 0000010
> >
> > $ gawk '{print "Length: " length($0)}' ttt.txt
> > Length: 6
> >
> > $ gawk --version | head -1
> > GNU Awk 4.0.1
> >
> > Corinna
> >
>
> After updating I added following lines on top of my script:
> export LANG=C.UTF-8
> echo LANG: $LANG
> echo LC_CTYPE: $LC_TYPE
> c:/unix/bin/gawk --version | head -1
>
> And this is my output:
> LANG: C.UTF-8
> LC_CTYPE:
> GNU Awk 4.0.1
> CYGWIN_NT-6.0-WOW64 WIESWEG 1.7.15(0.260/5/3) 2012-05-09 10:25 i686 Cygwin
> 0000000 R 374 c k e n \r \n
> 0000010
> Length: 5
>
> Very strange!
Not at all. The file contains an invalid character. 0374 is the
umlaut-u in the ISO-8859-1 or ISO-8859-15 codesets. Try this:
$ LC_ALL=de_DE gawk '{print "Length: " length($0)}' ttt.txt
Length: 6
When you create the file under the UTF-8 codeset, you'll get:
0000000 R 303 274 c k e n \n
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple