Unicode filenames on Windows?

andrew.collette · June 3, 2009, 3:26am

Hello,

Does anyone know how to open or create an HDF5 file on Windows with a
Unicode name? From what I've been able to determine, it looks like
it's possible to use file names which contain bytes > 127, and
interpret them according to locale settings. However, there doesn't
seem to be any mechanism for generic (multi-byte) Unicode file names
like those supported by NTFS. Is this possible in HDF5?

Thanks,
Andrew Collette

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted2 · June 3, 2009, 8:32am

A Wednesday 03 June 2009 05:26:25 Andrew Collette escrigué:

Hello,

Does anyone know how to open or create an HDF5 file on Windows with a
Unicode name? From what I've been able to determine, it looks like
it's possible to use file names which contain bytes > 127, and
interpret them according to locale settings. However, there doesn't
seem to be any mechanism for generic (multi-byte) Unicode file names
like those supported by NTFS. Is this possible in HDF5?

I don't know for sure how to do that in pure C, but if you are using Python
(and I think that's the case), you can encode the file name using the
underlying filesystem encoding. The next function:

def encode_filename(filename):
  """Return the encoded filename in the filesystem encoding."""
  if type(filename) is unicode:
    encoding = sys.getfilesystemencoding()
    encname = filename.encode(encoding)
  else:
    encname = filename
  return encname

works well on every filesystem that I've tested (including NTFS).

Once you've got the encode file name, it is just a matter to pass it to the
relevant HDF5 function (H5Fcreate/H5Fopen).

Hope that helps,

···

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

andrew.collette · June 3, 2009, 6:10pm

Hi Francesc,

I don't know for sure how to do that in pure C, but if you are using Python
(and I think that's the case), you can encode the file name using the
underlying filesystem encoding. The next function:

def encode_filename(filename):
"""Return the encoded filename in the filesystem encoding."""
if type(filename) is unicode:
encoding = sys.getfilesystemencoding()
encname = filename.encode(encoding)
else:
encname = filename
return encname

works well on every filesystem that I've tested (including NTFS).

Yes, this is how my Unicode handling works at the moment; it seems
fine on UNIX (UTF-8 encoding) and with common characters on Windows.
However, trying to encode certain unicode characters doesn't work on
Windows; for example, u'\u1201'. It seems that "mbcs" can only encode
characters in the current code page.

Unfortunately it looks like Windows does Unicode with a separate,
wide-character API, so I may be out of luck. It would be nice if HDF5
simply took UTF-8 everywhere and called the appropriate low-level API.

Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 3, 2009, 8:20pm

Hi Andrew,

···

On Jun 3, 2009, at 1:10 PM, Andrew Collette wrote:

Hi Francesc,

I don't know for sure how to do that in pure C, but if you are using Python
(and I think that's the case), you can encode the file name using the
underlying filesystem encoding. The next function:

def encode_filename(filename):
"""Return the encoded filename in the filesystem encoding."""
if type(filename) is unicode:
   encoding = sys.getfilesystemencoding()
   encname = filename.encode(encoding)
else:
   encname = filename
return encname

works well on every filesystem that I've tested (including NTFS).

Yes, this is how my Unicode handling works at the moment; it seems
fine on UNIX (UTF-8 encoding) and with common characters on Windows.
However, trying to encode certain unicode characters doesn't work on
Windows; for example, u'\u1201'. It seems that "mbcs" can only encode
characters in the current code page.

Unfortunately it looks like Windows does Unicode with a separate,
wide-character API, so I may be out of luck. It would be nice if HDF5
simply took UTF-8 everywhere and called the appropriate low-level API.

Hmm, I don't think we do anything special to the strings we pass to the file system. Is there some particular problem you are seeing?

Quincey

andrew.collette · June 3, 2009, 9:38pm

Hi Quincey,

   Hmm, I don&#39;t think we do anything special to the strings we pass to
the file system. Is there some particular problem you are seeing?

I can't figure out how to take an arbitrary sequence of Unicode code
points and create an HDF5 file with that name on Windows.

I have limited experience with Windows Unicode support, but I know
that the way Microsoft implements unicode is through a series of
wide-character (2-byte "UCS-2") APIs. Unlike most UNIX platforms,
where you simply pass in a UTF-8 (or whatever) string through a char*,
I think you actually have to call a separate function (e.g. fopen vs.
_wfopen) to be able to handle generic Unicode filenames on Windows.
Otherwise Windows treats a simple char* string as extended-ASCII,
according to the current locale settings. So if I'm on a French
computer, I can get HDF5 to generate an e-with-an-accent, but not (for
example) a name with Cyrillic letters.

Currently, as far as I've found out in my investigations, there's no
way to encode a generic Unicode string to char* on windows and have it
work with the filesystem; you have to use the UCS-2 functions. I've
peeked at H5FDwindows.c and it looks like you're using the traditional
char* API.

I realize it's probably not a priority for HDF5 development, but it
would be nice if HDF5 could handle the full extent of names allowed by
the filesystem. It seems like the correct place for that is the
Windows file driver. One way would be to have two modes, perhaps set
by the file access property list; in the first, it passes the raw
bytes through to the filesystem (as is done now), and in the other, it
performs two-way translation between UTF-8 strings (HDF5 user side)
and the UCS-2/wchar API (Windows platform side). This would have the
additional benefit of maintaining HDF5's internal standardization on
UTF-8.

Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 4, 2009, 1:27pm

Hi Andrew,

···

On Jun 3, 2009, at 4:38 PM, Andrew Collette wrote:

Hi Quincey,

Hmm, I don't think we do anything special to the strings we pass to
the file system. Is there some particular problem you are seeing?

I can't figure out how to take an arbitrary sequence of Unicode code
points and create an HDF5 file with that name on Windows.

I have limited experience with Windows Unicode support, but I know
that the way Microsoft implements unicode is through a series of
wide-character (2-byte "UCS-2") APIs. Unlike most UNIX platforms,
where you simply pass in a UTF-8 (or whatever) string through a char*,
I think you actually have to call a separate function (e.g. fopen vs.
_wfopen) to be able to handle generic Unicode filenames on Windows.
Otherwise Windows treats a simple char* string as extended-ASCII,
according to the current locale settings. So if I'm on a French
computer, I can get HDF5 to generate an e-with-an-accent, but not (for
example) a name with Cyrillic letters.

Currently, as far as I've found out in my investigations, there's no
way to encode a generic Unicode string to char* on windows and have it
work with the filesystem; you have to use the UCS-2 functions. I've
peeked at H5FDwindows.c and it looks like you're using the traditional
char* API.

I realize it's probably not a priority for HDF5 development, but it
would be nice if HDF5 could handle the full extent of names allowed by
the filesystem. It seems like the correct place for that is the
Windows file driver. One way would be to have two modes, perhaps set
by the file access property list; in the first, it passes the raw
bytes through to the filesystem (as is done now), and in the other, it
performs two-way translation between UTF-8 strings (HDF5 user side)
and the UCS-2/wchar API (Windows platform side). This would have the
additional benefit of maintaining HDF5's internal standardization on
UTF-8.

Seems like a reasonable idea. I've filed a bug in our bug tracker and it'll get prioritized with the other things there, but we'd be happy to accept a well-tested patch from the community also.

Quincey

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Unicode filenames on Windows?