Favoring Doxygen for documentation over independent wiki documentations

werner · July 14, 2020, 12:07pm

This is great to see a new approach on improving HDF5’s reference manual, and actually I still prefer the ancient pages such as https://support.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-Write over the newer ones such as https://portal.hdfgroup.org/display/HDF5/H5D_WRITE . One reason is that the old ones come up in google search for “H5Dwrite” right away, whereas the new ones are not found by it. The new ones require going through the HDF5 website to find it in the first place. Also it’s confusing in the new pages that the “procedure” is written in all uppercase, to it looks like Fortran documentation on first view, then only later follow the C details. All that is redundant information taking up unnecessary space, while the old version is right to the point of providing the actual information without gimmicks around it.

There are now two prominently competing examples on code documentation via the C++ documentation, such as en.cppreference.com versus www.cplusplus.com, both approaches providing different styles for same functions, e.g.:

https://en.cppreference.com/w/cpp/container/forward_list
https://www.cplusplus.com/reference/forward_list/forward_list/

I’m not even sure yet myself which of these styles I prefer. Both have their pros and cons. The one style is more minimalistic, allowing to focus, the other one more comprehensive, allowing easier navigation to similar functions.

For HDF5, I’d very much favor documentation to be autogenerated from the code such that both options are possible. Doxygen as documentation generator allows for multiple such styles via its configuration, thus there is no requirement to be limited to only one layout. I’d consider such direct integration with code better than manually written documentation in a wiki, but addon-pages such as describing examples can still be managed through some community-driven repository. Maybe a separate SVN or GIT repository that is pulled by an automated nightly build system for the HDF5 documentation?

On that aspect, I would disagree with the non-functional requirements as listed in https://hdf5.wiki/index.php/RM:Requirements#Non-Functional_Requirements ; not the “site” should be editable such that “Authenticated and authorized users shall have the ability to create new documents and modify existing documents”, but rather the sources that are used to automatically build that site should be editable and modifiable by authorized users - like a GIT repository for documentation addons that is pulled in addition to HDF5 main code to create and update the documentation side.

That way everyone can build the documentation on their own as well, use it offline, and also use different documentation styles, not being limited to the one particular layout as chosen by the main site. Building documentation rather than directly writing it bound to a specific style gives the same options like those two competing C++ documentation websites.

gheber · July 14, 2020, 9:48pm

Unless I’m mistaken, they are both implemented using Mediawiki.

The biggest challenges in “direct integration with code” are culture and the (artificial?) coupling of two life cycles that have very different dynamics, tooling, and roles. Is it time for a cultural change? Perhaps. But unless the rewards are tangible and the risks acceptable (See https://hdf5.wiki/index.php/RM:Requirements#Lessons_Learned) it’d be too much of a disruption in one go.

The assumption that all content has to be written manually is inaccurate. With tools such as bots (on MediaWiki) and pandoc, content can be created automagically.

I think that’s a matter of optics. Why the additional level of indirection (Git)? The Git “funnel” might apply to a small subset such as the reference manual or the file format spec., but let’s not forget that MOST HDF5 documentation has nothing to do with particular API functions or code. It also puts the burden on the writer to emulate/replicate the hosting environment to get an accurate picture of how a piece of documentation will be rendered. I’m not denying that there are pieces of software that might require this kind of “gated community approach,” but HDF5 has no history of it, and what are users and potential contributors getting in return for this added complexity?

I’m with you on the offline use, but it’s not a beauty contest, and I’m not sure how much weight I would put on the requirement for customizability to match a study’s wallpaper. We used to distribute offline versions of the reference manual (PDF and HTML) with our release tarballs. If I had to speculate, the simple reason we dropped that was, and we have to be honest about it, lack of respect for documentation. At a minimum, we should get back to that going forward. How we “mechanize” it is another question. (It’s actually easy to do for MediaWiki.)

The whole question of (documentation) versioning is a separate topic for another thread. What would add value?

G.

werner · July 15, 2020, 8:46am

I was referring to it just as an example of two styles making sense, while not being sure which one is the better one, so the option to write content independent of style is a benefit, otherwise you also run into a style decision. I’d be curious which one of the two styles HDF5 users would prefer - but I guess that would become it’s own topic of discussion, which should rather be separate from the work on the content. Ok, maybe via Mediawiki it is possible to provide multiple styles to the same database as well, since they feed the content from a database server anyway.

Particularly for the reference manual I’d see it as big benefit to actually enforce coupling of the code’s life cycle to its documentation. From the HDF5 sources, it already contains documentation, just not in doxygen style, but it wouldn’t even look like that much of a change to actually use it for building the documentation, and thus to ensure that every change of code is reflected in its documentation.

Is the cultural difference for HDF5 to write documentation independently of the code such that the code has to fit the abstract requirements defined in the documentation? That approach could work as well, but it would still work if those abstract requirements are placed in directly the code’s comments, or at least co-located.

There may be more abstract documentations beyond the reference manual such as design documents; but even those can be written in a markup language, such as LaTeX, or doxygen as abstraction that can also produce LaTeX, and others.

gheber:

werner:

On that aspect, I would disagree with the non-functional requirements as listed in…

I think that’s a matter of optics. Why the additional level of indirection (Git)? The Git “funnel” might apply to a small subset such as the reference manual or the file format spec., but let’s not forget that MOST HDF5 documentation has nothing to do with particular API functions or code. It also puts the burden on the writer to emulate/replicate the hosting environment to get an accurate picture of how a piece of documentation will be rendered. I’m not denying that there are pieces of software that might require this kind of “gated community approach,” but HDF5 has no history of it, and what are users and potential contributors getting in return for this added complexity?

I was primarily thinking about the reference manual, as that one was referred to as first use case and it’s also the most important documentation that I keep looking up myself when in need. With respect of a “gated community” access to contributing documentation, is there a difference between needing an authorized account to edit a wiki page versus needing an authorized account to a git server? I don’t even see an added complexity by having the ability to work on documentation offline after checkout. It also allows to replicate documentation easily to local, in-house servers, providing independence from the world’s one and only documentation server otherwise.

Even if there is more abstract documentation independently from particular API functions, wouldn’t even that benefit from direct references to code and linked examples?

That I disagree that it’s not a beauty contest - the prettier it is, the easier it can be read and understood, the better it is to use. Beauty of documentation has its merits. Doxygen got its breakthrough because it makes it easy to write pretty documentation even for coders that are otherwise lazy in providing documentation at all.

The html pages created can be viewed with just a webbrowser in the file system. No need to set up a mediawiki server and database to feed content to it.

If it’s related to code that is versioned, yes. Otherwise, can you really write documentation that is totally independent of a release version and will remain valid for decades?

gheber · July 15, 2020, 11:56am

We agree to disagree, but let’s build on our joint interest to create the best HDF5 RM ever! I propose we be practical and specific: Let’s use the H5Literate or H5Oget_info family of calls and mock-up a Doxygen solution! Because of their complexity and chequered history, they make good benchmarks for any candidate. Would you care to outline the coupling with library releases and branches, and how the underlying documentation content is embedded and “anchored” in the source code? I would encourage you to create a Wiki page, but I’d be happy to scrape this thread for nutritious morsels and transfer them.

G.

steven · July 15, 2020, 12:33pm

If the documentation is an the source code, does it follow a regular grammar or the other way: existing documentation in HTML follows an XML language?
Either way, is a mechanical transformation – with a tool – considered?

In the first case YACC/BISON (or similar CFG parser) should be able to handle the input, and transform the source code. An LLVM based solution one can do even more, as is demonstrated in H5CPP clang libtooling based source code transformation tool, or VIM you-complete-me.

If the HTML version considered the highest value, then similarly a transfer function may be applied to take the material into the target space. XSLT maybe considered if the source is already in some XML language, alternatively a custom parser written in FLEX/YACC or similar Context Free Grammar parser such as PEG can replace error prone copy paste.

werner · July 15, 2020, 4:59pm

Steven, doxygen uses a syntax that is similar to LaTeX that is put into the comments of the source code, while it already understands the source code itself as well. Even without annotation in comments, it can create full-featured documentation of functions, parameters, and their relationships, from the pure source code. At least the early versions used Yacc/Bison as far as I remember, not sure if they still use it in the most recent release. It nevertheless understands the C++ language as well as a couple of others, and C of course as well.

Producing HTML is one possible output, it can also output LaTeX that can then be compiled into a PDF document, or XML for tags, or a couple of other output options, as specified in its configuration file.

werner · July 15, 2020, 5:03pm

Just to understand you request, you’d like me to document how the H5Literate() function would need to be documented using Doxygen syntax such to produce the same result as you have on the wiki page now, do I understand you correctly? I can do that. In practice, Doxygen would produce more than the documentation of a single function of course, but also produce the pages that show the interconnections between related functions and link to examples using this function and such.

gheber · July 15, 2020, 6:04pm

I imagine the net result will be the Doxygen-generated superset of (including the) versions of H5Literate, H5Literate1, and H5Literate2. To get there we’ll have to annotate the H5L* source code. Since HDF5 1.8.22, 1.10.7, 1.12.1, etc. will be upon us sooner or later, we’ll need to think about what has to go into which branches. Unfortunately, but realistically, the story is non-trivial. We have deprecations, renamings, etc. going on. (Strictly speaking, we would need to know how the library API-compatibility is configured to create accurate release documentation, but let’s ignore that.) I think any artifacts/outlines you can produce that help us understand what it really takes start to finish to make this happen would be helpful. If Doxygen creates a lot of other stuff automagically, that’d be a bonus, but let’s drill a bit into the nitty-gritty details. OK? G.

werner · July 15, 2020, 6:24pm

Ok, I’ll give it a run. As I would do it, I’d document H5Literate2(), since that is the most modern function and thus the one that is recommended to be used. From there, there can be a reference to the documented H5Literate1() function, which is annotated with a warning that it should no longer be used as its outdated, and a comment that a H5Literate() define exists which maps to either of those functions depending on the library’s compilation defines, such as those defined by the respective config.h or as provided by the build system (which could override the settings in the config.h) . This documentation would be part of the respective release branches anyway, so each release gets its own address once placed on the web. If that address is f
ixed, then a documentation of 1.12.1 can also link to the documentation of 1.10.7 , but I am not sure if that is useful.

werner · July 15, 2020, 8:52pm

I’m not sure how to create an account for the HDF5 wiki such to edit a page there (“The supplied credentials could not be used for account creation.”), but here’s a possible doxygen version for the H5L group of functions, where I got H5Literate2() annotated for use with doxygen; it references the H5Literate() macro as well:

https://www.fiberbundle.net/hdf5-1.12.0/group___h5_l.html

The comment in H5L.c in doxygen-compatible form then looks like the following, which is a rather small change to the current status from the HDF distribution:

/**@ingroup H5L
 * Iterates over links in a group, with user callback routine,
 *              according to the order within an index.
 *
 *              Same pattern of behavior as H5Giterate.
 *
 * @return      Success:    The return value of the first operator that
 *                          returns non-zero, or zero if all members were
 *                          processed with no operator returning non-zero.
 *              Failure:    Negative if something goes wrong within the
 *                          library, or the negative value returned by one
 *                          of the operators.
 *
 * @param group_id The group ID to iterate over
 * @param idx_type The index type
 * @param order    Specifies the order how to iterate
 * @param idx_p    Pointer to an iteration index such to allow continuing
 *                 a previous iteration
 * @param op       Function pointer for a callback operation to be invoked
 *                 at each iteration
 * @param op_data  User-defined data structure that will be passed on to
 *                 the callback function
 *
 *
 * @see See also H5Literate_by_name(), H5Lvisit2(), H5Lvisit_by_name()
 *
 * @note This function is also available through the H5Literate() macro.
 *
 *
 *-------------------------------------------------------------------------
 */

btw @steven Doxygen uses Bison for source code parsing by default, but it also has the option to enable a clang-based parser.

gheber · July 15, 2020, 8:55pm

https://www.hdfgroup.org/register/

gheber · July 15, 2020, 8:59pm

https://hdf5.wiki/index.php/RM:Requirements/Doxygen

werner · July 15, 2020, 9:05pm

Seems my older account worked to login, but I cannot edit that page:

You do not have permission to edit this page, for the following reason:
The action you have requested is limited to users in the group: writer.

Does it allow to upload files? I’d upload the doxygen config file and the modified H5L.c then if that’s of interest.

gheber · July 15, 2020, 10:17pm

Yes (upload) https://hdf5.wiki/index.php/Special:Upload

werner · July 15, 2020, 10:19pm

No:

Permitted file types: png, gif, jpg, jpeg, webp.

Can’t use this for text files such as source code or the config file.

gheber · July 15, 2020, 10:39pm

OK, great start. Let’s maybe dig a little into the details. If you compare https://www.fiberbundle.net/hdf5-1.12.0/group___h5_l.html and https://hdf5.wiki/index.php/RM:H5Literate2 you notice that there’s quite a bit more context on the latter in the sense that I can see all the enumerations, structures, and callback types on the same page. Yes, I can click the corresponding links on the former, but that’s a bit of a cognitive challenge, because, by the time I get to the destination, I can’t remember the context from where I came. Notice also, that these are NOT copies (which would just be an invitation for spreading errors), but transclusions, i.e., there’s only one copy, and if that changes, it changes everywhere. Does Doxygen have a notion of transclusion or (user-defined) templates? The HDF5 documentation is just full of references and stereotypes like that, and you don’t wanna repeat tables such as https://hdf5.wiki/index.php/Template:file_access_flags. G.

werner · July 16, 2020, 2:52pm

Yes, I noticed in the wiki version you had the documentation of the structures used by the function directly under the function. Actually, I found that rather annoying as I missed the option to click on the function parameters to get the information about the parameter types right there, as I am so much used to that way. It is a widespread way of documentation also for the QT library, for instance:

https://doc.qt.io/qt-5/qwidget.html#grab

Using the QT library’s documentation tool, qtdoc, would be an option as well. It has well grown over the years to handle large codes, but I am not sure whether it is as flexible as doxygen. In the past, qtdoc was not freely available, that is why doxygen was created.

Back to the function documentation at hand: There are a couple ways to group functions and data structures together. So far I have not explored all those options, as it’s just to explore in which direction the HDF5 documentation wants to go. In general, doxygen is rather object-oriented focusing around the data structures, and then allows for functions to be related to this data structure. So this design is somewhat the other way round than the current wiki page for H5Literate, which starts with the function and then lists the data structures related to this function. It may be possible to tweak the doxygen docu more in that direction, but it’s a somewhat unusual way as from I am used to from other libraries.

If you say “if it changes it changes everywhere” - does that mean the structure documentation on that page is replicated to multiple places to other functions as well? So if that page is printed, then the printed versions contain redundantly the same information multiple times as it’s same on multiple functions? The doxygen / QTdoc approach avoids redundancy as there is one page for one structure that uniquely defines that data structures and the functions using that data structure just refer to that one. I find that much easier than redundancy, but I could try to look how much such replicated data-inline documentation is doable via doxygen, if that is a requirement. Doxygen also allows for macros and conditionals, so there may be ways.

The table like the file access flags are like the enum documentions in QT, e.g.

https://doc.qt.io/qt-5/qt.html#GestureType-enum

Doxygen does the same:

https://www.fiberbundle.net/hdf5-1.12.0/_h5public_8h.html#a6a6ddd1504d1ed61939d46d91d9441b9

Of course, it can be annotated more.

werner · July 16, 2020, 4:18pm

Here a version of the documentation with the data structure placed into the same H5L group, so its both on the same group page as the function, and in a style with a site navigation panel on the left:

https://www.fiberbundle.net/hdf5-1.12.0-v2/group___h5_l.html

gheber · July 16, 2020, 7:54pm

Others should chime in, but I was just following (HDF) precedent. I think that as long as we are not switching to a different page to get the related type info, that would work for me.

What I mean is this. If you look at the source of RM:H5Literate2 you won’t see, for example, the definition of H5_index_t. That’s transcluded in the source as {{H5_index_t}}, which is a reference to Template:H5_index_t. The source (as text) exists only there. If there’s a change or correction to that, I only have to do that in once place and it “spreads” from there, because there are no static pages (There’s caching, but it’s all rendered per request if any of the dependencies changes). H5_index_t appears all over the place and it would be costly to maintain occurrences-by-copy. Transclusion solves that problem. It also works in reverse, i.e., if I would like to know which pages transclude a given template I just go to Special:WhatLinksHere and type in the template name, and that returns all pages using that template.

I’m not sure I understand what you mean by that. If a page contains multiple transclusions of the same content, it will be rendered that many times. (And printed if the user presses the print button.) We don’t have any pages like that, but that’s what would happen.

G.

gheber · July 16, 2020, 7:57pm

I like the second attempt much better. What’s involved annotation-wise in the code and the doxygen file? I can create a Git repo and you can check’em in if you like. Let me know! G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Favoring Doxygen for documentation over independent wiki documentations