[ Wren's log of random thoughts on his f/oss projects ]

24 October 2009 : Gettin' some hate on for PythonView as RSS

So, I'm working on a Python project again. And, like usual, I'm finding a dearth of helpful tutorials for advanced hacking. The current problem? I have a module that I want to break up into a package in order to make it easier to mess with, but I want foo/__init__.py to export the same API as the old foo.py did— that is, not breaking old code by requiring extra namespace qualifiers.

I've managed to hack something together that mostly works, but I'm far from pleased. Client code can import the package like usual, and that seems to work. But the library's doctest tests don't work anymore because of an ImportError from python not being able to find submodules, despite having found them just fine for the client programs.

And for trying to debug this the internet is useless. All the searches I've come up with give a pasted copy of the official module tutorial, or are specific projects' FAQ pages giving such helpful explanations as telling me Python can't find the module (duh) or offering such helpful advice as adjusting the Python path so it can ((a) duh, (b) that's not the problem). I've yet to find any resources on how the module/package system actually works to know where things are going wrong.

music of the moment: Letting You ~ NIN (Halo 27-The Slip)

categories: python

11 March 2009 : Hugs, Bugs, and Complaints

I've been working on sprucing up some of my libraries on Hackage. In particular I've been working on making sure the code is portable to Hugs as well as GHC, since I'm not relying on GHC-specific extensions. This is made tricky, however, because I am using both CPP and the FFI. Along the way I've been stumbling into Dead Code Land, the place where code that was never documented goes when the project dies. Perhaps my charting of these territories will help anyone else misguided by portability.

Some background. The latest release of Hugs is September 2006, which was contemporary with, though prior to, GHC 6.6. Now that GHC 6.10 has been released, GHC 6.8 is quickly nearing the end of it's life cycle, though many people still use it (since installation is always a headache, or since 6.10 still has a few unfortunate performance regressions). For my code I'd like to support Hugs Sept06, GHC 6.8, and GHC 6.10. I'm fine to call GHC 6.6 dead; porting the code to it shouldn't be too difficult, but there are some things like LANGUAGE pragma which are just too nice to loose. There are other compilers out there like nhc98, yhc, jhc, and lhc though they tend to be niche and don't support all the standard extensions to Haskell 98.

Beyond compilers. Cabal is the de-facto packaging tool for Haskell. On GHC 6.8 we have Cabal 1.2.3.0, though because of the rapid evolution of the project, this version is ancient and all it's version-specific documentation has been blown away by the documentation for newer versions. Haddock is the de-facto documentation tool. Here we have version 2.0.0.0 which is pretty recent, though all the documentation is on how to write doc code, rather than how to run the program that compiles that into html.

And now on down to the dirt. The use of CPP for conditional compilation of Haskell has a long, if dubious, history. There's a cpphs project which aims to clean up C's original cpp, in order to make it easier to use for Haskell code which is sensitive to whitespace and other things that C isn't. The only other project even vaguely in this domain is Template Haskell, but that's GHC-only and it does far more besides. Using CPP with GHC is trivially easy. For Hugs it's a bit more difficult. Hugs doesn't have a --cpp flag, but instead has flags for being able to filter your code with any arbitrary program. A canonical invocation looks something like this:

hugs -F'cpp -P -traditional -D__HUGS__=200609' MyProgram.hs

Some tricky things to note. First is that C's cpp requires the -P flag in order to work in "pipe mode" rather than reading a file directly from disk. Second is the -traditional flag which turns off some newer cpp features which can interfere with Haskell code. And finally, for our conditional code, we define a macro that says we're running under Hugs Sept06.

Another tricky thing to note is that this will work on the command line, but Cabal 1.2.3.0 is hard-coded to use cpphs instead. In order to mix things up, cpphs has a different set of flags than cpp does, and so telling Cabal to just use cpp instead won't work. Another tricky bit is that Hugs installs a version under the name cpphs-hugs, so you may need to set up an alias or tell cabal --with-cpphs=/blah/bin/cpphs-hugs during configure.

Haddock (2.0.0.0 re Cabal 1.2.3.0) also has some misfeatures regarding CPP. In your .cabal file you can set the field cpp-options in order to define your macros, and from here they get included in the preprocessor runs for GHC or Hugs. They do not, however, get called when haddock is called. Thus, since haddock interprets the code in order to extract type information, you'll need to set the ghc-options field, which does get passed to haddock. To complicate things further, recent updates to Hackage have filters to detect setting CPP macros in ghc-options and will reject the package, telling you to use the cpp-options field instead. (A good practice in general, but doesn't work as well as it should.)

Update (15 March 2009): This misfeature re Haddock has been fixed in Cabal 1.6.

The foreign function interface was an early extension to the venerable Haskell 98 spec. Compiling to use the FFI is again trivial in GHC. For Hugs it's a bit more complicated since Hugs is an interpreter, not a compiler. The solution is, before calling hugs, just call ffihugs with all the same options and it will compile and link a DLL that gets loaded in when the source code is interpreted.

Again, this will work on the command line, but Cabal 1.2.3.0 has other ideas. To be fair, Cabal does automatically detect if you need to call ffihugs and will do so on your behalf. The problems are, again, more to do with option passing. The call to ffihugs does not inherit the hugs-options field in your .cabal file. There is no ffihugs-options field, but you can pass the --ffihugs-options="-98 +o" flag to configure. (Similarly there is no haddock-options field, but you can pass --haddock-options=-D__BLAH__ in order to get around the Hackage restriction.) The other problem is that cpphs is not called before ffihugs is called, therefore you must pass in the flag for the filter. This part is especially tricksy.

Scripting 101: if something has too many (or too few) levels of quoting, blame your shell first. Blame the program later, if ever.

The secret incantation is --ffihugs-option=-F'cpp -P -traditional -D__HUGS__=200609'. First trick is to use --ffihugs-option instead of --ffihugs-options. The plural does some whitespace tokenizing for being able to pass multiple flags at once, but it goes awry here. Second trick is to use single ticks in order to protect the spaces, but do not embed those within double quotes (like we usually do with --prog-options="..."). If you have that second level of quoting, then Cabal will end up looking for the program which is named by the whole quoted token, instead of evaluating it as cpp and some arguments passed to it.

Through much trial and tribulation, I can finally get Cabal to compile my packages for Hugs. Haddock still doesn't work, since it must be run with the same version (of GHC) as it was compiled with, but that's a whole other can of worms. The big sticking point now, is that all those CPP macros should be set in the .cabal file once and for all (using Cabal flags), and the user shouldn't have to know anything about them. At this point I'm thinking that there's simply no way around it without requiring a newer version of Cabal which fixes the option-passing bugs. So I guess it's back to README files to tell folks how to compile it. That's too bad really.

music of the moment: #1 Crush ~ Garbage (William Shakespeare's Romeo + Juliet)

categories: haskell, hugs, cabal

9 March 2009 : New Releases

Since this website seems to be fairly dead at the moment, I thought I'd mention a pointer to my new lightweight site for Haskell projects. Also, since I'd like to avoid spamming Haskell Café every time there's a new release, I think I'll start announcing them here for the folks who care.

logfloat-0.11.1 — added a new instance IArray UArray LogFloat, by popular demand. Honestly, for the library's use case I should've thought of it earlier.

categories: haskell, release

17 July 2008 : ICFP OMG WTF BBQ

Who needs functional languages when you have TeX?

categories: haskell

11 July 2008 : For Want of Difference Types

So I ran into a discussion on Lambda the Ultimate wherein was raised the topic of what the "real" mathematical type for (integral) division should be since :: Int -> Int -> Int doesn't cut it. Someone proposed :: Int -> Int -> Int+1, which is the normal way it's implemented— that is, the result is an integer or a singleton value representing (the single divide-by-zero) error it can generate. In a total functional world this has the problem of how to percolate exceptions up, since the caller isn't allowed to fail either.

That type is flagrantly wrong.

Invoking the abstract of the paper presented there (the site seems offline, so I can't read the paper), the reason it's flagrantly wrong is because it's not mathematical. Mathematics doesn't define an "undefined" value which would be permissible to return, it simply fails to define any value whatsoever. So if extending the result type is both mathematically incorrect and also a programming infrastructure nightmare, then what is the right answer?

The right answer is to restrict the inputs to the function. The usual way of doing this involves using dependent types to carry extra information about values in the types. Dependent types have many benefits (and many many issues), but we don't need even need them for this. All we need is difference types :: Int -> (Int \\ 0) -> Int.

In an ADT-based language like Haskell or ML, a type is a sum of products, for example the :: Maybe a type is either Nothing or Just a which can be expressed as 1+a. But in Haskell neither "Nothing" nor "Just a" are types, those names only exist in the data layer of the language, not the type layer. Sometimes we know for certain which branch of the type a value is in and I've often wanted to be able to treat the individual summands of a type as a type itself.[1]

Adding such an extension would not be too difficult, nor would it break the type inference algorithms in use, though it would require an, er, refinement of them[2]. In the underlying abstract machine of GHC, the case statement represents evaluation and hence is the only way to get the value from an expression. Case statements also give us proofs about which branch of a sum type a value resides in. That is, the following code is type correct.

> f :: Nothing -> b
> g :: Just a  -> b
>
> test :: Maybe a -> b
> test mx = case mx of
>           Nothing -> f mx
>           Just x  -> g mx
>
> -- This gives an identical proof
> test' mx@Nothing  = f mx
> test' mx@(Just x) = g mx

Coming into the test function we only know the maximal type that mx could have, and this is still inferable from looking at the heads of the case statement. But, once we've evaluated mx we have the proof necessary to refine the type to either :: Nothing or :: Just a. The type signatures of f and g are more restrictive, but we know that it is type safe to call them.

For the Maybe type this kind of refinement doesn't help very much because once we've refined things, the value is either unit (Nothing) or simply a (the Just tag gives us no additional information since Maybe contains exactly one extra bit of information and we already have that information from our proof). But for other types it can be quite helpful. Many of the standard list functions are not defined for empty lists and so being able to distinguish :: [] from :: (a : [a]) would be quite helpful in cleaning up the list library.[3].

Similarly, being able to give compile-time proofs that an integer is not 0 lets us not only make many functions correct, but lets us do so without any run-time overhead for verifying this proof. In the same way, if we can give laws expressing the zeros and identities of functions, the compiler can use this information along with proofs that a particular value must be one of those special values in order to optimize the function call away all together. GHC already gives us some of that ability via rewrite rules, but the barrier to entry can be high and it would only work when the special value argument is a literal.

In addition to removing the errors thrown by basic library functions, difference types would enable us to do some more extreme hackery when it comes to coproduct types. The :: Either a b type is defined as Left a or Right b and is another frequent monad for error-prone functions (where a is the type of all possible errors, generally a string for simplicity). When dealing with this and other monads for capturing errors, we often find that the monad comes to take over the world, either because we need to propagate those errors or because we want a family of functions to return the same type for consistency's sake. In the latter case, declaring that a function returns Right b would allow the programmer to see uniform code, and let the compiler know that it can probably optimize the type tag away. Similarly, we often want to take an :: Either a b and convert it to an :: Either a c. As things stand, if we have a Left a we'll have to case match out the a and then rewrap it to obtain Left a. This extra work is unnecessary because the type Left a can freely and safely be cast up to forall b. Either a b; since the b doesn't exist in the value, it is free to be whatever is needed.

Having difference types would not make Haskell total. It would take full dependent types to restrict the inputs to the lookup function on maps and lists so that it's guaranteed to succeed. But having case-directed safe downcasting from a sum type to a refinement of that sum, and free upcasting from refined sums to less refined sums, would be really nice. And if we could write the :: Int -> (Int \\ 0) -> Int version of division, it would be easy to turn it into :: Int -> Int -> Maybe Int for the folks who like that sort of thing.

[1] The only way to do this in Haskell is to define each of the branches individually and then to define their union as another type. Doing this has many downsides including the double tagging of types, no free casting from the individual types to the union type, and not restricting the individual types to only being used in this particular union.

[2] Since a parameter's type may be narrowed in subexpressions rather than being constant in the whole function body.

[3] Note that the second argument to cons is still the full list type. The fixed point on types breaks things and allowing us to unroll the list's type further may lead to some of the problems encountered in dependent type systems. All I'm advocating here is that we have difference types as the strict inverse of union types so we can say :: ([a] \\ []) which is exactly :: (a : [a]).

music of the moment: Beauty Queen-Horses ~ Tori Amos (Boys For Pele)

categories: haskell

10 July 2008 : Lawful Programming, Lazy Testing

Okay, so my plans for weekly postings about the papers I've been reading failed. Mostly because I haven't been reading as many these last few weeks than I had been. But I left off with a comment about testing in Haskell and I figured I should pop that off my stack before going on to a different post.

One of the glorious things about having a pure, side-effect free, language is that you can state properties of your program as mathematical laws. When you, the programmer, know your code should adhere to some law like x + y == y + x, the usual thing most languages force you to do is enumerate a few hundred unit tests stating the law over and over with different values for x and y. But what if you could just state the law itself and let your test harness make up the examples? While mathematics is obviously amenable to such laws, the realm of lawful programming extends far beyond mathematics.

Consider for example that you have some function you want to write which has an implementation that's entirely straightforward yet painfully slow, and another implementation that's horrifically complex but fast. Naturally we'd like to use the fast version, but how can we know it's implemented correctly? The slow implementation is easy to decipher so it's easy to spot check for being correct. The easiest way to test the complex version is to implement both, dictate a law stating they're equal (i.e. they give the same output for the same inputs), and then let our test harness try a few thousand arbitrary inputs and check that the outputs are indeed the same.

If that idea sounds so outrageous, so unfathomable, so simple that it just might work then QuickCheck is the test harness that made it happen. For those who've been around the Haskell community for a while, you may be wondering why I'd bring up this oft-trotted-out example of the beauty of pure functional programming. Well, other than introducing the new folks, QuickCheck ain't the only lawful kid on the block.

QuickCheck randomly generates values to feed into your laws. While this is good for informally testing code, there's a lot to beware of. For recursive types like lists or trees, usually the values QuickCheck generates are fairly small so it may not actually be as rigorous as you think. Moreover, because the values are also random, often you'll be testing the same small inputs over and over. You can tweak the functions for generating these values to bias them to be longer, which is usually enough to feel secure that passing the trial is meaningful. But it would be nice to be more systematic about it all.

SmallCheck takes the same laws as QuickCheck but, instead of passing random inputs, it will exhaustively try all values up to some depth (e.g. list length, tree height). If there are any errors, even corner cases, chances are they'll show up within that search space. Unfortunately, most of the time the search space increases exponentially with the depth, which means you can only test up to a rather limited depth.

LazySmallCheck does the same thing as SmallCheck but it employs laziness to prune the search space and so it can search deeper, faster. Let's say you have a function like this: f xs = if null xs then False else (head xs == 1). Because the function only looks at the first element of the list, it's possible to simultaneously test every list with a given first element. In situations like this LazySmallCheck can test a potentially infinite number of values in constant time. (Eventually LazySmallCheck is intended to merge with the strict SmallCheck.)

Speaking of laziness, it's often the case that being too strict means you're doing too much work and taking up too much memory. So how can we be sure a function we've written is as lazy as possible? It's still in an experimental state but, StrictCheck [hs, slides] aims to test just that. Moreover, if it finds that the function is not least-strict then it will suggest ways to make it less strict.

StrictCheck is built using the ChasingBottoms library. ChasingBottoms is certainly "unsafe" in the unsafePerformIO sense because it gives you access to information about the state of Haskell's runtime environment —information which can allow you to break the semantics of the language—, and it's GHC-specific (sorry Hugs, nhc98, Yhc, and Jhc folks), but when you're trying to test the laziness abstraction itself, sometimes you gotta murder a few people.

But away from murder, let's get back to the lawful. One of the nice features of QuickCheck is that you can place guards on your laws to say "if these preconditions hold, then this law holds". One problem is that if the precondition is too restrictive, QuickCheck will exhaust its allocation of samples without finding enough valid inputs for testing the law. We can raise the allocation limit, but when the valid inputs are sparse, even that won't help enough. Enter SparseCheck. Using the paradigm of logic programming, SparseCheck lets you define generators of arbitrary values which pass the preconditions, without even trying any values which don't.

While formal validation is an excellent sledgehammer only pure languages can succeed at, sometimes it's not enough. Sometimes you have to deal with the RealWorld, and sometimes you want to write explicit regression tests to trap the known corner cases where bugs have been found. When it comes to such unit testing HUnit is the framework to use.

This just in, delving one step further towards traditional testing methods, HTF offers a single framework for combining HUnit and QuickCheck along with doing file-based tests (`runMyProgram | diff - expectedOutput`).

So go forth and make agile with the test-driven development my friends!

music of the moment: Maahinen Neito (Earth-Maiden) ~ Värttinä (Iki)

categories: links, haskell

8 June 2008 : The PLS group are made of awesome.

Now that the summer's here I've been hoping to get back into posting more regularly. Over the last year I've accumulated a large number of "to read" papers on various topics having to do with functional and logic programming. And now that the summer's here I'm finally getting a chance to go back and read them. That sounds like two great tastes that taste great together, so let the linkblogging begin.

Functional programming is historically maligned with having less efficient implementations than imperative languages do, mostly because historically that's been the case. Note the emphasis on "historically". However, if you're not careful, you may be perpetuating that sad state of history. If you think —being a non-academic programmer— that academic papers don't affect you, quoth dons on #haskell: programming is not for the naïve.

In functional languages the list data structure is foundationally ubiquitous, and in lazy languages like Haskell they also abstract over the recursion pattern of loops in imperative languages. Unfortunately this excessive use of lists places an enormous burden on allocating and reclaiming memory, which is a major source of historical inefficiencies in functional language implementations. Fortunately, much research has been done to overcome these issues... if you take advantage of it.

One of the great claims of high-level (mathematically-rigorous) languages is that the altitude gives the programmer the ability to program prettily and idiomatically, while making the true invariants more visible to the compiler thus allowing it the freedom to optimize as far as (or further than) illegible programming would've let you. Today's papers are about approaching that goal through "deforestation" techniques which are a subset of "fusion" optimizations. All of today's papers are about automating these optimizations, but compilers are complex and it takes time to work current research into them. So today's post is also a PSA: only you can start forest fires, use libraries which optimize your lists away.

@InProceedings{ CSL2006:rewriting,
    author    = "Duncan Coutts and Don Stewart and Roman Leshchinskiy",
    title     = "Rewriting {H}askell {S}trings",
    booktitle = "PADL 2007: Practical Aspects of Declarative Languages
                 8th International Symposium",
    year      = 2007,
    month     = jan,
    pages     = "50--64",
    publisher = "Springer-Verlag",
    url = {http://www.cse.unsw.edu.au/~dons/papers/CSL06.html},
    url = {http://www.cse.unsw.edu.au/~dons/fps.html},

    Abstract = {
    The Haskell \textit{String} type is notoriously inefficient.
    We introduce a new data type, \textit{ByteString}, based on
    lazy lists of byte arrays, combining the speed benefits of
    strict arrays with lazy evaluation. Equational transformations
    based on term rewriting are used to deforest intermediate
    ByteStrings automatically. We describe novel fusion combinators
    with improved expressivity and performance over previous
    functional array fusion strategies. A library for ByteStrings
    is implemented, providing a purely functional interface, and
    approaches the speed of low-level mutable arrays in C.}
}

If your code does any string processing of note, you'll want to use Data.ByteString available on Hackage. This library provides the standard C implementation of unboxed arrays of characters, but with the standard Haskell interface of high-level functions for treating them like lists. This paper is also one of the best introductions to the field of deforestation I've seen so it's a good read to get a handle on the issues and the solutions out there.

@InProceedings{ Gill93:shortcut,
    author    = "Andrew Gill and John Launchbury and Simon~L. Peyton Jones",
    title     = "A Short Cut to Deforestation",
    booktitle = "FPCA '93: Conference on Functional Programming
                 Languages and Computer Architecture",
    location  = "Copenhagen, Denmark",
    pages     = "223--232",
    month     = jun,
    year      = 1993,
    publisher = "ACM Press",
    address   = "New York, NY, USA",
    
    isbn = "0-89791-595-X",
    url  = {http://citeseer.ist.psu.edu/gill93short.html},
    url  = {http://portal.acm.org/citation.cfm?id=165214},
    doi  = {http://doi.acm.org/10.1145/165180.165214},
    
    Abstract = {
    Lists are often used as ``glue'' to connect separate parts of
    a program together. We propose an automatic technique for
    improving the efficiency of such programs, by removing many of
    these intermediate lists, based on a single, simple, local
    transformation. We have imoplemented the method in the Glasgow
    Haskell compiler.}
}

This is an older paper and one of the first to present a general method for all permissible programs rather than a more restricted subset. The optimizations mentioned here are already implemented for stock GHC lists so you don't even need a library to take advantage of them. This is more of a background paper for contextualizing the PLS papers.

@InProceedings{ CLS2007:stream,
    author    = "Duncan Coutts and Roman Leshchinskiy and Don Stewart",
    title     = "Stream {F}usion: From Lists to Streams to Nothing at All",
    booktitle = "ICFP 2007: Proceedings of the ACM SIGPLAN International
                 Conference on Functional Programming",
    year      = 2007,
    month     = apr,
    note      = "To appear",
    url = {http://www.cse.unsw.edu.au/~dons/papers/CLS07.html},
    url = {http://www.cse.unsw.edu.au/~dons/streams.html},
    
    Abstract = {
    This paper presents an automatic fusion system, \textit{stream
    fusion}, based on equational transformations, that fuses a wider
    range of functions than existing short-cut fusion systems. In
    particular, stream fusion is able to fuse zips, left folds and
    functions over nested lists, including list comprehensions. A
    distinguishing feature of the stream fusion framework is its
    simplicity: by transforming list functions to expose their
    structure, intermediate values are eliminated by general purpose
    compiler optimisations.
    
    We have reimplemented the entire Haskell standard list library
    on top of our framework, providing stream fusion for Haskell
    lists. By allowing a wider range of functions to fuse, we see
    an increase in the number of occurrences of fusion in typical
    Haskell programs. We present benchmarks documenting time and
    space improvements.}
}

This paper is a followup on Rewriting which seeks to apply the fusion techniques to all lists (though not the unboxed arrays part of Data.ByteString). This too is provided as a library Data.List.Stream also on hackage. The plan is for this library to eventually replace the whole Data.List module and to ship with GHC itself.

One of the reasons the PLS folks are made of awesome is the empirical benchmarking they give along with the theory. Sadly, due to some limitations in GHC's general optimization techniques, this library is still experimental and still somewhat hit or miss (as of 2007 at least). For the majority of programs there's marginal gain over the foldr/build approach, for some there's amazing gains (because foldr/build can't apply), but for some deeply nested loops those limitations in GHC means we can't completely clean up after ourselves and so there's actually performance degradation. The moral of the story is to profile your code and use this library whenever possible (and help the maintainers when it breaks). As for the profiling, stay tuned...

music of the moment: Di Me ~ Gipsy Kings (Compas)

categories: I for one welcome our functional overlords, links, haskell

25 April 2008 : .agenda

Another stage in The Plan has been revealed to me:

Previous stages I'm permitted to share with the uninitiated:

music of the moment: Happiness Is A Warm Gun ~ Tori Amos (Strange Little Girls)

categories: I for one welcome our functional overlords, links

16 April 2008 : Escaping monads and topological spaces of computation

So I got to thinking about monads for computation, in particular about the "no escape" clause. In practice you actually can escape from most monads (ST, State, List, Logic (for backtracking search), Maybe, Cont, even IO) and these monads all have some "run" function for doing so (though it may be "unsafe"). These functions all have different signatures because there are different ways of catamorphing the computation away in arriving at your result, but in principle it's possible to subtype these computations into having appropriate generic execution functions.

The simplest computation is a constant computation which successfully returns a value, ST and Identity are just such computations. One way of extending this is to say that a computation is nondeterministic and can return many answers[1]. A different way of extending it is to say that the computation might fail to return any value, Maybe is just such a monad. And of course we could take both those extensions and either fail to return any answers or return nondeterministically many answers, List and Logic are our examples here. Another extension is that a computation may have different kinds of failure, like (Either e), and this is where things start to get interesting.

To generalize what I've said so far we can talk about the topological space of a computation's result. For a computation m a a constant computation has a result in a (we can think of the type a as the alphabet of possible kinds of success). An infallible nondeterministic computation has a result in ak where k ranges over the positive natural numbers and indicates all the different ways we can succeed.

A fallible computation could be thought of as ak where k is either 0 or 1, and where the empty result signifies failure. A different way of thinking fallibility which doesn't rely on empty results is as e | a where e is our error type. From this perspective the difference between Maybe and (Either e) is that Maybe has a unit type for failures, whereas (Either e) has a vocabulary/alphabet of e kinds of failure. Simply-fallible nondeterministic computations like List have results in the space a(0|k) == () | ak. For nondeterministic computations which may have some variety of possible failures, their result is in the space e | ak for an e with cardinality greater than 1. (Infallible computations also have this form, but for an e of cardinality 0.)

To run a constant computation your function is of the type m a → a, easy as pie. For simply-fallible computations you need to have some default value to return on failure in order to make it safe, hence m a → a → a. For nondeterministic computations you need a way of combining successes, hence m a → (a → a → a) → a. And for simply-fallible nondeterministic computations you need both: m a → (a → a → a) → a → a (which should look suspiciously similar to fold, though we've abstracted away the mapping aspect of fold which allows us to inject the values of a into values of a space b). For computations which have a variety of failure modes, instead of a default value, we need a function from errors to default values.

But let's put aside the idea of escaping monads and go back to the more interesting track of the typology of result spaces for computations. From symmetry, one may wonder whether it's possible to have a computation which may fail nondeterministically many times, i.e. has results in the space ej | ak for some j larger than one. Certainly it's conceivable. In fact, one reasonable example of this shows up in things like compilers. A good compiler does not simply exit after finding one error but rather tries to continue as long as possible in order to return as many errors as it can. Since there are different varieties of errors, compilers have results in ej | a for some non-unit vocabulary of errors e and some vocabulary of compiled programs a (for compilers that batch separate compilation we may rather think of it as in the full ej | ak where each object file is a separate value in a).

With simply-fallible nondeterministic computations we noticed that a0 == ()1, one may wonder then how to think of many errors in a simply-fallible computation. I posit that one way of thinking of multiple unit failures is as aj for some negative natural number j. In some interesting areas of formal language theory and typology theory the idea of such negative length strings is explored in some depth, though unfortunately I'm not familiar enough with these extensions, but I'm intrigued by the connection if it does in fact hold. This whole thing is also somewhat evocative of imaginary and complex numbers, another area I'm insufficiently schooled in to make any astute observations.

One thing such theories allow is the intermingling of positive and negative letters (or real and imaginary parts). So far I've presented the space of computations as ej | ak but why do we need to have the disjunctive union there? Could a nondeterministic computation return both success and also failure? Usually when we think of nondeterminism we implicitly assume we're going to throw away all the errors, leaving zero successes as our only error result. There's no reason why we couldn't keep the errors as well as the successes, though there's some question about whether such a thing would be helpful as a particular model of computation.

Again I'd argue that such a thing may indeed be helpful. One of the core ideas in quantum mechanics is the notion that things can exist in superposition. Nondeterminism captures the idea of a superposition among many choices. The notorious example from quantum is that something may be in a superposition not only between different locations but also between existing and not existing (or living and not living for the sake of some poor felines). I'd posit that a computation which nondeterministically returns both success and also failure captures just such a situation. Again, I know too little about quantum to say much of import here, but I think the connection is real. With many kinds of failure (non-unit error type) and the ability to fail in many ways (non-unit exponent) we can capture an entire multidimensional space of nonexistence/failure which could go in superposition with all the kinds and ways of existence/success.

Fuzzy set theory and fractal decision boundaries (and hence all of chaos theory) are other areas which may benefit from the ability to have computations that succeed while failing. This in turn brings us back to the theory of computation. ej | ak obviously models the ability of a turing machine, but does ej ak have any extra formal power? That the latter model allows for contradictions (both succeeding and also failing at the same time) may seem to imply that it does, but of course our first-blush intuitions about formal power don't always match reality (e.g. nondeterminism only gives formal power at certain levels of the language/automata hierarchy, but not at other higher levels).

In all, an interesting thought experiment. It also makes clear to me that I need to learn more higher math in order to flesh out the connections to complex numbers, negative length strings, and quantum mechanics.

[1] Of which I can conspicuously think of no examples offhand which aren't also simply-fallible. I'm sure there're some out there. We could construct one by having a new kind of List which has a singleton constructor instead of nil, and some tree ADTs are similar, though I'm not sure how "natural" these are for expressing any particular variety of computation in practice.

categories: monads, theory of computation, computer science

25 April 2007 : On Composition and Syntax: Thoughts

One of the classes I'm taking this term alternates between Haskell and Smalltalk in trying to teach a bunch of seniors and graduate students "extreme programming" and how coding in the real world is different from in school. In one of the exercises we were working with an attempt to formulate Haskell-like tuples and lists in Smalltalk, in particular trying to debug the implementation we were given. We found numerous issues with the implementation, but one in particular has been nagging me. It indicates inherent limitations in the syntax of Smalltalk, but mulling it over, it seems to be an even deeper issue than that.

Part of the implementation for tuples[1] was overloading the comma operator (normally string concatenation) to create pairs like (a, b) or triples like (a, b, c) etc. The problem was this, how do we do tuples of tuples? Using code like ((a, b), (c, d)) does not give us a pair of pairs, but rather is equivalent to (a, b, (c, d)). I thought, at first, it was a problem of associativity; when the parser sees the second comma, the one after the b, it takes the preceding object and combines it with what follows, in effect it's acting like an operator for constructing lists. Reversing the associativity of the operator just gives us the same problem in the other direction yielding ((a, b), c, d). This is not an issue for Haskell because the parentheses are required and so they let us know for certain when we're done making a tuple. But in Smalltalk as with most languages, the parentheses are only there as suggestions on how to create a parse tree.

All this diagnosis I did for the exercise, but I've just struck something. There is a deep seated difference between "destructive" and "constructive" operations in any language. Operators like addition, subtraction, and so on are destructive operations. The results they come up with are different from the arguments they start with. When we say 3 + 5 there is nothing of the three or the five in our answer of eight; certainly the answer is derived from the inputs, but the 8 is a unique entity, we could have made it by adding two and six, or subtracting four from twelve, or any number of other operations. Destructive operations are a form of compression, we had two objects, now we have one, something was lost.

But there are other operations too, such as the comma we were trying to make and such as the colon, period, and dollar in Haskell. These constructive operators create a new value which is a composition of it's arguments or a container for them. The list constructed by 1:2:3:[] is not merely derived from the arguments given to the operators, those arguments themselves are a part of the list they are in, they are pieces of the composite whole, we can't make that list without those values (even if the exact sequence of operations to arrive at those values can vary).

This idea of constructive operations should not be new to any computer scientist. It is merely a rephrasing of the idea of abstract algebras. But the distinction between constructive and destructive operations should give light to the very real differences between abstract algebras and real ones. Generally, real and abstract algebras do not interact in the rarified world of mathematics, or to the extent they do they're considered isomorphic.[2] But in the real world, in computers, the two do coexist and they behave very differently.

The problem then is this: how to we define the boundary of a composition? For the case of tuples in Haskell the answer is to make the parentheses required on the periphery and forbidden within, thus in effect making a series of unique operators: the binary (,), the ternary (,,), the 4-ary (,,,), etc. But the problem is a systemic one, and it can be rephrased as this: when do we treat a composition as a single whole vs when do we treat it as a collection of parts? Is a list within a list just a longer list, or is it a list with a composite value? The answers, of course, depend on the language[3] but it's something which always needs specifying and which is always fraught with tradeoffs. In imperative and object-oriented languages this is also known as the problem of when should we use a pointer or reference to a composition vs when should we use the composition itself? In languages with pass-by-reference as well as pass-by-value the whole issue becomes even more hairy because we can offer the illusion of treating compositions as wholes while retaining the syntax of treating them by parts, forcing one to "just know" which functions have which behaviors.[4]

Aside from the question of when to treat a composition as a whole vs a collection of parts, there's another question which is striking. In dealing with the real algebras of destructive operations, parentheses have the meaning of affecting the order of operations, or being a serialized representation of a tree structure. Which means parentheses that don't change the order from it's parentheses-less default can be discarded, and since intermediate values are destroyed, once a parse tree is constructed all of them can be discarded (just as they're unneeded in prefix-only or postfix-only languages). With the abstract algebras of constructive operations, however, parentheses mean something entirely different: they represent the boundaries of the composition, they declare the semantic shift from where we should treat a composition as a collection of parts (that is, as being in the process of being constructed) to treating it as a unified whole (that is, as a single value to be used in other operations). Because they indicate a shift in meaning they can (almost) never be discarded.

And yet, no language I'm familiar with actually makes a distinction between constructive and destructive operations nor makes a distinction between the very different kinds of parentheses.

[1] For those unfamiliar with the term/datastructure, tuples are objects that contain a fixed number (and order) of values which may be of different types; somewhat like a vector. The fixed size is important and makes it different from lists which have variable and mutable size or arrays which have variable (though generally immutable) size (and both of which are typically limited to being collections of a single type of variable). That is, a 2-tuple or pair is an entirely different type than a 3-tuple or triple, unlike different length lists or arrays which can generally be treated the same. For those familiar with C-like languages, tuples are a lot like records or structs though their components don't have names, just positions.

[2] So for example, the natural numbers can be thought of as the value 0 and the value received from applying a successor function to other natural numbers. Hence succ(0) is analogous to "1", succ(succ(0)) is analogous to "2", etc. In this way they're considered isomorphic by mathematicians since we can find a mapping between the inductive/constructive/abstract representation and the normal/destructive representation. But you can start to see the difference if you think about how we would go about writing that successor function. For the abstract structure to hold, there don't exist bit patterns to represent the natural numbers as we're used to seeing them which we could return, instead the successor function (like the zero function) are just data constructors like when we call new to create an instance of an object.

[3] In Perl, it's just a longer list. (Yes, I mean "list" (or perhaps "sequence") and not "array". Perl does have both, though only the latter can be stored as a datatype. But if we're talking about Perl's list-like "arrays", then the answer becomes an array with composite values (stored as references).) In Haskell it's always a list of compositions, and all the scalar values the thing ultimately contains must be of the same monomorphic type. In our Smalltalk assignment above, it's both, depending on where it's at in the list.

[4] Which is not just an idle question in, for example again, Perl. Normally arrays are expanded into lists of their components for passing to functions, hence given a call like foo(@a, @b) the function cannot know where the first array ends and the second begins, it just gets a list of all the values together. But by using prototypes you can get pass-by-reference behavior and so suddenly you can do things like that, but they must be real arrays or hashes not references or (such as they exist) literals. There isn't the same problem in C/C++/Java because in those languages the name of an array is actually a pointer/reference and you never really have the array as a collection of individual components.

music of the moment: Narrow Your Eyes ~ They Might Be Giants (Apollo 18), She's Dead ~ Jim's Big Ego

categories: design, computer science

1 August 2006 : LDAP: authentication on linux

Today's victory: getting a linux client to authenticate users (i.e. let them log in)

This tutorial covers most of the nasty bits. It turns out that I was nearly there with my earlier attempts at setting up the client, I just needed to mess with /etc/pam.d/* stuff. That tutorial will only enable logging in via normal means such as using su. To use ssh you also need to add the following to /etc/pam.d/ssh:

auth       sufficient   /lib/security/pam_ldap.so
auth       required     /lib/security/pam_env.so
auth       required     /lib/security/pam_unix.so shadow nodelay
account    sufficient   /lib/security/pam_ldap.so
account    sufficient   /lib/security/pam_ldap.so
account    required     /lib/security/pam_unix.so
password   required     /lib/security/pam_cracklib.so
password   required     /lib/security/pam_unix.so shadow nullok use_authtok
session    required     /lib/security/pam_unix.so
		

Note that some tutorials say to use pam_pwdb.so instead of pam_unix.so, but pam_unix.so seems to obviate the other so far as I can tell. On Solaris it looks like that in turn has been obviated by other modules. If you experience problems with PAM, yet another log to take a look into is tail /var/log/auth.log .

categories: ldap

18 July 2006 : LDAP: linux clients

Today's task: screw Solaris, let's try to get a linux client running.

So far I'm not sure if linux boxen can use profiles, though surely there must be a way. Instead they use /etc/ldap/ldap.conf. If all else fails we can make a script to update that file from the profile. All you have to do is set the BASE and URI appropriately and ldapsearch works perfectly.

Second task is to set it up to use ldap instead of nis for automounting. Since my old ldifautomnt.pl script was using Solaris' nisMap and nisObject objectclasses I needed to rewrite the script to use the automountMap and automount objects linux expects. Alas, the OID for automount given in the schema file I mentioned earlier conflicts with that of memberNisNetgroup. Jury's still out to lunch on how to properly proceed, but for now I just made up a new OID in RedHat's space. This site offers a listing of OIDs, but I haven't been able to find where automount belongs.

categories: ldap

17 July 2006 : LDAP: client profiles

Today's travail: trying to set up an ldap client using profiles.

Instead of just killing smelly (the Solaris client test box), this time I tried setting up profiles to make it easier to kill him next time. Now, technically, using profiles isn't "required" but a quick read through ldapclient's manpage and through Sun's docs shows it to be preferred over manual configuration. Naturally, since it isn't technically required, Sun doesn't include the appropriate schemas to use it.

This page has some pretty good documentation on setting up a client on Solaris, and it also includes the schema for nisDomainObject in the LDIF format that SunONE understands. With that schema and the one for DUAConfigProfile mentioned last time (and in need of conversion from OpenLDAP to SunONE) you're armed to begin setting profiles up.

When you run ldapclient init -a profileName=profile-name, it looks for a nisDomainObject with a nisDomain attribute matching the current NIS domain[1]. Then it assumes an ou called profile as a child of that object and searches under there for a DUAConfigProfile with a cn attribute matching the profile you're trying to load[2]. Which means that you need to have that ou=profile underneath any nisDomainObjects you may have.

So now that you know how ldapclient works, this is when we run into another problem. We're using profiles so that we don't need to manually configure every client, but instead can use ldapclient to retrieve the client file from the ldap server for us. When running the command on Solaris 10, however, we run into another glitch: Solaris 10's Service Management Facility. When smf attempts to turn on and off the different services to initialize ldap it fails to start network/ldap/client:default. After looking through many a log file we eventually identified that the problem is:

ldap_cachemgr[5459]: [ID 293258 daemon.error] libsldap: Status: 0 Mesg: Configuration Error: No entry for 'NS_LDAP_BINDDN' found.
ldap_cachemgr[5458]: [ID 703877 daemon.error] ldap_cachemgr: failed (rc = 255).

I.e. there's no client file. Now, the whole point of running ldapclient init ... is to generate such a file! At this point I'm inclined to believe that this is a bug in Sun's code. Absolutely no diagnostics are presented for why the file might not be created, nor does any log file mention anything other than that the file fails to exist. I would be elated if someone could point out a solution to this problem. For further reference, trying ldapclient manual ... also fails with the exact same unreported error.

[1] Or (&(objectClass=nisDomainObject)(nisDomain=current-nis-domain)) if you prefer.

[2] According to the logs it actually looks under ou=profile,... for (&(|(objectClass=SolarisNamingProfile) (objectClass=DUAConfigProfile)) (cn=profile-name)) which seems to imply that a SolarisNamingProfile could be used as well. Since ldapclient genprofile ... creates a DUAConfigProfile however, I'm disinclined to tempt fate.

mood of the moment: disgruntled

categories: ldap

16 July 2006 : LDAP: Introductions, client configuration, schema loading, and autofs

So at work they have me switching over the network from NIS to LDAP for a network of a couple hundred machines on a mixed network of Solaris10 and linux (Ubuntu). Now, the basic idea of LDAP is pretty easy and there are a number of books that'll teach it to you like O'Reilly's take on the subject. Unfortunately once you get past the basic introduction to what it is, the documentation peters out.

Documentation is important. In fact that's the very First Law of Language Design: a language is only as good as its documentation. The documentation for LDAP is like the documentation for an extremely proprietary product intended only for use by governments and very large corporations. First of all there's very little of it. Secondly what documentation there is goes into excruciating and technical detail of very specific and often esoteric facets of the product, but makes no attempt whatsoever to answer basic questions nor mention how one is to actually use it. Thirdly, what brief discussions there are of how to do half of a basic task only discuss a different model (openldap) than the one we're using, which is of course incompatible with ours (SunONE DS5.2).

Back when I started this blog I started it with the hopes of collecting my thoughts and experiences as I navigate the world of IT and F/OSS in particular. While oft I've been defunct at doing so, I think for this project especially it's important to keep a log of my discoveries so that others in my position may learn from my tribulations. And now, onward to the first colossi to be slain.

...

In truth, the first colossus was just getting the thing up and running enough to use the gui console to look at the (now sparse) directory tree and to get the thing to accept input in LDIF format. I didn't keep much documentation of that but it should be relatively straightforward even though you'll doubtless encounter some stumbling block along your way. For importing LDIF the manpage for ldapmodify is a helpful one. This page from Sun's documentation has a little more, but I also suggest reading what the O'Reilly book has to say in order to get an understanding of LDIF format itself. This is a script I wrote for converting passwd files into LDIF.

We'll skip the second and as yet undefeated colossus for now and bound on to the third one mortally wounded just this past friday. This site has the bulk of what little documentation I've seen for setting up clients, this page from Sun has a tiny bit more; all other documentation only looks at the server end of things. There's a complication however. From a basic install the SunONE server doesn't understand the DUAConfigProfile object class that ldapclient generates and requires. Other than that you can just use ldapclient genprofile to output a profile, just save the output and import it; see ldapclient's manpage for more info.

In order to load new object classes first you need to get ahold of the schema for them. The schema for DUAConfigProfile is here (in openldap's schema format). Naturally the SunONE server doesn't understand OpenLDAP's schema files. This script takes in an openldap schema file and prints out a version Sun's server can understand. This page from Sun's documentation gives a brief overview of adding new schema in. The big hangup it overlooks is that the files must have an extension of .ldif. If you save the output from that script with a *.schema suffix, the server will ignore it. Other than those two tricks installing new schema is easy: find the schema you need, convert it to Sun's format with the script, save it to the ServerRoot/slapd-serverID/config/schema directory in the 9* range with an *.ldif suffix, and restart the server.

The second colossus is trying to set up autofs so that NFS partitions can be mounted automatically on need across your network. Last time I tried this I ended up hosing the box; I'll prolly hose another on monday when I try it again. Naturally, again, the automounter on linux is different from the one on Solaris. This page gives an overview on how to set things up for linux.

In short, for those who may be unfamiliar with autofs. There's an automounter file called auto.master which maps the directory containing mountpoints to a different kind of map; that second map maps the individual mountpoints to where they come from over nfs. In ldap you have a single automountMap object with a child automount object for every line of the file. Now we have the parent dirs point to ldap queries, each of those queries points to an ou which represents the second kind of map and has automount children for each line of the old file.

On Solaris the file is called auto_master instead (and linux can be configured to use underscore for compatibility) and instead of automountMap and automount, nisMap and nisObject are used instead, although I've heard (though I've mislaid the link) that as of Solaris10 the openldap/redhat objects should be the default. Naturally the schema for the openldap/redhat objects doesn't ship with SunONE, though they can be found in openldap format here with prerequisites here and here. Note that the core and cosine prereqs have some schema already declared by Sun's schemas. My Sun formatted versions with those duplicates commented out are here and here. Also I have some scripts I used for converting the old files and NIS maps here and here.

music of the moment: Dead Boy's Poem ~ Nightwish (Wishmaster)

categories: ldap

16 July 2006 : Eng: On Sanity

So here's another crazy idea I have for Eng. When writing code, particularly at a low level, there's frequently call for doing basic sanity checks. That is, you often see code like this if ($foo == nil) { omg die! } elsif ($foo->bar()) { blah blah; } scattered liberally around everywhere. And especially with method calls when you got your object from some potentially unreliable source you see if ($foo and $foo->bar()) all over. Wouldn't it be nice if we could abstract that away?

I've had a few different ideas in the general area of different semantics of "return values" — for example you have different senses of "return value" when you're talking about the evaluation of a function/method/operator, the error code for failed evaluation, and returning "this" object so you can chain method calls together — but I recently had an idea that might be easier to implement in a similar field: we could do sanity checking with a special postfix operator ?. You would use it like $foo?->bar() and it would have short-circuit semantics so that if $foo passes the sanity check it behaves like a no-op, but if $foo fails sanity checking then it will skip evaluating bar() because trying to evaluate it will probably explode, what with $foo not being sane and all. In case there are different types of unsanity it would also be helpful to have some special variable ($? perhaps) that stores the type of unsanity, or have perhaps some other way to deal with unsanity handling.

The challenge is where exactly to short-circuit to. If we have a statement that's a chain of method or field calls like $foo->bar()->baz->bif()->zorch()->quux; then it makes sense, wherever the ? is, to skip everything from there to the end of the chain/statement. But in other circumstances things get more complicated.

For instance, what do we do if it's in a conditional statement? If it's just a basic conditional statement like if ($foo?) or if ($foo?->bar) then it would make sense to have the whole statement return false. But if it's in a compound conditional statement like if ($foo?->bar() or $zorch->bazzle()) then it would make sense to skip the call to bar(), fail the first subexpression, and so evaluate the second. We could say that in boolean contexts the expression returns false, but that is contingent on what we mean by "expression" here.

Another example of complexity is how it interacts with non-method calls, such as arithmetic expressions. If we have $foo?->numeric() + 5 and $foo isn't sane, then what should the expression return? Well maybe the whole greater expression should fail (i.e. be skipped), that sounds sensible. Now what happens in non-algebraic expressions, like assignment? Should $foo = $bar?->baz() skip the assignment too, or should it set $foo to a known-bad value like nil? In case that question seems too straight forward, compare it against foo($bar?->baz(), $bif); should we call foo() with a known-bad value, or should we short-circuit out of ever calling foo()? Also, since ? is intended to simplify code, we must expect that callers won't necessarily test $? unless they really care about why ? failed.

A brief digression into implementation details. When calling ? we obviously need to pass it the thing we're testing. But at an assembly level, in order to know where to short-circuit to we also need to pass a label to jump back to. If during the course of evaluating sanity we determine the object's not sane, we can't just do a normal return because that wouldn't give us the short-circuit semantics we desire. The easiest way to get our short-circuiting would be to overwrite the memory location that stores the return address with the label we wish to jump to, and then when we return we'll return to somewhere else. In the event that some architectures disallow overwriting the return address, we'll have to manually pop a stack frame and do an unconditional jump, or use an inline function instead of a real function call, or devise some other method. If we allow operator overloading, overloading ? will have to be treated specially since we're not using a normal return.

Back to semantics. So far, I can identify six contexts that to be specified: method chains, boolean expressions, arithmetic expressions, string expressions, function calls, and assignment. And two general approaches: returning nil/false/zero/lambda or skipping evaluation. Since the whole point of this sanity checking operator is to avoid Doing Bad Things(tm) I'm thinking that we should err on the side of skipping evaluation, but there are certain places where just jumping to the next statement is jumping further than we may like. Inside conditional expressions — particularly disjunctions — is the big one, but also boolean expressions used as shorthand for control flow (like Perl's famous open($fh, ">", $file) or die "omg!";). Perhaps when the ? is buried in a boolean context it will skip the rest of evaluation for that whole subexpression and return false to the boolean operator, but in all other situations it just skips to the next statement. That sounds like a sensible enough place to start for Doing The Right Thing.

music of the moment: Maahinen Neito (Earth-Maiden) ~ Värttinä (Iki)

categories: eng, design