Posts Tagged ‘Haskell’
Functional (Meta)?Programming Stunts for Ruby and Groovy (and a Little Perl) June 24, 2008 | 11:04 am

After I learned OCaml, my coding mindset was totally distorted. I started writing Java code that looked like this:

  public Collection<foo> getCertainFoos() {
    return
      CollectionUtils.select(getFoos(), new Predicate() { 
         public boolean evaluate(Object o) {
            return SOME_CONSTANT.equals(((Foo)o).getProperty());
         }
      });
  }
</foo>

This is kinda ugly in Java, but it’s simply what comes out when I was thinking this in OCaml:

List.find_all (fun i -> SOME_CONSTANT = i#getProperty()) #getFoos()

I also started slapping final everywhere — see Yet Another Reason final Is Your Friend. A ubiquitous use of final actually gave some nice patterns (in the “macro” sense of patterns), but raised all kinds of eyebrows and made my code unmistakable. This lead up to a unique coding style which you can see in my most involved open source project, JConch. Meanwhile, my co-blogger was talking about “The Hole in the Middle” Pattern, which is also a lay-up within FP circles but required some backflips to implement in Java (functional interfaces) and C# (delegates).

It wasn’t until the advent of Ruby and Groovy, though, that functional programming skills really became easier to use. Basically, because of the inline closure’s succinct syntax (and ability to access non-final variables), I could suddenly do all kinds of really fun stuff. This “fun stuff” was exactly the kind of stunts I was pulling in Perl back in the day (see the BEGIN block in my Text::Shift code for a reasonably accessible example), and it was part of the reason I loved Perl so much at the time.

So, I thought I’d share some more of these cute stunts with you.
Read the rest of this entry

How Ocaml Can Be Improved May 7, 2008 | 06:49 pm

UPDATE: At the request of damned near everyone, I’ve changed the title of this post to be both less inflammatory and more accurate.

One of the ways to not fall into the blub fallacy is to regularly consider those ways in which your favorite language is inferior, and could be improved- preferrably radically improved. Now, it should come as a surprise to no one that my favorite language is (currently) Ocaml. So as an intellectual exercise I want to list at least some of the ways that Ocaml falls short as a language.

I will note that if you use this post as a reason to not use Ocaml, you are a fool and are missing the point of this post. With the possible exception of Haskell (Edit: And Jocaml), no other language I know of comes close to getting all the things right that Ocaml gets right.

So, with that in mind, let’s begin.

Read the rest of this entry

7 Actually Useful Things You Didn’t Know Static Typing Could Do: An Introduction for the Dynamic Language Enthusiast April 14, 2008 | 06:52 am

Introduction

One of the things that has consistently been difficult in the whole dynamic typing/static typing conversation is that people don’t seem to understand what a real static typing language can do: here’s a classic example (and someone else who was also annoyed). The dynamic typing vs. static typing conversation seems to be Java’s type system vs. Ruby’s type system, which simply isn’t fair. So, in the spirit of advancing discourse and helping people understand why I enjoy Ocaml so much, let me present…

7 Actually Useful Things You Didn’t Know Static Typing Could Do

Read the rest of this entry

For Those of You Holding Your Breath: Adlib I/O Update October 15, 2007 | 06:46 am

Going to be checking in the first swing at Adlib I/O code to SourceForge’s CVS (yes, CVS. Ugh.) tomorrow between work and my lodge meeting.

So far, I’ve just got a couple of input streams implemented (FileInStream, ByteListInStream). The library is geared towards extensibility, though, which means you can implement a whole new stream in a dozen lines of code or so.

Projects Update: Gem Work and AdLib I/O October 11, 2007 | 05:59 am

Anyone out there have experience getting a Gem plugged into RubyForge’s system? I’m having a bit of trouble, and I’m looking for some help: I’m offering compensation to the amount of my eternal gratitude and a beer. Drop me a line via Jabber at “[email protected]” or leave a comment here with your actual e-mail address in the e-mail field.

Also, I’m implementing Java-style streams in Ocaml as part of the AdLib project that Brian and I are working on. Stay tuned for more information: I should have some code soon.

Thoughts on Parallelism February 8, 2007 | 07:50 pm

I was going to write a blog post flaming this blog post, but I’ve change my mind. The only real sin he’s committed is making up his mind before all the facts are in my opinion.

But I don’t think he’s right. Not because message passing is bad or is going to go away- it’s not. It’s that it’s not a case of either or. The more I think about it, the more I think the future is going to be a combination of STM and message passing. Both have their advantages and disadvantages.

The advantage of message passing is that it’s distributable. You don’t care if the other process you’re sending a message to is on the same CPU or the other side of the planet, it doesn’t matter. The disadvantage of message passing is that it requires, at minimum, at least one copy of the data, and often multiple copies of the data. Plus task switching and cache flushing. It is heavy weight, at least compared to STM and shared memory multithreading.

This is a non-trivial problem. Parallelism comes in a gradiant, based on how much data needs to be transmitted between the processing elements per time period. One on extreme end of the gradiant you have very little data that needs to be distributed. The gold standard here is stuff done by distributed.net takes on- brute force crypto key searches and the like. We’re talking about a few hundred bytes every couple of days. This makes it trivial to distribute the computational load over widely dispersed machines.

Note that this also explains why supercomputers died in the 90′s, but big iron database servers didn’t. Supercomputers were, at heart, about solving either large numbers of small linear algebra problems, or a small number of large linear algebra problems. In either case, the problems were fairly easily distributed with small communication demands- a few megabytes per second- that supercomputers could be replaced by racks of PCs communicating via ethernet. Databases, however, require more communication between nodes (especially if you want to parallelize a single query, or allow multiple queries to access the same table simultaneously), and as such didn’t distribute via ethernet. So Sun and IBM are still profitable, Cray and SGI are dead.

Distributed databases are possible, for at least some workloads, I comment (before half a dozen people do). The problem is that only certain types of workloads distribute- write a query that doesn’t distribute, and all your work has to be done on a single box. Which is why they haven’t caught on- and Sun and IBM are still in business.

The problem is that everything needs to be parallelized. But being distributed isn’t quite as big an issue. What’s driving the sudden obsession with parallelization is the sudden appearance of multicore chips. But notice something about the multicore chips- how close the CPUs are to each other. Often they even share cache. If two processes can share a data structure by just sharing a pointer to the datastructure, communication speeds are nearly infinite. This allows for multithreaded programs that have incredibly huge communication requirements. Tightly coupled programs are possible in this environment.

Tightly coupled parallelism is most likely to arise with some form of automatic parallelism. I think the fullest power of STM will be found in a Haskell-like purely functional language (with STM built into the monads automatically, meaning you can’t accidentally forget the STM) combined with cilk-style microthreading. Lazy evaluation and parallelism are very similiar concepts- both are dealing with asynchronous computation. Lazy evaluation moves the computation to the first time the value is needed, while parallel evaluation the evaluation happens (conceptually, at least) in parallel. But this level of sharing requires very tight binding, very high communication rates. But I’m not doing a lot of computation per byte of communication. This is even more the case if the different CPUs are, in fact, running on an SMT CPU (aka Hyperthreading by Intel).

I actually think that a lot of problems will be ameniable to both levels approaches. Do message passing, ala MPI or Erlang, at the top level, breaking the problem up into big chunks which can then be distributed over a local network or a highly non-uniform ccNUMA architecture. But then each chunk is solved in a microthread+STM approach, allowing for good utilization of local resources and tightly coupled parallel processes.

As a side note, the “emergent complexity” he worries about having with STM is also a problem with message passing. Emergent complexity is always a problem- and the fundamental simplicity of the individual components doesn’t help (much). And, unlike STM, message passing does admit the possibility of a deadlock- thread A waiting for a message from thread B which is waiting for a message from thread A. It’s much less likely than in the classic (and unarguably awful) threads+locks paradigm, but it’s still possible. The only advantage message passing has over STM in the emergent complexity department is that it encourages bigger chunks per thread (due to the relatively high cost of communication)- fewer interacting components means less emergent complexity.

But fewer interacting components also means less parallelism, and less performance on future hardware. The new Moore’s law is that the number of cores will doubly every couple of years now. However parallel you program is today, it’ll need to be twice as parallel in a couple of years- and dozens to thousands of times more parallel in a decade or so! Actually, the situation is even worse than this. Once multithreaded programming becomes common, the new performance metric will be performance per transistor. If I can trade off 10% performance and fit the CPU in 1/5th the die size on the same process, I can fit 5 cores where 1 used to fit and the CPU is 4.5x faster for programs that can take advantage of 5 cores.

Of coruse, 5 cores doesn’t mean I get 5x the memory bandwidth. It actually becomes more important in this case to be tightly bound, so gaggles of threads can share cache efficiently. Five threads all talking to memory and not to each other would die on performance. On the other hand, this also means that moderately size machines will start being more non-uniform in the memory architecture. Talking to the thread on the next core over is very cheap- but talking to that thread microseconds away is painfull and slow. The former is a job for STM, the latter for message passing.

That said, I could be wrong. I know what damned well doesn’t work, but I don’t know what does work yet. One good thing does arise out of this debate, I comment. Wether it’s Erlang and message passing, Haskell and STM, or some hybrid, it looks like the solution is going to be functional programming plus something else. What really drove the adoption of OO programming was GUIs. Note that Doug Englebart was one of the inventors of OO programming at the same time he was inventing the modern GUI interface. Heck, the first Smalltalk dates from 1972. But it wasn’t until GUIs hit the mainstream in the mid/late 80′s that OO programming also took off. In a similiar way, I think parallelism is going to drive the adoption of functional programming. Which will be for the betterment of programming in general.

Java’s Failure to be Lazy January 2, 2007 | 11:10 am

Now that BHurt has done a nice job talking about why Laziness is cool, I’ve got a bit of an ax to grind.

The problem I have is that Java all but completely fails to be lazy, which is a shame considering how useful the trick is.

The Sun team themselves seem to be of the opinion that functional programming generally isn’t something one should be doing. The core library has a complete lack of functional support, and so those who want to use functional coding methods are delegated down to defining lots of interfaces and inner classes, which really clutters up the API and looks ugly. C# did a somewhat nicer job by basically stealing C++’s “function pointer” concept, and I’d like to see Java implement it eventually.

The bigger issue is that even in the Jakarta Commons framework, there is no implementation as Brian described it. What they have as a “LazyList” in Jakarta Commons-Collections isn’t terribly useful: instead of generating a “next object” like the theoretical lazy list does, it generates an object based on no input (not even the index of the array!). There isn’t a way to wrap a List in a series of lazilly-evaluated (get-time) transforms: the TransformedList is all put-time. There is no “next-function-based” Iterator. The two major useful pieces of lazy functionality they do have is the TransformIterator (which does enable you to chain transforms at get-time) and the LazyMap (which enables transformation-based lazy generation).

Now, there is work that can be done to extend the Commons-Collections to implement that kind of functionality, but I haven’t even begun to test the waters to see if it would be something the core development team would be willing to accept. I’m a bit busy devoting my free time elsewhere, so if someone else would pick up this ball and run with it, I’d love to see it in. Otherwise, it’s something I’ll get around to whenever I get back to spending my limited free time on Java work.

Sorting Lazy Lists January 1, 2007 | 03:31 pm

Update: I fixed the formating. Also, a half-completed version of this article got posted to the front page when it wasn’t supposed to, it got deleted as well.

Now for some real meat. I’ve come up with a neat algorithm for sorting lazy lists. I’m not sure if it’s original or not, but it’s a neat enough trick I thought I’d pass it on. It has a couple of interesting properties: forcing the first element of the sorted list costs only O(N) time/space complexity. Forcing each additional element costs only O(log N) time/space complexity. Which means if you force the entire list, it’s your standard O(N log N) complexity sort, but if you only need the first k+1 elements of the sorted list, it’s O(N + k log N), which (for small k) is signifigantly less than doing a full sort. This is important in a lazy environment where we may not need the entire list sorted.

I’m assuming you’re familiar with Ocaml and have read my introduction to lazy lists earlier today.

Read the rest of this entry

Ocaml Lazy Lists- An Introduction January 1, 2007 | 02:02 pm

[Edit: I fixed the links on this article. I am never changing the link format again. Ever. Blargh. ~~ The Mgmt.]

Solidifying this blogs iron-fisted hold on being the leading provider of information for Ocaml Lazy Lists (hope you weren’t drinking there), I thought I’d add in some old-fasioned meanderings on the subject. There’s some new information in a later post for those already familiar with the subject, but I think I’ll start with an introduction. Some passing knowledge of Ocaml is assumed, however. Lazy lists are, I beleive, a powerful and usefull concept. Unfortunately, they are virtually unknown outside of the functional programming community, and, I think, under appreciated even within the functional language community. There use and power deserves to be popularized.

Read the rest of this entry

The Functional-Relational Impedance Match June 17, 2006 | 09:52 am

One of the big topics of discussion recently among developers has been the Object-Relational Impedance Mismatch. Or maybe not, OO developers may have just decided that it’s a permanent fact of life, like the primacy of the church of Rome, the naval power of Spain, or smallpox. Something you live with and survive (or not, as the case may be), but not anything you can really do something about. Until, of course, some fool does go and do something about it.

Generally, how that fool (and he’s obviously a fool for trying to change the unchangeable) changes the unchangeable is by first changing the paradigm- changing how the problem is thought about. When the plague was God’s punishment for the wickedness of the earth, there wasn’t really much of anything you could do about it (except try not to torque off God quite so much). But once the plague was caused by bacteria, the possibility of killing off the bacteria and not the host (and thus curing the plague), becomes a possibility. The important point here is that a change in paradigm can change an impossibility into a possibility.

Those who know me know where I’m going with this. I’ve been spending the last few months working with SQL databases and Ocaml. The true mark of a genius is the ability to create new paradigms, explicitly to turn a given impossibility into a possibility- I claim no such great intellect here. Instead, I have my new paradigm already created by greater minds than I- the functional paradigm. What follows is just my first attempt to view the problem of interfacing a "normal" programming language (for loose enough definitions of normal that include Ocaml) to a relational/SQL database. What follows should not be considered the last word on this subject, but only the first word.

As the title of the article should suggest, I think that the new paradigm clears up and cleans up a number of deep problems that the classic Object Oriented approaches to database struggle with. Along the way I hope to shed new light on the mistakes made by most (all up to this date to my knowledge) Object Oriented database libraries, that give rise to what is known as the Object-Relations Impedance Mismatch.

But first, some review. Specifically, some review of Relational theory- trust me, this is important. The core of relational calculus operates on relations, aka tables. There are three main operators, all of which work on relations/tables-

  • join, which takes two relations and creates a third relation which consists of all possible pairings of the rows of the two original relations,
  • selection, which takes a relation and creates a new relation with only a subset of rows of the original relation, and
  • projection, which takes a relation and creates a new relation with a subset of the columns of the old relation

These three operations are the core, the heart and soul, of relational calculus- which is the core of SQL. To faithfully model SQL we must, on some level, faithfully model the relational calculus. And this is where I think the Object Oriented programmers go astray in trying to interface to SQL. In their hurry to make things into objects, they immediately (and without fail) declare the rows to be objects- and thus miss the fact that relational calculus and thus SQL is about relations, not rows. They’re abstracting at the wrong level.

The allure of making rows objects can’t be denied- especially as it encourages the programmer to think of the database as a persistent object store- a magical storage space where objects don’t go away when the program dies. The problem is that this is something that SQL is manifestly not. And this fundamental misconception leads to a number of symptoms. For example- when are changed objects written back to the database? If the application is too eager in storing objects back into the database after changing them, performance suffers as the number of slow SQL queries skyrockets. Don’t be eager enough in writing them back, and a sudden crash of the program could lose too much data.

Another problem that the database as a persistent object store incurs is the inheritance problem. All rows may well be objects- but not all objects are rows. I am human, Socrates is human, therefor I am Socrates. One important restriction the relational calculus and SQL impose is that all rows have exactly the same structure- the same elements. This flies directly in the face of the fundamental assumption of object oriented programming- that two objects may be of the same type even if they have different internal representations. So, for example, lines and points may have different internal representations- different numbers, and even types, of internal members. But in OO programming, they can quite happily cohabitate in a list of Drawables. However, SQL strictly forbids different rows of a table from having different structures, from having different numbers or types of members, and discourages those members from even having different meanings from one row to the next.

But the even bigger problem this approach to databases have I’ve already alluded to- they’re abstracting at the wrong level. By concentrating the row level and demoting the relation level to second class citizens, it totally misses the power and point of relational calculus. Again, there is a cause for this failure- relations are data structures, and the relational operations are operations on data structures as a whole. In the terms used by functional programmers, relations are composable data structures. While not something disallowed by Object Oriented programming, creating and working with composable data structures is not some thing encouraged. OK, set implementations generally have some composable operations (union, intersection, difference)- but most data structure operations are accessors- insertions, deletions, and retrievals of various flavors. The idea of operating on whole data structures is not generally done.

And this is the first major advantage approaching the problem in the functional paradigm brings- because operating on whole data structures is natural and intuitive to a functional programmer. Operating on whole data structures started as a necessary optimization. Retrieving all elements from a tree one at a time, altering them, and inserting them into another tree is O(N log N). But I can generally do the same operation in O(N) by operating on the whole tree, giving orders of magnitude speed up. But even though this paradigm shift was originally born of necessity, it is now bearing fruit in a different realm. The natural way to model SQL in a functional language is to directly model the relational calculus, modelling relations/tables as abstract composable data structures.

Let me be somewhat more specific here. At a 10,000 foot level, the SQL interface would have some abstract data type representing a relation, call it a relation_t. There would be a join function with type relation_t -> relation_t -> relation_t, there would be a selection function of type relation_t -> sql_selection_t -> relation_t, and there would be a projection function relation_t -> column_selection_t -> relation_t. The definition of the types sql_selection_t and column_selection_t obviously need to be defined (I’m still working on that).

These functions would not actually be firing stuff off to the database, they’d just be local data structures that would be gathering the information that gets turned into the appropriate SQL query. Again, I’d like to emphasize that there is nothing here that couldn’t be done in an object oriented fashion- it just isn’t done this way.

How does one actually get elements (rows) out of these abstract data structures? Here again, the natural inclinations of a functional programmer give a radically different answer. In this situation, the functional programmer will reach for a lazily evaluated list.

Some introduction for those who aren’t familiar with lazy evaluation is in order. In Ocaml, a lazily evaluated expression is introduced with the lazy keyword. What follows the lazy keyword is an expression- which isn’t evaluated at that point, but instead is saved, creating a a lazy value, or thunk. The first time the thunk is forced (by calling Lazy.force on it), the expression is evaluated, and cached inside the thunk- so on the second and all subsequent times the cached value is returned.

An example makes this more obvious:

# let f () =
        print_string "Enter a number: ";
        flush stdout;
        int_of_string (read_line ())
  ;;
val f : unit -> int = <fun>
# f ();;
Enter a number: 3
- : int = 3
# let z = lazy (f ());;
val z : int lazy_t = <lazy>
# Lazy.force z;;
Enter a number: 3
- : int = 3
# Lazy.force z;;
- : int = 3
# Lazy.force z;;
- : int = 3
#

Notice how the prompt to enter a number is not issued when the lazy thunk is created (on the let z = ... line), nor on the second or later times when it’s forced- it’s only on the first time it’s forced that the expression is evaluated, the function is called, and the user is prompted to enter a value. Lazy evaluation is a form of mutable data that "plays nice with" purely functional programming. A lazy list, then, is a singly linked list where the next element of a node is not a reference, but instead a lazy thunk that can be forced to get the next element.

Lazy lists are in many ways not unlike the iterators and enumerators of classic OO programming, but with one critical difference- many operations on the lazy list are applied lazily. Specially, maps and filters can be applied lazily. A map just takes a conversion function that converts from type a to type b, and converts a lazy list (or other data structure) of type a into a lazy list of type b. But rather than going through and changing all the elements immediately, the map function is just folded into the lazy thunk- forcing an element of the list of type b first forces the next element of the list of type a, and then applies the mapping function to it.

This is a critical difference because it moves when the computation happens. So we have two functions- one function that converts a relation_t into a lazy list of elements, which actually issues the select statement to the database, and a function which inserts a lazy list of elements into a relation_t, which performs an insert or update. So we start with a base table or so, perform various joins, selects, and projections on them to produce our source and destination relations. We convert the source relation to a lazy list. We then perform several maps and filters on the lazy list, then insert it into our destination table. Now the library goes to work- pulling items from the select, passing them through the various maps and filters, and inserting them into the destination insert.

It is the sign of a good abstraction layer that new forms of optimization can be added easily and automatically. And that’s what we’re seeing here. Many databases- including PostgreSQL- have the idea of a cursor, that allows us to bring results of a query back in small chunks, rather than all at once. This is a natural thing to use when converting a select into a lazy list- only pull in elements as they’re needed. On inserting in PostgreSQL, you can do a COPY, which is basically a form of insert where multiple elements are inserted at once, which has much higher performance than individual inserts. If you can batch multiple inserts into a single copy, you have much better performance. Note that all of this can go on without the intervention of the user of the library- this optimization goes on under the covers.

Another advantage functional programming confers on interacting with a database is immutable data structures- the elements of an instantiated relation/table are immutable, and thus the question of what happens when you change a row of a table is completely avoided, because you can’t change a row of the table except by explicitly inserting them.

Again, there is nothing here that can’t be done in an object oriented language- but functional programming encourages it to a greater extent. Like compositional data structures, lazy evaluation, especially lazy evaluation of data structures, was mainly forced on functional programming. But the paradigm is more broadly applicable then just overcoming the self-imposed limitations of functional programming. It is truly a different paradigm- in that it makes the impossible, or at least excessively difficult, easy, or at least easier. Several things still need to be worked- for example, the form of the selection_t and column_selector_t types, and what a row data structure (the member of the lazy lists casually tossed around above) looks like. But I think the broad strokes I have laid down here are fundamentally correct, and will lead to an interface layer that looks less like Hibernate, and more like SQL.