A real-world example of refactoring with Java 8 streams

Java 8 introduced lambdas and streams, two tools that seem to offer many ways for improve our code, in particular for what concerns conciseness and readability. There are a lot of samples around, focusing often on over-simplified contexts. In this post I'm describing a small refactoring that I applied to a portion of code in a project I'm developing. With its own limitations, I think it's an interesting real-world case to analyse. In this post, I'm assuming that the reader knows the basic concepts and syntax of some Java 8 features, such as lamdas, Streams and Optional<T>.

First, let's give the relevant context. I have a MediaItem interface, which represents e.g. an audio file, which exposes a Metadata interface:

public interface Metadata
  {
    @Nullable
    public <T> get (@Nonnull Key<T> key);

    @Nonnull
    public Set<Map.Entry<Key<?>, ?>> getEntries();
    …
  }

It's basically a type-safe Map, that there is a set of pre-defined Key items which carry the type information of the value they're associated with. So there's no need to cast the values coming out from the get() method.

Now I have to extract metadata items and store them into a semantic repository. You don't need to know a lot about semantic repositories, but the fact that they contain statements, or triples, in the form subject-predicate-object. Subjects or objects, when they refer to an entity and not a literal value, are basically the identifiers of the related entities (which are URIs in the semantic world). That's all you really need to know.

Indeed, for the scope of this post, you could pretend that the semantic repository is just a plain relational database in which I have to insert some SQL statements. The relevant point is that I'm going to insert them by firing an asynchronous message in a publish-and-subscribe architecture; a subscriber somewhere will be notified of the request and perform the insertion. For the sake of performance, I'm not firing a message for each statement, but I'm grouping them together (this might also have to do with transaction control: perhaps I want to process them as atomic batches). On this purpose, there's an AddStatementRequest message with its fluent Builder. The fluent nature of this object was a small problem with streams that I needed to solve.

Here it is the version of the code at the beginning of the refactoring:

public void importMediaItemEmbeddedMetadata (final @Nonnull MediaItem mediaItem, final @Nonnull URI mediaItemUri)
  {
    log.info("importMediaItemEmbeddedMetadata({}, {})", mediaItem, mediaItemUri);
    
    final MediaItem.Metadata metadata = mediaItem.getMetadata();
    final Integer trackNumber = metadata.get(MediaItem.Metadata.TRACK);
    final Integer sampleRate = metadata.get(MediaItem.Metadata.SAMPLE_RATE);
    final Integer bitRate = metadata.get(MediaItem.Metadata.BIT_RATE);
    final Duration duration = metadata.get(MediaItem.Metadata.DURATION);

    AddStatementsRequest.Builder builder = AddStatementsRequest.build();
    
    if (sampleRate != null)
      {
        builder = builder.with(mediaItemUri, MO.SAMPLE_RATE, literalFor(sampleRate));
      }

    if (bitRate != null)
      {
        builder = builder.with(mediaItemUri, MO.BITS_PER_SAMPLE, literalFor(bitRate));
      }

    if (trackNumber != null)
      {
        builder = builder.with(mediaItemUri, MO.TRACK_NUMBER, literalFor(trackNumber));
      }

    if (duration != null)
      {
        builder = builder.with(mediaItemUri, MO.DURATION, literalFor((float)duration.toMillis()));
      }
    
    messageBus.publish(builder.create());
  }

I think it is quite readable. There are two immediate code smells: the sequence of if statements, and the fact that I'm dealing with nulls.

Let's start from the latter issue. Java 8 offers the Optional<T> class which represents a value that can be present or not. I can refactor Metadata as follows:

public interface Metadata
  {
    @Nunnull
    public <T> Optional<T> get (@Nonnull Key<T> key);

    @Nonnull
    public Set<Map.Entry<Key<?>, ?>> getEntries();
    …
  }

So get() now is guaranteed to return a non null value. The method can be rewritten as follows:

public void importMediaItemEmbeddedMetadata (final @Nonnull MediaItem mediaItem, final @Nonnull URI mediaItemUri)
  {
    log.info("importMediaItemEmbeddedMetadata({}, {})", mediaItem, mediaItemUri);
    
    final MediaItem.Metadata metadata = mediaItem.getMetadata();
    final Optional<Integer> trackNumber = metadata.get(MediaItem.Metadata.TRACK);
    final Optional<Integer> sampleRate = metadata.get(MediaItem.Metadata.SAMPLE_RATE);
    final Optional<Integer> bitRate = metadata.get(MediaItem.Metadata.BIT_RATE);
    final Optional<Duration> duration = metadata.get(MediaItem.Metadata.DURATION);

    AddStatementsRequest.Builder builder = AddStatementsRequest.build();
    
    if (sampleRate.isPresent())
      {
        builder = builder.with(mediaItemUri, MO.SAMPLE_RATE, literalFor(sampleRate.get()));
      }

    if (bitRate.isPresent())
      {
        builder = builder.with(mediaItemUri, MO.BITS_PER_SAMPLE, literalFor(bitRate.get()));
      }

    if (trackNumber.isPresent())
      {
        builder = builder.with(mediaItemUri, MO.TRACK_NUMBER, literalFor(trackNumber.get()));
      }

    if (duration.isPresent())
      {
        builder = builder.with(mediaItemUri, MO.DURATION, literalFor((float)duration.get().toMillis()));
      }
    
    messageBus.publish(builder.create());
  }

Honestly, this is an almost neglectable improvement. Sure, isPresent() is marginally better readable than != null, but there's no much beef. The if statements are still needed, and while there can't be NPEs any longer, I can still get it wrong: if a presence check is wrong, sampleRate.get() will throw NoSuchElementException. So,  I've just exchanged an exception with another. While there's plenty of code samples like this, and flame wars about whether it's good or not, to me it's quite evident that this usage of Optional is basically meaningless.

Optional offers a more functionally oriented method, such as map() to apply a Function. It would eliminate conditionals, but wouldn't solve an issue, described below, with the fact that builder is not final. That's why I did not consider this route.

Now, a fundamental point is that, if you want quality in your code then you need testing. But a compiler of a statically-typed language could help in avoiding some blatant errors and, in the end, save some time. By introducing Optional it has not been the case.

Basically, the null-related issue is secondary. The problem here is conditioned by the major code-smell, which is the sequence of conditional statements. It's also an approach that badly scales: what if I need to add mapping for another dozen of metadata items?

Let's think of a Java 8-oriented refactoring that will bring the code down to 4 lines. Well, I'm lying: the fundamental lines will be 4+4, plus some other stuff. But improving the quality of code doesn't (always) mean literally shrinking lines, rather having better lines.

First, we realize that, rather than having a sequence of special-cases, we need a collection of metadata mappings, and iterating over it. In Java 8 we have lambda expressions that allow code an iteration in a single line. For instance, something such as the following pseudo-code:

AddStatementsRequest.Builder builder = AddStatementsRequest.build();
mapping.stream().forEach(() -> builder = builder.with(… /* do something smart */));
messageBus.publish(builder.create());

Unfortunately, it can't work whatever smart the thing is conceived. The problem is that a lambda is a sort of masked inner class, and the usual rule applies: if it accesses local variables declared outside its scope, they must be final. But a fluent Builder can't be referenced by a final variable (it prevents from writing builder = builder.with(...)). While at first sight one might think: “ok, these are the classic devil's details that jeopardise a could-be elegant approach, and I give up”, the problem is that we're looking at a very limited number of features offered by Java 8. Functional often requires that you re-think some details. For instance, a fluent Builder is a sort of 'collector' of things that are incrementally added. But Java 8 streams offer their own version of a Collector: an object that 'terminates' a stream, collecting all the elements in it into a single object. Perhaps we can replace our collector with their collector, and find that their collector fits better?

I received some feedback about this passage. It was pointed out that the problem is not only related to the limitation about lambdas and final references pointing to immutable objects such as Builder. Indeed from the design point of view a good functional approach should avoid side-effects.

So our code becomes:

Collector<Statement, List<Statement>, AddStatementsRequest> collector = ...;
AddStatementRequest request = mapping.stream(). … .collect(collector);
messageBus.publish(request);

Even better, taking advantage of static methods, in a single line:

messageBus.publish(mapping.stream(). … .collect(toAddStatementsRequest()));

Now, what is the cost of writing a custom Collector? Something like that:

@Nonnull
public static Collector<Statement, List<Statement>, AddStatementsRequest> toAddStatementsRequest()
  {
    return new ArrayListCollectorSupport<Statement, AddStatementsRequest>()
      {
        @Override @Nonnull
        public Function<List<Statement>, AddStatementsRequest> finisher() 
          {
            return statements -> new AddStatementsRequest(statements);
          }
      };  
  }

public abstract class ArrayListCollectorSupport<COLLECTED_TYPE, COLLECTING_TYPE> 
        implements Collector<COLLECTED_TYPE, List<COLLECTED_TYPE>, COLLECTING_TYPE>
  {
    @Override @Nonnull 
    public Supplier<List<COLLECTED_TYPE>> supplier() 
      {
        return ArrayList::new;
      } 

    @Override @Nonnull 
    public BiConsumer<List<COLLECTED_TYPE>, COLLECTED_TYPE> accumulator() 
      {
        return List::add;
      }

    @Override @Nonnull 
    public BinaryOperator<List<COLLECTED_TYPE>> combiner() 
      {
        return (left, right) -> { left.addAll(right); return left; };
      }

    @Override @Nonnull 
    public Set<Characteristics> characteristics() 
      {
        // Not CONCURRENT since ArrayList is not thread-safe
        return Collections.emptySet();
      }
  }

Honestly, we're adding a lot of lines of code. But:

  1. ArrayListCollectorSupport is a very reusable class for many kinds of programmer-defined Collectors. Actually, it was already in my kitchen-sink library of small useful things that I reuse everywhere.
  2. toAddStatementsRequest() is very specific to AddStatementsRequest, but I can reuse it a few times. 

I think most people working with Java 8 know Collectors up to a certain degree, since they offer some facility methods to easily re-pack a Stream into a List, or format it to a String with controllable formatting. But I see that only a few ones understand that they could write their own Collector for their classes, and this kind of usage of the Java 8 APIs can be useful for achieving a real benefit. Sometimes you have complex logics in the collection strategy that don't fit with a simply iterative approach. For instance, within a Stream you can filter() item by item, but you cannot take a decision based on the whole collection of items accumulated so far. In a earlier version of the code related to this post it was possible to override statements with the same subject and predicate, because a certain piece of medatada was first extracted by an audio file and later superseded by a better piece of information retrieved from the web. This was implemented in a custom Collector.

This requirement went away with other refactorings, so now the custom Collector was removed. I'm leaving it in this post just for didactic purposes - let's just recall that in some more complex cases it can help us. Ignoring that the Java 8 Streams API is made not only of Streams might make you give up a refactoring and preclude an interesting way to re-shape your code.

Now, let's focus on the stream. We can start with the whole set of metadata entries:

mediaItem.getMetadata().getEntries().stream()

and then apply a filter for picking only those that I want to store to my semantic repository:

filter(e -> MAPPER.containsKey(e.getKey())

Now the trickiest part. I have to 'translate' a Metadata Key in the corresponding Statement. A Statement, I repeat, is a triple subject-predicate-object. In my case the subject is the id (i.e. the URI) of the mediaItem. This is a variable part for each invocation, so can't be a responsiblity of the MAPPER. The predicate and object are fixed for each metadata key, so they can be a responsibility of the MAPPER. First, we need to introduce a Pair class which encapsulates predicate and object:

@Immutable @RequiredArgsConstructor @ToString
static class Pair
  {
    @Nonnull
    private final URI predicate;
    
    @Nonnull
    private final Value object;
  }

I'm taking advantage of Lombok for saving some plumbing code (constructor and getters). At this point I can define the MAPPER as a Map from Key to a Function transforming an Object (the metadata item value) to a Pair of predicate and object:

private static final Map<Key<?>, Function<Object, Pair>> MAPPER = new HashMap<>();

static
  {
    MAPPER.put(Metadata.TRACK,       v -> new Pair(MO.P_TRACK_NUMBER,    literalFor((int)v)));
    MAPPER.put(Metadata.SAMPLE_RATE, v -> new Pair(MO.P_SAMPLE_RATE,     literalFor((int)v)));
    MAPPER.put(Metadata.BIT_RATE,    v -> new Pair(MO.P_BITS_PER_SAMPLE, literalFor((int)v)));
    MAPPER.put(Metadata.DURATION,    v -> new Pair(MO.P_DURATION,
                                                literalFor((float)((Duration)v).toMillis())));
  }

Note that here we are already losing something. While in the original code, with the “unrolled” conditional statements, we were taking advantage of the type information brought by Key, here we have lost it. For this reason, we are forced to put explicit casts in the parameter of literalFor(). I don't see solutions, and the cause is the limited support of type information, not reified, in Java generics.

The final touch is to add to Pair a method to create a Statement with a given subject:

@Nonnull
public Statement createStatementWithSubject (final @Nonnull URI subject)
  {
    return ...createStatement(subject, predicate, object);
  }

How to use that stuff? By applying a mapping to the stream, in which we first get the relevant Function inside the MAPPER, then apply it to the metadata value, retrieve the Pair and create the Statement:

stream().map(e -> MAPPER.get(e.getKey()).apply(e.getValue()).createStatementWithSubject(mediaItemUri))

Putting all together:

public void importTrackMetadata (final @Nonnull MediaItem mediaItem, final @Nonnull URI mediaItemUri)
  {
    log.debug("importTrackMetadata({}, {})", mediaItem, mediaItemUri);
    messageBus.publish(mediaItem.getMetadata().getEntries().stream()
        .filter(e -> MAPPER.containsKey(e.getKey()))
        .map(e -> MAPPER.get(e.getKey()).apply(e.getValue()).createStatementWithSubject(mediaItemUri))
        .collect(toAddStatementsRequest()));
  }

So, let's recap. The original method had 15 lines (excluding logging, blanks and braces). The refactored method has got 4, but part of the information has been moved to the further 4 lines that populate MAPPER. In the end, we have 8 lines: we saved 7 lines. But we had to introduce Pair (4 lines, and a few one saved by Lombok): 4 further lines. Roughly, the number of lines didn't change. Was it worth the effort?

First, the refactored code scales much better: for each new metadata item to map, we need to add one more line (we can't do better than this!); in the original code we needed to add a whole new if statement. Not only it's made by 2 lines, but it's less readable. In the end, we split the original blob into an imperative part (the one in the method, with the stream iteration), which will stay the same forever. Then we have a purely declarative part, which is the code populating MAPPER. There's not much to test in it, but the correctness of the mapping. So we have split an original blob of two responsibilities into two separate sections, each one with a single responsibility. We can understand each section separately, focusing only on a few lines at a time.

On the other hand, I think that many would argue that they find the new solution more “complex” (whatever the word means...) than the original.

The most controversial line is clearly this:

.map(e -> MAPPER.get(e.getKey()).apply(e.getValue()).createStatementWithSubject(mediaItemUri))

Is there any further room for improvement? It can be refactored as:

.map(e -> MAPPER.forItem(e).createStatementWithSubject(mediaItemUri))

but with a non neglectable cost, as we need to create a subclass:

static class Mapper extends HashMap<Key<?>, Function<Object, Pair>>
  {
    @Nonnull
    public Pair forItem (final @Nonnull Map.Entry<Key<?>, ?> entry)
      {
        return get(entry.getKey()).apply(entry.getValue());
      }
  }

We're not saving any line here, instead we're introducing 4 more lines. And they are not reusable. In the end I applied this refactoring as well, but I'm not sure it's worth while.


What conclusion to draw? This is a specific problem, but a real case one. Undoubtedly, Java 8 was useful for refactoring my code in a way that has better responsibility separation, readability and maintainability. In terms of lines of code, there's not really a real gain, even considering some of the new lines I wrote are at least partially reusable.

So, if we limit ourselves to the mere LoC count, we could be easily displeased. When taking a broader horizon in consideration, the advantage is evident.

There are other advantages, in addition to readability. Streams are parallelizable, which means that if I had a large numbers of items to map, the refactored code would be faster, by taking advantage of multiple cores.

PS. While I was writing this post, my code evolved further. In particular, I found that the semantic schema I was applying was not correct: some metadata items have to be related to a certain entity, while others to a different one. The current code is:

static class Mapper extends HashMap<Key<?>, Function<Object, Pair>>
  {
    @Nonnull
    public Pair forItem (final @Nonnull Map.Entry<Key<?>, ?> entry)
      {
        return get(entry.getKey()).apply(entry.getValue());
      }
  }

private static final Mapper SIGNAL_MAPPER = new Mapper();
private static final Mapper TRACK_MAPPER = new Mapper();

static
  {
    TRACK_MAPPER. put(Metadata.TRACK,       v -> new Pair(MO.P_TRACK_NUMBER,    literalFor((int)v)));

    SIGNAL_MAPPER.put(Metadata.SAMPLE_RATE, v -> new Pair(MO.P_SAMPLE_RATE,     literalFor((int)v)));
    SIGNAL_MAPPER.put(Metadata.BIT_RATE,    v -> new Pair(MO.P_BITS_PER_SAMPLE, literalFor((int)v)));
    SIGNAL_MAPPER.put(Metadata.DURATION,    v -> new Pair(MO.P_DURATION,
                                                       literalFor((float)((Duration)v).toMillis())));
  }

public void importAudioFileMetadata (final @Nonnull MediaItem mediaItem, 
                                     final @Nonnull URI signalUri,
                                     final @Nonnull URI trackUri)
  {
    log.debug("importAudioFileMetadata({}, {}, {})", mediaItem, signalUri, trackUri);
    final Set<Map.Entry<Key<?>, ?> entries = mediaItem.getMetadata().getEntries();
    messageBus.publish(entries.stream()
                              .filter(e -> SIGNAL_MAPPER.containsKey(e.getKey()))
                              .map(e -> SIGNAL_MAPPER.forItem(e).createStatementWithSubject(signalUri))
                              .collect(toAddStatementsRequest()));
    messageBus.publish(entries.stream()
                              .filter(e -> TRACK_MAPPER.containsKey(e.getKey()))
                              .map(e -> TRACK_MAPPER.forItem(e).createStatementWithSubject(trackUri))
                              .collect(toAddStatementsRequest()));
  }

The readability improvement due do the introduction of MAPPER is now even more evident: the original sequence of conditional statements would easily induce errors in mis-matching the metadata items with the URIs of the two entities, while now we have two MAPPERs with their own configuration.

I have to further stress the point, anyway, that the final warranty of quality can't be given by syntax, but I need testing. Actually I have an integration test running over a few thousands audio files of my personal collection that makes sure data is properly translated; I couldn't have applied this radical refactoring without being supported by this test that assured me I was not introducing regressions at each step.

The better syntax, however, might help in making fewer errors and spending less time to run tests, detect problems and fix them.

comments powered by Disqus