Java Serialization is Fun

Serialization is the process of taking an in-memory representation of data and transforming it to a representation suitable for sending to another location.

Deserialization is the reverse of that process. Code takes a structured representation of data from some location and transforms it to a representation in-memory.

Every programming language has a myriad of approaches for performing these tasks. These approaches vary greatly depending on the semantics of the language, the semantics of the output format, and the culture surrounding both.

What sets Java's serialization mechanism apart is that the semantics of the language map extremely closely to that of the output format.

To fully appreciate the implications of this, allow me to take you on a bit of a tour of some other data formats.

CSV

Data is written one line at a time, with each value in a "row" separated by commas.

By convention sometimes the very first row is used to store a "label" of what each "column" means.

While labels can add contextual information, the actual "data model" that is directly encoded here is just rows of strings. Interpretation of these rows is dependent on a combination of convention and "out of band" information.

CSV is popular in quite a few domains. It's easy to import and export to Spreadsheets, write out from sensors on an Arduino, and feed into Machine Learning libraries.

But its data model is not close to how most programs represent data. To go from a representation in memory to CSV is most always going to be a "lossy" process. To go from CSV back to that same representation in memory is requires knowledge about how to interpret the order of elements in a row, what each element means, etc.

import java.time.LocalDate;

record Person(
        // have to assume that the first element is the name
        String name,
        // have to assume that the second element is this
        int numberOfCats,
        // How should a boolean be encoded?
        boolean taxFraud,
        // What format is the date in?
        // What is done when no value is known?
        LocalDate upcomingCourtDate
) {
    static Person fromCsvRow(List<String> row) {
        // Code here could be autogenerated if you assume
        // conventions, but it probably won't be
         
        if (row.size() != 4) { ... }
         
        String name = row.get(0);
        
        int numberOfCats;
        try {
            numberOfCats = Integer.parseInt(row.get(1));
        } catch (NumberFormatException __) {
             ...
        }
         
        // ... and so on ...
         
        return new Person(name, numberOfCats, ...);
    }
    
    List<String> toCsvRow() {
        // ...
        return List.of(this.name, ...);
    }
}

JSON

"JavaScript Object Notation" is a format derived from the syntax of declaring object literals in JavaScript.

{ 
     "stockName": "IDK",
     "stockPrice": "100USD",
     "twitterComments": [
          {
               "retweets": 10,
               "text": "...",
          },
          {
               "retweets": 20,
               "text": "..."
          }
     ]
}

Compared to CSV it is way more expressive. Instead of just rows of strings the data model includes dedicated representations for booleans, numbers, lists, and more.

This makes it somewhat of a "lowest common denominator" data format. Most modern languages have support for these data types and the structure can represent nested data much more ergonomically than "flat" formats like CSV.

The translation from a model in memory to JSON is still "lossy" in quite a few common cases though.

record Recruiter(
        // Often enums will be translated to Strings
        TellsYouTheSalary tellsYouTheSalary,
        // Times might be put into a ISO-8601 format String 
        // or a Unix Time integer
        Instant postedFirstCringeStatus,
        // Sets aren't representable, so often
        // they will be encoded as lists
        Set<ReservationsAtDorsier> reservations,
        // Multiple possibilities with overlapping fields need a
        // convention for representing which is present
        LovedOne lovedOne
) {}

enum TellsYouTheSalary {
     UP_FRONT,
     IF_YOU_ASK,
     NEVER
}

sealed interface LovedOne {}
record Cat(String name) implements LovedOne {}
record Dog(String name) implements LovedOne {}
record NoOne() implements LovedOne {}

// Both of these would be valid representations 
// depending on your conventions
//
// { "tellsYouTheSalary": "UP_FRONT",
//   "postedFirstCringeStatus": 1234,
//   "reservations": [],
//   "lovedOne": {"type": "cat", "name": "fred" } }
//
// { "tells_you_the_salary": "up_front",
//   "posted_first_cringe_status": "2020-07-10 15:00:00.000",
//   "reservations": {"kind": "set", "contents": []},
//   "loved_one": {"kind": "cat", "name": "fred"} }

EDN

"Extensible Data Notation" is a format that came out of the syntax of the Clojure programming language.

{ 
     :teethLeft       #{5 12 14 23}
     :countryOfOrigin "United States of America"
     :whelped         #inst "2006-04-12T00:00:00.000-00:00"
     :parents         #{#pokemon "Skitty" 
                        #pokemon "Wailord"}
     :moves           [:quick-attack :tail-whip]
}

More likely than not you have not heard of it. That's a shame because it's pretty cool.

Compared to JSON it has a larger base set of types and a defined mechanism for extending that set.

The key capability for the purposes of this discussion is that you are able to attach an arbitrary tag to any EDN value.

This serves the same purpose as the { "type": ..., "data": ... } pattern in JSON, but by virtue of being part of the format that encoding is not "positional".

As an example of what I mean, in JSON the way you know that a given field contains a moment in time is by knowing implicitly that the string under a specific name like "createdAt" will be formatted in as a timestamp.

{ "createdAt": "2020-08-12T00:00:00.000-00:00" }

In EDN if you know how a given tag like #inst should be interpreted then you can automatically do that interpretation no matter where in the structure of the document it appears.

{ "createdAt" #inst"2020-08-12T00:00:00.000-00:00" }

This means that translation to and from EDN doesn't have to be lossy in the same way JSON serialization is. If you have a custom aggregate, you can define a tag for that aggregate and include whatever data is needed to reconstruct it

package some.pack;

sealed interface Mascot {}
record Gecko(int age) {}
record Sailor(int age, boolean captain) {}

// This could be encoded as
// #some.pack.Gecko{:age 12}
// #some.pack.Sailor{:age 35 :captain true}

You can also have non-string keys {{:map "key"} "whatever value"}. Y'all are missing out.

Java's Serialization Format

"Java Serialization" is a mechanism by which any object in memory can be serialized to and deserialized from a sequence of bytes while preserving the same semantics that object had in memory.

For regular classes, it accomplishes this by recursively scraping the fields of the class and producing bytes as specified here. Then when the bytes are read back in, it reconstructs the object by doing the reverse.

For "special" classes (Strings, Enums, and Records) there are slightly different rules, but the effect is essentially the same.

This is exceedingly hard to properly communicate with words, so here is a quick walk-through.

Step 1. Make a Serializable class

Implement the Serializable marker interface and make sure every field of your class does as well or is a primitive.

import java.io.Serializable;

public class LabeledPosition implements Serializable {
    private String label;
    private int x;
    private int y;
    
    public LabeledPosition(String label, int x, int y) {
        this.label = label;
        this.x = x;
        this.y = y;
    }
    
    @Override
    public String toString() {
        return "LabeledPosition[label=" + this.label +
                ", x=" + this.x +
                ", y=" + this.y +
                "]";
    }
}

Step 2. Make an ObjectOutputStream

You can make this special class by wrapping any existing OutputStream. This is where the bytes of your serialized form will be written.

import java.io.ByteArrayOutputStream;
import java.io.ObjectOutputStream;

var byteArrayOutputStream = new ByteArrayOutputStream();
var objectOutputStream = new ObjectOutputStream(
        byteArrayOutputStream
);

Step 3. Write your object to the ObjectOutputStream

This is a binary format, so there isn't any fun visual aid, but you can inspect and see that indeed we have written some bytes.

objectOutputStream.writeObject(new LabeledPosition("bob", 9, 1));

byte[] bytes = byteArrayOutputStream.toByteArray();
System.out.println(Arrays.toString(bytes));
// [-84, -19, 0, ..., 98, 111, 98]

Step 4. Create an ObjectInputStream

This is very similar to how we wrote the object out. Wrap any existing InputStream.

import java.io.ByteArrayInputStream;
import java.io.ObjectInputStream;

var byteArrayInputStream = new ByteArrayInputStream(bytes);
var objectInputStream = new ObjectInputStream(byteArrayInputStream);

Step 5. Read in the object you wrote out

var labeledPosition = 
        (LabeledPosition) objectInputStream.readObject();

System.out.println(labeledPosition);
// LabeledPosition[label=bob, x=9, y=1]

Step 6. Make another Serializable class

record TwoLists(
     List<Integer> listOne,
     List<Integer> listTwo
) implements Serializable {}

Step 7. Make a mutable object

So here we will make an instance of this TwoLists record where each List is the exact same list in memory.

This means that if we add to either listOne or listTwo both will be updated.

var theList = new ArrayList<>(List.of(1, 2, 3));
var twoLists = new TwoLists(theList, theList);

System.out.println(twoLists);
// TwoLists[listOne=[1, 2, 3], listTwo=[1, 2, 3]]

twoLists.listOne().add(4);
System.out.println(twoLists);
// TwoLists[listOne=[1, 2, 3, 4], listTwo=[1, 2, 3, 4]]

Step 8. Write that mutable object to an ObjectOutputStream

var byteArrayOutputStream = new ByteArrayOutputStream();
var objectOutputStream = new ObjectOutputStream(
        byteArrayOutputStream
);
objectOutputStream.writeObject(twoLists);
byte[] bytes = byteArrayOutputStream.toByteArray();

Step 9. Read that mutable object from an ObjectInputStream

var byteArrayInputStream = new ByteArrayInputStream(bytes);
var objectInputStream = new ObjectInputStream(byteArrayInputStream);

var roundTripped = (TwoLists) objectInputStream.readObject();

Step 10. Oh no

System.out.println(roundTripped);
// TwoLists[listOne=[1, 2, 3, 4], listTwo=[1, 2, 3, 4]]

System.out.println(roundTripped.listOne() == roundTripped.listTwo());
// true

roundTripped.listOne().add(5);
System.out.println(roundTripped);
// TwoLists[listOne=[1, 2, 3, 4, 5], listTwo=[1, 2, 3, 4, 5]]

If you have the same object two places in the "object graph" of something you are serializing, the fact that those two places hold the same object is preserved.

Because of this, you can even seamlessly serialize things like circular linked lists.

class CircularThing implements Serializable {
    CircularThing next;
}

// How would you write this in JSON?
var circular = new CircularThing();
circular.next = circular;

What is this good for?

Since you can save any arbitrary object and there is no extra code needed to make that just "work", Java Serialization can be a very useful crutch for getting code working quickly.

In the Python world, a similar utility is often used to save the results of training ML models. It's easy to imagine that Java Serialization could see similar use if Data Science ever took off on the JVM in the same way.

What is this bad for?

While you can version serialized objects, doing so is non-obvious and error-prone. Making a class serializable, especially in a library, can therefore be a fairly large maintenance problem.

If you read serialized data that you did not write, that is a giant security hole. There is more nuance to it, but basically if you read untrusted serialized data then any hacker can get full access to your system. I'm not going to go in to every way you can exploit serialization, but this talk should give you a basic idea.

Because serialized objects are stored in a binary format, it is impossible to read without special tooling and prohibitively hard to write by hand.

While technically you could write a parser for the binary format in your language of choice and recover the information, you would likely be the first. If you need to share values with programs in other languages, falling back to a "lowest common denominator" like JSON is a better strategy.

Part of what made writing this so hard for me is that most people who I've seen be shown serialization were shown it very early in their curriculums. It's hard to explain nuance around the object model and encapsulation when talking to someone who learned what classes are two weeks back, so I left most of that out.

Leave a comment below if anything was unclear, incorrect, or you would like to learn more.