Serialization is the process of taking an in-memory representation of data and transforming it to a representation suitable for sending to another location.
Deserialization is the reverse of that process. Code takes a structured representation of data from some location and transforms it to a representation in-memory.
Every programming language has a myriad of approaches for performing these tasks. These approaches vary greatly depending on the semantics of the language, the semantics of the output format, and the culture surrounding both.
What sets Java's serialization mechanism apart is that the semantics of the language map extremely closely to that of the output format.
To fully appreciate the implications of this, allow me to take you on a bit of a tour of some other data formats.
CSV, Comma Separated Values, is one of the most "basic" data formats out there.
Data is written one line at a time, with each value in a "row" separated by commas.
frankie,25,yes,Jun 8, 2023
casca,63,no,none
By convention sometimes the very first row is used to store a "label" of what each "column" means.
First Name,Number of Cats,Tax Fraud?,Upcoming Court Date
frankie,25,yes,Jun 8, 2023
casca,63,no,none
While labels can add contextual information, the actual "data model" that is directly encoded here is just rows of strings. Interpretation of these rows is dependent on a combination of convention and "out of band" information.
CSV is
- a list of Rows
A Row is
- a list of strings
CSV is popular in quite a few domains. It's easy to import and export to Spreadsheets, write out from sensors on an Arduino, and feed into Machine Learning libraries.
But its data model is not close to how most programs represent data. To go from a representation in memory to CSV is most always going to be a "lossy" process. To go from CSV back to that same representation in memory is requires knowledge about how to interpret the order of elements in a row, what each element means, etc.
import java.time.LocalDate;
Person(
record // have to assume that the first element is the name
String name,
// have to assume that the second element is this
int numberOfCats,
// How should a boolean be encoded?
boolean taxFraud,
// What format is the date in?
// What is done when no value is known?
LocalDate upcomingCourtDate) {
static Person fromCsvRow(List<String> row) {
// Code here could be autogenerated if you assume
// conventions, but it probably won't be
if (row.size() != 4) { ... }
String name = row.get(0);
int numberOfCats;
try {
= Integer.parseInt(row.get(1));
numberOfCats } catch (NumberFormatException __) {
...
}
// ... and so on ...
return new Person(name, numberOfCats, ...);
}
List<String> toCsvRow() {
// ...
return List.of(this.name, ...);
}
}
"JavaScript Object Notation" is a format derived from the syntax of declaring object literals in JavaScript.
{
"stockName": "IDK",
"stockPrice": "100USD",
"twitterComments": [
{
"retweets": 10,
"text": "...",
},
{
"retweets": 20,
"text": "..."
}
]
}
Compared to CSV it is way more expressive. Instead of just rows of strings the data model includes dedicated representations for booleans, numbers, lists, and more.
JSON is one of
- null
- a string
- a number
- a boolean
- a list of JSON
- a map of string to JSON
This makes it somewhat of a "lowest common denominator" data format. Most modern languages have support for these data types and the structure can represent nested data much more ergonomically than "flat" formats like CSV.
The translation from a model in memory to JSON is still "lossy" in quite a few common cases though.
Recruiter(
record // Often enums will be translated to Strings
,
TellsYouTheSalary tellsYouTheSalary// Times might be put into a ISO-8601 format String
// or a Unix Time integer
,
Instant postedFirstCringeStatus// Sets aren't representable, so often
// they will be encoded as lists
Set<ReservationsAtDorsier> reservations,
// Multiple possibilities with overlapping fields need a
// convention for representing which is present
LovedOne lovedOne) {}
enum TellsYouTheSalary {
,
UP_FRONT,
IF_YOU_ASK
NEVER}
interface LovedOne {}
sealed Cat(String name) implements LovedOne {}
record Dog(String name) implements LovedOne {}
record NoOne() implements LovedOne {}
record
// Both of these would be valid representations
// depending on your conventions
//
// { "tellsYouTheSalary": "UP_FRONT",
// "postedFirstCringeStatus": 1234,
// "reservations": [],
// "lovedOne": {"type": "cat", "name": "fred" } }
//
// { "tells_you_the_salary": "up_front",
// "posted_first_cringe_status": "2020-07-10 15:00:00.000",
// "reservations": {"kind": "set", "contents": []},
// "loved_one": {"kind": "cat", "name": "fred"} }
"Extensible Data Notation" is a format that came out of the syntax of the Clojure programming language.
{ :teethLeft #{5 12 14 23}
:countryOfOrigin "United States of America"
:whelped #inst "2006-04-12T00:00:00.000-00:00"
:parents #{#pokemon "Skitty"
"Wailord"}
#pokemon :moves [:quick-attack :tail-whip]
}
More likely than not you have not heard of it. That's a shame because it's pretty cool.
Compared to JSON it has a larger base set of types and a defined mechanism for extending that set.
EDN is one of
- null
- a string
- an integer
- a vector of EDN
- a map of EDN to EDN
- a set of EDN
- a keyword
- a symbol
- an element with a tag and an EDN value
... and a few other base types ...
The key capability for the purposes of this discussion is that you are able to attach an arbitrary tag to any EDN value.
This serves the same purpose as the { "type": ..., "data": ... }
pattern in JSON, but by virtue of being part of the format that encoding is not "positional".
As an example of what I mean, in JSON the way you know that a given field contains a moment in time is by knowing implicitly that the string under a specific name like "createdAt" will be formatted in as a timestamp.
{ "createdAt": "2020-08-12T00:00:00.000-00:00" }
In EDN if you know how a given tag like #inst
should be interpreted then you can automatically do that interpretation no matter where in the structure of the document it appears.
"createdAt" #inst"2020-08-12T00:00:00.000-00:00" } {
This means that translation to and from EDN doesn't have to be lossy in the same way JSON serialization is. If you have a custom aggregate, you can define a tag for that aggregate and include whatever data is needed to reconstruct it
package some.pack;
interface Mascot {}
sealed Gecko(int age) {}
record Sailor(int age, boolean captain) {}
record
// This could be encoded as
// #some.pack.Gecko{:age 12}
// #some.pack.Sailor{:age 35 :captain true}
You can also have non-string keys {{:map "key"} "whatever value"}
. Y'all are missing out.
"Java Serialization" is a mechanism by which any object in memory can be serialized to and deserialized from a sequence of bytes while preserving the same semantics that object had in memory.
For regular classes, it accomplishes this by recursively scraping the fields of the class and producing bytes as specified here. Then when the bytes are read back in, it reconstructs the object by doing the reverse.
For "special" classes (Strings, Enums, and Records) there are slightly different rules, but the effect is essentially the same.
This is exceedingly hard to properly communicate with words, so here is a quick walk-through.
You can follow along by pasting each snippet into JShell.
(If you have Java installed, run jshell
on the command line)
Implement the Serializable
marker interface and make sure every field of your class does as well or is a primitive.
import java.io.Serializable;
public class LabeledPosition implements Serializable {
private String label;
private int x;
private int y;
public LabeledPosition(String label, int x, int y) {
this.label = label;
this.x = x;
this.y = y;
}
@Override
public String toString() {
return "LabeledPosition[label=" + this.label +
", x=" + this.x +
", y=" + this.y +
"]";
}
}
You can make this special class by wrapping any existing OutputStream
. This is where the bytes of your serialized form will be written.
import java.io.ByteArrayOutputStream;
import java.io.ObjectOutputStream;
= new ByteArrayOutputStream();
var byteArrayOutputStream = new ObjectOutputStream(
var objectOutputStream
byteArrayOutputStream);
This is a binary format, so there isn't any fun visual aid, but you can inspect and see that indeed we have written some bytes.
.writeObject(new LabeledPosition("bob", 9, 1));
objectOutputStream
byte[] bytes = byteArrayOutputStream.toByteArray();
System.out.println(Arrays.toString(bytes));
// [-84, -19, 0, ..., 98, 111, 98]
This is very similar to how we wrote the object out. Wrap any existing InputStream
.
import java.io.ByteArrayInputStream;
import java.io.ObjectInputStream;
= new ByteArrayInputStream(bytes);
var byteArrayInputStream = new ObjectInputStream(byteArrayInputStream); var objectInputStream
=
var labeledPosition (LabeledPosition) objectInputStream.readObject();
System.out.println(labeledPosition);
// LabeledPosition[label=bob, x=9, y=1]
Hold with me here, this gets good.
TwoLists(
record List<Integer> listOne,
List<Integer> listTwo
) implements Serializable {}
So here we will make an instance of this TwoLists record where each List is the exact same list in memory.
This means that if we add to either listOne
or listTwo
both will be updated.
= new ArrayList<>(List.of(1, 2, 3));
var theList = new TwoLists(theList, theList);
var twoLists
System.out.println(twoLists);
// TwoLists[listOne=[1, 2, 3], listTwo=[1, 2, 3]]
.listOne().add(4);
twoListsSystem.out.println(twoLists);
// TwoLists[listOne=[1, 2, 3, 4], listTwo=[1, 2, 3, 4]]
= new ByteArrayOutputStream();
var byteArrayOutputStream = new ObjectOutputStream(
var objectOutputStream
byteArrayOutputStream);
.writeObject(twoLists);
objectOutputStreambyte[] bytes = byteArrayOutputStream.toByteArray();
= new ByteArrayInputStream(bytes);
var byteArrayInputStream = new ObjectInputStream(byteArrayInputStream);
var objectInputStream
= (TwoLists) objectInputStream.readObject(); var roundTripped
Oh yeah.
System.out.println(roundTripped);
// TwoLists[listOne=[1, 2, 3, 4], listTwo=[1, 2, 3, 4]]
System.out.println(roundTripped.listOne() == roundTripped.listTwo());
// true
.listOne().add(5);
roundTrippedSystem.out.println(roundTripped);
// TwoLists[listOne=[1, 2, 3, 4, 5], listTwo=[1, 2, 3, 4, 5]]
If you have the same object two places in the "object graph" of something you are serializing, the fact that those two places hold the same object is preserved.
Because of this, you can even seamlessly serialize things like circular linked lists.
class CircularThing implements Serializable {
;
CircularThing next}
// How would you write this in JSON?
= new CircularThing();
var circular .next = circular; circular
Since you can save any arbitrary object and there is no extra code needed to make that just "work", Java Serialization can be a very useful crutch for getting code working quickly.
In the Python world, a similar utility is often used to save the results of training ML models. It's easy to imagine that Java Serialization could see similar use if Data Science ever took off on the JVM in the same way.
Spark uses this mechanism for distributing Java objects across different nodes.
While you can version serialized objects, doing so is non-obvious and error-prone. Making a class serializable, especially in a library, can therefore be a fairly large maintenance problem.
If you read serialized data that you did not write, that is a giant security hole. There is more nuance to it, but basically if you read untrusted serialized data then any hacker can get full access to your system. I'm not going to go in to every way you can exploit serialization, but this talk should give you a basic idea.
This was a crucial part of the Log4Shell vulnerability.
Because serialized objects are stored in a binary format, it is impossible to read without special tooling and prohibitively hard to write by hand.
While technically you could write a parser for the binary format in your language of choice and recover the information, you would likely be the first. If you need to share values with programs in other languages, falling back to a "lowest common denominator" like JSON is a better strategy.
Part of what made writing this so hard for me is that most people who I've seen be shown serialization were shown it very early in their curriculums. It's hard to explain nuance around the object model and encapsulation when talking to someone who learned what classes are two weeks back, so I left most of that out.
Leave a comment below if anything was unclear, incorrect, or you would like to learn more.