Skip to content

Un-Buffered Kryo Serialization#20

Open
rjhall wants to merge 6 commits into
Cascading:masterfrom
rjhall:master
Open

Un-Buffered Kryo Serialization#20
rjhall wants to merge 6 commits into
Cascading:masterfrom
rjhall:master

Conversation

@rjhall
Copy link
Copy Markdown

@rjhall rjhall commented Nov 18, 2013

Currently, cascading.kryo writes to a byte buffer in memory which is then written out to the output format. When serializing large objects this is a waste of memory and will contribute to OOM errors. This patch introduces an option whereby cascading.kryo instead writes directly to the output stream (note that since the SequenceFile.writer itself has the same sort of buffer, this is only half of the solution to the above problem).

This comes at the cost of CPU, since it is still crucial to know the size of the serialized object in bytes prior to writing (both for compatibility with the current "buffered" serialization, and because on deserialization, the size must be known -- to prevent kryo from reading beyond object boundaries and screwing hadoop). Thus I invoke kryo serialization twice, once to write to a fake output stream which just counts how many bytes would have been written, and then again to the hadoop-owned output stream for actual writing. This way the size of the serialized object, and its serialization are both computed without the necessity of a memory buffer large enough to hold the thing.

The flag is called "cascading.kryo.unbuffered" and defaults to false. Note that since the bytes written to disk are the same irrespective of this flag, it is possible to read and write sequence files generated under a different value of this flag.

@johnynek
Copy link
Copy Markdown
Contributor

We are using chill-hadoop now and have kind of abandoned this project.

Scalding 0.9.0 does not depend on this. Do you want to consider moving this to chill-hadoop?

@rjhall
Copy link
Copy Markdown
Author

rjhall commented Nov 19, 2013

I figured that from looking at the source. we are still using scalding 8.2 here, I wil look at how the new one works since I anticipate we will move to it eventually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants