Un-Buffered Kryo Serialization by rjhall · Pull Request #20 · Cascading/cascading.kryo

rjhall · 2013-11-18T20:38:47Z

Currently, cascading.kryo writes to a byte buffer in memory which is then written out to the output format. When serializing large objects this is a waste of memory and will contribute to OOM errors. This patch introduces an option whereby cascading.kryo instead writes directly to the output stream (note that since the SequenceFile.writer itself has the same sort of buffer, this is only half of the solution to the above problem).

This comes at the cost of CPU, since it is still crucial to know the size of the serialized object in bytes prior to writing (both for compatibility with the current "buffered" serialization, and because on deserialization, the size must be known -- to prevent kryo from reading beyond object boundaries and screwing hadoop). Thus I invoke kryo serialization twice, once to write to a fake output stream which just counts how many bytes would have been written, and then again to the hadoop-owned output stream for actual writing. This way the size of the serialized object, and its serialization are both computed without the necessity of a memory buffer large enough to hold the thing.

The flag is called "cascading.kryo.unbuffered" and defaults to false. Note that since the bytes written to disk are the same irrespective of this flag, it is possible to read and write sequence files generated under a different value of this flag.

This reverts commit 06a2b3b.

johnynek · 2013-11-18T23:14:44Z

We are using chill-hadoop now and have kind of abandoned this project.

Scalding 0.9.0 does not depend on this. Do you want to consider moving this to chill-hadoop?

rjhall · 2013-11-19T00:17:04Z

I figured that from looking at the source. we are still using scalding 8.2 here, I wil look at how the new one works since I anticipate we will move to it eventually

Sam Ritchie and others added 6 commits May 17, 2013 14:38

bump to snapshot version.

49e381e

Add flag to shortcut Kryo and use java serialization.

06a2b3b

Merge remote-tracking branch 'upstream/master'

fa343d4

Revert "Add flag to shortcut Kryo and use java serialization."

f62aa9c

This reverts commit 06a2b3b.

Add buffer free kryo serializers.

bf89b76

works

717892d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Un-Buffered Kryo Serialization#20

Un-Buffered Kryo Serialization#20
rjhall wants to merge 6 commits into
Cascading:masterfrom
rjhall:master

rjhall commented Nov 18, 2013

Uh oh!

johnynek commented Nov 18, 2013

Uh oh!

rjhall commented Nov 19, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjhall commented Nov 18, 2013

Uh oh!

johnynek commented Nov 18, 2013

Uh oh!

rjhall commented Nov 19, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants