Un-Buffered Kryo Serialization#20
Open
rjhall wants to merge 6 commits into
Open
Conversation
This reverts commit 06a2b3b.
Contributor
|
We are using chill-hadoop now and have kind of abandoned this project. Scalding 0.9.0 does not depend on this. Do you want to consider moving this to chill-hadoop? |
Author
|
I figured that from looking at the source. we are still using scalding 8.2 here, I wil look at how the new one works since I anticipate we will move to it eventually |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, cascading.kryo writes to a byte buffer in memory which is then written out to the output format. When serializing large objects this is a waste of memory and will contribute to OOM errors. This patch introduces an option whereby cascading.kryo instead writes directly to the output stream (note that since the SequenceFile.writer itself has the same sort of buffer, this is only half of the solution to the above problem).
This comes at the cost of CPU, since it is still crucial to know the size of the serialized object in bytes prior to writing (both for compatibility with the current "buffered" serialization, and because on deserialization, the size must be known -- to prevent kryo from reading beyond object boundaries and screwing hadoop). Thus I invoke kryo serialization twice, once to write to a fake output stream which just counts how many bytes would have been written, and then again to the hadoop-owned output stream for actual writing. This way the size of the serialized object, and its serialization are both computed without the necessity of a memory buffer large enough to hold the thing.
The flag is called "cascading.kryo.unbuffered" and defaults to false. Note that since the bytes written to disk are the same irrespective of this flag, it is possible to read and write sequence files generated under a different value of this flag.