Fixed broken links in header.html file.#4
Open
mandadipavan wants to merge 581 commits into
Open
Conversation
This patch makes some minor performance improvements to the new builtin function for functional dependency discovery, including vectorized handling of know dependencies, less indexing overhead, removed unnecessary operations (rmEmpty, distinct, agg), and fixed sized table computation. On a scenario of 10K x 1K columns (499500 pairs) with domain 1:100, this patch improved performance from 96.5s to 62.2s. In the future, we should restrict ourself to integer/boolean types, stick to simple pairs, and only enumerate candidates of relevant pairs.
This patch contains - 1. Refactoring of function results caching 2. Code to skip caching if function contains Rand/Sample 3. Few bug fixes.
- Added entry for multilevel caching - Fixed bugs in rewrite stats
Outliers detection using standard deviation and repair using row deletion, mean and median imputation Closes apache#89.
Outlier detection using IQR - Initial commit Closes apache#91.
Optimizes rbind and cbind to only append federated metadata for the result. Closes apache#92.
This patch adds the unary builtin functions is.na (NA or NaN), is.nan (NaN), and is.infinite (-INF, +INF). All matrix text readers are now aware of NAs, but convert them to NaNs which falls under the definition of NAs. Furthermore, this patch also removes unnecessary builtin function reuse for all individual builtin operations which has no performance impact but reduces the codesize.
Basic JSONL Reader Implementation. Basic JSONL Writer Implementation. Basic Parallel JSONL Reader/Writer Implementation. Test Utils and WriteRead Tests DIA project. Closes apache#93.
DIA project, part 2 Closes apache#94.
1. Included Mask and threshold as input parameters 2. Output Matrix FD contains the scores for FDs Closes apache#95.
DIA project data augmentation for data cleaning (outliers, missing values, typos, swapped columns) Closes apache#101.
- Upgrade the startup of the Federated Environment - Support for Default Port - Relative and static file path URL for Federated Worker - Minor Startup cleanup in Federated Worker - No need for extra file argument to start a federated worker Closes apache#98.
1) Fixed corrupted print of stop/parse issues to stderr 2) Fixed missing handling of tensors in federated instruction wrapping
- Builtin function for Multinominal Logistic Regression - Function test verifying integration Closes apache#107
- Added scripts for deployment of SystemDS on the Amazon EMR - Located in /scripts/aws/* Closes apache#112
Closes apache#954. Closes apache#961.
This patch fixes bugs in handling of multi-level cache duplicates, eviction and reading from disk. This also adds a new test, l2svm.
Minor change to pom to update and re-enable code coverage in testing. when testing using the following command (replace ??? with package name) `mvn test -DskipTests=false -Dtest=org.apache.sysds.???` This uses jacoco to produce a folder containing a webpage in `target/site` that show coverage. Closes apache#956
Adds Federated prefix to instructions, so the statistics returned show federated instruction executions just like Spark or GPU instructions. Minor fix in Startup of worker allowing log4j to work again. Closes apache#970
This commit moves the documentation back to the master branch. It also clean up the previous documentation (by deleting it). Such that we have a clean start. Furthermore this commit, merges back the documentation on master, into the webpage documentation. Related PRs: apache#949 apache#922 Discussion Mails: https://tinyurl.com/yal7fd3r https://preview.tinyurl.com/yal7fd3r
This patch enables reuse for rand(matrix) and few more instructions. Furthermore, it fixes a bug in eviction logic that was forming cycles in the linked lists.
This patch contains a rewrite to reuse tsmm result in lmDS if called after PCA incrementally for increasing number of columns.
Extend Python API with more operations: - rev, t, order, cholesky, trigonometric ops (sin, cos, tan, asin, acos, atan, sinh, cosh, tanh) Also including Test cases and Docs update. Closes apache#975.
This patch fixes the logic of IPA scalar propagation into functions with multiple function calls. Similar to sizes, we check if literal function arguments have consistent values and propagate valid ones. However, this check had a logic problem of only checking if the first call was a literal. This missed cases where the first call had a scalar variable but the second call a valid scalar literal that could had been propagated individually.
This patch adds msvm w/ remote_spark parfor workers to the test suite and fixes missing support for tak+ operators in the recompute-by-lineage utility.
Adds support for Protobuf file format, for both reads and write. AMLS project SS2020, part 1 Closes apache#971
This patch adds basic lineage support to the MLContext API. Since in-memory objects are directly bound to the symbol table, lineage tracing views these objects as literals and incorrectly reused intermediates even if different in-memory objects where used in subsequent mlcontext invocations.
This patch makes a major refactoring of the lineage deduplication framework, including removed indirections and support for while loops and nested if program blocks. We now drop support for nested loops but this is fine as they are anyway split into many items and the biggest benefit comes from the last-level loop. In contrast, nested if blocks are critical in practice and this required a more generic collection of the lineage patches for all distinct paths (which we still do in a single pass over the loop body program). Additionally, we now support while loops with an integration very similar to for loops.
This patch fixes size propagation issues during parsing and recompilation for rbind/cbind operations over lists into a single matrix. Together with other rewrites, the incorrect size propagation led to invalid runtime plans. However, the additional tests with CV-lm still require an assertion to allow function inlining as a precondition for the fold-rewrite to eliminate redundancy. Solving this remaining issue requires a principled size propagation approach for matrix objects in lists.
This patch adds a new rewrite to partially reuse tsmm results in StepLM (forward).
- Privacy Constraint support for GLM - Privacy tests for GLM - Improved exception handling of federated responses - Log of checked privacy constraints This is to give the federated master information about which privacy constraints were violated and to be able to throw the actual exception on the master side. Add Initial Implementation of Checked Privacy Constraints Log This will enable the user to check which privacy constraints were retrieved during handling of federated instruction. This is an initial implementation since the checked privacy constraints are added to the federated response, but this is never retrieved by the federated master. If the privacy constraint is null for a checked data object, this is currently not logged. This could easily be changed by moving the put operation before the privacy constraint null check in the PrivacyMonitor. Closes apache#946
- New builtin for identifying cells which violate length constrain. - Replacing OutputInfo.CSVOutputInfo with Types.FileFormat.CSV 1. Operations are now consistent with their semantics i.e., dropInvalidLength and dropInvalidType 2. Instead of identify the invalid cells the "dropInvalidLength" now replaces the invalid values with null and returns a frame 3. Binary method changed from MMBinaryMethod.MR_BINARY_R to MMBinaryMethod.MR_BINARY_M 4. Spark broadcast replaced with PartitionedBroadcast
exclude protobuf in Jdocs Closes apache#923
Add FederatedWorkerHandlerException And Improved Handling of Exceptions in FederatedWorkerHandler
This patch makes the following performance improvements in the context of basic lineage tracing and lineage-based reuse probing: 1) Avoid string handling: Materialize the flag if a createvar instruction has a persistent-read prefix in the name, which avoid unnecessary string comparisons for ALL createvar instructions, so almost 30% of all instructions. 2) Apply the existing constant folding rewrite not just during static rewrites but now also as a cleanup rewrite in order to remove remaining constant expressions (introduced by rewrites) inside loops. This has especially large impact in lineage because constructing the lineage item is more expensive than the entire scalar operation. 3) Leverage the materialized hash code in lineage items as early-out condition in the recursive equals check of lineage DAGs. This is especially useful where all lineage DAGs have the same repreated structure (e.g., from unrolled iterations) but a different input. The equals would go all the way to the first differences, while the comparison of hash codes (aggregates over all inputs) very likely differ earlier. On a mini-batch scenario of 250,000 iterations batch size 8, and 40 operations per iteration, the runtime w/o lineage was 65.6s, and the changes (1) and (2) improved the runtime with lineage tracing from 76.5s to 72.6s. Furthermore, we also seen some improvements for reuse probing in this scenario, but this requires but work too.
This patch fixes an interesting performance bug caused by the recursive hash computation of lineage items. Due to repeated operation sequences (from loop iterations) and integer overflows during the hash computation, there were systematic hash sequence within one lineage DAG. This in turn lead to less pruning power on recursive equals computations, and collisions in the lineage cache, leading to even more recursive equals comparisons. The fix is simple. We now handle such overflows on hash aggregation (e.g., hash(int,int)) with a long instead of int hash function on demand. On the following test scenario for(i in 1:1000) X = ((X + X) * 2 - X) / 3 the previous runtime was 162s while with this patch it reduced to 0.244s. Even with 10K iterations, the runtime is still 1.1s, which suggests that any super-linear behavior has been eliminated.
This patch makes some minor performance improvements to the lineage reuse probing and cache put operations. Specifically, we now avoid unnecessary lineage hashing and comparisons by using lists instead of hash maps, move the time computations into the reuse path (to not affect the code path without lineage reuse), avoid unnecessary branching, and materialize the score of cache entries to avoid repeated computation for the log N comparisons per add/remove/constaints operation. For 100K iterations and ~40 ops per iteration, lineage tracing w/ reuse improved from 41.9s to 38.8s (pure lineage tracing: 27.9s).
This patch makes a minor performance improvement to the important partial rewrite tsmm(cbind(X,v)) to tsmm(X) + compensation plan, by avoiding cbind(X, v)[,1:n-1] to extract X if X is still available in the lineage cache. This avoids unnecessary allocation and copies.
simply updated systemml links to systemds in header.html file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Simply changed systemml links to systemds in header.html file.