ML code to sample a file based on cheap, easily-attainable features of a file.
python xtract_sampler_main.py --mode train --classifier ex1 --feature ex2 --label_csv ex3
ex1should be either rf, svc, or logit for a random forest, support vector classification, or logistic regression model.ex2should be either head, rand, randhead to set the features as bytes from the head of the file, random bytes, or a mixture of both.ex3is the path to a .csv file with the file path, file size, and file label for files to train on.- Additional
--head_bytesand--rand_bytesparameters can be passed to specify the number of bytes to take from the file (the default is 512 bytes if these parameters aren't passed).
python xtract_sampler_main.py --mode predict --trained_classifier ex1 --feature ex2 --predict_file ex3
ex1is the path to a trained classifier, trained using the training mode of xtract_sampler_main.py.ex2is the type of feature thatex1was trained on (head, rand, randhead).- Note: If a
--head_bytesor--rand_bytesvalue was passed during training, the same value should be passed during predicting.
- Note: If a
ex3is the path to the file to predict on.- Alternatively, to predict on a directory, use
--dirname ex3instead of--predict_file ex3.
- Alternatively, to predict on a directory, use
Two-phase automated training allows users to generate labels and save features for multiple directories before training on those features and labels.
python xtract_sampler_main.py --mode labels_features --dirname ex1 --features_outfile ex2 --csv_outfile ex3 --features ex4ex1is the directory to generate labels from and to grab features from.ex2is the name/path to the .pkl file to write file features to.ex3is the name/path to the .csv file to write labels to.ex4should be either head, rand, randhead to set the features as bytes from the head of the file, random bytes, or a mixture of both.- Additional
--head_bytesand--rand_bytesparameters can be passed to specify the number of bytes to take from the file (the default is 512 bytes if these parameters aren't passed).
- Repeat step 1 with as many directories as you want. However,
--features_outfileand--featuresmust always be the same. Additionally if--head_bytesor--rand_bytesis passed, they must stay the same too. python xtract_sampler_main.py --mode train --classifier ex1 --features ex2 --features_outfile ex3ex1should be either rf, svc, or logit for a random forest, support vector classification, or logistic regression model.ex2should be either head, rand, randhead for the features to be bytes from the head of the file, random bytes, or a mixture of both.ex3is the name/path of the .pkl file passed to--features_outfilein steps 1 and 2.- Note: If a
--head_bytesor--rand_bytesvalue was passed during steps 1 and 2, the same value should be passed here.
- Note: If a
- Models created using the training mode will be saved under the name
classifier-feature-date.pklwhere the classifier and feature are the values passed to the command line and date is the current date. Training a model will also create a .json file namedclassifier-feature-date.jsonthat will contain training times and accuracy results about the trained model. To change the model name, pass--model_name ex1where ex1 is the name of the file to save the model. - Predictions from the prediction mode will be saved under the name
sampler_results.json. To change this, pass--results_file ex1where ex1 is the name of the file to save prediction results.