Adding generate module for complex grammars. #1

AdamBanham · 2022-12-21T00:42:00Z

We have two modules for generating logs. dtlog is a quickly and simple way to generate simplified logs without any fuss. While generate offers alternative approachs with many options for generate traces and/or specifying the data that should be generated for events.

Features

dtlog to remain the same as before.
generate to offer more complex trace generation patterns.
- two such patterns are introduced augmented delimited patterns and grammar-based generation.
- generate.generate_log allows for the creation of a log using augmented patterns (see example 1)
- generate.generate_from_grammar allows for the creation of a log using a grammar-based approach (see example 2)

Augmented patterns

Example 1

See the code snippet below for an example. The augmented patterns for multiplying a trace and rolling for a data issue are shown below. These can be combined to generate many traces, each with a roll for a data issue.

from pmkoalas.generate import generate_log

# generate from lists
variant_a = ["a b e f || ^20"]
variant_b = ["a b e c d b f || ^30"]
# each generated trace could have a data issue
variant_c = ["a b c e d b f || ^20 %d25"] 
variants = variant_a + variant_b + variant_c
log = generate_log(*variants)
# or 
log = generate_log(
    "a b e f || ^20",
    "a b e c d b f || ^30",
    "a b c e d b f || ^20 %d25"
)
print(log)

# show some __repr__
print(log.__repr__())
print(log.language().pop().__repr__())
print(log.directly_follow_relations().__repr__())

Grammar-based approach

TODOs

need to impl 'limit' shift in grammar, for all types.
add a data issue chance?
- should this be on the pattern or the set of patterns?
allow for many 'log' elements in the grammar?

Grammar

<system> :: <log> | <domain> <log> | <log> <issue> | <domain> <log> <issue>

<log> :: [Patterns]{<nonzero>} <trace>
<trace> :: [ <event> ]{<nonzero>} | [<event>]{<nonzero>} <trace>
<event> :: <event> <event>| <word> | <word>{<data> }
<word> :: <ascii> | <ascii><word>
<data> :: <attr> | <attr>|<shift> | <data>, <data>
<attr> :: d_<alldigits>
<shift> :: <limit> | <lshift> | <rshift> | <mshift>
<lshift> :: <halfnumber>%-left
<rshift> :: <halfnumber>%-right 
<mshift> :: <halfnumber>%-m-<halfnumber>%
<limit> :: <<<number> | >><number>
<halfnumber> :: <nonzerodigits> | <halfdigits><halfdigits>
<number> :: <alldigits> | <number><number>
<nonzero> :: <nonzerodigits> | <halfdigits><alldigits>
<alldigits> :: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 
<nonzerodigits> :: 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 
<halfdigits> :: 1 | 2 | 3 | 4 | 5
<ascii> :: a | b | c | ... | x | y | z ** anything that matches [a-zA-Z0-9_]*

<domain> :: [Domains] <atttribute>
<attribute> :: <attribute> <attribute> | <attr>-<type> | <attr>-<type>-<dist>
<type> :: int | float | string | bool
<dist> :: <disttype> | <disttype>-<number>
<disttype> :: normal | uniform

Example 2

See the example code snippet below for using the grammar to generate a log. This example has a system with two xor choices, where one has a discriminative cut on d1 and has a somewhat discriminative cut on d4 for the latter xor choice. Histograms are shown afterwards to showcase the effect of each shift on process attributes.

from pmkoalas.generate import generate_from_grammar
from pmkoalas.complex import ComplexEventLog
from pmkoalas.dtlog import convert

from string import ascii_lowercase

from tqdm import tqdm

from matplotlib import pyplot as plt
from matplotlib.cm import get_cmap

virdis = get_cmap("viridis_r")

TRACES_A_B = convert( 
    "A B E F",
    "A B E G",
    "A B E H",
)
TRACES_A_C = convert( 
    "A C E F",
    "A C E G",
    "A C E H",
)
TRACES_A_D = convert( 
    "A D E F",
    "A D E G",
    "A D E H",
)
TRACES_A_F = convert(
    "A B E F",
    "A C E F",
    "A D E F",
)
TRACES_A_G = convert(
    "A B E G",
    "A C E G",
    "A D E G",
)
TRACES_A_H = convert( 
    "A B E H",
    "A C E H",
    "A D E H",
)

CASE_SYSTEM = """
Domains: 
    d_1-int-normal-50
    d_2-string-normal
    d_3-string-uniform
    d_4-float-normal-50
    d_5-float-uniform
    d_6-bool-uniform
    
Patterns:{5000} 
    [ A{d_1|25%-left, d_2|5%-left,  } B E{d_4|10%-left, d_5, } F ]{4} 
    [ A{d_1|25%-left, d_2|10%-left,  } B E{d_4|10%-left, d_6, } G ]{2} 
    [ A{d_1|25%-left, d_2|15%-left,  } B E{d_4|10%-left,      } H ]{1} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%, d_5, } F ]{4} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%, d_6, } G ]{2} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%,      } H ]{1} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right, d_5, } F ]{4} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right, d_6, } G ]{2} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right,      } H ]{1}
"""

def creation_example():
    # collect data and makes some histograms to prove that it works
    values_d1_a_b = []
    values_d1_a_c = []
    values_d1_a_d = []
    values_d2_a_f = []
    values_d2_a_g = []
    values_d2_a_h = []
    values_d2_a_c = []
    values_d2_a_d = []
    values_d4_a_f = []
    values_d4_a_g = []
    values_d4_a_h = []
    # make a lot batches
    for _ in tqdm(range(100)):
        tqdm.write("making batch...")
        log:ComplexEventLog = generate_from_grammar(CASE_SYSTEM)
        tqdm.write("made batch log...")
        tqdm.write("with variants...")
        # show variant count
        # print(f"#variants = {log.get_nvariants()}")
        scores = []
        for trace, insts in log:
            scores.append([str(trace),len(insts) ])
        scores.sort(key=lambda x : x[0])
        for trace, size in scores:
            tqdm.write(f"\t...variant : {trace} x {size}")
        # add points to containers
        for trace, insts in log:
            if trace in TRACES_A_B.language():
                for inst in insts:
                    values_d1_a_b.append(inst[0].data()['d_1'])
            if trace in TRACES_A_C.language():
                for inst in insts:
                    values_d1_a_c.append(inst[0].data()['d_1'])
                    letter = inst[0].data()['d_3']
                    val = ascii_lowercase.index(letter)
                    values_d2_a_c.append(val+0.5) 
            if trace in TRACES_A_D.language():
                for inst in insts:
                    values_d1_a_d.append(inst[0].data()['d_1'])
                    letter = inst[0].data()['d_3']
                    val = ascii_lowercase.index(letter)
                    values_d2_a_d.append(val+0.5) 
            if trace in TRACES_A_F.language():
                for inst in insts:
                    values_d4_a_f.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_f.append(val+0.5)
            if trace in TRACES_A_G.language():
                for inst in insts:
                    values_d4_a_g.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_g.append(val+0.5)
            if trace in TRACES_A_H.language():
                for inst in insts:
                    values_d4_a_h.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_h.append(val+0.5)
    # set up figures
    fig_d1 = plt.figure(0, figsize=(4,3), dpi=200)
    bins = [ n for n in range(20,80,2)]
    ax = fig_d1.subplots(1,1)
    ax.hist([values_d1_a_b,values_d1_a_c,values_d1_a_d], 
            alpha=0.88, label=["a -> b", "a -> c", "a -> d"], bins=bins, 
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_title("Analysis of d_1")
    ax.set_xlim([19,81])
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d1.show()
    plt.show()
    fig_d2 = plt.figure(0, figsize=(4,3), dpi=200)
    ax = fig_d2.subplots(1,1)
    ax.hist([ values_d2_a_c,values_d2_a_d,
        values_d2_a_f,values_d2_a_g, values_d2_a_h], 
            label=["a -> c", "a -> d","a -> f","a -> g", "a -> h", ],
            color=[ virdis((1/5.0) * n) for n in range(5) ],
            bins=[ n for n in range(27)],
            align='mid', alpha=0.88,  
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_xlim([-1,27])
    ax.set_xticks(
        [ n + 0.5 for n in range(26) ]
    )
    ax.set_xticklabels(
        [ l for l in ascii_lowercase ]
    )
    ax.set_title("Analysis of d_2")
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d2.show()
    plt.show()
    fig_d4 = plt.figure(1, figsize=(4,3), dpi=200, facecolor=None)
    bins = [ n for n in range(20,80,2)]
    ax = fig_d4.subplots(1,1)
    ax.hist([values_d4_a_f,values_d4_a_g,values_d4_a_h], 
            alpha=0.88, label=["a -> f", "a -> g", "a -> h"], bins=bins, 
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_title("Analysis of d_4")
    ax.set_xlim([19,81])
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d4.show()
    plt.show()

if __name__ == "__main__":
    creation_example()

Histogram for d_1 after 'A'

Histogram for d_2 and d_3 after 'A'

Histogram for d4 after 'E'

seems more appropriate for general audiences and future proofs the package api.

adamburkegh · 2022-12-21T00:52:55Z

But it's not generation, it's conversion. All the information required is in the input. I wouldn't rename until there is a clear second case to generalize from.

AdamBanham · 2022-12-21T01:19:39Z

Fair enough, holding off on the merge until then. We can circle back at a later date.

a multiply trace sequence, ^X. a data issue chance, %dX. These can work in combination with each other.

AdamBanham · 2023-01-04T07:21:42Z

So I have made some additions to the form of the delimited traces. Each delimited trace can have some augments attached.
I have made two augments, one for multiplying the number of traces (as I found myself repeating a single sequence) and another to have a data issue occur at a given X% chance. I made an assumption that each augment is triggered in an ordered manner but in testing it, I liked the idea that there is no order as well. The latter could mean users could construct complex generation sequences, e.g. multiply the starting sequence 5 times, then apply a data issue at 25% to each of these, and then multiply each sequence by 50.

Thoughts on this direction?

from koalas.generate import gen_log

# generate from lists
variant_a = ["a b e f || ^20"]
variant_b = ["a b e c d b f || ^30"]
# each generated trace could have a data issue
variant_c = ["a b c e d b f || ^20 %d25"] 
variants = variant_a + variant_b + variant_c
log = gen_log(*variants)
print(log)

# show some __repr__
print(log.__repr__())
print(log.language().pop().__repr__())
print(log.directly_follow_relations().__repr__())

Which produces the following:

[<a,b,e,f>^20,<a,b,e,c,d,b,f>^30,<a,b,c,e,d,b,f>^16,<a,b,c,e,b,f>^1,<e,b,c,a,d,b,f>^1,<a,b,c,d,e,b,f>^1,<a,b,c,e,d,f,b>^1]
EventLog(
	[Trace(['a','b','e','f'])] * 20+
	[Trace(['a','b','e','c','d','b','f'])] * 30+
	[Trace(['a','b','c','e','d','b','f'])] * 16+
	[Trace(['a','b','c','e','b','f'])] * 1+
	[Trace(['e','b','c','a','d','b','f'])] * 1+
	[Trace(['a','b','c','d','e','b','f'])] * 1+
	[Trace(['a','b','c','e','d','f','b'])] * 1
)
Trace(['a','b','c','e','d','f','b'])
FlowLanguage([
	DirectlyFlowsPair(left='SOURCE',right='a',freq=69),
	DirectlyFlowsPair(left='a',right='b',freq=69),
	DirectlyFlowsPair(left='b',right='e',freq=50),
	DirectlyFlowsPair(left='e',right='f',freq=20),
	DirectlyFlowsPair(left='f',right='END',freq=69),
	DirectlyFlowsPair(left='e',right='c',freq=30),
	DirectlyFlowsPair(left='c',right='d',freq=31),
	DirectlyFlowsPair(left='d',right='b',freq=47),
	DirectlyFlowsPair(left='b',right='f',freq=49),
	DirectlyFlowsPair(left='b',right='c',freq=20),
	DirectlyFlowsPair(left='c',right='e',freq=18),
	DirectlyFlowsPair(left='e',right='d',freq=17),
	DirectlyFlowsPair(left='e',right='b',freq=3),
	DirectlyFlowsPair(left='SOURCE',right='e',freq=1),
	DirectlyFlowsPair(left='c',right='a',freq=1),
	DirectlyFlowsPair(left='a',right='d',freq=1),
	DirectlyFlowsPair(left='d',right='e',freq=1),
	DirectlyFlowsPair(left='d',right='f',freq=1),
	DirectlyFlowsPair(left='f',right='b',freq=1),
	DirectlyFlowsPair(left='b',right='END',freq=1),
])

adamburkegh · 2023-01-04T18:37:07Z

It's neat, but I have some questions:- For the hat operator ^, why limit it to dtlogs? Seems like it would work on any koalas log? I can sort of guess what the formatting parameter does, but I don't understand this example. Is it so you can hold particular types (eg ints) in the log object? Are you planning to mix types? Why is gen_log() expecting varargs and not a list? Are you expecting logs on the filesystem to look like this?

renaming dtlog to new generate module.

1e60fc7

seems more appropriate for general audiences and future proofs the package api.

AdamBanham requested a review from adamburkegh December 21, 2022 00:42

AdamBanham assigned AdamBanham and adamburkegh Dec 21, 2022

AdamBanham added the enhancement New feature or request label Dec 21, 2022

AdamBanham added this to the v0.0.1 milestone Dec 21, 2022

AdamBanham added the holding revisiting later label Dec 21, 2022

AdamBanham added 2 commits January 3, 2023 15:58

Merge branch 'main' into dtlog-to-generate

ad4f27e

added two augments for generating traces

e1f59e9

a multiply trace sequence, ^X. a data issue chance, %dX. These can work in combination with each other.

AdamBanham added 9 commits June 22, 2023 12:22

Merge branch 'main' into dtlog-to-generate

267c220

updated from main and got tests working.

6234800

updated library reference.

08ee886

Merge branch 'main' into dtlog-to-generate

0934474

stolen code to generate accepted strings in a grammar

4cd4214

updated grammar generator.

4e1caf6

current grammar

9d16da9

a half working grammar parser?

2756ec5

a bit closer

cbeb17a

AdamBanham changed the title ~~renaming dtlog to new generate module.~~ Adding generate module for complex grammars. Jun 23, 2023

AdamBanham added 7 commits June 23, 2023 13:16

added versioned parsimonious to dependencies

a556c41

working parser now.

d5969d0

added support for uppercase letters, dang yo

bfefe52

case system for paper

3ac722f

shifted work into the library.

2730b50

testing for new additions to the library

bc7407d

updating docstring

d38c590

AdamBanham added 23 commits June 26, 2023 15:22

rename tests

3cc1aad

raising error when given pattern cannot produce an issue.

c98b506

added template for testing generate module.

3cb2ba2

more samples please.

72bb3cc

type setting parsed output.

a3925c3

able to generate traces from grammar, data coming soon...

e9f5f86

silly test for grammar.

5f2bc5b

Update test_complex_log_grammar.py

282ae82

adjusted weird rereference in __init__ of complexlogs.

647cf70

adjusted parsed information

455b1e8

implemented generating int process attributes with shifters.

23a5b6c

implemented float process attributes with shifters

e156080

updated case.

bf44e39

clean up of printables to stdout

06c877a

made pattern weighted selector use uniform rather than randint.

57922ef

adjusting probs in case system.

2fec8a3

fixed silly class typing for linter.

f282d9c

fixed a bug where 49% could not be used.

294a130

broke the grammar due to trying to limit values between 0 and 50

27669ca

added impl for string generation.

719bc48

updated schema for grammar

7022a6c

somewhat closer to impl.

67416ec

updating case system.

d49f1a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding generate module for complex grammars. #1

Adding generate module for complex grammars. #1

Uh oh!

AdamBanham commented Dec 21, 2022 •

edited

Loading

Uh oh!

adamburkegh commented Dec 21, 2022

Uh oh!

AdamBanham commented Dec 21, 2022

Uh oh!

AdamBanham commented Jan 4, 2023

Uh oh!

adamburkegh commented Jan 4, 2023 via email •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding generate module for complex grammars. #1

Are you sure you want to change the base?

Adding generate module for complex grammars. #1

Uh oh!

Conversation

AdamBanham commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Augmented patterns

Example 1

Grammar-based approach

Grammar

Example 2

Uh oh!

adamburkegh commented Dec 21, 2022

Uh oh!

AdamBanham commented Dec 21, 2022

Uh oh!

AdamBanham commented Jan 4, 2023

Uh oh!

adamburkegh commented Jan 4, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AdamBanham commented Dec 21, 2022 •

edited

Loading

adamburkegh commented Jan 4, 2023 via email •

edited

Loading