Skip to content

Conversation

@AdamBanham
Copy link
Owner

@AdamBanham AdamBanham commented Dec 21, 2022

We have two modules for generating logs. dtlog is a quickly and simple way to generate simplified logs without any fuss. While generate offers alternative approachs with many options for generate traces and/or specifying the data that should be generated for events.

Features

  • dtlog to remain the same as before.
  • generate to offer more complex trace generation patterns.
    • two such patterns are introduced augmented delimited patterns and grammar-based generation.
    • generate.generate_log allows for the creation of a log using augmented patterns (see example 1)
    • generate.generate_from_grammar allows for the creation of a log using a grammar-based approach (see example 2)

Augmented patterns

Example 1

See the code snippet below for an example. The augmented patterns for multiplying a trace and rolling for a data issue are shown below. These can be combined to generate many traces, each with a roll for a data issue.

from pmkoalas.generate import generate_log

# generate from lists
variant_a = ["a b e f || ^20"]
variant_b = ["a b e c d b f || ^30"]
# each generated trace could have a data issue
variant_c = ["a b c e d b f || ^20 %d25"] 
variants = variant_a + variant_b + variant_c
log = generate_log(*variants)
# or 
log = generate_log(
    "a b e f || ^20",
    "a b e c d b f || ^30",
    "a b c e d b f || ^20 %d25"
)
print(log)

# show some __repr__
print(log.__repr__())
print(log.language().pop().__repr__())
print(log.directly_follow_relations().__repr__())

Grammar-based approach

TODOs

  • need to impl 'limit' shift in grammar, for all types.
  • add a data issue chance?
    • should this be on the pattern or the set of patterns?
  • allow for many 'log' elements in the grammar?

Grammar

<system> :: <log> | <domain> <log> | <log> <issue> | <domain> <log> <issue>

<log> :: [Patterns]{<nonzero>} <trace>
<trace> :: [ <event> ]{<nonzero>} | [<event>]{<nonzero>} <trace>
<event> :: <event> <event>| <word> | <word>{<data> }
<word> :: <ascii> | <ascii><word>
<data> :: <attr> | <attr>|<shift> | <data>, <data>
<attr> :: d_<alldigits>
<shift> :: <limit> | <lshift> | <rshift> | <mshift>
<lshift> :: <halfnumber>%-left
<rshift> :: <halfnumber>%-right 
<mshift> :: <halfnumber>%-m-<halfnumber>%
<limit> :: <<<number> | >><number>
<halfnumber> :: <nonzerodigits> | <halfdigits><halfdigits>
<number> :: <alldigits> | <number><number>
<nonzero> :: <nonzerodigits> | <halfdigits><alldigits>
<alldigits> :: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 
<nonzerodigits> :: 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 
<halfdigits> :: 1 | 2 | 3 | 4 | 5
<ascii> :: a | b | c | ... | x | y | z ** anything that matches [a-zA-Z0-9_]*

<domain> :: [Domains] <atttribute>
<attribute> :: <attribute> <attribute> | <attr>-<type> | <attr>-<type>-<dist>
<type> :: int | float | string | bool
<dist> :: <disttype> | <disttype>-<number>
<disttype> :: normal | uniform

Example 2

See the example code snippet below for using the grammar to generate a log. This example has a system with two xor choices, where one has a discriminative cut on d1 and has a somewhat discriminative cut on d4 for the latter xor choice. Histograms are shown afterwards to showcase the effect of each shift on process attributes.

from pmkoalas.generate import generate_from_grammar
from pmkoalas.complex import ComplexEventLog
from pmkoalas.dtlog import convert

from string import ascii_lowercase

from tqdm import tqdm

from matplotlib import pyplot as plt
from matplotlib.cm import get_cmap

virdis = get_cmap("viridis_r")

TRACES_A_B = convert( 
    "A B E F",
    "A B E G",
    "A B E H",
)
TRACES_A_C = convert( 
    "A C E F",
    "A C E G",
    "A C E H",
)
TRACES_A_D = convert( 
    "A D E F",
    "A D E G",
    "A D E H",
)
TRACES_A_F = convert(
    "A B E F",
    "A C E F",
    "A D E F",
)
TRACES_A_G = convert(
    "A B E G",
    "A C E G",
    "A D E G",
)
TRACES_A_H = convert( 
    "A B E H",
    "A C E H",
    "A D E H",
)

CASE_SYSTEM = """
Domains: 
    d_1-int-normal-50
    d_2-string-normal
    d_3-string-uniform
    d_4-float-normal-50
    d_5-float-uniform
    d_6-bool-uniform
    
Patterns:{5000} 
    [ A{d_1|25%-left, d_2|5%-left,  } B E{d_4|10%-left, d_5, } F ]{4} 
    [ A{d_1|25%-left, d_2|10%-left,  } B E{d_4|10%-left, d_6, } G ]{2} 
    [ A{d_1|25%-left, d_2|15%-left,  } B E{d_4|10%-left,      } H ]{1} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%, d_5, } F ]{4} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%, d_6, } G ]{2} 
    [ A{d_1|20%-m-20%, d_3, } C E{d_4|25%-m-25%,      } H ]{1} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right, d_5, } F ]{4} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right, d_6, } G ]{2} 
    [ A{d_1|25%-right, d_3, } D E{d_4|10%-right,      } H ]{1}
"""

def creation_example():
    # collect data and makes some histograms to prove that it works
    values_d1_a_b = []
    values_d1_a_c = []
    values_d1_a_d = []
    values_d2_a_f = []
    values_d2_a_g = []
    values_d2_a_h = []
    values_d2_a_c = []
    values_d2_a_d = []
    values_d4_a_f = []
    values_d4_a_g = []
    values_d4_a_h = []
    # make a lot batches
    for _ in tqdm(range(100)):
        tqdm.write("making batch...")
        log:ComplexEventLog = generate_from_grammar(CASE_SYSTEM)
        tqdm.write("made batch log...")
        tqdm.write("with variants...")
        # show variant count
        # print(f"#variants = {log.get_nvariants()}")
        scores = []
        for trace, insts in log:
            scores.append([str(trace),len(insts) ])
        scores.sort(key=lambda x : x[0])
        for trace, size in scores:
            tqdm.write(f"\t...variant : {trace} x {size}")
        # add points to containers
        for trace, insts in log:
            if trace in TRACES_A_B.language():
                for inst in insts:
                    values_d1_a_b.append(inst[0].data()['d_1'])
            if trace in TRACES_A_C.language():
                for inst in insts:
                    values_d1_a_c.append(inst[0].data()['d_1'])
                    letter = inst[0].data()['d_3']
                    val = ascii_lowercase.index(letter)
                    values_d2_a_c.append(val+0.5) 
            if trace in TRACES_A_D.language():
                for inst in insts:
                    values_d1_a_d.append(inst[0].data()['d_1'])
                    letter = inst[0].data()['d_3']
                    val = ascii_lowercase.index(letter)
                    values_d2_a_d.append(val+0.5) 
            if trace in TRACES_A_F.language():
                for inst in insts:
                    values_d4_a_f.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_f.append(val+0.5)
            if trace in TRACES_A_G.language():
                for inst in insts:
                    values_d4_a_g.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_g.append(val+0.5)
            if trace in TRACES_A_H.language():
                for inst in insts:
                    values_d4_a_h.append(inst[2].data()['d_4'])
                    if trace in TRACES_A_B.language():
                        letter = inst[0].data()['d_2']
                        val = ascii_lowercase.index(letter)
                        values_d2_a_h.append(val+0.5)
    # set up figures
    fig_d1 = plt.figure(0, figsize=(4,3), dpi=200)
    bins = [ n for n in range(20,80,2)]
    ax = fig_d1.subplots(1,1)
    ax.hist([values_d1_a_b,values_d1_a_c,values_d1_a_d], 
            alpha=0.88, label=["a -> b", "a -> c", "a -> d"], bins=bins, 
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_title("Analysis of d_1")
    ax.set_xlim([19,81])
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d1.show()
    plt.show()
    fig_d2 = plt.figure(0, figsize=(4,3), dpi=200)
    ax = fig_d2.subplots(1,1)
    ax.hist([ values_d2_a_c,values_d2_a_d,
        values_d2_a_f,values_d2_a_g, values_d2_a_h], 
            label=["a -> c", "a -> d","a -> f","a -> g", "a -> h", ],
            color=[ virdis((1/5.0) * n) for n in range(5) ],
            bins=[ n for n in range(27)],
            align='mid', alpha=0.88,  
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_xlim([-1,27])
    ax.set_xticks(
        [ n + 0.5 for n in range(26) ]
    )
    ax.set_xticklabels(
        [ l for l in ascii_lowercase ]
    )
    ax.set_title("Analysis of d_2")
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d2.show()
    plt.show()
    fig_d4 = plt.figure(1, figsize=(4,3), dpi=200, facecolor=None)
    bins = [ n for n in range(20,80,2)]
    ax = fig_d4.subplots(1,1)
    ax.hist([values_d4_a_f,values_d4_a_g,values_d4_a_h], 
            alpha=0.88, label=["a -> f", "a -> g", "a -> h"], bins=bins, 
            rwidth=0.8, histtype='barstacked')
    ax.legend(fontsize=6)
    ax.set_title("Analysis of d_4")
    ax.set_xlim([19,81])
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(7)
    fig_d4.show()
    plt.show()

if __name__ == "__main__":
    creation_example()

Histogram for d_1 after 'A'
pmkoalas_d1_histogram
Histogram for d_2 and d_3 after 'A'
pmkoalas_d2_histogram
Histogram for d4 after 'E'
pmkoalas_d4_histogram

seems more appropriate for general audiences and future proofs the package api.
@AdamBanham AdamBanham added the enhancement New feature or request label Dec 21, 2022
@AdamBanham AdamBanham added this to the v0.0.1 milestone Dec 21, 2022
@adamburkegh
Copy link
Collaborator

But it's not generation, it's conversion. All the information required is in the input. I wouldn't rename until there is a clear second case to generalize from.

@AdamBanham AdamBanham added the holding revisiting later label Dec 21, 2022
@AdamBanham
Copy link
Owner Author

Fair enough, holding off on the merge until then. We can circle back at a later date.

a multiply trace sequence, ^X.
a data issue chance, %dX.
These can work in combination with each other.
@AdamBanham
Copy link
Owner Author

So I have made some additions to the form of the delimited traces. Each delimited trace can have some augments attached.
I have made two augments, one for multiplying the number of traces (as I found myself repeating a single sequence) and another to have a data issue occur at a given X% chance. I made an assumption that each augment is triggered in an ordered manner but in testing it, I liked the idea that there is no order as well. The latter could mean users could construct complex generation sequences, e.g. multiply the starting sequence 5 times, then apply a data issue at 25% to each of these, and then multiply each sequence by 50.

Thoughts on this direction?

from koalas.generate import gen_log

# generate from lists
variant_a = ["a b e f || ^20"]
variant_b = ["a b e c d b f || ^30"]
# each generated trace could have a data issue
variant_c = ["a b c e d b f || ^20 %d25"] 
variants = variant_a + variant_b + variant_c
log = gen_log(*variants)
print(log)

# show some __repr__
print(log.__repr__())
print(log.language().pop().__repr__())
print(log.directly_follow_relations().__repr__())

Which produces the following:

[<a,b,e,f>^20,<a,b,e,c,d,b,f>^30,<a,b,c,e,d,b,f>^16,<a,b,c,e,b,f>^1,<e,b,c,a,d,b,f>^1,<a,b,c,d,e,b,f>^1,<a,b,c,e,d,f,b>^1]
EventLog(
	[Trace(['a','b','e','f'])] * 20+
	[Trace(['a','b','e','c','d','b','f'])] * 30+
	[Trace(['a','b','c','e','d','b','f'])] * 16+
	[Trace(['a','b','c','e','b','f'])] * 1+
	[Trace(['e','b','c','a','d','b','f'])] * 1+
	[Trace(['a','b','c','d','e','b','f'])] * 1+
	[Trace(['a','b','c','e','d','f','b'])] * 1
)
Trace(['a','b','c','e','d','f','b'])
FlowLanguage([
	DirectlyFlowsPair(left='SOURCE',right='a',freq=69),
	DirectlyFlowsPair(left='a',right='b',freq=69),
	DirectlyFlowsPair(left='b',right='e',freq=50),
	DirectlyFlowsPair(left='e',right='f',freq=20),
	DirectlyFlowsPair(left='f',right='END',freq=69),
	DirectlyFlowsPair(left='e',right='c',freq=30),
	DirectlyFlowsPair(left='c',right='d',freq=31),
	DirectlyFlowsPair(left='d',right='b',freq=47),
	DirectlyFlowsPair(left='b',right='f',freq=49),
	DirectlyFlowsPair(left='b',right='c',freq=20),
	DirectlyFlowsPair(left='c',right='e',freq=18),
	DirectlyFlowsPair(left='e',right='d',freq=17),
	DirectlyFlowsPair(left='e',right='b',freq=3),
	DirectlyFlowsPair(left='SOURCE',right='e',freq=1),
	DirectlyFlowsPair(left='c',right='a',freq=1),
	DirectlyFlowsPair(left='a',right='d',freq=1),
	DirectlyFlowsPair(left='d',right='e',freq=1),
	DirectlyFlowsPair(left='d',right='f',freq=1),
	DirectlyFlowsPair(left='f',right='b',freq=1),
	DirectlyFlowsPair(left='b',right='END',freq=1),
])

@adamburkegh
Copy link
Collaborator

adamburkegh commented Jan 4, 2023 via email

@AdamBanham AdamBanham changed the title renaming dtlog to new generate module. Adding generate module for complex grammars. Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request holding revisiting later

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants