This contains various scripts for curating and working with data from Swedish Riksdag. This repo is "internal" in some sense -- we make no effort to maintain compatibility or to provide really thorough documentation, and this repo is not intended as part of the project's API. Nevertheless, we feel that users might find some utility in these example scripts.
The general recommendation is to set up a python virtual environment for working with this data set and these scripts. Do that how you like -- below is just one example of how it can be done. We're working with Python 3.8 due to compatibility issues with e.g. tensor flow.
Set up a conda environment : Follow the steps here.
With the environment active, install the pyriksdagen module, either from PyPi
pip install pyriksdagen
or from a local copy in the pyriksdagen repo
pip install .
The LazyArchive() class attempts to connect to the KB labs in the lazyest way possible. If you'll use the scripts often, it's worthwhile to set 3 environment variables:
KBLMYLAB=https://betalab.kb.se
KBLUSER=
KBLPASS=
We are phasing out reliance on kblabb servers, and this will soon be deprecated.
They can be added to the environment variables, e.g. ~/miniconda3/envs/tf/etc/conda/activate.d/env_vars.sh. If these are not present, you will be prompted for the username and password.
Most scripts take --start YEAR and --end YEAR arguments to define a span of time to operate on. Other options are noted in with the file below.
-1. Create new curation branch from dev.
git checkout -b curation-<decade_start_year>s dev
-
Generate an input csv by querying protocol packages using
scripts/query2csv.py- this creates
input/protocols/scanned.csvorinput/protocols/digital_originals.csv, to be read byscripts/pipeline.py - with the
-moption the script will create year directories incorpus/protocols/if they don't already exist obs., unlike the other scripts use of– updated to behave like the other scripts – obs. 2, a potential problem is that this doesn't handle the two-year formats - 199495--startand--endto define a range of dates is exclusive of the end year
- this creates
-
Compile parlaclarin for years queried in (1) with
scripts/pipeline.py– make sureinput/raw/exists. -
Look for introductions with
scripts/classify_intros.py- this creates
input/segmentation/intros.csv - had to add
/home/bob/miniconda3/envs/tf/lib/python3.9/site-packages/nvidia/cublas/lib/to $LD_LIBRARY_PATH
- this creates
-
Run
scripts/resegment.pyto segment and label introductions incorpus/protocols/<year>/*.xmlfiles -
Run
scripts/add_uuid.pyto make sure any new segments have a uuid. -
Run
scripts/find_dates.pyto find marginal notes with dates and add dates to metadata. -
Run
scripts/build_classifier.py(the classifier doesn't need to be built every time) different args!?--datapath: needs a file currently atinput/curation/classifier_data.csv(but how is this file generated? it's a mystery... it just exists)--epochs(can use the default)- writes to the
segment-classifier/... how does it relate to years of protocols? it doesn't – it's apparently trained generally andscripts/reclassify.pyallows to specify which years are operated on
-
Run
scripts/reclassify.pyto reclassify utterances and notes- nb.
build_classifierwrites tosegment-classifier/, but this reads frominput/segment-classifier/, so the output needs to be moved, or we can fix the discrepancy - do this one year at a time for dolan's sakie
for year in {START..END}; do python scripts/reclassify.py -s $year -e $year; done
- nb.
-
Run
add_uuid.pyagain. -
Run
scripts/dollar_sign_replace.pyto replace dollar signs. -
Run
scripts/fix_capitalized_dashes.py. -
Run
scripts/wikidata_process.py(makes metadata available for redetect.py) -
Run
scripts/redetect.py. -
Run
scripts/split_into_sections.py.
-
generate a sample for by decade with
sample_pages_new.py.- This generates a csv file in
input/quality_control/sample_<decade-start-year>.csvand a list of protocols in the sampleinput/quality_control/sample_<decade-start-year>.txt
- This generates a csv file in
-
Add (
git-add_QC-sample.shfor the lazy) and commit the sample to working branch. -
Populate the quality control csv file with
populate-QC-sample-test.py- sample protocols need to be on the local machine where the script is run. Since it pops open protocols in github an originals in betalab in a browser, this script doesn't play nice with working over ssh
- QC should distinguish between the same segment classes that
scripts/reclassify.pyproduces and . Other classes may become relevant later.
-
Does data pass QC test? If yes, add and push the rest of the protocols.