OpenCPU

Objective

Sometimes, it is practically or legally not possible to move corpus data to a local machine. This vignette explains the usage of CWB corpora that are hosted on an OpenCPU server.

library(polmineR)
## polmineR is throttled to use 2 cores as required by CRAN Repository Policy. To get full performance:
## * Use `n_cores <- parallel::detectCores()` to detect the number of cores available on your machine
## * Set number of cores using `options('polmineR.cores' = n_cores - 1)` and `data.table::setDTthreads(n_cores - 1)`
## 
## Attaching package: 'polmineR'
## The following object is masked from 'package:base':
## 
##     use

Remote Corpora

Publicly Available Corpora

The GermaParl corpus is hosted on an OpenCPU server with the IP 132.252.238.66 (subject to change). To use the corpus, use the corpus()-method. The only difference is that you will need to supply the IP address using the argument server.

gparl <- corpus("GERMAPARL", server = "http://opencpu.politik.uni-due.de")

The gparl object is an object of class remote_corpus.

is(gparl)

Using polmineR core functionality

The polmineR at this stage exposes a limited set of its functionality for remote corpora. Simple investigations in the remote corpus are possible.

Get corpus size

size(gparl)

Get structural annotation (metadata)

s_attributes(gparl)

Subsetting

gparl2006 <- subset(gparl, year == "2006")

The returned object has the class remote_subcorpus.

is(gparl2006)

Simple count

count(gparl, query = "Integration")

The count()-method works for remote_subcorpus objects, too.

count(gparl2006, query = "Integration")

KWIC

kwic(gparl, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Works for the remote_subcorpus, too.

kwic(gparl2006, query = "Islam", left = 15, right = 15, meta = c("speaker", "party", "date"))

Restricted Corpora

  1. Create directory for registry file-style files with credentials

  2. Create file with credentials for your corpus in this directory

Note: Filename is corpus id in lowercase

##
## registry entry for corpus GERMAPARLSAMPLE
##

# long descriptive name for the corpus
NAME "GermaParlSample"
# corpus ID (must be lowercase in registry!)
ID   germaparlsample
# path to binary data files
HOME http://localhost:8005
# optional info file (displayed by ",info;" command in CQP)
INFO https://zenodo.org/record/3823245#.XsrU-8ZCT_Q 

# corpus properties provide additional information about the corpus:
##:: user = "YOUR_USER_NAME"
##:: password = "YOUR_PASSWORD"
  1. Set environment variable “OPENCPU_REGISTRY” in .Renviron to dir just mentioned

  2. Get server whereabouts

x <- corpus("MIGPRESS_FAZ", server = "YOURSERVER", restricted = TRUE)

Next steps

Upcoming versions of polmineR will expose further functionality. This is a simple proof-of-concept!