We are directly analyzing linguistic datasets and monitoring the resources as published in order to deduce metadata about the availability, technical quality and content of language resources. In this way Prêt-à-LLOD will support the discovery of datasets along many axes. Some example queries that could be answered based on Prêt-à-LLOD’s search are:
“Give me all bilingual dictionaries where German is one of the languages.”
“Give me all datasets, licensed for reuse with modification, and tagged with the Penn TreeBank tagset.”
“Give me all services that implement a part-of-speech tagger for German using the STTS tagset.”
Dataset transformation currently depends significantly on manual transformation languages such as XSLT or R2RML. In order to move beyond this, the use of ontologies such as OLiA have shown benefit in how we can integrate dataset schemas and these can be combined into efficient methods. We are developing the following a three-step methodology as follows
- Transformation from source formats into an isomorphic RDF rendering of the original format,
- Transformation of native RDF into a community-approved LLOD data model,
- Manipulation of LLOD data, including linking and enrichment with LLOD term bases.
Prêt-à-LLOD Data Manager
We are investigating (i) the representation of rights information of the Prêt-à-LLOD resources as ODRL policies, including copyright law, database law and GDPR; (ii) the methodology to manipulate policies and provenance information (PROV-O) granting a lawful consumption of resources and services, not only with respect to copyright law but also implementing a privacy-by-design approach capable of dealing with personal data files; (iii) new license composition algorithms using deontic reasoning techniques.
We are investigating how state-of-the-art NLP techniques can be applied to the linking process and combined with the constraints typically found in ontology alignment. Secondly, we will move beyond simply linking homogenous resources such as ontologies by looking at linking across linguistic data modalities, in particular corpora, lexicons, thesauri and ontologies. A particular focus will be made on lexicon-corpus linking, where elements of a dictionary are linked to text and to ontology lexicalization, where linking is established between an ontology and a dictionary.
We are building on our recently proposed a new methodology, called Teanga which will allow the deployment of language technology pipelines on the cloud, increasing the interoperability and thus also exchangeability of services and datasets. This will be achieved by using containerization technology to remove the need to fit to single technical platform (e.g., Java) and semantic descriptions of web services to enable their easy integration.