Introduction to Datalog

Поділитися
Вставка
  • Опубліковано 30 вер 2024

КОМЕНТАРІ • 9

  • @haonanqiu4251
    @haonanqiu4251 2 роки тому +6

    00:00: Introduction
    2:06: A rule-based query language
    7:45: Datalog Syntax
    15:55: Datalog Semantics (1)
    20:50: Datalog Semantics (2)
    22:54: Datalog Semantics (3)
    33:04: Using Datalog on RDF
    37:53: Queries beyond SPARQL
    41:37: Datalog Complexity

  • @thegeniusfool
    @thegeniusfool 3 місяці тому

    As soon as a language is intriguing, the folks get German or French accents.

  • @ben_jammin242
    @ben_jammin242 Рік тому +2

    I think pseudo code would have been better. This custom syntax is more mathematical, coming from a programmer.

  • @nilofarmoradifarisar3171
    @nilofarmoradifarisar3171 2 роки тому

    Great course! Thank you so much, Prof. Markus Krotzsch.

  • @jonaskoelker
    @jonaskoelker 2 роки тому

    At 43:16 it is asserted that answering datalog queries is P-complete, yet there are queries in P that cannot be answered. Instead of P-complete, should the slides say P-hard? That is, every problem in P can be reduced to a datalog query, but not every datalog query can be reduced to a problem in P?
    Also, I'm guessing here-is the interpretation of "P-hard in data complexity" the following: "for every decision problem p1 in P there exists a datalog query plus ruleset (q, r1, ..., rn) and a log-space turing machine T which outputs datalog facts (not rules or queries) such that for every instance x of p1 it is true that rundatalog(q, r1, ..., rn, output of T(x)) returns a non-empty result set if and only if x is a yes-instance of p1"?
    I guess for exptime completeness the turing machine T' outputs a mix of facts, rules and a query and you rundatalog(T(x)), i.e. the query and rules can be a function not just of the problem type (e.g. 3-SAT vs. 3-colorability) but also a function of the particular instance.

    • @knowledge-basedsystemstudr9413
      @knowledge-basedsystemstudr9413  2 роки тому +3

      Right, so let me answer to the second two paragraphs first. Your idea of what P-hard for data complexity (and Exp-hard for combined complexity) means seem to be correct. The actual rules that are used to put this into practice mimic the computation of a Turing machine in Datalog. For the most part, this is not too hard (we often describe Turing machine computations using rule-like statements anyway, as in "if the machine is in state q and sees symbol a, then it changes to state p, writes a b, and moves to the left"). To make this work in Datalog, the main idea is that the relational structure (database) that gets computed should look like a "trace" of the TM run: the tape (memory) will be a chain-like structure where every element is a cell, and we will have a new tape for every moment in time (so every time point, from start to end, is in the model). The challenge is to come up with a good encoding for things like "cell 23 at time step 42".
      If the computation is not too long (polynomial), we can simply assume that the relevant coordinates (like "23" and "42") are given to us with the input and we merely create pairs of them during the computation. That's the idea for the P-hardness proof. If we want to show Exp-hardness, we need to create a lot more "cells" in some way. The idea for doing this is to use facts with long lists of parameters (e.g., if I have 10 parameters in a predicate and 2 constants, I can already represent 2^10 facts). Indeed, the number of facts one can get is exponential, but the exponent is given by the arity of predicates. For Exp-hardness, the input should determine the exponent, so the arity of predicates must grow if we want to process larger input instances. This is why this is no longer "data complexity". Other than this complication (which turns "addresses" of cells into lists of constants), the rules that describe the actual Turing machine behaviour are basically the same as in the PTime case.
      If you want a more formal explanation, the complete Datalog programs needed for these encodings can be found in a 2001 survey paper "Complexity and Expressive Power of Logic Programming" by Evgeny Dantsin and colleagues. It should be easy to find and free to access online.
      Now to the first question. It is really true and there is no error in the video here. Datalog is P-complete for data complexity, yet there are problems over databases that can be decided in PTime but not with a Datalog program. A simple example of the latter would be: "Find out if the database does not contain the fact q(a)". There is nothing contradictory in this situation. P-hardness (for data complexity) just means that we could find a "cheap" reduction from any problem in P to Datalog query answering (with a fixed rule set). The ability (or, as it is here, non-ability) to express all problems in a complexity class directly (without any reduction) is known as the *descriptive complexity* of a database query language. If we can express all problems in P with a query language, then we say that the language *captures P*. As for Datalog, it does not capture P, but what it captures is still difficult enough to be P-hard. If we want to capture P, we can do that by adding some more features: input negation and successor ordering. I have some more details on these in my course on database theory (iccl.inf.tu-dresden.de/web/Database_Theory_(SS2022)/en Lecture "Datalog Complexity"). A more detailed explanation of what "expressing" a query in a query language is given in our work "Capturing Homomorphism-Closed Decidable Queries with Existential Rules" (Camille Bourgeaux et al. , it's about a different query language, but the introduction to expressivity applies to Datalog just the same).

    • @jonaskoelker
      @jonaskoelker 2 роки тому

      @@knowledge-basedsystemstudr9413 > If the computation is not too long (polynomial), [...]
      Right, I think I understand. I think we could do the same in a less direct way: the Cook-Levin theorem (3-SAT is NP-Complete) builds a boolean circuit bounded in space and time which simulates a polynomial-time Turing machine. Instead of asking "is there an input such that " we're asking "is for this particular input", since we're working with deterministic p-time TMs.
      I think I can figure out how to convert logic gates to Datalog. We could also reduce directly from the circuit value problem (which is P-complete).
      I guess the datalog is something like:
      canOutput(GATE, true) :- hasType(GATE, and), hasInput(GATE, X), hasInput(GATE, Y), canOutput(X, true), canOutput(Y, true).
      canOutput(GATE, true) :- hasType(GATE, or), hasInput(GATE, X), canOutput(X, true).
      canOutput(GATE, false) :- hasType(GATE, and), hasInput(GATE, X), canOutput(X, false).
      canOutput(GATE, false) :- hasType(GATE, or), hasInput(GATE, X), hasInput(GATE, Y), canOutput(X, false), canOutput(Y, false).
      [canOutput if hasType(_, not) is left as an exercise.]
      ... and then the reduction on "true and not true" produces:
      canOutput(gensym1, true).
      canOutput(gensym2, true).
      hasType(gensym3, not).
      hasInput(gensym3, gensym2).
      hasType(gensym4, and).
      hasInput(gensym4, gensym1).
      hasInput(gensym4, gensym3).
      ... and this database can be computed in log space by walking over the input string repeatedly. Maybe I need to insist that the circuit is presented in a way that's easy (enough) to parse.
      Since no input value canOutput both true and false, it follows that no gate can output both true and false, and thus by induction canOutput(the circuit's output gate, true) is in the database iff the circuit value is true, so that shall be the query.
      [The "fixpoint of (database

    • @kellymoses8566
      @kellymoses8566 8 місяців тому

      @@knowledge-basedsystemstudr9413 This might be the smartest comment on UA-cam.

  • @dansplain2393
    @dansplain2393 2 роки тому

    Just the right level of theory, thank you