Friday, October 31, 2014

2015 class hour schedule

News: The 2015 class hour schedule will be on Fridays 2.30pm-5.45pm. BUT: we will discuss tomorrow whether we can move it to 4-7pm.

Friday, June 6, 2014

Lecture 12: Statistical Machine Translation

Introduction to Machine Translation. Rule-based vs. Statistical MT. Statistical MT: the noisy channel model. The language model and the translation model. The phrase-based translation model. Learning a model of training. Phrase-translation tables. Parallel corpora. Extracting phrases from word alignments. Word alignments

IBM models for word alignment. Many-to-one and many-to-many alignments. IBM model 1 and the HMM alignment model. Training the alignment models: the Expectation Maximization (EM) algorithm. Symmetrizing alignments for phrase-based MT: symmetrizing by intersection; the growing heuristic. Calculating the phrase translation table. Decoding: stack decoding. Evaluation of MT systems. BLEU. Log-linear models for MT.

Monday, June 2, 2014

Lecture 11: Homework 1 correction + homework Q&A + Combinatory Categorial Grammar (CCG)

Homework 1 correction. Q&A on the other two homeworks. Combinatory Categorial Grammar (CCG).


Lecture 10: NLP research at LCL, Sapienza

Wednesday, May 21, 2014

IMPORTANT: about Friday's lecture

Dear students,

Due to planned blackout in our building, Friday's class will be in the mathematics building, second floor, aula IV, same hour. The mathematics building "Guido Castelnuovo" is the main campus:

Please make sure that you know where to go in order to avoid starting late.

Sunday, May 18, 2014

Lecture 9: Wheels for the mind of the language producer: microscopes, macroscopes, semantic maps and a good compass (prof. Michael Zock)

Languages are not only means of expression, but also vehicles of thought, allowing us to discover new ideas (brainstorming) or clarify existing ones by refining, expanding, illustrating more or less well specified thoughts. Of course, all this must be learned, and to this end we need resources, tools and knowledge on how to use them.

Knowledge can be encoded at various levels of abstractions, considering different units (words, sentences, texts). While semantic maps represent words and their relations at a micro-level, schematic maps (tree banks, pattern libraries) represent them combined, in larger chunks (macrolevel). We all are familiar with microscopes, maps, and navigational tools, and we normally associate them with professions having little to do with NLP. I will argue during my talk that this does not need to be so. Methaphorically speaking, we do use the very same tools to process language, regardless of the task (analysis vs. generation) and the processor (machine vs. human brain).

Dictionaries are resources, but they can also be seen as microscopes as they reveal in more detail the hidden meanings, nutshelled in a word. This kind of information display can be achieved nowadays by a simple mouse-click, even for languages whose script we cannot read (e.g. oriental languages for most Europeans). A corpus query system like Sketch Engine can reveal additionally very precious information: a word’s grammatical and collocational behaviour in texts.

Unlike inverted spyglasses, which reduce only size, macroscopes are tools that allow us to get the great picture. Even though badly needed, they are not yet available in hardware stores, but they do exist in some scientists’ minds. They are known under the headings of pattern recognition, feature detectors, etc. The resulting abstractions, models or blueprints (frames, scripts, patterns) are useful for a great number of tasks. I will illustrate this point for patterns via two examples related to realtime language production and foreign language learning (acquisition of fluency via a selfextending speakable phrasebook).

Semantic maps (wordnets, thesauri, ontologies, encyclopedias) are excellent tools for organizing words and knowledge in a huge multidimensional meaning space. Nevertheless, in order to be truly useful, i.e. to guarantee access to the stored and desired information, maps are insufficient — we also need some navigational tool(s). To illustrate this point I will present some of my ongoing work devoted to the building of a lexical compass. The assumption is that people have a highly connected conceptual-lexical network in their mind. Finding a word amounts thus to entering the network at any point by giving a related word (source word) and to follow then the links (associations) until one has reached the target word. To allow for this kind of navigation, we try to build an association matrix that contains on one axis the target words and on the other the trigger words. Once built, this kind of tool should allow the user to navigate quickly and naturally, by starting from anywhere, to reach in very few steps the desired word, with the search being based on whatever knowledge is available at the onset of search.

Short bio: Prof. Michael Zock, H.PHD, is research director at CNRS, LIF (Laboratory of Fundamental Informatics). His research interests lie in cognitive science and language production, including the development of tools to assist language production (L1 + L2) and its acquisition, and understanding and simulating the cognitive processes underlying discourse planning (automatic creation of an outline) and word access. He is the author of hundreds of international publications in the field.

Wednesday, May 14, 2014

Lecture 8: Semantic relatedness

What is semantic relatedness? String-based similarity measures. Longest common substring/subsequence; n-gram overlap. Knowledge-based approaches: Lesk; Leacock & Chodorow; Wu & Palmer. Corpus-based approaches: Vector-space models, Explicit Semantic Analysis (ESA). Align, Disambiguate and Walk. Cross-level semantic similarity.

Sunday, May 4, 2014

Lecture 7: Word Sense Disambiguation + homework/project presentation

Introduction to Word Sense Disambiguation (WSD). Motivation. The typical WSD framework. Lexical sample vs. all-words. WSD viewed as lexical substitution and cross-lingual lexical substitution. Knowledge resources. Representation of context: flat and structured representations. Main approaches to WSD: Supervised, unsupervised and knowledge-based WSD. Two important dimensions: supervision and knowledge. Supervised Word Sense Disambiguation: pros and cons. Vector representation of context. Main supervised disambiguation paradigms: decision trees, neural networks, instance-based learning, Support Vector Machines. Unsupervised Word Sense Disambiguation: Word Sense Induction. Context-based clustering. Co-occurrence graphs: curvature clustering, HyperLex. Knowledge-based Word Sense Disambiguation. The Lesk and Extended Lesk algorithm. Structural approaches: similarity measures and graph algorithms. Conceptual density. Structural Semantic Interconnections. Evaluation: precision, recall, F1, accuracy. Baselines.

Presentation of homeworks 2 and 3 + 7 projects!

Saturday, April 12, 2014

Lecture 6: Semantics

Introduction to computational semantics. Syntax-driven semantic analysis. Semantic attachments. First-Order Logic. Lambda notation and lambda calculus for semantic representation. Lexicon, lemmas and word forms. Word senses: monosemy vs. polysemy. Special kinds of polysemy. Computational sense representations: enumeration vs. generation. Graded word sense assignment. Encoding word senses: paper dictionaries, thesauri, machine-readable dictionary, computational lexicons. WordNet. Wordnets in other languages. BabelNet.

Lecture 5: Syntax

Introduction to syntax. Context-free grammars and languages. Treebanks. Normal forms. Dependency grammars. Syntactic parsing: top-down and bottom-up. Structural ambiguity. Backtracking vs. dynamic programming for parsing. The CKY algorithm. The Earley algorithm. Probabilistic CFGs (PCFGs). PCFGs for disambiguation: the probabilistic CKY algorithm. PCFGs for language modeling.

Saturday, March 29, 2014

Lecture 4: Part-of-Speech Tagging

Introduction to part-of-speech (POS) tagging. POS tagsets: the Penn Treebank tagset and the Google Universal Tagset. Rule-based POS tagging. Stochastic part-of-speech tagging. Hidden markov models. Deleted interpolation. Linear and logistic regression: Maximum Entropy models. Transformation-based POS tagging. Handling out-of-vocabulary words.

Monday, March 24, 2014

Lecture 3: language modeling (2)

The third lecture was about language models. You discovered how important language models are and how we can approximate real language with them. N-gram models (unigrams, bigrams, trigrams) were discussed, together with their probability modeling and issues. We discussed perplexity and its close relationship with entropy, we introduced smoothing and interpolation techniques to deal with the issue of data sparsity.

In the second part of the class, we discussed the first homework in more detail.

Tuesday, March 18, 2014

Lecture 2: morphology and language modeling (1)

We introduced words and morphemes. Before delving into morphology and morphological analysis, we introduced regular expressions as a powerful tool to deal with different forms of a word. We also introduced finite state transducers for encoding the lexicon and orthographic rules. Today's lecture is about language models. We discussed the importance of language models and how we can approximate real language with them. We also introduced N-gram models (unigrams, bigrams, trigrams), together with their probability modeling and issues.

In the last part I talked about the first part of homework 1 (deadline: April 30th)! Be sure you know all the details by participating in the discussions on the google group. Don't miss the next class on Friday 21st!

Friday, March 7, 2014

Lecture 1: introduction

We gave an introduction to the course and the field it is focused on, i.e., Natural Language Processing, with a focus on the Turing Test as a tool to understand whether "machines can think". We also discussed the pitfalls of the test, including Searle's Chinese Room argument. We then provided examples of tasks in desperate need for accurate NLP: machine translation, summarizaiton, machine reading, question answering, information retrieval.

First class: Today at 4pm!

Dear students,

the first class will be this afternoon at 4pm, in Viale Regina Elena, 295, yellow building (informatica e statistica), third floor, room G50. I already invited to the google discussion group all the students who signed up for the group. See you later!

Friday, February 21, 2014

Final schedule!!!

The final schedule for the class is Friday, 4-7pm, in Viale Regina Elena, 295, yellow building (informatica e statistica), third floor, room G50! Please register here.