Data mining day
Centre de Développement des Technologies Avancées (CDTA)
9h00-10h00 : Revisiting maximal pattern mining problems
10h00-11h00 : Frequent string mining from large scale noisy data
Takeaki Uno - NII, Tokyo, Japon
11h30-12h30 : Une approche globale pour la fouille de graphes
14h00-15h30 : Discussion : Datamining and related areas
Revisiting maximal pattern mining problems
Professeur Jean-Marc Petit - LIRIS, INSA de Lyon, France
Abstract. The framework
of Mannila and Toivonen \cite{MT97} classifies pattern mining problems
that are (isomorphically) equivalent to frequent itemset mining
(FIM). Moreover, the notion of dualization turns out to be crucial to
characterize the complexity of maximal patterns mining problems.
Nevertheless, many pattern mining problems cannot benefit from this
framework since the isomorphism requirement is too restrictive for many
patterns such as sequences, episodes or graphs to mention a few. In
this setting, our ambition is to extend their framework to take into
account such complex patterns and studying their complexity. First, we
identify and classify pattern structures, which are not boolean
lattices, for which the dualization problem remains
quasi-polynomial. We introduce convex embedding and poset
reduction as key notions to obtain a new classification of pattern
mining problems. Second, we point out how to adapt the seminal Dualize
& Advance algorithm for these classes. Last but not least, we point
out how known instances of complex pattern mining problems, including
sequences and conjunctive queries, fit into our framework.
Biography.
Jean-Marc Petit is a professor in
computer science at the engineering school INSA Lyon. From september
1997 to august 2005, he was an associate professor at the University
Blaise Pascal, Clermont-Ferrand. He received his PhD in Computer
Science from the University Claude Bernard in 1996 and graduated in
computer science from the University Claude Bernard and ENS Lyon
("Magistère") in 1993. Since 2005, he is the head of the “pervasive
information system” group (15 members) and has developed scientific
partnerships with major companies (e.g. EDF R&D or Alcatel- Lucent)
on new database applications. His main research interests concern
databases and data mining and leaded three main scientific projects:
DBA Compnion, Generules and iZi.
Frequent string mining from large scale noisy data
Takeaki Uno - NII, Tokyo, Japon
Abstract.
Finding frequent string is a fundamental issue in data mining
and information processing. The problem is actually easy from
computation side, thanks to suffix array algorithms. However,
once we allow errors for the occurrence, i.e., if we consider a
string appears at a place if the string is *similar* (in short
distance) to the substring beginning from the place, it turns to
be very difficult problem. The reason is that quite many short
strings can appear in many places if we allow errors of at most
some constant k; on the other hand, if we allow errors according
to the length, the monotone property is violated, and search
algorithm will not be efficient. In the existing studies, we can
say that no algorithm is efficient for this task. Instead of
usual hill climbing approach, here we propose an approach based
on similarity detection. In the approach, we first find all pairs
of similar strings, and then find frequent strings according to
the groups of similar strings. We also explain how to compute the
similarity efficiency. As the result, from genome sequence we can
extract frequent strings that are very long compared to existing
researches, in quite short time, say 10 min.
Takeaki Uno received the PhD degree (Doctor of Science) from Department of Systems Science, Tokyo Institute of Technology Japan, 1998. He was an assistant professor in Department of Industrial and Management Science in Tokyo Institute of Technology from 1998 to 2001, and have been an associate professor of National Institute of Informatics Japan, from 2001. His research topic is discrete algorithms, especially enumeration algorithms, algorithms on graph classes, and data mining algorithms. On the theoretical part, he studies low degree polynomial time algorithms, and hardness proofs. In the application area, he works on the paradigm of constructing practically efficient algorithms for large scale data that are data oriented and theoretically supported. In an international frequent pattern mining competition in 2004 he won the best implementation award. He got Young Scientists' Prize of The Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology in Japan, 2010.
Retour au programme de la journée
Une approche globale pour la fouille de graphes
Yahia Lebbah- Université d'Oran, Algérie
Abstract. Nous
introduisons la problématique de la fouille de graphes et ses
différentes approches de résolution, en mettant en avant les
étapes coûteuses dans le processus de génération des sous-graphes
fréquents dans une base de graphes. Nous verrons notamment que la
fouille de graphes fait appel aux problématiques combinatoires de
l'isomorphisme de sous-graphes et de graphes. Nous montrerons
comment relâcher des problèmes combinatoires difficiles de la fouille
de graphes, pour se ramener à une méthode globale qui génère un
encadrement rigoureux des sous-graphes fréquents. Nous montrerons aussi
une évaluation expérimentale de l'ensemble de l'approche, en
explicitant le type de graphes où des gains intéressants sont
obtenus.
Yahia Lebbah received the PhD degree in
computer science from Ecole des Mines de Nantes, France, 1999, and then
joined university of Oran in Algeria as assistant professor. Since
2009, he is a professor. From 2010, he is the head of the LITIO
laboratory in computer science at the university of Oran. He is also
conducting the creation of the high performance computing center at the
same university. He has a long time collaboration with the CeP team,
I3S/CNRS laboratory at the university of Nice - Sophia Antipolis in
France, where he is associate researcher. He has collaborated with
COPRIN team at INRIA Sophia Antipolis and Celtique team at INRIA
Rennes. His main research area is constraint programming (CP). He has
particularly worked on solving constraints and continuous optimisation
problems. His current research interests is the use of CP and
optimisation techniques in global optimization, graph mining, software
verification and multicriteria decision making.
Retour au programme de la journée