Industry 4.0, Digital Factory, Internet of Things, Digital Economy: they all name the fusion of the physical and virtual and digital worlds. Data has become a key asset in the global and digital economy; the amount of collectible, collected and available data keeps growing at a tremendous pace, whereas data sources and dimensions (temporal/spatial, numerical/textual) become more and more complex.
1) To acquire overview of the supply-chain of the data management (from production to analysis and results communication) and be able to manage a data science project form end-to-end
2) To know methods and to apply tools for data handling, cleaning and requesting
3) To know methods and to apply tools to explore and analyze data
4) To know methods and to apply tools specific to data characterized by two specific dimensions of the industrial big data: spatial data and temporal data
The course covers the whole issues of the data supply-chain: from data collection and production, storage and organization, management, exploitation and analysis, and communication. Big data and dynamic process of analysis needs transparent, repeatable and reproducible technics. Information and knowledge production will be presented 'backward' departing from the needs of decision tools.
The course is made of four parts on the following topics. Each of them aims, first, at identifying the needs of specific Big Data Management situations, and second, to give overview of the relevant methods and tools.
1) *Big-Data Management*: Data Production and Storage rely on infrastructures and platform dedicated to the big data specificities. These are design to answer the needs of agility of the tools and of update of data. Repeatability, transparency and traceability of process will be discussed all along the data cleaning, querying and extraction operations.
2) *Exploration of complex data with high dimensionality* (characterized by large number of variables of different natures, eventually structured and/or latent): Main methods of data exploration (classification, segmentation, etc.) will be presented from the IA, mathematics and statistics, and algorithmic methods.
3) *Analysis of complex data with temporal and / or spatial dimensions* (i.e. duration of process, duration between events - in the factory, in logistics, in consumer analysis, etc.): Duration data and temporal data methods will be presented like time series, survival analysis, and duration modeling.
4) *Visualization and communication*: Visualization for big data exploration and results presentation will be discussed. Spatial dimension of big data will be explored using spatial data visualization tools (geographical information system). All along the data management and analysis processes, attention will be paid to answer the needs, which means adopting an integrated view of the operations from data production to delivering the analysis using communication tools. Coordination and integration of tools will answer the needs of reproducibility, transparency of the processes.
Information and knowledge production will be presented ‘backward’
departing from the needs of decision tools. Hence, Visualization and Reporting will be the first step of the course, to raise questions of data architecture, handling, exploration and analysis.
Future industrial practices will rely more deeply on information and data management and analytics. Digitization spreading, new communication tools, measure and observation tools (sensor, camera, smartphones, ...) increase needs to collect, stock, organize, secure, consult, extract and analyze data. These new data are characterized by their volume, variety and velocity.
Assessing the relevance of data and selecting the *right data for business decisions* is a key strategic capability.
Analysis of complex and big data, temporal and spatial data needs specific skills to search and to extract the relevant information and to analyze them accordingly with their specific dimensions.
General Introduction to the “BigData analytics” process
From the sensor and survey collection to the visualization of analytics results
Elements for a reproducible and transparent data analysis (introduced in the project sessions)
PART I Vizualisation
Reproducible report Smart and efficient document edition
Advanced visual analytics Interactive GIS - Dismantle a map before to built one
Decision tools Dashboard, apps, etc.
PART II Big Data Architecture
Distributed data management
System definition and architecture; data distribution
and sharding; distributed querying and map-reduce; Data models; distributed transactions
Tools: MongoDB (ensimag), MongoDB sharding docker Population in USA
profiling, cleaning and preparation on fragmented large data sets
Tools: Kaggle Data Lab (http://www.kaggle.com) Python networkx library, Neo4J graph querying
(both on a docker container)
Data processing and analytics at different scales
Big Data Analytics Stacks and AI/ML studios
Tools: PySpark (docker) on Zeppelin, Azure ML Studio, Kaggle
PART III Big Data Exploration
Définitions. Notions de distances point à point, point à groupe, groupe
à groupe. Méthodes et algorithmes : de partitionnement (K-means, K-medoids,
CLARA), hiérarchiques (CHA, CURE), basés sur la densité (DBSCAN), par grilles
Définitions. Métriques. Génération de rêgles. Méthodes et
algorithmes : A priori, A prioriTID, A priori Partition, comptage dynamique, FP-Tree,
RAM, random forest. Redondances des rêgles.
Classification automatique. Définitions. Construction d’arbres. Mesures d’impureté.
Élagage, sur-apprentissage. Variables discrètes et variables continues.
Données manquantes. Méthodes et algorithmes : CART, ID3, C4.5, SPRINT. Validation.
Mesure de la qualité d’un partitionnement.
Neural Network and Bayesian Network
Definitions. Perceptron : apprentissage par correction d’erreur
et par descente de gradient. Réseau multi-couches : apprentissage par
retropropagation du gradient. Détermination de modèles d’estimation
et de classification. Formule de Bayes. Classificateurs Bayésiens naïfs.
PART IV Categorical, Temporal and Geospatial Data Analysis
Categorical Data Analysis
Définitions (Binomial and Multinomial vector), Decision modeling and Data Analysis
Logistic regression, Multinomial logit, random parameter logit
Ordered logit ; Sequential Logit ; Nested Logit
Temporal Data Analysis
Définitions (activités vs états ; Processus Markovien et semi-Markovien ).
Modèle Markovien à mémoire, modèle de durée entre états
Données ordonnées, séquentielles, datée. Modélisation de l’ordre et du temps.
Séries temporelles, DTW et corrélation croisée.
Geospatial Data Analysis
Systèmes d’information géographique (Qgis et R)
Introduction to Cartography (coordinnates, projections, semiology, etc.)
Base de données spatiales : couche de polygones, jointure entre données
attributaires et spatiales
Position et attributs fixes et/ou variables.
Analyse spatiale: Statistique descriptives spatiale
Rélation et corrélation spatiale (notion de voisinage)
Modèle gravitaire, modèle à auto-corrélation spatiale
Students will have to know and be able to demonstrate skills in:
Knowledge in probability, statistics, introduction to data analysis, introduction to R language programming, relational database, declarative language SQL2.
If these elements are unknown by the student, teachers may decide to exclude them from the course (or under condition, may be requested to learn by themselves some of these elements)
Individual evaluation (one for each of the 4 parts), e.g. in-class work (TP), multiple choice questions or closed-formed quizzes (E1 to E4)
Application Project realized in group (P)
Second session exam (E6)
N1 = (E1 + E2 + E3 + E4 + P)/5
N1 = (E6)
This weighting is compatible with teaching and examen by distance
Individual evaluation (one for each of the 2 parts), e.g. in-class work (TP), multiple choice questions or closed-formed quizzes (E1 to E2)
Application Project realized in group (P)
Second session exam (E6)
N1 = (E1 + E2 + P)/3
N1 = (E6)
The exam is given in english only
The course exists in the following branches:
Course ID : 5GUC3500
You can find this course among all other courses.
Greene, W.H. 2008. Econometric Analysis, 6th. Prentice-HallOxford: Clarendon Press.
Hayter, A.J. 2012. Probability and Statistics for Engineers and Scientists. Cengage Learning. https://books.google.fr/books?id=Z3lr7UHceYEC.
Hougaard, P. 2000. Analysis of Multivariate Survival Data. Springer Verlag.
Kantardzic, Mehmed. n.d. Data Mining: Concepts, Models, Methods, and Algorithms, Second Edition: Concepts, Models, Methods, and Algorithms, Second Edition. Wiley.
Lawless, J.F. 2003. Statistical Models and Methods for Lifetime Data. John WILEY Sons, New York.
Listwon, A., and P. Saint-Pierre. 2013. “SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models.” Working Paper -
Ma, Y., and P. B. Seetharaman. 2004. “Multivariate Hazard Models for Multicategory Purchase Timing Behavior.” Working Paper, Rice University.
Martinussen, T., and T.H. Scheike. 2006. Dynamic Regression Models for Survival Data. Springer-Verlag New York.
McFadden, D., and K.E. Train. 2000. “Mixed Mnl Models for Discrete Response.” Journal of of Applied Econometrics 64: 207–40.
Train, K. 2009. Discrete Choice Methods with Simulation (2nd Ed.). UK: Cambridge University Press, Cambridge.
Date of update July 13, 2023