Génie industriel - Rubrique Formation - 2022

UE Smart Analytics for Big Data - 5GUC3500

Number of hours
- Lectures 24.0
- Projects -
- Tutorials 24.0
- Internship -
- Laboratory works -
- Written tests -
ECTS
ECTS 6.0

Goal(s)

Motivation

Artificial Intelligence, Industry 4.0, Digital Factory, Internet of Things, Digital Economy: they all name the fusion of the physical and virtual and digital worlds. Data has become a key asset in the global and digital economy; the amount of collectible, collected and available data keeps growing at a tremendous pace, whereas data sources and dimensions (temporal/spatial, numerical/textual, visual) become more and more complex.

Description of the course

1. Skills of the learning program

1) To acquire overview of the data management pipeline (from production to analysis and results communication) and be able to manage a data science project form end-to-end
2) To know methods and to apply tools for data handling, cleaning and requesting
3) To know methods and to apply tools to explore and analyze data
4) To know methods and to apply tools specific to data characterized by two specific dimensions of the industrial big data: spatial data and temporal data

1. Organisation of the course

The course covers the whole issues of the data management pipeline: from data collection and production, storage and organization, management, exploitation and analysis, and communication. Big data and dynamic process of analysis needs transparent, repeatable and reproducible technics. Information and knowledge production will be presented 'backward' departing from the needs of decision tools.

The course is made of four parts on the following topics. Each of them aims, first, at identifying the needs of specific Big Data Management situations, and second, to give overview of the relevant methods and tools.

1) *Big-Data Management*: Data Production and Storage rely on infrastructures and platform dedicated to the big data specificities. These are design to answer the needs of agility of the tools and of update of data. Repeatability, transparency and traceability of process will be discussed all along the data cleaning, querying and extraction operations.

2) *Exploration of complex data with high dimensionality* (characterized by large number of variables of different natures, eventually structured and/or latent): Main methods of data exploration (classification, segmentation, etc.) will be presented from the IA, mathematics and statistics, and algorithmic methods.

3) *Applied Data Science* on industrial data (i.e. duration of process, duration between events - in the factory, in logistics, in consumer analysis, etc.): Data science methods (ML and DL algorithms) will be presented to deal with prediction, classificatino, survival analysis, and duration modeling, consumer analysis.

4) *Visualization and communication*: Visualization for big data exploration and results presentation will be discussed. Spatial dimension of big data will be explored using spatial data visualization tools (geographical information system). All along the data management and analysis processes, attention will be paid to answer the needs, which means adopting an integrated view of the operations from data production to delivering the analysis using communication tools. Coordination and integration of tools will answer the needs of reproducibility, transparency of the processes.

Information and knowledge production will be presented ‘backward’ departing from the needs of decision tools. Hence, Visualization and Reporting will be the first step of the course, to raise questions of data architecture, handling, exploration and analysis.
Future industrial practices will rely more deeply on information and data management and analytics. Digitization spreading, new communication tools, measure and observation tools (sensor, camera, smartphones, ...) increase needs to collect, stock, organize, secure, consult, extract and analyze data. These new data are characterized by their volume, variety and velocity.

Assessing the relevance of data and selecting the *right data for business decisions* is a key strategic capability.

Analysis of complex and big data, temporal and spatial data needs specific skills to search and to extract the relevant information and to analyze them accordingly with their specific dimensions.

Responsible(s)

Iragael JOLY

Content(s)

General Introduction to the “Data Science for BigData” process
From the sensor and survey collection to the visualization of analytics results
Elements for a reproducible and transparent data analysis (introduced in the project sessions)
PART I Vizualisation
Reproducible report Smart and efficient document edition
Advanced visual analytics Interactive GIS - Dismantle a map before to built one
Decision tools Dashboard, apps, etc.
PART II Big Data Architecture
Distributed data management
System definition and architecture; data distribution
and sharding; distributed querying and map-reduce; Data models; distributed transactions
Tools: MongoDB, sharding docker, R and Python
Data engineering
Profiling, cleaning and preparation on fragmented large data sets
Tools: Kaggle Data Lab (http://www.kaggle.com) Python networkx library
Data processing and analytics at different scales
Big Data Analytics Stacks and AI/ML studios
Tools: Kaggle, R and Python
PART III Big Data Exploration
Segmentation, clustering
Definitions, methods and algorithms : (K-means, KNN)
Regression tools
Definitions, methods and algorithms : (OLS, SEM, Logistic)
Goodness of Fit and residuals diagnostics
Variable selection
Decision Tree
Automatic Classification. Definitions. Tree building. Impurity Measures.
Artificial Neural Networks
Definitions. Perceptron. Multilayers networks: learning by gradient retropropagation
Categorical Data Analysis
Définitions (Binomial and Multinomial vector), Decision modeling and Data Analysis
Logistic regression, Multinomial logit, random parameter logit
Ordered logit ; Sequential Logit ; Nested Logit

Prerequisites

Students will have to know and be able to demonstrate skills in:
Knowledge in probability, statistics, introduction to data analysis, introduction to R and/or Python language programming, relational database, declarative language SQL2.

If these elements are unknown by the student, teachers may decide to exclude them from the course (or under condition, may be requested to learn by themselves some of these elements)

Test

This weighting is compatible with teaching and examen by distance

Individual evaluation: final exam (E)

In Class projects : (P = average of all projects grades))

Application Project realized in group Defense (D) and Report (R)

Second session Individual examination grade : EX (based on written or oral evaluation) : (E6)

N1 = 0.45*E + 0.225* Report + 0.225*Defense + 0.1*P

N2 = E6

The exam is given in english only

Calendar

The course exists in the following branches:

Curriculum - Engineer student Master SCM - Semester 9 (this course is given in english only )
Curriculum - Engineer IPID apprentice program - Semester 9 (this course is given in english only )
Curriculum - Master 2 GI SIE program major SPD - Semester 9 (this course is given in english only )
Curriculum - Engineer student Master PD - Semester 9 (this course is given in english only )
Curriculum - Master 2 GI GID major GOD - Semester 9 (this course is given in english only )
Curriculum - Master 2 GI GID major DPD - Semester 9 (this course is given in english only )
Curriculum - Master 2 GI SIE program major SOM - Semester 9 (this course is given in english only )

see the course schedule for 2025-2026

Additional Information

Course ID : 5GUC3500
Course language(s):

You can find this course among all other courses.

Bibliography

Greene, W.H. 2008. Econometric Analysis, 6th. Prentice-HallOxford: Clarendon Press.
Hayter, A.J. 2012. Probability and Statistics for Engineers and Scientists. Cengage Learning. https://books.google.fr/books?id=Z3lr7UHceYEC.
Hougaard, P. 2000. Analysis of Multivariate Survival Data. Springer Verlag.
Kantardzic, Mehmed. n.d. Data Mining: Concepts, Models, Methods, and Algorithms, Second Edition: Concepts, Models, Methods, and Algorithms, Second Edition. Wiley.
Lawless, J.F. 2003. Statistical Models and Methods for Lifetime Data. John WILEY Sons, New York.
Listwon, A., and P. Saint-Pierre. 2013. “SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models.” Working Paper - .
Ma, Y., and P. B. Seetharaman. 2004. “Multivariate Hazard Models for Multicategory Purchase Timing Behavior.” Working Paper, Rice University.
Martinussen, T., and T.H. Scheike. 2006. Dynamic Regression Models for Survival Data. Springer-Verlag New York.
McFadden, D., and K.E. Train. 2000. “Mixed Mnl Models for Discrete Response.” Journal of of Applied Econometrics 64: 207–40.
Train, K. 2009. Discrete Choice Methods with Simulation (2nd Ed.). UK: Cambridge University Press, Cambridge.

Update - 19/06/2025

UE Smart Analytics for Big Data - 5GUC3500

Number of hours

ECTS

Goal(s)

Content(s)