- Number of hours- Lectures 24.0
- Projects -
- Tutorials 24.0
- Internship -
- Laboratory works -
- Written tests -
 - ECTSECTS 6.0
Goal(s)
Motivation
Artificial Intelligence, Industry 4.0, Digital Factory, Internet of Things, Digital Economy: they all name the fusion of the physical and virtual and digital worlds. Data has become a key asset in the global and digital economy; the amount of collectible, collected and available data keeps growing at a tremendous pace, whereas data sources and dimensions (temporal/spatial, numerical/textual, visual) become more and more complex.
- Description of the course
- Skills of the learning program
 
1) To acquire overview of the data management pipeline (from production to analysis and results communication)  and be able to manage a data science project form end-to-end
2) To know methods and to apply tools for data handling, cleaning and requesting  
3) To know methods and to apply tools to explore and analyze data 
4) To know methods and to apply tools specific to data characterized by two specific dimensions of the industrial big data: spatial data and temporal data
- Organisation of the course
 
The course covers the whole issues of the data management pipeline: from data collection and production, storage and organization, management, exploitation and analysis, and communication. Big data and dynamic process of analysis needs transparent, repeatable and reproducible technics. Information and knowledge production will be presented 'backward' departing from the needs of decision tools.
The course is made of four parts on the following topics. Each of them aims, first, at identifying the needs of specific Big Data Management situations, and second, to give overview of the relevant methods and tools.
1) *Big-Data Management*: Data Production and Storage rely on infrastructures and platform dedicated to the big data specificities. These are design to answer the needs of agility of the tools and of update of data. Repeatability, transparency and traceability of process will be discussed all along the data cleaning, querying and extraction operations.
2) *Exploration of complex data with high dimensionality* (characterized by large number of variables of different natures, eventually structured and/or latent): Main methods of data exploration (classification, segmentation, etc.) will be presented from the IA, mathematics and statistics, and algorithmic methods.
3) *Applied Data Science* on industrial data (i.e. duration of process, duration between events - in the factory, in logistics, in consumer analysis, etc.): Data science methods (ML and DL algorithms) will be presented to deal with prediction, classificatino, survival analysis, and duration modeling, consumer analysis.
4) *Visualization and communication*: Visualization for big data exploration and results presentation will be discussed. Spatial dimension of big data will be explored using spatial data visualization tools (geographical information system). All along the data management and analysis processes, attention will be paid to answer the needs, which means adopting an integrated view of the operations from data production to delivering the analysis using communication tools. Coordination and integration of tools will answer the needs of reproducibility, transparency of the processes.
Information and knowledge production will be presented ‘backward’ departing from the needs of decision tools. Hence, Visualization and Reporting will be the first step of the course, to raise questions of data architecture, handling, exploration and analysis.
Future industrial practices will rely more deeply on information and data management and analytics. Digitization spreading, new communication tools, measure and observation tools (sensor, camera, smartphones, ...) increase needs to collect, stock, organize, secure, consult, extract and analyze data. These new data are characterized by their volume, variety and velocity.
Assessing the relevance of data and selecting the *right data for business decisions* is a key strategic capability.
Analysis of complex and big data, temporal and spatial data needs specific skills to search and to extract the relevant information and to analyze them accordingly with their specific dimensions.
Content(s)
General Introduction to the “Data Science for BigData” process
	From the sensor and survey collection to the visualization of analytics results
	Elements for a reproducible and transparent data analysis (introduced in the project sessions)
PART I	Vizualisation
	Reproducible report	Smart and efficient document edition
	Advanced visual analytics	Interactive GIS - Dismantle a map before to built one
	Decision tools	Dashboard, apps, etc.
PART II	Big Data Architecture
   Distributed data management
	System definition and architecture; data distribution 
	and sharding; distributed querying and map-reduce; Data models; distributed transactions
	Tools: MongoDB, sharding docker, R and Python
   Data engineering
	Profiling, cleaning and preparation on fragmented large data sets
	Tools: Kaggle Data Lab (http://www.kaggle.com) Python networkx library
   Data processing and analytics at different scales
	Big Data Analytics Stacks and AI/ML studios
	Tools: Kaggle, R and Python
PART III	Big Data Exploration
   Segmentation, clustering
	Definitions, methods and algorithms : (K-means, KNN)
   Regression tools
	Definitions, methods and algorithms : (OLS, SEM, Logistic)
	Goodness of Fit and residuals diagnostics
	Variable selection
   Decision Tree
	Automatic Classification. Definitions. Tree building. Impurity Measures.
   Artificial Neural Networks 
	Definitions. Perceptron. Multilayers networks: learning by gradient retropropagation
   Categorical Data Analysis
	Définitions (Binomial and Multinomial vector), Decision modeling and Data Analysis
	Logistic regression, Multinomial logit, random parameter logit
	Ordered logit ; Sequential Logit ; Nested Logit
 
Students will have to know and be able to demonstrate skills in:
Knowledge in probability, statistics, introduction to data analysis, introduction to R and/or Python language programming, relational database, declarative language SQL2.
If these elements are unknown by the student, teachers may decide to exclude them from the course (or under condition, may be requested to learn by themselves some of these elements)
This weighting is compatible with teaching and examen by distance
Individual evaluation: final exam (E)
In Class projects : (P = average of all projects grades))
Application Project realized in group Defense (D) and Report (R)
Second session Individual examination grade : EX (based on written or oral evaluation) : (E6)
N1 = 0.45*E + 0.225* Report + 0.225*Defense + 0.1*P
N2 = E6
The exam is given in english only 
The course exists in the following branches:
- Curriculum - Engineer student Master SCM - Semester 9 (this course is given in english only  ) )
- Curriculum - Engineer IPID apprentice program - Semester 9 (this course is given in english only  ) )
- Curriculum - Master 2 GI SIE program major SPD - Semester 9 (this course is given in english only  ) )
- Curriculum - Engineer student Master PD - Semester 9 (this course is given in english only  ) )
- Curriculum - Master 2 GI GID major GOD - Semester 9 (this course is given in english only  ) )
- Curriculum - Master 2 GI GID major DPD - Semester 9 (this course is given in english only  ) )
- Curriculum - Master 2 GI SIE program major SOM - Semester 9 (this course is given in english only  ) )
Course ID : 5GUC3500
Course language(s): 
You can find this course among all other courses.
Greene, W.H. 2008. Econometric Analysis, 6th. Prentice-HallOxford: Clarendon Press.
Hayter, A.J. 2012. Probability and Statistics for Engineers and Scientists. Cengage Learning. https://books.google.fr/books?id=Z3lr7UHceYEC.
Hougaard, P. 2000. Analysis of Multivariate Survival Data. Springer Verlag.
Kantardzic, Mehmed. n.d. Data Mining: Concepts, Models, Methods, and Algorithms, Second Edition: Concepts, Models, Methods, and Algorithms, Second Edition. Wiley.
Lawless, J.F. 2003. Statistical Models and Methods for Lifetime Data. John WILEY Sons, New York.
Listwon, A., and P. Saint-Pierre. 2013. “SemiMarkov: An R Package for Parametric Estimation in Multi-State Semi-Markov Models.” Working Paper - 
Ma, Y., and P. B. Seetharaman. 2004. “Multivariate Hazard Models for Multicategory Purchase Timing Behavior.” Working Paper, Rice University.
Martinussen, T., and T.H. Scheike. 2006. Dynamic Regression Models for Survival Data. Springer-Verlag New York.
McFadden, D., and K.E. Train. 2000. “Mixed Mnl Models for Discrete Response.” Journal of of Applied Econometrics 64: 207–40.
Train, K. 2009. Discrete Choice Methods with Simulation (2nd Ed.). UK: Cambridge University Press, Cambridge.
 
       
      
    