Time series data prediction by using integrated network models and generalized linear model 

Ying Hu1, Chunhua Yan1, Chih-Hao Hsu1, Yu Liu2, George Komatsoulis1, and Daoud Meerzaman1

1.Center for Biomedical Informatics & Information Technology, National Cancer Institute, Rockville, MD 20850, USA; 
2.State Key Laboratory of Molecular Oncology, Cancer Institute of Chinese Academy of Medical Sciences, Beijing 100021, China

0. Summary Sentence: include 1 short sentence description of your method that can be used in a main-text table of the Challenge overview paper. Please be very explicit and concise, and use specific terminology describing methods.

--Time series data prediction was performed by using given network structure and generalized linear model. 

1- What approach did you use to predict the time-series? 
Please describe your methods so a reader can reproduce them.  You can include figures to illustrate your method, but don.t include figures to report on your results. the purpose of this document is to describe your method and the rationale behind it, not to analyze your results.

--Generalized linear Model (glm) was used in the prediction. In the glm, the dependent variables (Y) are child nodes in the network and the independent variables (X) are the parent nodes.

1.a. Did you use a network model learned in sub-challenge1?

--Yes. The network structures were generated by three algorithms. The three algorithms are: 1) dynamic Bayesian network using 1st order conditional dependencies (DBN); 2) the graphical Gaussian models (GM); 3) max-min hill-climbing (MMHC). All data were separated into 20 sub-datasets corresponding to 20 combinations of inhibitor and stimulus conditions. In the case, 60 networks were created. For each sub-dataset, 3 networks were merged into one network, so totally 20 networks are formed corresponding to 20 sub-datasets. And then these 20 networks were merged into 8 networks by the given 8 stimulus types. The relationship between the network models and the prediction tables are as the following:

Network ID	Stimulus	Sub data set ID	Prediction Table Column
1	hiLIG1	3 11 15 19	2
2	hiLIG1_hiLIG2	4	8
3	hiLIG1_loLIG2	5	7
4	hiLIG2	1 9 13 17	4
5	loLIG1	6 12 16 20	1
6	loLIG1_hiLIG2	7	6
7	loLIG1_loLIG2	8	5
8	loLIG2	2 10 14 18	3

In each network (for 8 networks), the links (edges, paired nodes) were scored by the link frequency from the un-merged networks. We select links for each network. The selection cutoff is 30-35 that was based on the score distribution. The selected links were as the final network.    

1.b. Please indicate what kind of method you used (Neural Networks, Model-based, Regression-based, etc.) and expand on whether there are special methodological innovations in your algorithm. 

--Regression-based method (generalized linear model) was used.

1.c. Please specify what programming languages you used and indicate if you used or extended existing tools, packages or algorithms.

--glm function in R was used.

1.d. Did your algorithm have any tuning parameters (e.g., penalization, hyper-parameters, thresholds...). If so, how many, and how did you set them?

--No, we don't have any tuning parameters.

1.e. How did you model the actions of the inhibitors and the stimuli?

--We did not model them.

1.f. Did you apply the same or different methodologies for predicting the time-courses in the experimental sub-challenge and the in-silico sub-challenge? 

--We used the same methods.

1.g. Have you considered any other approaches, and why did you decide against them?

--We considered the differential equation algorithms. We did not use it because the time points are few and the value distributions are not complex as we thought.    

1.h Please indicate how much computation time was required for your algorithm to obtain results for sub-challenges 2A and 2B, and specify the computing resources you used.

--In a regular laptop, it only takes few minutes.

2-What portions of the provided data did you use? 

2.a. Did you use all cell lines, stimuli, time points and inhibitors? For the experimental sub-challenge, did you use only the .Main. dataset or did you also use the extra antibodies and time points in the .Full. dataset? Did you lump all data together, or use it in specific combinations?  Please describe.

--We used all cell lines, stimuli, time points and inhibitors, but the main datasets only.  We didn't use any extra antibodies and time points in the .Full. dataset. Data in different conditions are first processed separately and then are merged together. 

2.b. Did you further process the data in some way?

--Time in all datasets was converted into log2 scale.

3-What external information did you use, and how did you use it? 

--We didn't use external information to do this prediction

4-What cross-validation scheme did you use? Did you model randomness in your cross-validation method? 

--We didn't use any cross-validation scheme.

5-Please discuss the results that you obtained in cross-validation and in the leader board.

5.a. Did you participate on all rounds on leader board, and if not in how many?

--No, we didn't participate on all rounds on leader board. We only participated on the last two rounds on leader board.

5.b. Were the results of the leader board useful for further model development?

--Yes, it helps us to know how to improve the performance of our model.

6-What is the background of your team?

Ying Hu: General biology, bioinformatics, statistics and systems biology
Chunhua Yan: Structural biology, bioinformatics
Chih-Hao Hsu: Computer science, bioinformatics
Yu Liu: General biology and bioinformatics 
Daoud Meerzaman: Molecular biology
George Komatsoulis: Molecular Biology and Biochemistry

6.a. What fields do you and your team members work on (network inference, regression, parameter estimation, network biology, other)?

--Bioinformatics, biostatics, computer science, systems biology

6.b. How many hours did you invest into the challenge?

Ying Hu: 160
Chunhua Yan: 160
Chih-Hao Hsu: 160
Yu Liu: 80 
George Komatsoulis, Daoud Meerzaman: 20

Conclusion/Discussion

This section should include a short summary and any insights gained during the algorithm development/challenge. You can include future directions, or alternative aspects you did not manage to explore.  

--In this study, the time-course distributions of protein activity are not so complex, so a regression-based method is good to do prediction. In the future, the datasets might have different distributions; we might need to use other models such as differential equations modeling to do prediction. 

References
Maximum 3 references including a hyperlink

Authors Statement

Please list all author.s contributions 
Ying Hu: study design, data analysis and write-up 
Chunhua Yan: study design, data analysis and write-up 
Chih-Hao Hsu: study design, data analysis and write-up 
Yu Liu: data analysis
George Komatsoulis, Daoud Meerzaman: study design

