Metadata-Version: 2.1
Name: learning_orchestra_client
Version: 0.4.1
Summary: Learning Orchestra client for Python
Home-page: https://github.com//riibeirogabriel/learningOrchestra
Author: Gabriel Ribeiro
Author-email: gabbriel.rribeiro@gmail.com
License: UNKNOWN
Description: # learningOrchestra client package
        
        ## Installation
        Ensure which you have the python 3 installed in your machine and run:
        ```
        pip install learning_orchestra_cliet
        ```
        
        ## Documentation
        
        After downloading the package, import all classes:
        
        ```python
        from learning_orchestra_client import *
        ```
        
        create a Context object passing a ip from your cluster in constructor 
        parameter:
        
        ```python
        cluster_ip = "34.95.222.197"
        Context(cluster_ip)
        ```
        
        After create a Context object, you will able to usage learningOrchestra, each 
        learningOrchestra functionality is contained in your own class, therefore, to 
        use a specific functionality, after you instantiate and configure Context 
        class, you need instantiate and call the method class of interest, in below, 
        there are all class and each class methods, also have an example of workflow 
        using this package in a python code.
        
        ## DatabaseApi
        
        ### read_resume_files
        
        ```python
        read_resume_files(pretty_response=True)
        ```
        
        Read all metadata files in learningOrchestra
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ### read_file
        
        ```python
        read_file(filename_key, skip=0, limit=10, query={}, pretty_response=True)
        ```
        
        * filename_ley : filename of file
        * skip: number of rows amount to skip in pagination (default 0)
        * limit: number of rows to return in pagination (default 10)
        (max setted in 20 rows per request)
        * query: query to make in mongo (default empty query)
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ### create_file
        
        ```python
        create_file(filename, url, pretty_response=True)
        ```
        
        * filename: filename of file to be created
        * url: url to csv file
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ### delete_file
        
        ```python
        delete_file(filename, pretty_response=True)
        ```
        
        * filename: file filename to be deleted
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ## Projection
        
        ### create_projection
        
        ```python
        create_projection(filename, projection_filename, fields, pretty_response=True)
        ```
        
        * filename: filename of file to make projection
        * projection_filename: filename used to create projection
        * fields: list with fields to make projection 
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ## DataTypeHandler
        
        ### change_file_type
        
        ```python
        change_file_type(filename, fields_dict, pretty_response=True)
        ```
        
        * filename: filename of file
        * fields_dict: dictionary with "field": "number" or field: "string" keys  
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ## Histogram
        ### create_histogram
        ```python
        create_histogram(filename, histogram_filename, fields, 
                         pretty_response=True)
        ```
        
        * filename: filename of file to make histogram
        * histogram_filename: filename used to create histogram
        * fields: list with fields to make histogram 
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ## Tsne
        ### create_image_plot
        ```python
        create_image_plot(tsne_filename, parent_filename,
                          label_name=None, pretty_response=True)
        ```
        
        * parent_filename: filename of file to make histogram
        * tsne_filename: filename used to create image plot
        * label_name: label name to dataset with labeled tuples (default None, to 
        datasets without labeled tuples) 
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        ## ModelBuilder
        
        ### create_model
        
        ```python
        create_model(training_filename, test_filename, preprocessor_code, 
                     model_classificator, pretty_response=True)
        ```
        
        * training_filename: filename to be used in training
        * test_filename: filename to be used in test
        * preprocessor_code: python3 code for pyspark preprocessing model
        * model_classificator: list of initial from classificators to be used in model
        * pretty_response: return indented string to visualization 
        (default True, if False, return dict)
        
        #### model_classificator
        
        * "lr": LogisticRegression
        * "dt": DecisionTreeClassifier
        * "rf": RandomForestClassifier
        * "gb": Gradient-boosted tree classifier
        * "nb": NaiveBayes
        
        to send a request with LogisticRegression and NaiveBayes classificators:
        
        ```python
        create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"])
        ```
        
        #### preprocessor_code environment
        
        The python 3 preprocessing code must use the environment instances in bellow:
        
        * training_df (Instantiated): Spark Dataframe instance for training filename
        * testing_df  (Instantiated): Spark Dataframe instance for testing filename
        
        The preprocessing code must instantiate the variables in bellow, all 
        instances must be transformed by pyspark VectorAssembler:
        
        * features_training (Not Instantiated): Spark Dataframe instance for train 
        the model
        * features_evaluation (Not Instantiated): Spark Dataframe instance for 
        evaluate trained model accuracy
        * features_testing (Not Instantiated): Spark Dataframe instance for test 
        the model
        
        Case you don't want evaluate the model prediction, define features_evaluation 
        as None.
        
        ##### Handy methods
        
        ```python
        self.fields_from_dataframe(dataframe, is_string)
        ```
        
        * dataframe: dataframe instance
        * is_string: Boolean parameter, if True, the method return the string 
        dataframe fields, otherwise, return the numbers dataframe fields.
        
        ## learning_orchestra_client usage example
        
        In below there is a python script using the package with 
        [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview):
        
        ```python
        from learning_orchestra_client import *
        
        cluster_ip = "34.95.187.26"
        
        Context(cluster_ip)
        
        database_api = DatabaseApi()
        
        print(database_api.create_file(
            "titanic_training",
            "https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo"))
        print(database_api.create_file(
            "titanic_testing",
            "https://filebin.net/mguee52ke97k0x9h/titanic_testing.csv?t=ub4nc1rc"))
        
        print(database_api.read_resume_files())
        
        
        projection = Projection()
        non_required_columns = ["Name", "Ticket", "Cabin",
                                "Embarked", "Sex", "Initial"]
        print(projection.create("titanic_training",
                                "titanic_training_projection",
                                non_required_columns))
        print(projection.create("titanic_testing",
                                "titanic_testing_projection",
                                non_required_columns))
        
        
        data_type_handler = DataTypeHandler()
        type_fields = {
            "Age": "number",
            "Fare": "number",
            "Parch": "number",
            "PassengerId": "number",
            "Pclass": "number",
            "SibSp": "number"
        }
        
        print(data_type_handler.change_file_type(
            "titanic_testing_projection",
            type_fields))
        
        type_fields["Survived"] = "number"
        
        print(data_type_handler.change_file_type(
            "titanic_training_projection",
            type_fields))
        
        
        preprocessing_code = '''
        from pyspark.ml import Pipeline
        from pyspark.sql.functions import (
            mean, col, split,
            regexp_extract, when, lit)
        
        from pyspark.ml.feature import (
            VectorAssembler,
            StringIndexer
        )
        
        TRAINING_DF_INDEX = 0
        TESTING_DF_INDEX = 1
        
        training_df = training_df.withColumnRenamed('Survived', 'label')
        testing_df = testing_df.withColumn('label', lit(0))
        datasets_list = [training_df, testing_df]
        
        for index, dataset in enumerate(datasets_list):
            dataset = dataset.withColumn(
                "Initial",
                regexp_extract(col("Name"), "([A-Za-z]+)\.", 1))
            datasets_list[index] = dataset
        
        
        misspelled_initials = ['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 'Countess',
                               'Jonkheer', 'Col', 'Rev', 'Capt', 'Sir', 'Don']
        correct_initials = ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 'Mrs',
                            'Other', 'Other', 'Other', 'Mr', 'Mr', 'Mr']
        for index, dataset in enumerate(datasets_list):
            dataset = dataset.replace(misspelled_initials, correct_initials)
            datasets_list[index] = dataset
        
        
        initials_age = {"Miss": 22,
                        "Other": 46,
                        "Master": 5,
                        "Mr": 33,
                        "Mrs": 36}
        for index, dataset in enumerate(datasets_list):
            for initial, initial_age in initials_age.items():
                dataset = dataset.withColumn(
                    "Age",
                    when((dataset["Initial"] == initial) &
                         (dataset["Age"].isNull()), initial_age).otherwise(
                            dataset["Age"]))
                datasets_list[index] = dataset
        
        
        for index, dataset in enumerate(datasets_list):
            dataset = dataset.na.fill({"Embarked": 'S'})
            datasets_list[index] = dataset
        
        
        for index, dataset in enumerate(datasets_list):
            dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch'))
            dataset = dataset.withColumn('Alone', lit(0))
            dataset = dataset.withColumn(
                "Alone",
                when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"]))
            datasets_list[index] = dataset
        
        
        text_fields = ["Sex", "Embarked", "Initial"]
        for column in text_fields:
            for index, dataset in enumerate(datasets_list):
                dataset = StringIndexer(
                    inputCol=column, outputCol=column+"_index").\
                        fit(dataset).\
                        transform(dataset)
                datasets_list[index] = dataset
        
        
        training_df = datasets_list[TRAINING_DF_INDEX]
        testing_df = datasets_list[TESTING_DF_INDEX]
        
        assembler = VectorAssembler(
            inputCols=training_df.columns[1:],
            outputCol="features")
        assembler.setHandleInvalid('skip')
        
        features_training = assembler.transform(training_df)
        (features_training, features_evaluation) =\
            features_training.randomSplit([0.1, 0.9], seed=11)
        features_testing = assembler.transform(testing_df)
        '''
        
        model_builder = Model()
        
        print(model_builder.create_model(
            "titanic_training_projection",
            "titanic_testing_projection",
            preprocessing_code,
            ["lr", "dt", "gb", "rf", "nb"]))
        ```
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
